a codinghj/journals/38.pdf · is executing on the parallel system and does not function correctly...

James E. Lumpp, Jr., Samuel A. Fineberg, Wayne G. Sation,

Thomas L. Casavant, Edward C. Bronson, Howard Jay Siegel, Pierre H. Pero,

Thomas Schwederski, and Dan C. Marinescu

p rogramming parallel machines is very difficult. First, generating an algo- rithm requires the programmer to assimilate the interactions of multiple

threads of control. Second, synchronization and communication among the threads must be ad- dressed to avoid contention and deadlock. Then, once the program is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem [21]. Additionally, debugging software is an activity that requires systematic attention to detail. Success is a function of the experi- enced individual involved and the

A C O D I N G tools employed. The ability to efficiently debug software requires the wisdom to know what questions to ask, the ability to analyze the an- swers received, and the knowledge to formulate the best next question. To aid in this interactive process, the p rogrammer needs information about the run-time behavior of the program.

The debugging process is actu- ally the classic scientific method: Formulate a hypothesis (e.g., possible cause of an error encountered), generate an experiment (e.g., in- strument the program or specify actions to be taken by the monitoring tools), collect data (e.g., execute the program with debugging statements inserted or with breakpoints set), analyze results (e.g., review traces of execution or check aspects of process state), test hypothesis, and, if necessary, formulate a new hypothesis. It may be possible for tools to direct the user toward possible sources of contention in data access or possible logic errors in conditionals, thereby guiding the user toward the cause of the errors. l ln this article, debugging refers to the process o f modi fy ing a p r o g r a m to execute both correctly and effectively, e.g., the identification and compensa t ion for content ion for a shared resource is considered debugging .

104 November 1991/Vol.34, No.ll/COMMUNICATIONS OF THE ACM

It also may be possible to display the information to the user in a form that is easily assimilated, making the debugging process more efficient. No tool set, however, will ever automatically debug programs. It will always be the user's job to formulate possible causes of errors or to decide which of the possible performance problems to examine. A fundamental goal of the tool de- signer is to make the experimentation and analysis steps of the process as powerful and efficient as possible. I f coding aids accomplish nothing else, they should aid the programmer in posing questions, gathering information, and analyzing data. Therefore, the quantity and quality of information pro-

A I D F O R vided is important.

For serial programs, debugging information is fairly easy to obtain and can be gained solely with (intrusive) software probes. Since there is a single thread of control, it can usually be delayed arbitrarily while the state of the program is observed. Breakpoints can be set and the contents o f program variables checked to obtain information about the instantaneous state of the program. Utilities exist that aid the programmer in setting breakpoints and in examining the state of the program. Tools are also available for profiling serial programs to obtain statistics on the execution time of various sections of the program. These techniques are invaluable in the debugging phase of programming for serial computers, which often consumes the bulk of program development time.

These techniques, however, cannot be extended directly to the parallel case due to the existence of multiple threads o f control in parallel programs [22]. Since parallel programs often depend on the synchronization of threads of control (e.g., the order of access to shared resources is important), the change in flow of the computation, to

gather status information such as the values of program variables, can greatly affect the execution of the program. This phenomenon is called intrusion by the monitoring system. It becomes necessary to provide some architectural support in the form of hardware instrumentation to aid debugging efforts. A software probe of one thread can supply only local information and cannot affect other threads (e.g., stop them to examine system state). In addition, if the monitoring system affects the relative timing of different threads, the execution time of the program may increase, or (worse yet) deadlock situations could be created or masked, and the debugging of the parallel pro-

gram may become impossible. An example would be debugging of a program in which the ordering of two events in different nodes is important and the monitoring changes the ordering of these two events.

In this work, a system designed to aid in software development for a PArtitionable SIMD/MIMD (PASM) prototype which is capable of operation in both single instruction stream multiple data stream (SIMD) and multiple instruction stream multiple data stream (MIMD) modes. A primary goal in the design of Coding Aid for the PASM System (CAPS) was to provide users with the ability to monitor their programs with maximum flexibility and minimum intrusion,

COMPUTING PRACTICES

given the existing system hardware, while providing information on a wide range of program attributes. Rigorous limits were placed on the cost and complexity of the system, however, to have a working, usable tool in a period o f about three months for a cost of under $1,000. CAPS integrates hardware support and software tools to provide a remote execution and program debugging/monitoring environment for the PASM prototype, designed and constructed at Purdue University [30, 31]. CAPS is the current generation of monitoring hardware and software for PASM. This includes specialized hardware added to the prototype and software servers, running on PASM

and the user's workstation, to facilitate the transfer and presentation of information to the programmer. CAPS is currently used to assist development of application and system software for PASM, as well as in experimental system evaluation [4, 5, 10, 11] and to drive higher-level visualization tools [15].

An environment of this type is useful because it allows the machine to be accessed from a remote site. Also, on a partitionable machine, multiuser access is permitted with users working at separate remote sites using different parts o f the machine concurrently. Download- ing application code and the development of code may be integrated into the same remote environment. The remote machine used for pro-

COMMUNJCATION$ OFTHE ACM/November 1991/Vol.34, No.ll l O S

m : m m a ~ : : a m m : m m l l | | s m | : m m : | | m l | m | : | a m | | | | a m | | | | l l m | | | m | | l l | | | | | | | m l | | | | | | | | | !

gram development may have software tools not available on the target machine. Suppor t for run-t ime moni tor ing of parallel programs is also crucial in the debugging phase of parallel software development . Informat ion gained through the actual execution of a p rogram on the parallel machine is invaluable in correct ing and optimizing the operation of programs. Finally, the addition of such an environment to an existing machine can be inexpensive, as shown by the implementation presented here.

CAPS consists of a set of dedicated I/O channels and associated hardware and software that facilitate bidirectional information flow between the individual nodes of PASM and a workstation providing the uselr interface. The information is specified by source level p rogram statements (code) added to the user 's p rogram that transmits messages th rough dedicated I/O channels. Once the data is sent f rom the nodes, it is combined into a single stream and sent through a high- bandwidth Local Area Network (LAN) to a workstation where it is used to debug and analyze the execution of programs.

Some degree of hardware instru- mentat ion is necessary to keep intrusion to a reasonable level. A "reasonable" level, however, depends on the p rogram being monitored and the goal of the monitoring. CAPS was designed to keep the hardware enhancements to the ar- chitect,are at minimal cost. The techniques used can be appl ied to a broad class of parallel machines, including the PASM parallel processing system, and almost any mul- t iprocessor system. The individual processors of these machines must be provided with the capability to t ransmit debug/trace information over an I / 0 channel to a location where the informat ion can be collected and forwarded to a remote site.

Background Run-t ime suppor t for moni tor ing is crucial in debugging multiproces-

sor software. Static suppor t tools can aid the p rog rammer in coding the p rogram more efficiently or make the task of p rogramming eas- ier, but run-t ime suppor t provides informat ion on the actual interaction of the p rogram and the parallel machine. This allows the pro- g rammer to optimize the program's performance, and facilitates the location of errors in the program. Additionally, interactive I/O with the nodes of the system allows their individual states to be examined. CAPS provides dynamic suppor t and interactive I/O for the PASM parallel processing system.

A number of systems suppor t software development by providing run-t ime moni tor ing support . As early as 1975, McDaniel p roposed a kernel ins t rumentat ion for distrib- u ted environments [20]. The work on some distr ibuted systems designed in the early 1980s included some sort of per formance monitoring facility, usually a software monitor (e.g., [7] for the V kernel and [25] for DEMOS/MP). Systems for interactive debugging of a distributed computat ional environment were also designed [27].

Currently, several ins t rumented systems are in different stages of completion. Among these are CARAT at the University of Massa- chusetts and the Distributed Com- puter Testbed at Honeywell. The real-time moni tor ing systems at The Ohio State University can be based on either an Ethernet or Hyperchannel [35]. IPS at the Uni- versity of Wisconsin aids in guiding the user to the sources of inefficiencies [23]. The resource moni tor ing system (REMS) at the National Bu- reau of Standards and the performance moni tor ing facilities for RP3 [6] at IBM feature substantial hardware support . The Faust project for Cedar at the University of Illinois includes hardware and opera t ing system suppor t for t ime-stamping significant events occurr ing dur ing p rogram execution. Instant Replay at Carnegie-Mellon University allows the replay of a p rogram from trace files [17]. The high-level de-

bugging environment at the Uni- versity of Massachusetts is a sophisticated environment based on the Event-Based Behavioral Abstrac- tion (EBBA) parad igm [2, 3]. Parasight at Encore is an example of a software moni tor for shared- memory parallel processing systems [1]. Also, the SEECUBE package allows the visualization of communication in parallel p rograms on the NCube, a hypercube system [8].

Several systems have also been designed to provide interactive I/O with the nodes of the parallel machine. The most common method of interactive I/O among the global system bus machines (e.g., Sequent Balance or ELXSI [33]) is by means of a separate I/O unit on the system bus. This I/O unit handles all user I/O between the CPUs and a number of terminals connected to the system. All I/O passes over the system bus between the CPUs and I/O unit, increasing bus traffic and interfer ing with the computat ion unit. T h e FLEX is an example of a dis tr ibuted memory global system bus machine that can be conf igured with an I/O unit for each CPU [12]. No device for concentrat ing data into a single stream exists. The BIO-link of a BBN Butterfly processing node can interface to an external device [34]. Again, no in- s t rumentat ion is included to combine user I/O from all processing nodes. The NCube machine has an internal channel, permi t t ing a processing node to send/receive information to/from the system's host processor [13], and the Intel iPSC/ 860 has dedicated I/O nodes for access to mass storage [14]. This informat ion passes th rough separate I/O processing nodes. Future systems will most likely provide more advanced suppor t for monitoring.

All of these systems provide user-directed I/O with ei ther a single processor or a g roup of processors th rough dedicated I/O units. These systems, however, do not suppor t a coherent user environment that combines code development tools, graphics, and interac-

O S November 1991/Vol.34, No.ll/COMMUNICATIONS OF T H E ACM

tive I/O with all processors on a sophisticated workstation as is the case with CAPS.

P A S M O v e r v i e w PASM is a dynamically recon- f igurable architectural design for a thousand processor machine where: (1) the processors may be parti- t ioned to form cooperat ing or inde-

pendent submachines of various sizes, (2) each submachine can operate in SIMD or MIMD mode, and switch between the two with instruction level granulari ty and generally negligible overhead (this is re fer red to as mixed-mode parallel computation [11]), and (3) the processors communicate through a multistage cube network

COMPUTING PRACTICES

that can be software configured to provide dif ferent connection topologies [30, 31]. A 30-processor prototype has been constructed and is in use in the Parallel Processing Laboratory at the Purdue Univer-

e l G U R t 1

BlOck d i a g r a m o f t h e p r o t o t y p e o f t h e PASM p a r a l l e l p r o c e s s i n g s y s t e m

J

I , I °co I I °c,

, +

I soo b [ I

I I +

cs ]~_ I I

' I I

' t I I - - I , +

,o~ I I ,.,o~ I I : r-

PCU

I I I I I I I I I ' I I

M S S

I !

Mr,Is

I . . . . . . . . Direct Memory Access Link

. . . . . . . . . . . . . . . . . . . . . . . GPIB

lOP

SIMD Instruction Broadcast Bus

Parallel Port Limk

COMMUNICATIONS OF THE ACM/November 1991/Vo1.34, No, 11 1 0 7

. . . . . , . . . . . . . . . . . . . . . ' . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

sity School of Electrical Engineer- ing. The prototype utilizes Moto- rola MC68000 processors. Figure 1 shows a block diagram of the basic components of the prototype.

The Parallel Computation Unit (PCU) contains N = 2" processing element,; (PEs) (numbered from 0 to N - l ) connected by an Extra Stage Cube interconnection network (a fault- tolerant variation of the mukistage cube network) [30]. Each PE is a sophisticated micro- processor/memory pair that performs the actual SIMD and MIMD operations. The PE memory mod- ules are used by the processors for data storage in SIMD mode and both data and instruction storage in MIMD mode. The Micro Controllers (MCs) a:re a set of Q = 2 q processors numbered from 0 to Q - I . The MCs act as the control units for the PEs in SIMD mode, sending instructions via the SIMD instruction broadcast bus, and coordinate the activities of the PEs in MIMD mode, th rough a genera l -purpose interface bus (GPIB) (see Figure 1). Each MC controls a fixed set of N/Q PEs. PASM is being designed for N = 1024 and Q = 32. The prototype has N = 16 PEs and Q = 4 MCs.

The System Control Unit (SCU) is responsible for the overall coordi- nation of the other components of PASM. In the prototype, the SCU connec~Ls to an Ethernet-based LAN that includes several dozen mini- compulEers, ranging in performance from 1 to 12 Mflops, and over a h u n d r e d Sun Workstations TM

on the Engineer ing Compute r Net- work (ECN) at Purdue University. The Nlemory Management System (MMS) is composed of multiple proces,;ors that control secondary storage and file t ransfer to/from the PEs. The Memory Storage System (MSS) provides multiple secondary storage devices for the PEs. The Control Storage (CS) provides secondary storage space for the MCs and tile SCU. One of the MMS processors of the prototype, the I/0 Processor (lOP), is responsible for interfacing to external I/O devices

and distr ibuting data to the MSS. The SCU communicates with the MCs and the I/O Processor through individual parallel por t connections. As previously mentioned, there are a total of 30 processors in the PASM prototype: 16 PE CPUs, four MC CPUs, four MMS CPUs, one CPU associated with each of the four MSS disks and the CS, and the SCU CPU.

User's View of CAPS The process of moni tor ing a program's execution begins with decid- ing which aspects of the program's behavior to examine and then in- s t rument ing the application code (either manually or automatically) with moni tor ing statements. These statements send messages th rough dedicated I/O channels to the mon- i toring system. The data sent from the nodes are collected and can be presented to the user in a number of forms including textual or graphical. The display can potentially provide informat ion on each node or on the system as a whole. The physical interface to the user is th rough a high-resolution graphics workstation runn ing an interactive terminal screen control p rogram called X-windows [26].

When a user instruments a program, the goal is to gain information about the occurrence of events of interest dur ing the execution of the program. In this article, an event is any change of system state. For example, a thread of control reaching a predef ined point in the computat ion or a change in some memory location (program variable) are both events. Therefore , events of interest are subsets of the events that occur dur ing the execution of the program, with knowledge of their occurrence conveying particularly useful information to the user. The ability to moni tor events occurr ing in the p rogram in this way allows the user to moni tor arbi trary attributes of the program.

From these events of interest, the user can obtain informat ion on more global issues relating to program execution. Some events are

known, a priori, to be of interest in almost all applications, (e.g., the beginning and end of p rogram execution or certain calls made to the opera t ing system). In this case, the opera t ing system could send information about the computat ion th rough the dedicated I/O channels or the compiler could insert extra executable statements into the application to indicate event occurrence.

Despite the means of insertion of the moni tor ing statements, e i ther by the user or by the system, both are intrusive measures because the computat ion is in te r rup ted for the time it takes to mark the event. The user-intrusive approach (moni tor ing statements added manually) requires the most p r o g r a m m e r effort, but is quite flexible. It can be as simple as pr int statements scattered th roughout the code. It also can involve sophisticated logging operations with automatic occurrence time recording added to the user program. This log of events can be stored in memory and reviewed on the workstation once p rogram execution has completed. At the workstation, data records from all PEs can be analyzed by the user in either their raw form or processed by the workstation to present a graphic t iming d iagram of the program execution or o ther meaning- ful displays. With this informat ion the p r o g r a m m e r can detect errors, bottlenecks in the code, and network conflicts, and make appropr i - ate code modifications.

The system-intrusive (monitor ing statements added automatically) approach releases the p r o g r a m m e r from the chore of insert ing monitor ing probes at the cost of some flexibility. The user is l imited to studying only the events recognized by the opera t ing system. Events such as system calls, network accesses, SIMD/MIMD mode switches, can be logged and analyzed as described earlier. Cur- rently in CAPS, the debugging and moni tor ing code is inserted manually by the p r o g r a m m e r as macros that are expanded by the compiler.

108 November 1991/Vol.34, No.ll/COMMUNICATIONS OF THE AOM

A major advantage of a system like CAPS is that it allows the pro- g rammer to use some sophisticated features of a graphics workstation to process I/O coming from the parallel system. The workstation's windowing capability allows textual debugging information to be displayed at several interactive virtual terminals, each connected to a different processor. The information can also be used to generate graphic displays summarizing collected data.

The graphics workstation in the current CAPS implementat ion is a Sun Workstation [32] running X-windows. These windows allow interactive I/O between the workstation and any processor on the PASM system. For textual output , each PASM processor can have its own window on the workstation. Each window can be adjusted to any size and windows may overlap if necessary. A window can act as a terminal, allowing access to the processor's resident monitors, which are based on Motorola's MVMEBUG software [24]. Monitor features include pr int ing memory and register contents, setting program breakpoints, and disassembly of segments of memory. The pro- g rammer may also send displayed information to disk or review information that has scrolled past the screen using s tandard X-windows utilities.

In a typical debugging session, the CAPS server on a Sun opens windows to PASM's SCU, the host where the parallel p rogram is being developed, and to the processors of interest on the PASM system. The window to the host machine is opened through a s tandard Unix remote Iogin procedure. CAPS is invoked on the workstation and opens windows to the desired PASM processors. The SCU window can be opened via a remote login from the Sun and provides access to Unix System V on the SCU. In this configuration, the p rog rammer can execute programs on PASM and use p rogram output to help debug the software. The

PEs' resident monitors also can help detect errors. The moni tor ing informat ion can be used to make changes in the source code on the host, then the p rogram can be re- compiled/assembled, re loaded, and reexecuted on PASM, all within the CAPS environment . This procedure can be carried out on any workstation capable of runn ing X-windows at tached to the LAN, as well as remotely from any In terne t site with the same capabilities.

The sample screen of a Sun Workstation dur ing a typical pro- g ramming and debugging session is shown in Figure 2. Five windows are opened to a 4-PE machine partition or MC group consisting of MC 0 and PEs 0, 4, 8, and 12. This partition is running a mixed-mode SIMD/MIMD 8-point fast Four ier t ransform program. The window entitled "PASM Parallel Processing System" is open to the SCU, and displays the operat ions necessary to download the p rogram code and begin execution of the p rogram on the 4-PE parti t ion. Another window is open to a Vax 11/780 on the ECN LAN and is being used for p rogram development. Addit ional windows are open for communication to the Sun Workstation. One is being used as the workstation console and another is the CAPS control window. It can be seen from the CAPS control window that the user has requested two MC groups, mcg0 and mcgl (the windows of mcgl are iconified). It also shows that another user has reserved a part i t ion containing MC group three and PE group two as seen by the circles and slashes placed over the processors in those groups (a PE group is simply the correspond- ing MC group with no window for the MC). The interface is ra ther flexible and can easily be tailored to specific debugging tasks. In this example, some windows are par- tially covered by portions of o ther windows. During the p rogramming and debugging session, each window can be moved, resized, and iconified (and restored) depend ing on the needs of the user. Virtually

COMPUTING PRACTICES

any number of windows is supported.

CAPS also can be used to drive graphical display tools such as Par- allel Animated Debugg ing and Simulation Environment (PARA- DISE) [15]. PARADISE acts as an environment for generat ing views to visualize the interaction between parallel applications and parallel systems. The user introduces defi- nitions of various aspects of a model o f the system and then can visually simulate the interaction of the elements of the model with simulation functions provided by the environment . In this way, run-t ime traces can be used to drive graphic representat ions o f the execution to animate the model being observed. This greatly aids in the analysis and comprehension of the trace information for debugging and performance enhancements .

Design Alternatives In general , despite the implementation details, remote access and execution environments consist of a channel between the parallel machine and a data concentrator. The goal is to collect the information from each node and transfer it to a central location that is, in this case, the screen of the user's workstation. The data concentrator combines the information coming from the parallel machine and transmits it on a high-bandwidth channel to the remote site. Similarly, data input from the remote site is re tu rned to the appropr ia te processor of the parallel machine.

Concentrat ing data streams from many processors and transmit t ing this combined stream to a remote site can present a serious technical challenge when considering even moderate-sized parallel systems. For example, a 1024-PE system where each PE produces a stream of data at 9,600 bps (approximately one character per millisecond) will saturate a typical 10Mbit/second LAN. Also, the workstation at the remote site must receive this stream and place individual characters into appropr ia te windows at an approx-

COMMUNICATIONS OFTHE ACM/November 1991/Vol.34, No.ll 109

imate rate of one character/microsecond, a true challenge for state- of-the-art workstations. Neverthe- less, the user is then presented with the equivalent of about 400 pages of text each second, clearly a challenge to the user's senses. The technical challenges encountered, however, are not rooted in the design of gigabit/second optical LANs and super-workstations. The challenge is in the creation of architectural support for monitoring with minimal intrusion to the program's exe-

I = I G U I ~ I E ~1

The screen of the Sun Worksta- tion showing nine windows: five opened to a machine partition with one MC and four PES, one opened to the SCU, one opened to a vax, one opened to the Sun Workstation and the CAPS window manager

cution. As discussed later in this section, the problem of saturating the monitoring subsystem (including the user) can be avoided through informed usage based on characteristics of parallel programs and the process of debugging.

A continuum of possible monitoring systems exists with respect to their level of intrusion on the executing program. In this analysis of the design alternatives, the relative levels of intrusion are compared with the goal of reaching a design with minimum intrusion, given the existing hardware and time and cost constraints. The level of intrusion is closely related to the amount of hardware support available from the monitoring system. In this work, however, the amount of hardware support was not the only factor affecting the level of intru-

sion. For CAPS, many points on the continuum were considered as the data path for debugging information, ranging from software-only instrumentation using the PASM control hierarchy [28], to elaborate hardware-intensive implementations. This section touches on some of the features of the alternatives considered and the relative trade- offs of each.

In all implementations considered, monitoring code is inserted into the user's program to send information through the dedicated I/O channels. This is an unavoida- ble source of intrusion for these implementations. Once the information is sent to the dedicated channel, the path it follows and the means of data concentration, dic- tate any additional amount of intrusion. The channels from each

CONSOLE <bla ckm aihjel>/n ome/hitchcock2/jel[2] 17

Ed - Dual VAX11/780

PE stage 2 butter f ly opecat ior

move.1 #FU tife÷b su, aO ; load move.w #(b'stt-low)12, (aO) ; wor

-; ~;'ansfer (even, 512 ; the MC fetch unit r

xl~a b end ; move on b stt: move.t (L~phyl PE),d2 ; physi

j u (L÷buttwid) ; Initialize stage hop ; evade MC68000 p jmp OxSO000O ; Jump to the beg nod ; evade MC68000 p

b_end:

. . . . . . . . . . . . . . . pROCESSING ELEMENT 0 (0) . . . . . . . . . . . . . . . FFT input (wrp{4,8) order)

ceal imag teal imag O) 0.5000000 0.000000 4) 0.500000 0.000000

burrer hy initialization stage t: WO butter#y stage 2: WO butterfly 1.000000 0.000000 stage 3: WO butterfly 1.000000 0.000000

cube ~nction iniflalization cube(O)desflna~on: 4 cube(1) desflnaflon: 8

CAPS

meg1 pGgl mmlB mlul

meg3 pegs lop

seahotse,ecn.purdue.edu connected Io bfeckrnalLeng.uiowa,edu using log directoqy I h o m e l h t i c h c o c k 2 , / j a l / x m o n / I o g / i e l

m c O

2 Point Complex Decimation-in-Time Fast Fourier Transform Using 4 PE Edward C. Bronson

a C Langauge wogram

alg_calO: SIMO/MIMD spit_out(): SIMD/MIMO main():

PE stage 1 butterfly: SIMD/MIMD PE cube(t) intereonnection function: SIMD/MIMD PE stage 2 butte0"fly: SIMD/MIMD PE cube(O) Interconnection function: SlMD/MIMD PE stage 3 butterfly:

wrap_up{): ~ PASM Parallel Processing Syatem

PE init ial ization complete, Proceed, Ed? I ] Pasta: I~eset tango

~ magO has been reset, [ ] '4 pasm: C<lmagO FFT_8.4

reset: . . . . . . . . . . . . . . . PROCt mcg0 has been reset,

FFTi send: FFT_g.4.DMP real imJ sending FFT_ti.4.DMP to pugo via/devlpasm/mcO.

sending FFT_8.4.DMP to mcO via Idev/pasmtmcO. 1) 2,8300000 0.0~ sending FFT 8.4.DMP to IuO via/dev/pasm/cs. bu send command done.

stage 1: WO butterly start: PEs O, 4, 8, 12 stage 2: Wti butlerfly star flng pego at 800000 via Idevlpasm/mcO, stage 3: W2 butterfly start command done.

start: MC 0 i starting mc0 at 100000 via /dev/pasmlmc0.

cut> Pasta: 17 cube(O) desflnatio~: 0

. . . . . . . . . . . . . . . PROCESSING ELEMENT 2 (8) . . . . . . . . . . . . . . . FFT input (wrp(4.8) order)

real imag real imag 2) 0,5000000 0.000000 6) 0,500000 0,000000

butterfly initializa flon stage 1: WO butterfly stage 2: W2 butterfly t .OOOOOO -1.0beO00 stage 3: Wl butterRy 1.707107 ~.707107

cube function initialization cube(O)destinaUon: 12 cube(I)desflnatiorl: 0

. . . . . . . . . . . . . . . PROCESSING ELEMENT 3 (12) . . . . . . . . . . . . . . . FFT input (wrp{4,8) order)

real imag real imag 3) -1,8300000 0.000000 7) -1.830000 0.000000

butterfly initiaJization stage 1: WO butterfly stage 2: W2 butterfly 0,000000 -1.00000O stage 3: W3 butterfly -0.707107 -0,707107

cube func~on irdfialization CUbu(O) des~rmbon: 8 CUbe(l) desflnation: 4

11 O November 1991/Vol.34, No.ll/COMMUNICATIONS OF T H E A C M

processor to the point of data concentrat ion can be external (to the parallel processing system), internal, or some combination of both. An external channel is physically at- tached to the processor and is isolated from the hardware of the parallel machine. An internal channel transmits information through data paths already embedded within the architecture of the parallel machine. A hybrid channel may use some internal channels as well as external channels to form the path that routes the information to a remote site. Ideally, the data paths used by the debug/trace statements are isolated from the rest of the parallel machine. This way, no part of the computat ional hardware is affected by the overhead of the moni tor ing environment. This form of intrusion can be avoided by intelligent decisions about the placement of the moni tor hardware support .

One hardware solution considered (Annex approach) uses a N-to-1 concentrator component that combines N serial ports connected to PASM's Parallel Computat ion Unit (PCU) into a serial stream of data received and transmitted over a LAN. This hardware acts as a dis- t r ibutor of data for input from the windows of the workstation to the processors of the PCU and as a concentrator of data from the PCU processors back to the workstation. This single piece o f hardware (manufactured by the Encore Com- puter Corpora t ion under the name A n n e x ~ - U X terminal server [9]) performs these functions. This design has the lowest degree of intrusion because no part of the PASM control hierarchy is involved. The only intrusion comes from the statements added to the user program to mark events.

One impor tant disadvantage of having the moni tor ing system com- prised of completely disjoint paths from the parallel system stems from the added complexity of scheduling the new resources of the monitoring system, that is, part i t ioning PASM for multiple users. I f the

data paths are integrated into the architecture of the parallel system, the opera t ing system can be used to allocate the moni tor ing channels with the rest of the system's resources. Completely disjoint data paths would require addit ional effort to control the complete system (monitor and target systems combined). For these reasons along with initial studies that indicated several al ternate designs would per fo rm well and could be put into operat ion in a mat ter of weeks for a fraction of the cost, the Annex solution was not chosen.

At the other end of the continuum was an approach that required no addit ional hardware. This embedded approach uses the parallel I/O capabilities of the control hierarchy of PASM itself. In this solution, the parallel data paths from PE to MC and from MC to SCU shuttle data packets between the PCU and the SCU. From the SCU, the LAN channel is accessible to send the packets to the remote site. This approach had the highest amount of intrusion with parallel computation because the monitor ing/debugging information is passed along the same path as p rogram control information. This would reduce the effective bandwidth of the data paths used by the user's program, and would also incur more overhead on the MCs, fur ther interfer- ing with the execution of the user's program.

The implementat ion chosen was a hybrid o f the Annex approach and the embedded approach using the PASM hierarchy. Since the SCU does not take part in the actual execution of the parallel program, no intrusion occurs from the use of its LAN channel. Also, the path chosen between the PEs and the SCU does not include any paths dedicated to parallel control. A new board, the System Monitoring Module (SMM), was added to the backplane of the I/O Processor. The SMM is capable of combining signals from the PCU and forwarding them to the IOP. Since the IOP is not a part of the PCU, it also can serve as an

C O M P U T I N G P R A C T I C E S

I/O channel without added intrusion. From the lOP, the data is passed to the SCU without using any paths dedicated to the PCU. Once received by the SCU, the information is sent over a LAN to the moni tor ing workstation. The operation of the SMM approach chosen will be discussed in detail in the next section.

Another impor tant considera- tion in the design was its scalability and ult imate limitations. When combining debugging/ t racing information from N processors, where N is arbitrari ly large, there is some value o f N that will ultimately overwhelm the bandwidth of the system. Even if all issues of hardware and software scalability were overcome and the environment could handle an arbitrari ly large number of processors, the application p rog ra mme r may not be able to use all the information provided. I t is impractical in most debugging settings to expect a user to assimilate information from 1,024 processors simultaneously.

Each design alternative considered degrades differently as the PCU becomes larger. The embedded approach begins degradat ion of system performance through intrusion immediately with both the MCs and SCU acting as potential bottlenecks for the data flow. The amount of intrusion also rises because the MCs take par t in the parallel program's execution and are fur ther burdened by the monitoring suppor t they provide. The level of intrusion of the hardware solu- tions, however, is not affected by scaling the PCU, because the data paths for these approaches do not include any paths used by the PCU. The strength for the embedded approach lies in its requi rement of no addit ional architectural support . As the size of the PCU increases, the Annex solution would only require addit ional terminal servers with independen t connections to the LAN. The only possible bottleneck would be the LAN or the workstation that would receive the data. Al though expensive ($5,000

COMMUNIGATIONS OFTHE ACM/Novcmbcr 1991/Vol.34, No.ll 111

per 32 PEs), this approach is attractive due to its ease of implementation. The scaling limitations of the SMM approach, as well as consider- ations to scaling to extremely large number,,; of processors will be discussed fiarther in the section on ar-

c h i t e c t u r a l support . The most efficient way to avoid

bottlenecks in the system is through informed usage based on observa- tions of the proper t ies of parallel p rograms and the process of debugging these programs. Consider some general characteristics of parallel programs and their p rogram- mers. A p ruden t approach to pro- g ramming parallel systems is to initially write applications for a subset (partition) of the available processors and/or a reduced data set, increasing the number o f PEs or amount of data after debugging is complete. By writing programs that can execute on parti t ions of varying size, the p rog rammer gains two advantages. First, the p rog rammer can debug and test code on a small number of processors (although it is possible that increasing the number of PEs and/or data set size could introduce new errors). Second, the p rogram will run on whatever size part i t ion is available at a later time. The part i t ion size may be deter- mined by the user's data set or machine usage at run time. Also, most parallel p rograms have only a small number (often one) of unique pro- cesses dis tr ibuted as identical copies on a large number of processors. When debugging such programs, the p rog rammer can choose a rep- resentative set of the processors to moni tor initially. As testing contin- ues, this set of moni tored processors can change depend ing on errors encountered in the code or on the events the p r o g r a m m e r wishes to moniitor. Thus, it should be possible to limit the number of processors that must be moni tored to a manageable size to both the environment and the p rogrammer . Exceptions obviously exist and include some applications where contention grows nonlinearly with system size.

Architectural Support for CAPS Each CPU board in the PASM prototype has a serial por t in tended for terminal I/O with the CPU's resident moni tor to allow for p rogram debugging and control. CAPS uses these I/O channels. Each serial por t in the prototype is connected to the SMM, which is control led by the I/O Processor. Together they act as the data concentrator . Al though the SMM is custom hardware, it can be used as par t of any parallel pro- g ramming envi ronment in which the moni tor ing data is der ived from a s tandard serial or parallel por t connection on each processor.

A block d iagram showing how the architectural suppor t for CAPS is integrated into PASM is shown in Figure 3. The CAPS system on the PASM prototype functions in the following manner . The I/O Proces- sor constantly monitors each serial por t of the SMM for incoming data from any of the PASM CPUs. Once a PASM CPU sends a character out of its own serial port, the associated por t on the SMM receives and stores the character. The I/O Pro- cessor reads the PASM CPU's transmitted character and forms a 2-byte packet. The first byte of the packet contains information indicating which of the PASM CPUs sent the character. The second byte of the

= I G U R I 3

BlOck diagram of the architectural support for CAPS

packet is the 7-bit ASCII character sent. The I/O Processor sends this packet to the SCU via the I/O Pro- cessor - -SCU parallel por t connection. A process running on the SCU reads the packets f rom its parallel por t connection and sends the packets out onto the LAN channel to the Sun Workstation. Data input th rough the windows on the Sun a r e similarly packetized and re- tu rned to the appropr ia te PASM CPU, i.e., Sun to SCU, SCU to IOP, IOP to SMM, SMM to PASM CPU.

The data concentrator ( S M M - - IOP pair) is necessary because no o ther component of PASM, (e.g., SCU or IOP), has the number of ports required to br ing all CPU serial connections together. The IOP controls the SMM ra ther than the SCU because the ports must be ser- viced in real t ime to avoid loss of data. The SCU, runn ing Unix Sys- tem V, is not capable of servicing that number of ports without ne- glecting its o ther activities or losing data, while the IOP is not runn ing Unix. The SCU however, is capable of handl ing the single stream of packetized data from the IOP. When the SCU is unable to service the I O P I S C U parallel port , the IOP buffers packets in its local memory.

Consider the requi red data rates. Ten bits of data are t ransmit ted for each ASCII character sent between a CPU and the SMM: seven bits for the ASCII character, one parity bit,

I'°1 Pr.~ces-,:or ,I

I Control and Data

I Sy'.:c m r,'c r.,~or irlg Modu'.::

; ; ; ; & ~ a l Data

To Other PASM OPUs [ ~A~Sul [

I Sys:(-m J Parallel ~ Control Port Link Unit

EON LAN

I Sun J Wor k~.','-.'*..:, rl

I It.~st I M:~:nln(:

112 November 1991/Vol.34, No.ll/COMMUNICATIONS OF THE ACM

one start bit, and one stop bit. I f all 30 processors send data at the same speed, (e.g., 9,600 bps), a data rate of approximate ly 28K characters/ second results. Because each character received causes the formation of a 2-byte packet, the I O P - - S C U parallel por t connection must be capable o f twice this rate (approximately 56K bytes/second). This rate is far below the th roughput provided by the parallel por t connections. The LAN channel is also capable of this rate.

In addition, the I/O Processor must be capable of these rates as well. When an average single byte memory access is conservatively es- t imated at one microsecond, 35 memory accesses are permit ted per t ransferred ASCII character. This speed is easily attainable by efficient assembly language p rogramming of the required task. The worst case scenario of 30 CPUs sending simultaneously, however, is very unlikely to be sustained. Fur ther , the I/O Processor accesses machine instructions two bytes at a time, while the serial ports and parallel por t are restricted to single-byte accesses. Thus, 35 accesses per t ransferred character is a conservative figure.

While all processors are accessible to the application p rog rammer through CAPS, only the Parallel Computat ion Unit and MCs are of use. Therefore , for applications programmers , the worst-case scenario ment ioned above reduces to 20 sending CPUs. The remaining processors are dedicated to suppor t services for the Parallel Computa- tion Unit and MCs under opera t ing system control. These service processors are l inked to the SMM so systems programmers can also take full advantage of CAPS.

Data t ransmit ted from the Sun back through the SMM originates as keyboard input at a maximum rate of several characters per second. This data rate is negligible compared to data rates to the Sun from the CPUs, and was therefore omitted from the analysis.

When considering the scalability of this system, the initial perfor-

mance bottleneck is the I O P - - SMM pair. The IOP's task of multi- plexing data coming from multiple I/O channels must be done in real time and is limited by the rate at which it can service the SMM's ports. T iming measurements taken from the I O P - - S M M pair show that, on average, 30 memory accesses for instructions and data are made for each character transmitted. Recall that with 30 processors t ransmit t ing data, up to 35 memory accesses are allowed for each transmitted character before the IOP begins to miss characters. This indi- cates that, while the current design will suppor t traffic generated by a N = 16 PASM system, a N = 3 2 PASM system (with ei ther four or eight MCs) will saturate the I O P - - SMM pair if all processors are sending data. Multiple I O P - - S M M pairs can be used with larger systems to alleviate this weak point, as the NCube system does by using mult iple I/O processing nodes. Again, even if the hardware bottlenecks were overcome, the user could not use all the data at once. Therefore , for the purpose of program development and debugging, it is reasonable to accept a maximum number of processors being simultaneously monitored, even if this number is a fraction of the total.

The SCU will also become a bottleneck in any expanded system. During the execution of the parallel program, the IOP can be dedicated to suppor t ing monitoring. The SCU is runn ing Unix System V, however, and must t imeshare this function with its o ther duties. Ex- perience has shown that the current SCU can become a bottleneck if mult iple users are logged in and pe r fo rming various tasks. It will become necessary to provide realtime suppor t for the transfer of data from the IOP to the LAN in the form of a dedicated interface to the LAN, bypassing the SCU. Thus, any larger PASM system would require an I O P - - S M M pair to function much like the Annex~M-UX terminal server. Without a LAN

COMPUTING PRACTICES

interface, the SMM costs approximately $300 compared to $5,000 for a 32-port Annex~M-UX terminal server. The addi t ion of a LAN interface to the IOP would put its total cost near the 32-port Annex- UX terminal server, making the adopt ion of the Annex approach increasingly attractive. Also, scalability with multiple indepen- dent terminal servers is possible. The advantage of the I O P - - S M M approach, however, is that the resources of the moni tor ing system are under the control of the SCU and can be allocated with other system resources. For the remainder of this discussion, the term terminal

server will be used to refer to ei ther an I O P - - S M M pair or an Annex- UX terminal saver.

The user interface is another potential bottleneck. In textural form it is inconceivable for a user to assimilate the data from more than a few processors in real time, and it is a laborious task to go through extensive p rogram traces after execution. The alternative is improved graphical representat ions of computat ion and automatic identification of inefficiencies.

Consider the characteristics and/ or limitations of an expanded system as described with multiple terminal savers.

1) I/O would still be possible from all processors because there are mult iple terminal servers.

2) With I/O intensive tasks, it may be possible to visually moni tor only a subset o f the active processors due to the bandwidth bottleneck at the LAN or worksta- ,tion.

3) With tasks runn ing on many processors, it may not be possible to convey useful information on all the processors to the user with a single workstation.

Using multiple terminal servers would permi t a large n u m b e r . o f processors to be monitored. How- ever, with only a single interface to the user, only a limited subset of the processors of interest could be ac- tively moni tored simultaneously.

COMMUNICATIONS OF THE ACM/November 1991/Vo1.34, No.ll 113

T h i S s o p h i s t i c a t e d f u t u r e h a r d w a r e i m u s t b e c a p a b l e o f i d e n t i f y i n g e v e n t s

a n d t r a c k i n g t h e s t a t e o f t h e u s e r ' s p r o g r a m t h r o u g h o n l y a p h y s i c a l

c o n n e c t i o n t o t h e C P U b u s e s o f t h e n o d e s o f t h e

p a r a l l e l s y s t e m .

Debugging that is I/O intensive or where large bursts of information can be genera ted by the executing p rogram will be a problem for an expanded CAPS. In such cases, a t rade-off will exist between the number of processors the programmer wis]hes to moni tor and the detail of the informat ion the pro- g rammer wishes to obtain. For most expected applications, informat ion gathered simultaneously on the subset of processors of interest will meet the user 's needs.

For extremely large numbers of processors (massively parallel), new forms of parallel I/O must be created. It will no longer be possible to have any point in the flow of information where there is serialization or for the p rog rammer to gain but the most general informat ion on the execution in real time. Most informat ion on the execution will be gained af terward from the exami- nation of trace files or from graphical representat ions of collections of data.

Finally, consider the case where multiple users on separate workstations are using the system. I f egtch p rog rammer is moni tor ing a number of processors that are manageable by their workstation, there is the possibility of per formance degradat ion at the LAN. The combined number of debugging/tracing messages from each user's set of moni tored processors can saturate the path if the number of users is large enough. With multiple terminal servers, the LAN becomes the potential bott leneck for I/O travel-

ing from PEs th rough the terminal servers to the Sun. To ease this situ- ation, the terminal servers buffer data when the LAN becomes satu- rated. As ment ioned previously, I/O traveling f rom the Sun(s) back to the PEs originates as keyboard input at a negligible data rate. Also, p rogram downloading uses different paths within PASM and does not utilize the path used for interactive moni tor ing and debugging. The end effect with mult iple users, however, is that each user experiences a h igher latency with interactive I/O. The amount of latency caused by a given number of users depends on factors such as I/O required by each user and the software overhead of the terminal servers. This overhead is difficult to quantify.

Next-Generation Support This section describes plans for the next generat ion of debugging suppor t for PASM. The ult imate goal for this research is the construction of a completely nonintrusive environment with a high-level user interface. The user interface will rely heavily on the graphics capabilities of high-resolution workstations to aid the p rog ra mme r in visualizing the parallel computat ion. The environment also will automatically identify causes of inefficiencies or contention and point them out to the user. For the moni tor ing of the execution of the p rogram to be nonintrusive, the moni tor ing system must provide substantial hardware suppor t for the identification

of events without any modification to the user's original source code.

Work toward the deve lopment of hardware capable of nonintrusive moni tor ing is a lready well under- way [18, 19]. The event-action par- adigm is a model of the under ly ing principles of the moni tor ing process. F rom this model, a layered architectural model has been developed and appl ied to the design of a nonintrusive moni tor ing system. This sophisticated future hardware must be capable of identifying events and tracking the state of the user 's p rogram th rough only a physical connection to the CPU buses of the nodes of the parallel system.

The proposed moni tor ing system includes a Central Monitoring Facility that acts as the user interface (graphic workstation). The Central Moni tor ing Facility also will be responsible for the coordinat ion and synchronization o f the Special- Purpose Hardware Monitoring Units that are replicated at each node of the parallel system. Addit ionally, if the network cannot be simulated in software or if it exhibits nondeter - ministic behavior, network monitor ing hardware will be included. Finally, the components of the moni tor ing system will be intercon- nected with a high bandwidth interconnection (e.g., Ethernet) and suppor t hardware for synchronization including a clock line to provide a locally available view of global time.

The Special-Purpose Hardware Monitor ing Units contain fast c o r n -

114 November 1991/Vol.34, No.ll/COMMUNICATIONS OF THE A C M

COMPUTING PRACTICES

T h e i m p l e m e n t a t i o n o f t h e C A P S e n v i r o n m e n t o n t h e P A S M

p r o t o t y p e s h o w s t h a t t h e a m o u n t o f e x t r a h a r d w a r e n e c e s s a r y i s s m a l l

a n d a s o m e w h a t l o w d e g r e e o f s y s t e m i n t r u s i o n c a n b e

m a i n t a i n e d .

parison logic that compares bus signal patterns with groups of patterns of interest in an event memory. Comparison with these signals allows the identification of program- level events such as variables chang- ing value. The Special-Purpose Hardware Monitoring Units will also analyze predicates involving these program-level events such as" "Is the value of the variable zero?" Finally, the Special-Purpose Hard- ware Monitoring Units and the Central Monitoring Facility work together to evaluate predicates spanning several nodes of the system or concerning the system as a whole such as: "Does variable A equal zero in all nodes?" Each Spe- cial-Purpose Hardware Monitoring Unit will include a processor and a high-speed controller to facilitate coordinat ion between Special- Purpose Hardware Monitoring Units and the identification of predicates.

To gain a global o rder ing o f events in the system, it is necessary to have a locally available idea of global time. The ability to record the time of occurrence o f events (time-stamp) is critical to analyzing code execution in parallel machines, but is difficult to do with physical clocks [16]. Without t ime stamps, rebui lding a picture of the execution from the marked events across multiple processors is difficult because, while the events marked on an individual processor are ordered , the events marked across processors are not. The relative o rder ing of events across proc-

essors must be deduced from synchronization points, network accesses, and SIMD/MIMD mode switches. To accurately t ime-stamp events, a global system clock that allows simultaneous access must be present at each processor. Such a global system clock is a necessary but potentially expensive component. Each PASM PE has a 32-bit t imer that can be clocked by a single clock line distr ibuted th rough the machine at a resolution as small as 125 nanoseconds. It is possible to clear and start all times in SIMD mode so their values proceed iden- tically. Therefore , PASM has suppor t for a relatively inexpensive global system clock that permits simultaneous access by all PEs.

In addit ion, work is continuing toward improved user interfaces. Graphical user interfaces show much promise in conveying information on the execution of parallel programs. It may be possible for a p rog rammer to assimilate information about a greater number of nodes with greater detail if the informat ion can be presented in a sophisticated graphical format.

Conclusion This work shows that a small amount of addit ional hardware can be used to implement a useful remote access and debugging environment . This environment provides remote access to a parallel machine for mult iple users and integrates system features such as downloading code, code development, interactive I/O, and run-t ime

moni tor ing of programs with sophisticated workstation windowing capabilities.

The implementat ion of the CAPS environment on the PASM prototype shows that the amount of extra hardware necessary is small and a somewhat low degree o f system intrusion can be maintained. The added hardware necessary for the CAPS environment imple- mented on the PASM prototype costs on the o rde r of $300. Of course, the LAN and workstation are not included in this cost.

I t is likely that the cost o f this system on other parallel machines would be higher because the LAN i n t e r f a c e - - I O P pair may not exist. The general idea of using a terminal server, however, to combine multiple streams of low bandwidth I/O to a single high-bandwidth channel that can be t ransmit ted to a workstation or o ther display device, can be extended to almost any parallel system. By using channels external to the parallel system's data path, the level of intrusion can be minimized and the intrusion can be localized to the processor sending debugging information. While serial channels may not exist in a parallel system, they can be added to the nodes as memory-mapped I/O devices. Therefore , a system similar to CAPS can be used with equal utility on many parallel systems.

The scalability of CAPS was discussed and it was argued that the technical challenges do not lie in the development of super-workstations and high-bandwidth channels.

C O M M U N I C A T I O N S OF THE ACM/November 1991/Vol.34, No.ll 1 l S

but instead pivot on the design of architectural support that intrudes minimally with the execution of a parallel program. Since a user may not possibly assimilate the masses of data from thousands of processors or even the amount of data that a LAN's 'bandwidth can provide, it was shown how discriminate usage of an expanded CAPS system could enable a user to monitor a program.

Pure ihardware approaches to the problem of monitor ing multiprocessor systems is not the ultimate solution. A modest amount of specialized hardware combined with execution trace analysis techniques to compensate for intrusion appear promising. Work to develop more sophisticated program monitor ing and tracing tools is continuing. These tools will provide more in- formative graphic displays of program execution to aid in the debugging, testing, and study of parallel algorithms and architectures.

Acknowledgments A preliminary version of portions of this material was presented at the 1989 Workshop on Experiences with Diistributed and Multiproces- sor Systems. We thank the review- ers of this manuscript for their many insightful and useful sugges- tions. El

References 1. Aral, Z. and Gertner, I. Parasight: A

high-level debugger/profiler architecture for share-memory mul- tiprocessors. In Proceedings of the ACM 1988 International Conference on Supercomputing (July 1988), pp. 131- 139.

2. Bates, P. Distributed debugging tools for heterogeneous distributed systems. In Proceedings of the Eighth Inter- national Conference on Distributed Com- puting Systems (June 1988), pp. 308- 316.

3. Bates, P.C. and Wileden, J.C. High- level debugging of distributed systems:: The behavioral abstraction approach. J. Syst. Softw. 3, 4 (Apr. 1983), 255-264.

4. Berg, T.B., Kim, S.D. and Siegel, H.J. Limitations imposed on mixed- mode performance of optimized phases due to temporal juxtaposi-

tion. J Parallel and Distributed Comput., To be published Oct. 1991.

5. Brantley, W., McAuliffe, K. and Ngo, T. RP3 performance monitoring hardware. In Instrumentation for Fu- ture Parallel Computing Systems, M. Simmons, Ed., Addison Wesley, Reading, Mass., 1989, pp. 35-48.

6. Bronson, E.C., Casavant, T.L. and Jamieson, L.H. Experimental application-driven architecture analysis of an SIMD/MIMD parallel processing system. IEEE Trans. Parallel Distrib- uted Syst. 1, 2 (Apr. 1990), 195-205.

7. Cheriton, D.R. and Zwaenopol, W. The distributed V Kernel and its performance for diskless workstations. In Proceedings of the 9th Sympo- sium on Operating Systems (Oct. 1985), pp. 128-139.

8. Couch, A.L. Seecube user's manual. Tech. Rep. CS Department, Tufts University, 1987.

9. Encore Computer Corporation, ANNEX Hardware Installation Guide, and ANNEX User's Guide, Documents 716-02887 and 716- 02886, Encore Computer Corpora- tion, 1986.

10. Fineberg, S.A., Casavant, T.L., Schwederski, T. and Siegel, H.J. Non-deterministic instruction time experiments on the PASM system prototype. In Proceedings of the 1988 International Conference on Parallel Processing (Aug. 1988), pp. 444- 451.

11. Fineberg, S.A., Casavant, T.L. and Siegel, H.J. Experimental analysis of a mixed-mode parallel architecture using bitonic sequence sorting. J. Parallel and Distributed Comput. 11 (Mar. 1991), pp. 239-251.

12. Flexible Computer Corporation, The Flex/32 MultiComputer: Sys- tem Overview, Report No. 030- 0000-002, Flexible Computer Cor- poration, 1985.

I3. Hayes, J.P., Mudge, T.N., Stout, Q.F. and Colley, S. Architecture of a hypercube supercomputer. In Pro- ceedings of the 1986 International Con- ference on Parallel Processing, (Aug. 1986), pp. 653-660.

14. Intel Corporation, iPSC/2 and iPSC/ 860 User's Guide, Document No. 311532-006, Intel Corporation, 1990.

15. Kohl, J.A. and Casavant, T.L. Use of PARADISE: A meta-tool for visualizing parallel systems. In Proceed- ings of the Fifth International Parallel Processing Symposium (IPPS), To be

published, Apr. 1991. 16. Lamport, L. Time, clocks and or-

dering of events in a distributed system. Commun. ACM 21, 7 (July 1978), 558-565.

17. LeBlanc, T.J. and Mellor-Crum- mey, J.M. Debugging parallel programs with instant replay. IEEE Trans Comput. C-36, 4 (Apr. 1987), 471-482.

18. Lumpp, J.E., Jr., Casavant, T.L., Siegel, H.J. and Marinescu, D.C. Specification and identification of events for debugging and performance monitoring of distributed multiprocessor systems. In Proceed- ings of the 10th International Confer- ence on Distributed Computing Systems (June 1990), pp. 476-483.

19. Marinescu, D.C., Lumpp, J.E., Jr., Casavant, T.L. and Siegel, H.J. Models for monitoring and debugging tools for parallel and distributed software. J Parallel and Distrib- uted Comput. 9, 2 (June 1990), 171- 184.

20. McDaniel, G. Metric: A kernel instrumentation system for distributed environments. In Proceedings of the 6th Symposium on Operating Sys- tems Principles (Nov. 1975), pp. 93- 99.

21. McDowell, C.E. and Helmbold, D.P. Debugging concurrent programs. ACM Comput. Surv. 21, 4 (Dec. 1989), 593-622.

22. Miller, B.P. and Choi, J.D. Break- points and halting in distributed programs. In Proceedings of the Eighth International Conference on Distributed Computing Systems (June 1988), pp. 316-323.

23. Miller, B.P. and Yang, C.Q. IPS: An interactive and automatic performance measurement tool for parallel and distributed programs. In Proceedings of the Seventh Interna- tional Conference on Distributed Com- puting Systems (Sept. 1987), pp. 482- 489.

24. Motorola. VMEbug: Debugging Package User's Manual. MVMEbuglD3, Motorola, Inc, 1983.

25. Powell, M.L. and Miller, B.P. Pro- cess migration in DEMOS/MP. In Proceedings of the 9th Symposium on Operating Systems (Oct. 1983), pp. 110-119.

26. Scheifler, R.W. and Gettys, J. The X window system. ACM Trans. Graph. 5 (Apr. 1986), 79-109.

27. Schiffenbauer, R .D. Interactive

116 November 1991/Vol.34, No. 11/COMMUNICATIONS OF THE A C M

debugging in a distributed compu- tational environment. Tech. Rep. MIT/LCS/TR 264, Massachusetts Institute of Technology, 1981.

28. Schwederski, T., Nation, W.G., Sie- gel, H.J. and Meyer, D.G. Design and implementation of the PASM prototype control hierarchy. In Pro- ceedings of the Second International Supercomputing Conference (May 1987), pp. 418-427.

29. Siegel, H.J. Interconnection Networks for Large-Scale Parallel Processing: Theory and Case Studies, Second Edi- tion, McGraw-Hill, New York, N.Y., 1990.

30. Siegel, H.J., Armstrong, J.B. and Watson, D.W. Mapping tasks onto the PASM reconfigurable parallel processing system. In Proceedings of the 1990 Parallel Computing Work- shop, sponsored by the Computer and Information Science Depart- ment at the Ohio State University (Mar. 1990), pp. 13-24.

31. Siegel, H.[., Nation, W.G. and AI- lemang, M.D. The organization of the PASM reconfigurable parallel processing system. In Proceedings of the 1990 Parallel Computing Work- shop, sponsored by the Computer and Information Science Depart- ment at the Ohio State University (Mar. 1990), pp. 1-12.

32. Sun Microsystems. Sun System Over- view, Sun Microsystems Manual, Sun Microsystems, Inc., 1986.

33. Taylor, G.S. Arithmetic on the ELXSI system 6400. In Proceedings of the 6th Symposium on Computer Arithmetic (May 1983), pp. 110-115.

34. Thomas, B., Gurwitz, R., Goodhue, J., Allen, D. and Beeler, M. Butter- fly parallel processor overview. BBN Rep. 6148, BBN Advanced Computers, Inc., 1986.

35. van Tilborg, A. Ed., Workshop on Instrumentation for Distributed Com- puting Systems, IEEE Computer So- ciety and ACM, 1987.

About the Authors:

JAMES E. LUMPP, JR. is currently a Ph.D. candidate in The Department of Electrical and Computer Engineering at the University of Iowa. His research interests include parallel processing, computer architecture, operating systems, and execution monitoring.

SAMUEL A. FINEBERG is currently enrolled in the Ph.D. program in the Department of Electrical and Computer

Engineering at the University of Iowa. His research interests include parallel computer architecture and performance evaluation.

THOMAS L. CASAVANT is currently an assistant professor on the faculty of the Department of Electrical and Com- puter Engineering at the University of Iowa. His research interests include parallel processing, computer architecture, programming environments for parallel computers, and performance analysis.

Authors' Present Address: Parallel Processing Laboratory, Department of Electrical and Computer Engineering, University of Iowa, Iowa City, IA 52242.

WAYNE G. NATION is with the IBM Corporation, Systems Technology Divi- sion, Endicott, New York, His research interests include interconnection networks, parallel processing, and parallel processor prototyping. Author's Pres- ent Address: IBM Corporation, Sys- tems Technology Division, Endicott, NY 13760.

EDWARD C. BRONSON is currently with the Department of Biology at Pur- due University. His research interests include computer architectures and algorithms, parallel and distributed processing, speech analysis and recognition, signal processing, DNA pattern recognition, and computer analysis of the high-order structure of the cellular ge- nome. Author's Present Address: Han- sen Life Sciences Research Building, Purdue University, West Lafayette, IN 47907.

HOWARD JAY SIEGEL is a professor and coordinator of the Parallel Process- ing Laboratory in the School of Electri- cal Engineering, at Purdue University. His current research focuses on interconnection networks and the use and design of the unique PASM reconfigurable parallel computer system.

PIERRE H. PERO is employed as the systems engineer for the Parallel Pro- cessing Laboratory in the Purdue Uni- versity School of Electrical Engineering where he contributed to the design and construction of the PASM parallel processing system. His research interests are parallel processing and microcom- puters, Authors' Present Address: Parallel Pro- cessing Laboratory, School of Electrical

C O M P U T I N G P R A C T I C E S

Engineering, Purdue University, West Lafayette, IN 47907.

DAN C. MARINESCU is associate professor in the Computer Science Depart- ment at Purdue University. His research areas are parallel processing and scientific computing, Petri nets, distributed systems and networking, performance evaluation, and realtime systems. Author's Present Address: Department of Computer Science, Purdue Univer- sity, West Lafyette, IN 47907.

THOMAS SCHWEDERSKI is with the Institute for Microelectronics Stuttgart, Germany, where he heads the Custom Processor and Test Department. His research interests are VLSI design, parallel processing, and interconnection networks for parallel computers and ATM. Author's Present Address: Insti- tute for Microelectronics Stuttgart, Allmandring 30a, D-7000 Stuttgart 80, Germany.

CR Categories and Subject Descrip- tors: C.4 [Performance of System]: design studies; C.5m [Computer System Implementation]: Miscellaneous; D.2.5 [Software Engineering]: Testing and Debugging; D.2.6 [Software Engineer- ing]: Programming Environments- - interactive.

Additional Key Words and Phrases: Design, experimentation instrumentation.

This work is supported by the National Sci- ence Foundation under grant numbers CCR- 8702115, CCR-8704826, CCR-8809600, CCR-8846388, and CDA-9015696; by the NSF Software Engineering Research Center (SERC), by the SDI under ARO contract under DDAAL03-86K-0106, by the Naval Ocean Systems Center under the High Per- formance Computing Block, ONT, and by the office of Naval Research under grant number N00014-90-J- 1937.

Sun Workstation is a registered trademark of Sun Microsystems Inc.

Annex is a registered trademark of Encore Computer Corporation.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appeal, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.

© ACM 0002-0782/91/1100-104 $1.50

COMMUNICATIONS OF THE ACM/November 1991/Vol.34, No. 11 117