a codinghj/journals/38.pdf · is executing on the parallel system and does not function correctly...

14
James E. Lumpp, Jr., Samuel A. Fineberg, WayneG. Sation, Thomas L. Casavant,EdwardC. Bronson, HowardJay Siegel, Pierre H. Pero, Thomas Schwederski, and Dan C. Marinescu p rogramming parallel ma- chines is very difficult. First, generating an algo- rithm requires the pro- grammer to assimilate the interactions of multiple threads of control. Second, syn- chronization and communication among the threads must be ad- dressed to avoid contention and deadlock. Then, once the program is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem [21]. Additionally, debug- ging software is an activity that re- quires systematic attention to detail. Success is a function of the experi- enced individual involved and the A CODING tools employed. The ability to effi- ciently debug software requires the wisdom to know what questions to ask, the ability to analyze the an- swers received, and the knowledge to formulate the best next question. To aid in this interactive process, the programmer needs information about the run-time behavior of the program. The debugging process is actu- ally the classic scientific method: Formulate a hypothesis (e.g., possi- ble cause of an error encountered), generate an experiment (e.g., in- strument the program or specify actions to be taken by the monitor- ing tools), collect data (e.g., execute the program with debugging state- ments inserted or with breakpoints set), analyze results (e.g., review traces of execution or check aspects of process state), test hypothesis, and, if necessary, formulate a new hypothesis. It may be possible for tools to direct the user toward pos- sible sources of contention in data access or possible logic errors in conditionals, thereby guiding the user toward the cause of the errors. lln this article, debugging refers to the process of modifying a program to execute both cor- rectly and effectively, e.g., the identification and compensation for contention for a shared resource is considered debugging. 104 November 1991/Vol.34,No.ll/COMMUNICATIONS OF THE ACM

Upload: others

Post on 20-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

James E. Lumpp, Jr., Samuel A. Fineberg, Wayne G. Sation,

Thomas L. Casavant, Edward C. Bronson, Howard Jay Siegel, Pierre H. Pero,

Thomas Schwederski, and Dan C. Marinescu

p rogramming parallel ma- chines is very difficult. First, generating an algo- rithm requires the pro- grammer to assimilate the interactions of multiple

threads of control. Second, syn- chronization and communication among the threads must be ad- dressed to avoid contention and deadlock. Then, once the program is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem [21]. Additionally, debug- ging software is an activity that re- quires systematic attention to detail. Success is a function of the experi- enced individual involved and the

A C O D I N G tools employed. The ability to effi- ciently debug software requires the wisdom to know what questions to ask, the ability to analyze the an- swers received, and the knowledge to formulate the best next question. To aid in this interactive process, the p rogrammer needs information about the run-time behavior of the program.

The debugging process is actu- ally the classic scientific method: Formulate a hypothesis (e.g., possi- ble cause of an error encountered), generate an experiment (e.g., in- strument the program or specify actions to be taken by the monitor- ing tools), collect data (e.g., execute the program with debugging state- ments inserted or with breakpoints set), analyze results (e.g., review traces of execution or check aspects of process state), test hypothesis, and, if necessary, formulate a new hypothesis. It may be possible for tools to direct the user toward pos- sible sources of contention in data access or possible logic errors in conditionals, thereby guiding the user toward the cause of the errors. l ln this article, debugging refers to the process o f modi fy ing a p r o g r a m to execute both cor- rectly and effectively, e.g., the identification and compensa t ion for content ion for a shared resource is considered debugging .

104 November 1991/Vol.34, No.ll/COMMUNICATIONS OF THE ACM

Page 2: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

It also may be possible to display the information to the user in a form that is easily assimilated, mak- ing the debugging process more efficient. No tool set, however, will ever automatically debug programs. It will always be the user's job to formulate possible causes of errors or to decide which of the possible performance problems to examine. A fundamental goal of the tool de- signer is to make the experimenta- tion and analysis steps of the pro- cess as powerful and efficient as possible. I f coding aids accomplish nothing else, they should aid the programmer in posing questions, gathering information, and analyz- ing data. Therefore, the quantity and quality of information pro-

A I D F O R vided is important.

For serial programs, debugging information is fairly easy to obtain and can be gained solely with (in- trusive) software probes. Since there is a single thread of control, it can usually be delayed arbitrarily while the state of the program is observed. Breakpoints can be set and the contents o f program vari- ables checked to obtain information about the instantaneous state of the program. Utilities exist that aid the programmer in setting breakpoints and in examining the state of the program. Tools are also available for profiling serial programs to ob- tain statistics on the execution time of various sections of the program. These techniques are invaluable in the debugging phase of program- ming for serial computers, which often consumes the bulk of pro- gram development time.

These techniques, however, can- not be extended directly to the par- allel case due to the existence of multiple threads o f control in par- allel programs [22]. Since parallel programs often depend on the syn- chronization of threads of control (e.g., the order of access to shared resources is important), the change in flow of the computation, to

gather status information such as the values of program variables, can greatly affect the execution of the program. This phenomenon is called intrusion by the monitoring system. It becomes necessary to provide some architectural support in the form of hardware instru- mentation to aid debugging efforts. A software probe of one thread can supply only local information and cannot affect other threads (e.g., stop them to examine system state). In addition, if the monitoring sys- tem affects the relative timing of different threads, the execution time of the program may increase, or (worse yet) deadlock situations could be created or masked, and the debugging of the parallel pro-

gram may become impossible. An example would be debugging of a program in which the ordering of two events in different nodes is important and the monitoring changes the ordering of these two events.

In this work, a system designed to aid in software development for a PArtitionable SIMD/MIMD (PASM) prototype which is capable of operation in both single instruc- tion stream multiple data stream (SIMD) and multiple instruction stream multiple data stream (MIMD) modes. A primary goal in the design of Coding Aid for the PASM System (CAPS) was to pro- vide users with the ability to moni- tor their programs with maximum flexibility and minimum intrusion,

COMPUTING PRACTICES

given the existing system hardware, while providing information on a wide range of program attributes. Rigorous limits were placed on the cost and complexity of the system, however, to have a working, usable tool in a period o f about three months for a cost of under $1,000. CAPS integrates hardware support and software tools to provide a remote execution and program debugging/monitoring environ- ment for the PASM prototype, de- signed and constructed at Purdue University [30, 31]. CAPS is the current generation of monitoring hardware and software for PASM. This includes specialized hardware added to the prototype and soft- ware servers, running on PASM

and the user's workstation, to facili- tate the transfer and presentation of information to the programmer. CAPS is currently used to assist development of application and system software for PASM, as well as in experimental system evalua- tion [4, 5, 10, 11] and to drive higher-level visualization tools [15].

An environment of this type is useful because it allows the machine to be accessed from a remote site. Also, on a partitionable machine, multiuser access is permitted with users working at separate remote sites using different parts o f the machine concurrently. Download- ing application code and the devel- opment of code may be integrated into the same remote environment. The remote machine used for pro-

COMMUNJCATION$ OFTHE ACM/November 1991/Vol.34, No.ll l O S

Page 3: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

m : m m a ~ : : a m m : m m l l | | s m | : m m : | | m l | m | : | a m | | | | a m | | | | l l m | | | m | | l l | | | | | | | m l | | | | | | | | | !

gram development may have soft- ware tools not available on the tar- get machine. Suppor t for run-t ime moni tor ing of parallel programs is also crucial in the debugging phase of parallel software development . Informat ion gained through the actual execution of a p rogram on the parallel machine is invaluable in correct ing and optimizing the oper- ation of programs. Finally, the ad- dition of such an environment to an existing machine can be inexpen- sive, as shown by the implementa- tion presented here.

CAPS consists of a set of dedi- cated I/O channels and associated hardware and software that facili- tate bidirectional information flow between the individual nodes of PASM and a workstation providing the uselr interface. The information is specified by source level p rogram statements (code) added to the user 's p rogram that transmits mes- sages th rough dedicated I/O chan- nels. Once the data is sent f rom the nodes, it is combined into a single stream and sent through a high- bandwidth Local Area Network (LAN) to a workstation where it is used to debug and analyze the exe- cution of programs.

Some degree of hardware instru- mentat ion is necessary to keep in- trusion to a reasonable level. A "reasonable" level, however, de- pends on the p rogram being moni- tored and the goal of the monitor- ing. CAPS was designed to keep the hardware enhancements to the ar- chitect,are at minimal cost. The techniques used can be appl ied to a broad class of parallel machines, including the PASM parallel pro- cessing system, and almost any mul- t iprocessor system. The individual processors of these machines must be provided with the capability to t ransmit debug/trace information over an I / 0 channel to a location where the informat ion can be col- lected and forwarded to a remote site.

Background Run-t ime suppor t for moni tor ing is crucial in debugging multiproces-

sor software. Static suppor t tools can aid the p rog rammer in coding the p rogram more efficiently or make the task of p rogramming eas- ier, but run-t ime suppor t provides informat ion on the actual interac- tion of the p rogram and the paral- lel machine. This allows the pro- g rammer to optimize the program's performance, and facilitates the location of errors in the program. Additionally, interactive I/O with the nodes of the system allows their individual states to be examined. CAPS provides dynamic suppor t and interactive I/O for the PASM parallel processing system.

A number of systems suppor t software development by providing run-t ime moni tor ing support . As early as 1975, McDaniel p roposed a kernel ins t rumentat ion for distrib- u ted environments [20]. The work on some distr ibuted systems de- signed in the early 1980s included some sort of per formance monitor- ing facility, usually a software mon- itor (e.g., [7] for the V kernel and [25] for DEMOS/MP). Systems for interactive debugging of a distrib- uted computat ional environment were also designed [27].

Currently, several ins t rumented systems are in different stages of completion. Among these are CARAT at the University of Massa- chusetts and the Distributed Com- puter Testbed at Honeywell. The real-time moni tor ing systems at The Ohio State University can be based on either an Ethernet or Hyperchannel [35]. IPS at the Uni- versity of Wisconsin aids in guiding the user to the sources of inefficien- cies [23]. The resource moni tor ing system (REMS) at the National Bu- reau of Standards and the perfor- mance moni tor ing facilities for RP3 [6] at IBM feature substantial hard- ware support . The Faust project for Cedar at the University of Illinois includes hardware and opera t ing system suppor t for t ime-stamping significant events occurr ing dur ing p rogram execution. Instant Replay at Carnegie-Mellon University al- lows the replay of a p rogram from trace files [17]. The high-level de-

bugging environment at the Uni- versity of Massachusetts is a sophis- ticated environment based on the Event-Based Behavioral Abstrac- tion (EBBA) parad igm [2, 3]. Parasight at Encore is an example of a software moni tor for shared- memory parallel processing systems [1]. Also, the SEECUBE package allows the visualization of commu- nication in parallel p rograms on the NCube, a hypercube system [8].

Several systems have also been designed to provide interactive I/O with the nodes of the parallel ma- chine. The most common method of interactive I/O among the global system bus machines (e.g., Sequent Balance or ELXSI [33]) is by means of a separate I/O unit on the system bus. This I/O unit handles all user I/O between the CPUs and a num- ber of terminals connected to the system. All I/O passes over the sys- tem bus between the CPUs and I/O unit, increasing bus traffic and in- terfer ing with the computat ion unit. T h e FLEX is an example of a dis tr ibuted memory global system bus machine that can be conf igured with an I/O unit for each CPU [12]. No device for concentrat ing data into a single stream exists. The BIO-link of a BBN Butterfly pro- cessing node can interface to an external device [34]. Again, no in- s t rumentat ion is included to com- bine user I/O from all processing nodes. The NCube machine has an internal channel, permi t t ing a pro- cessing node to send/receive infor- mation to/from the system's host processor [13], and the Intel iPSC/ 860 has dedicated I/O nodes for access to mass storage [14]. This informat ion passes th rough sepa- rate I/O processing nodes. Future systems will most likely provide more advanced suppor t for moni- toring.

All of these systems provide user-directed I/O with ei ther a sin- gle processor or a g roup of proces- sors th rough dedicated I/O units. These systems, however, do not suppor t a coherent user environ- ment that combines code develop- ment tools, graphics, and interac-

O S November 1991/Vol.34, No.ll/COMMUNICATIONS OF T H E ACM

Page 4: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

tive I/O with all processors on a sophisticated workstation as is the case with CAPS.

P A S M O v e r v i e w PASM is a dynamically recon- f igurable architectural design for a thousand processor machine where: (1) the processors may be parti- t ioned to form cooperat ing or inde-

pendent submachines of various sizes, (2) each submachine can operate in SIMD or MIMD mode, and switch between the two with instruction level granulari ty and generally neg- ligible overhead (this is re fer red to as mixed-mode parallel computation [11]), and (3) the processors communicate through a multistage cube network

COMPUTING PRACTICES

that can be software configured to provide dif ferent connection topologies [30, 31]. A 30-processor prototype has been constructed and is in use in the Parallel Processing Laboratory at the Purdue Univer-

e l G U R t 1

BlOck d i a g r a m o f t h e p r o t o t y p e o f t h e PASM p a r a l l e l p r o c e s s i n g s y s t e m

J

I , I °co I I °c,

, +

I soo b [ I

I I +

cs ]~_ I I

' I I

' t I I - - I , +

,o~ I I ,.,o~ I I : r-

PCU

I I I I I I I I I ' I I

M S S

I !

Mr,Is

I . . . . . . . . Direct Memory Access Link

. . . . . . . . . . . . . . . . . . . . . . . GPIB

lOP

SIMD Instruction Broadcast Bus

Parallel Port Limk

COMMUNICATIONS OF THE ACM/November 1991/Vo1.34, No, 11 1 0 7

Page 5: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

. . . . . , . . . . . . . . . . . . . . . ' . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

sity School of Electrical Engineer- ing. The prototype utilizes Moto- rola MC68000 processors. Figure 1 shows a block diagram of the basic components of the prototype.

The Parallel Computation Unit (PCU) contains N = 2" processing element,; (PEs) (numbered from 0 to N - l ) connected by an Extra Stage Cube interconnection net- work (a fault- tolerant variation of the mukistage cube network) [30]. Each PE is a sophisticated micro- processor/memory pair that per- forms the actual SIMD and MIMD operations. The PE memory mod- ules are used by the processors for data storage in SIMD mode and both data and instruction storage in MIMD mode. The Micro Controllers (MCs) a:re a set of Q = 2 q processors numbered from 0 to Q - I . The MCs act as the control units for the PEs in SIMD mode, sending instruc- tions via the SIMD instruction broadcast bus, and coordinate the activities of the PEs in MIMD mode, th rough a genera l -purpose interface bus (GPIB) (see Figure 1). Each MC controls a fixed set of N/Q PEs. PASM is being designed for N = 1024 and Q = 32. The proto- type has N = 16 PEs and Q = 4 MCs.

The System Control Unit (SCU) is responsible for the overall coordi- nation of the other components of PASM. In the prototype, the SCU connec~Ls to an Ethernet-based LAN that includes several dozen mini- compulEers, ranging in perfor- mance from 1 to 12 Mflops, and over a h u n d r e d Sun Workstations TM

on the Engineer ing Compute r Net- work (ECN) at Purdue University. The Nlemory Management System (MMS) is composed of multiple proces,;ors that control secondary storage and file t ransfer to/from the PEs. The Memory Storage System (MSS) provides multiple secondary storage devices for the PEs. The Control Storage (CS) provides sec- ondary storage space for the MCs and tile SCU. One of the MMS processors of the prototype, the I/0 Processor (lOP), is responsible for interfacing to external I/O devices

and distr ibuting data to the MSS. The SCU communicates with the MCs and the I/O Processor through individual parallel por t connec- tions. As previously mentioned, there are a total of 30 processors in the PASM prototype: 16 PE CPUs, four MC CPUs, four MMS CPUs, one CPU associated with each of the four MSS disks and the CS, and the SCU CPU.

User's View of CAPS The process of moni tor ing a pro- gram's execution begins with decid- ing which aspects of the program's behavior to examine and then in- s t rument ing the application code (either manually or automatically) with moni tor ing statements. These statements send messages th rough dedicated I/O channels to the mon- i toring system. The data sent from the nodes are collected and can be presented to the user in a number of forms including textual or graphical. The display can potenti- ally provide informat ion on each node or on the system as a whole. The physical interface to the user is th rough a high-resolution graphics workstation runn ing an interactive terminal screen control p rogram called X-windows [26].

When a user instruments a pro- gram, the goal is to gain informa- tion about the occurrence of events of interest dur ing the execution of the program. In this article, an event is any change of system state. For example, a thread of control reaching a predef ined point in the computat ion or a change in some memory location (program vari- able) are both events. Therefore , events of interest are subsets of the events that occur dur ing the execu- tion of the program, with knowl- edge of their occurrence conveying particularly useful information to the user. The ability to moni tor events occurr ing in the p rogram in this way allows the user to moni tor arbi trary attributes of the program.

From these events of interest, the user can obtain informat ion on more global issues relating to pro- gram execution. Some events are

known, a priori, to be of interest in almost all applications, (e.g., the beginning and end of p rogram exe- cution or certain calls made to the opera t ing system). In this case, the opera t ing system could send infor- mation about the computat ion th rough the dedicated I/O channels or the compiler could insert extra executable statements into the ap- plication to indicate event occur- rence.

Despite the means of insertion of the moni tor ing statements, e i ther by the user or by the system, both are intrusive measures because the computat ion is in te r rup ted for the time it takes to mark the event. The user-intrusive approach (moni tor ing statements added manually) re- quires the most p r o g r a m m e r effort, but is quite flexible. It can be as simple as pr int statements scattered th roughout the code. It also can involve sophisticated logging oper- ations with automatic occurrence time recording added to the user program. This log of events can be stored in memory and reviewed on the workstation once p rogram exe- cution has completed. At the work- station, data records from all PEs can be analyzed by the user in ei- ther their raw form or processed by the workstation to present a graphic t iming d iagram of the pro- gram execution or o ther meaning- ful displays. With this informat ion the p r o g r a m m e r can detect errors, bottlenecks in the code, and net- work conflicts, and make appropr i - ate code modifications.

The system-intrusive (monitor ing statements added automatically) approach releases the p r o g r a m m e r from the chore of insert ing moni- tor ing probes at the cost of some flexibility. The user is l imited to studying only the events recognized by the opera t ing system. Events such as system calls, network ac- cesses, SIMD/MIMD mode switches, can be logged and ana- lyzed as described earlier. Cur- rently in CAPS, the debugging and moni tor ing code is inserted manu- ally by the p r o g r a m m e r as macros that are expanded by the compiler.

108 November 1991/Vol.34, No.ll/COMMUNICATIONS OF THE AOM

Page 6: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

A major advantage of a system like CAPS is that it allows the pro- g rammer to use some sophisticated features of a graphics workstation to process I/O coming from the parallel system. The workstation's windowing capability allows textual debugging information to be dis- played at several interactive virtual terminals, each connected to a dif- ferent processor. The information can also be used to generate graphic displays summarizing col- lected data.

The graphics workstation in the current CAPS implementat ion is a Sun Workstation [32] running X-windows. These windows allow interactive I/O between the work- station and any processor on the PASM system. For textual output , each PASM processor can have its own window on the workstation. Each window can be adjusted to any size and windows may overlap if necessary. A window can act as a terminal, allowing access to the processor's resident monitors, which are based on Motorola's MVMEBUG software [24]. Monitor features include pr int ing memory and register contents, setting pro- gram breakpoints, and disassembly of segments of memory. The pro- g rammer may also send displayed information to disk or review infor- mation that has scrolled past the screen using s tandard X-windows utilities.

In a typical debugging session, the CAPS server on a Sun opens windows to PASM's SCU, the host where the parallel p rogram is being developed, and to the processors of interest on the PASM system. The window to the host machine is opened through a s tandard Unix remote Iogin procedure. CAPS is invoked on the workstation and opens windows to the desired PASM processors. The SCU win- dow can be opened via a remote login from the Sun and provides access to Unix System V on the SCU. In this configuration, the p rog rammer can execute programs on PASM and use p rogram output to help debug the software. The

PEs' resident monitors also can help detect errors. The moni tor ing in- format ion can be used to make changes in the source code on the host, then the p rogram can be re- compiled/assembled, re loaded, and reexecuted on PASM, all within the CAPS environment . This proce- dure can be carried out on any workstation capable of runn ing X-windows at tached to the LAN, as well as remotely from any In terne t site with the same capabilities.

The sample screen of a Sun Workstation dur ing a typical pro- g ramming and debugging session is shown in Figure 2. Five windows are opened to a 4-PE machine par- tition or MC group consisting of MC 0 and PEs 0, 4, 8, and 12. This par- tition is running a mixed-mode SIMD/MIMD 8-point fast Four ier t ransform program. The window entitled "PASM Parallel Processing System" is open to the SCU, and displays the operat ions necessary to download the p rogram code and begin execution of the p rogram on the 4-PE parti t ion. Another win- dow is open to a Vax 11/780 on the ECN LAN and is being used for p rogram development. Addit ional windows are open for communica- tion to the Sun Workstation. One is being used as the workstation con- sole and another is the CAPS con- trol window. It can be seen from the CAPS control window that the user has requested two MC groups, mcg0 and mcgl (the windows of mcgl are iconified). It also shows that another user has reserved a part i t ion containing MC group three and PE group two as seen by the circles and slashes placed over the processors in those groups (a PE group is simply the correspond- ing MC group with no window for the MC). The interface is ra ther flexible and can easily be tailored to specific debugging tasks. In this example, some windows are par- tially covered by portions of o ther windows. During the p rogramming and debugging session, each win- dow can be moved, resized, and iconified (and restored) depend ing on the needs of the user. Virtually

COMPUTING PRACTICES

any number of windows is sup- ported.

CAPS also can be used to drive graphical display tools such as Par- allel Animated Debugg ing and Simulation Environment (PARA- DISE) [15]. PARADISE acts as an environment for generat ing views to visualize the interaction between parallel applications and parallel systems. The user introduces defi- nitions of various aspects of a model o f the system and then can visually simulate the interaction of the elements of the model with sim- ulation functions provided by the environment . In this way, run-t ime traces can be used to drive graphic representat ions o f the execution to animate the model being observed. This greatly aids in the analysis and comprehension of the trace infor- mation for debugging and perfor- mance enhancements .

Design Alternatives In general , despite the implemen- tation details, remote access and execution environments consist of a channel between the parallel ma- chine and a data concentrator. The goal is to collect the information from each node and transfer it to a central location that is, in this case, the screen of the user's workstation. The data concentrator combines the information coming from the parallel machine and transmits it on a high-bandwidth channel to the remote site. Similarly, data input from the remote site is re tu rned to the appropr ia te processor of the parallel machine.

Concentrat ing data streams from many processors and transmit t ing this combined stream to a remote site can present a serious technical challenge when considering even moderate-sized parallel systems. For example, a 1024-PE system where each PE produces a stream of data at 9,600 bps (approximately one character per millisecond) will saturate a typical 10Mbit/second LAN. Also, the workstation at the remote site must receive this stream and place individual characters into appropr ia te windows at an approx-

COMMUNICATIONS OFTHE ACM/November 1991/Vol.34, No.ll 109

Page 7: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

imate rate of one character/micro- second, a true challenge for state- of-the-art workstations. Neverthe- less, the user is then presented with the equivalent of about 400 pages of text each second, clearly a chal- lenge to the user's senses. The tech- nical challenges encountered, how- ever, are not rooted in the design of gigabit/second optical LANs and super-workstations. The challenge is in the creation of architectural support for monitoring with mini- mal intrusion to the program's exe-

I = I G U I ~ I E ~1

The screen of the Sun Worksta- tion showing nine windows: five opened to a machine partition with one MC and four PES, one opened to the SCU, one opened to a vax, one opened to the Sun Workstation and the CAPS win- dow manager

cution. As discussed later in this section, the problem of saturating the monitoring subsystem (includ- ing the user) can be avoided through informed usage based on characteristics of parallel programs and the process of debugging.

A continuum of possible moni- toring systems exists with respect to their level of intrusion on the exe- cuting program. In this analysis of the design alternatives, the relative levels of intrusion are compared with the goal of reaching a design with minimum intrusion, given the existing hardware and time and cost constraints. The level of intru- sion is closely related to the amount of hardware support available from the monitoring system. In this work, however, the amount of hardware support was not the only factor affecting the level of intru-

sion. For CAPS, many points on the continuum were considered as the data path for debugging informa- tion, ranging from software-only instrumentation using the PASM control hierarchy [28], to elaborate hardware-intensive implementa- tions. This section touches on some of the features of the alternatives considered and the relative trade- offs of each.

In all implementations consid- ered, monitoring code is inserted into the user's program to send in- formation through the dedicated I/O channels. This is an unavoida- ble source of intrusion for these implementations. Once the infor- mation is sent to the dedicated channel, the path it follows and the means of data concentration, dic- tate any additional amount of in- trusion. The channels from each

CONSOLE <bla ckm aihjel>/n ome/hitchcock2/jel[2] 17

Ed - Dual VAX11/780

PE stage 2 butter f ly opecat ior

move.1 #FU tife÷b su, aO ; load move.w #(b'stt-low)12, (aO) ; wor

-; ~;'ansfer (even, 512 ; the MC fetch unit r

xl~a b end ; move on b stt: move.t (L~phyl PE),d2 ; physi

j u (L÷buttwid) ; Initialize stage hop ; evade MC68000 p jmp OxSO000O ; Jump to the beg nod ; evade MC68000 p

b_end:

. . . . . . . . . . . . . . . pROCESSING ELEMENT 0 (0) . . . . . . . . . . . . . . . FFT input (wrp{4,8) order)

ceal imag teal imag O) 0.5000000 0.000000 4) 0.500000 0.000000

burrer hy initialization stage t: WO butter#y stage 2: WO butterfly 1.000000 0.000000 stage 3: WO butterfly 1.000000 0.000000

cube ~nction iniflalization cube(O)desflna~on: 4 cube(1) desflnaflon: 8

CAPS

meg1 pGgl mmlB mlul

meg3 pegs lop

seahotse,ecn.purdue.edu connected Io bfeckrnalLeng.uiowa,edu using log directoqy I h o m e l h t i c h c o c k 2 , / j a l / x m o n / I o g / i e l

m c O

2 Point Complex Decimation-in-Time Fast Fourier Transform Using 4 PE Edward C. Bronson

a C Langauge wogram

alg_calO: SIMO/MIMD spit_out(): SIMD/MIMO main():

PE stage 1 butterfly: SIMD/MIMD PE cube(t) intereonnection function: SIMD/MIMD PE stage 2 butte0"fly: SIMD/MIMD PE cube(O) Interconnection function: SlMD/MIMD PE stage 3 butterfly:

wrap_up{): ~ PASM Parallel Processing Syatem

PE init ial ization complete, Proceed, Ed? I ] Pasta: I~eset tango

~ magO has been reset, [ ] '4 pasm: C<lmagO FFT_8.4

reset: . . . . . . . . . . . . . . . PROCt mcg0 has been reset,

FFTi send: FFT_g.4.DMP real imJ sending FFT_ti.4.DMP to pugo via/devlpasm/mcO.

sending FFT_8.4.DMP to mcO via Idev/pasmtmcO. 1) 2,8300000 0.0~ sending FFT 8.4.DMP to IuO via/dev/pasm/cs. bu send command done.

stage 1: WO butterly start: PEs O, 4, 8, 12 stage 2: Wti butlerfly star flng pego at 800000 via Idevlpasm/mcO, stage 3: W2 butterfly start command done.

start: MC 0 i starting mc0 at 100000 via /dev/pasmlmc0.

cut> Pasta: 17 cube(O) desflnatio~: 0

. . . . . . . . . . . . . . . PROCESSING ELEMENT 2 (8) . . . . . . . . . . . . . . . FFT input (wrp(4.8) order)

real imag real imag 2) 0,5000000 0.000000 6) 0,500000 0,000000

butterfly initializa flon stage 1: WO butterfly stage 2: W2 butterfly t .OOOOOO -1.0beO00 stage 3: Wl butterRy 1.707107 ~.707107

cube function initialization cube(O)destinaUon: 12 cube(I)desflnatiorl: 0

. . . . . . . . . . . . . . . PROCESSING ELEMENT 3 (12) . . . . . . . . . . . . . . . FFT input (wrp{4,8) order)

real imag real imag 3) -1,8300000 0.000000 7) -1.830000 0.000000

butterfly initiaJization stage 1: WO butterfly stage 2: W2 butterfly 0,000000 -1.00000O stage 3: W3 butterfly -0.707107 -0,707107

cube func~on irdfialization CUbu(O) des~rmbon: 8 CUbe(l) desflnation: 4

11 O November 1991/Vol.34, No.ll/COMMUNICATIONS OF T H E A C M

Page 8: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

processor to the point of data con- centrat ion can be external (to the parallel processing system), inter- nal, or some combination of both. An external channel is physically at- tached to the processor and is iso- lated from the hardware of the par- allel machine. An internal channel transmits information through data paths already embedded within the architecture of the parallel ma- chine. A hybrid channel may use some internal channels as well as external channels to form the path that routes the information to a remote site. Ideally, the data paths used by the debug/trace statements are isolated from the rest of the parallel machine. This way, no part of the computat ional hardware is affected by the overhead of the moni tor ing environment. This form of intrusion can be avoided by intelligent decisions about the placement of the moni tor hardware support .

One hardware solution consid- ered (Annex approach) uses a N-to-1 concentrator component that com- bines N serial ports connected to PASM's Parallel Computat ion Unit (PCU) into a serial stream of data received and transmitted over a LAN. This hardware acts as a dis- t r ibutor of data for input from the windows of the workstation to the processors of the PCU and as a con- centrator of data from the PCU processors back to the workstation. This single piece o f hardware (manufactured by the Encore Com- puter Corpora t ion under the name A n n e x ~ - U X terminal server [9]) performs these functions. This de- sign has the lowest degree of intru- sion because no part of the PASM control hierarchy is involved. The only intrusion comes from the statements added to the user pro- gram to mark events.

One impor tant disadvantage of having the moni tor ing system com- prised of completely disjoint paths from the parallel system stems from the added complexity of scheduling the new resources of the monitor- ing system, that is, part i t ioning PASM for multiple users. I f the

data paths are integrated into the architecture of the parallel system, the opera t ing system can be used to allocate the moni tor ing channels with the rest of the system's re- sources. Completely disjoint data paths would require addit ional ef- fort to control the complete system (monitor and target systems com- bined). For these reasons along with initial studies that indicated several al ternate designs would per fo rm well and could be put into operat ion in a mat ter of weeks for a fraction of the cost, the Annex solu- tion was not chosen.

At the other end of the contin- uum was an approach that required no addit ional hardware. This em- bedded approach uses the parallel I/O capabilities of the control hierarchy of PASM itself. In this solution, the parallel data paths from PE to MC and from MC to SCU shuttle data packets between the PCU and the SCU. From the SCU, the LAN channel is accessible to send the packets to the remote site. This approach had the highest amount of intrusion with parallel computa- tion because the monitor ing/debug- ging information is passed along the same path as p rogram control information. This would reduce the effective bandwidth of the data paths used by the user's program, and would also incur more over- head on the MCs, fur ther interfer- ing with the execution of the user's program.

The implementat ion chosen was a hybrid o f the Annex approach and the embedded approach using the PASM hierarchy. Since the SCU does not take part in the actual exe- cution of the parallel program, no intrusion occurs from the use of its LAN channel. Also, the path cho- sen between the PEs and the SCU does not include any paths dedi- cated to parallel control. A new board, the System Monitoring Module (SMM), was added to the backplane of the I/O Processor. The SMM is capable of combining signals from the PCU and forwarding them to the IOP. Since the IOP is not a part of the PCU, it also can serve as an

C O M P U T I N G P R A C T I C E S

I/O channel without added intru- sion. From the lOP, the data is passed to the SCU without using any paths dedicated to the PCU. Once received by the SCU, the in- formation is sent over a LAN to the moni tor ing workstation. The oper- ation of the SMM approach chosen will be discussed in detail in the next section.

Another impor tant considera- tion in the design was its scalability and ult imate limitations. When combining debugging/ t racing in- formation from N processors, where N is arbitrari ly large, there is some value o f N that will ultimately overwhelm the bandwidth of the system. Even if all issues of hard- ware and software scalability were overcome and the environment could handle an arbitrari ly large number of processors, the applica- tion p rog ra mme r may not be able to use all the information provided. I t is impractical in most debugging settings to expect a user to assimi- late information from 1,024 proc- essors simultaneously.

Each design alternative consid- ered degrades differently as the PCU becomes larger. The embed- ded approach begins degradat ion of system performance through intrusion immediately with both the MCs and SCU acting as potential bottlenecks for the data flow. The amount of intrusion also rises be- cause the MCs take par t in the par- allel program's execution and are fur ther burdened by the monitor- ing suppor t they provide. The level of intrusion of the hardware solu- tions, however, is not affected by scaling the PCU, because the data paths for these approaches do not include any paths used by the PCU. The strength for the embedded approach lies in its requi rement of no addit ional architectural support . As the size of the PCU increases, the Annex solution would only re- quire addit ional terminal servers with independen t connections to the LAN. The only possible bottle- neck would be the LAN or the workstation that would receive the data. Al though expensive ($5,000

COMMUNIGATIONS OFTHE ACM/Novcmbcr 1991/Vol.34, No.ll 111

Page 9: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

per 32 PEs), this approach is attrac- tive due to its ease of implementa- tion. The scaling limitations of the SMM approach, as well as consider- ations to scaling to extremely large number,,; of processors will be dis- cussed fiarther in the section on ar-

c h i t e c t u r a l support . The most efficient way to avoid

bottlenecks in the system is through informed usage based on observa- tions of the proper t ies of parallel p rograms and the process of de- bugging these programs. Consider some general characteristics of par- allel programs and their p rogram- mers. A p ruden t approach to pro- g ramming parallel systems is to initially write applications for a sub- set (partition) of the available proc- essors and/or a reduced data set, increasing the number o f PEs or amount of data after debugging is complete. By writing programs that can execute on parti t ions of varying size, the p rog rammer gains two advantages. First, the p rog rammer can debug and test code on a small number of processors (although it is possible that increasing the num- ber of PEs and/or data set size could introduce new errors). Second, the p rogram will run on whatever size part i t ion is available at a later time. The part i t ion size may be deter- mined by the user's data set or ma- chine usage at run time. Also, most parallel p rograms have only a small number (often one) of unique pro- cesses dis tr ibuted as identical copies on a large number of processors. When debugging such programs, the p rog rammer can choose a rep- resentative set of the processors to moni tor initially. As testing contin- ues, this set of moni tored proces- sors can change depend ing on er- rors encountered in the code or on the events the p r o g r a m m e r wishes to moniitor. Thus, it should be pos- sible to limit the number of proces- sors that must be moni tored to a manageable size to both the envi- ronment and the p rogrammer . Exceptions obviously exist and in- clude some applications where con- tention grows nonlinearly with sys- tem size.

Architectural Support for CAPS Each CPU board in the PASM pro- totype has a serial por t in tended for terminal I/O with the CPU's resi- dent moni tor to allow for p rogram debugging and control. CAPS uses these I/O channels. Each serial por t in the prototype is connected to the SMM, which is control led by the I/O Processor. Together they act as the data concentrator . Al though the SMM is custom hardware, it can be used as par t of any parallel pro- g ramming envi ronment in which the moni tor ing data is der ived from a s tandard serial or parallel por t connection on each processor.

A block d iagram showing how the architectural suppor t for CAPS is integrated into PASM is shown in Figure 3. The CAPS system on the PASM prototype functions in the following manner . The I/O Proces- sor constantly monitors each serial por t of the SMM for incoming data from any of the PASM CPUs. Once a PASM CPU sends a character out of its own serial port, the associated por t on the SMM receives and stores the character. The I/O Pro- cessor reads the PASM CPU's trans- mitted character and forms a 2-byte packet. The first byte of the packet contains information indicating which of the PASM CPUs sent the character. The second byte of the

= I G U R I 3

BlOck diagram of the architectural support for CAPS

packet is the 7-bit ASCII character sent. The I/O Processor sends this packet to the SCU via the I/O Pro- cessor - -SCU parallel por t connec- tion. A process running on the SCU reads the packets f rom its parallel por t connection and sends the packets out onto the LAN channel to the Sun Workstation. Data input th rough the windows on the Sun a r e similarly packetized and re- tu rned to the appropr ia te PASM CPU, i.e., Sun to SCU, SCU to IOP, IOP to SMM, SMM to PASM CPU.

The data concentrator ( S M M - - IOP pair) is necessary because no o ther component of PASM, (e.g., SCU or IOP), has the number of ports required to br ing all CPU se- rial connections together. The IOP controls the SMM ra ther than the SCU because the ports must be ser- viced in real t ime to avoid loss of data. The SCU, runn ing Unix Sys- tem V, is not capable of servicing that number of ports without ne- glecting its o ther activities or losing data, while the IOP is not runn ing Unix. The SCU however, is capable of handl ing the single stream of packetized data from the IOP. When the SCU is unable to service the I O P I S C U parallel port , the IOP buffers packets in its local memory.

Consider the requi red data rates. Ten bits of data are t ransmit ted for each ASCII character sent between a CPU and the SMM: seven bits for the ASCII character, one parity bit,

I'°1 Pr.~ces-,:or ,I

I Control and Data

I Sy'.:c m r,'c r.,~or irlg Modu'.::

; ; ; ; & ~ a l Data

To Other PASM OPUs [ ~A~Sul [

I Sys:(-m J Parallel ~ Control Port Link Unit

EON LAN

I Sun J Wor k~.','-.'*..:, rl

I It.~st I M:~:nln(:

112 November 1991/Vol.34, No.ll/COMMUNICATIONS OF THE ACM

Page 10: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

one start bit, and one stop bit. I f all 30 processors send data at the same speed, (e.g., 9,600 bps), a data rate of approximate ly 28K characters/ second results. Because each char- acter received causes the formation of a 2-byte packet, the I O P - - S C U parallel por t connection must be capable o f twice this rate (approxi- mately 56K bytes/second). This rate is far below the th roughput pro- vided by the parallel por t connec- tions. The LAN channel is also ca- pable of this rate.

In addition, the I/O Processor must be capable of these rates as well. When an average single byte memory access is conservatively es- t imated at one microsecond, 35 memory accesses are permit ted per t ransferred ASCII character. This speed is easily attainable by efficient assembly language p rogramming of the required task. The worst case scenario of 30 CPUs sending simul- taneously, however, is very unlikely to be sustained. Fur ther , the I/O Processor accesses machine instruc- tions two bytes at a time, while the serial ports and parallel por t are restricted to single-byte accesses. Thus, 35 accesses per t ransferred character is a conservative figure.

While all processors are accessi- ble to the application p rog rammer through CAPS, only the Parallel Computat ion Unit and MCs are of use. Therefore , for applications programmers , the worst-case sce- nario ment ioned above reduces to 20 sending CPUs. The remaining processors are dedicated to suppor t services for the Parallel Computa- tion Unit and MCs under opera t ing system control. These service proc- essors are l inked to the SMM so sys- tems programmers can also take full advantage of CAPS.

Data t ransmit ted from the Sun back through the SMM originates as keyboard input at a maximum rate of several characters per sec- ond. This data rate is negligible compared to data rates to the Sun from the CPUs, and was therefore omitted from the analysis.

When considering the scalability of this system, the initial perfor-

mance bottleneck is the I O P - - SMM pair. The IOP's task of multi- plexing data coming from multiple I/O channels must be done in real time and is limited by the rate at which it can service the SMM's ports. T iming measurements taken from the I O P - - S M M pair show that, on average, 30 memory ac- cesses for instructions and data are made for each character transmit- ted. Recall that with 30 processors t ransmit t ing data, up to 35 memory accesses are allowed for each trans- mitted character before the IOP begins to miss characters. This indi- cates that, while the current design will suppor t traffic generated by a N = 16 PASM system, a N = 3 2 PASM system (with ei ther four or eight MCs) will saturate the I O P - - SMM pair if all processors are send- ing data. Multiple I O P - - S M M pairs can be used with larger sys- tems to alleviate this weak point, as the NCube system does by using mult iple I/O processing nodes. Again, even if the hardware bottle- necks were overcome, the user could not use all the data at once. Therefore , for the purpose of pro- gram development and debugging, it is reasonable to accept a maxi- mum number of processors being simultaneously monitored, even if this number is a fraction of the total.

The SCU will also become a bot- tleneck in any expanded system. During the execution of the parallel program, the IOP can be dedicated to suppor t ing monitoring. The SCU is runn ing Unix System V, however, and must t imeshare this function with its o ther duties. Ex- perience has shown that the current SCU can become a bottleneck if mult iple users are logged in and pe r fo rming various tasks. It will become necessary to provide real- time suppor t for the transfer of data from the IOP to the LAN in the form of a dedicated interface to the LAN, bypassing the SCU. Thus, any larger PASM system would re- quire an I O P - - S M M pair to func- tion much like the Annex~M-UX ter- minal server. Without a LAN

COMPUTING PRACTICES

interface, the SMM costs approxi- mately $300 compared to $5,000 for a 32-port Annex~M-UX terminal server. The addi t ion of a LAN in- terface to the IOP would put its total cost near the 32-port Annex- UX terminal server, making the adopt ion of the Annex approach increasingly attractive. Also, scalability with multiple indepen- dent terminal servers is possible. The advantage of the I O P - - S M M approach, however, is that the re- sources of the moni tor ing system are under the control of the SCU and can be allocated with other sys- tem resources. For the remainder of this discussion, the term terminal

server will be used to refer to ei ther an I O P - - S M M pair or an Annex- UX terminal saver.

The user interface is another potential bottleneck. In textural form it is inconceivable for a user to assimilate the data from more than a few processors in real time, and it is a laborious task to go through extensive p rogram traces after exe- cution. The alternative is improved graphical representat ions of com- putat ion and automatic identifica- tion of inefficiencies.

Consider the characteristics and/ or limitations of an expanded sys- tem as described with multiple ter- minal savers.

1) I/O would still be possible from all processors because there are mult iple terminal servers.

2) With I/O intensive tasks, it may be possible to visually moni tor only a subset o f the active proc- essors due to the bandwidth bot- tleneck at the LAN or worksta- ,tion.

3) With tasks runn ing on many processors, it may not be possi- ble to convey useful information on all the processors to the user with a single workstation.

Using multiple terminal servers would permi t a large n u m b e r . o f processors to be monitored. How- ever, with only a single interface to the user, only a limited subset of the processors of interest could be ac- tively moni tored simultaneously.

COMMUNICATIONS OF THE ACM/November 1991/Vo1.34, No.ll 113

Page 11: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

T h i S s o p h i s t i c a t e d f u t u r e h a r d w a r e i m u s t b e c a p a b l e o f i d e n t i f y i n g e v e n t s

a n d t r a c k i n g t h e s t a t e o f t h e u s e r ' s p r o g r a m t h r o u g h o n l y a p h y s i c a l

c o n n e c t i o n t o t h e C P U b u s e s o f t h e n o d e s o f t h e

p a r a l l e l s y s t e m .

Debugging that is I/O intensive or where large bursts of information can be genera ted by the executing p rogram will be a problem for an expanded CAPS. In such cases, a t rade-off will exist between the number of processors the program- mer wis]hes to moni tor and the de- tail of the informat ion the pro- g rammer wishes to obtain. For most expected applications, informat ion gathered simultaneously on the subset of processors of interest will meet the user 's needs.

For extremely large numbers of processors (massively parallel), new forms of parallel I/O must be cre- ated. It will no longer be possible to have any point in the flow of infor- mation where there is serialization or for the p rog rammer to gain but the most general informat ion on the execution in real time. Most in- format ion on the execution will be gained af terward from the exami- nation of trace files or from graphi- cal representat ions of collections of data.

Finally, consider the case where multiple users on separate worksta- tions are using the system. I f egtch p rog rammer is moni tor ing a num- ber of processors that are manage- able by their workstation, there is the possibility of per formance deg- radat ion at the LAN. The com- bined number of debugging/trac- ing messages from each user's set of moni tored processors can saturate the path if the number of users is large enough. With multiple termi- nal servers, the LAN becomes the potential bott leneck for I/O travel-

ing from PEs th rough the terminal servers to the Sun. To ease this situ- ation, the terminal servers buffer data when the LAN becomes satu- rated. As ment ioned previously, I/O traveling f rom the Sun(s) back to the PEs originates as keyboard input at a negligible data rate. Also, p rogram downloading uses differ- ent paths within PASM and does not utilize the path used for inter- active moni tor ing and debugging. The end effect with mult iple users, however, is that each user experi- ences a h igher latency with interac- tive I/O. The amount of latency caused by a given number of users depends on factors such as I/O re- quired by each user and the soft- ware overhead of the terminal serv- ers. This overhead is difficult to quantify.

Next-Generation Support This section describes plans for the next generat ion of debugging sup- por t for PASM. The ult imate goal for this research is the construction of a completely nonintrusive envi- ronment with a high-level user in- terface. The user interface will rely heavily on the graphics capabilities of high-resolution workstations to aid the p rog ra mme r in visualizing the parallel computat ion. The envi- ronment also will automatically identify causes of inefficiencies or contention and point them out to the user. For the moni tor ing of the execution of the p rogram to be nonintrusive, the moni tor ing sys- tem must provide substantial hard- ware suppor t for the identification

of events without any modification to the user's original source code.

Work toward the deve lopment of hardware capable of nonintrusive moni tor ing is a lready well under- way [18, 19]. The event-action par- adigm is a model of the under ly ing principles of the moni tor ing pro- cess. F rom this model, a layered architectural model has been devel- oped and appl ied to the design of a nonintrusive moni tor ing system. This sophisticated future hardware must be capable of identifying events and tracking the state of the user 's p rogram th rough only a physical connection to the CPU buses of the nodes of the parallel system.

The proposed moni tor ing sys- tem includes a Central Monitoring Facility that acts as the user inter- face (graphic workstation). The Central Moni tor ing Facility also will be responsible for the coordinat ion and synchronization o f the Special- Purpose Hardware Monitoring Units that are replicated at each node of the parallel system. Addit ionally, if the network cannot be simulated in software or if it exhibits nondeter - ministic behavior, network moni- tor ing hardware will be included. Finally, the components of the moni tor ing system will be intercon- nected with a high bandwidth inter- connection (e.g., Ethernet) and suppor t hardware for synchroniza- tion including a clock line to pro- vide a locally available view of glo- bal time.

The Special-Purpose Hardware Monitor ing Units contain fast c o r n -

114 November 1991/Vol.34, No.ll/COMMUNICATIONS OF THE A C M

Page 12: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

COMPUTING PRACTICES

T h e i m p l e m e n t a t i o n o f t h e C A P S e n v i r o n m e n t o n t h e P A S M

p r o t o t y p e s h o w s t h a t t h e a m o u n t o f e x t r a h a r d w a r e n e c e s s a r y i s s m a l l

a n d a s o m e w h a t l o w d e g r e e o f s y s t e m i n t r u s i o n c a n b e

m a i n t a i n e d .

parison logic that compares bus sig- nal patterns with groups of patterns of interest in an event memory. Comparison with these signals al- lows the identification of program- level events such as variables chang- ing value. The Special-Purpose Hardware Monitoring Units will also analyze predicates involving these program-level events such as" "Is the value of the variable zero?" Finally, the Special-Purpose Hard- ware Monitoring Units and the Central Monitoring Facility work together to evaluate predicates spanning several nodes of the sys- tem or concerning the system as a whole such as: "Does variable A equal zero in all nodes?" Each Spe- cial-Purpose Hardware Monitoring Unit will include a processor and a high-speed controller to facilitate coordinat ion between Special- Purpose Hardware Monitoring Units and the identification of predicates.

To gain a global o rder ing o f events in the system, it is necessary to have a locally available idea of global time. The ability to record the time of occurrence o f events (time-stamp) is critical to analyzing code execution in parallel ma- chines, but is difficult to do with physical clocks [16]. Without t ime stamps, rebui lding a picture of the execution from the marked events across multiple processors is diffi- cult because, while the events marked on an individual processor are ordered , the events marked across processors are not. The rela- tive o rder ing of events across proc-

essors must be deduced from syn- chronization points, network accesses, and SIMD/MIMD mode switches. To accurately t ime-stamp events, a global system clock that allows simultaneous access must be present at each processor. Such a global system clock is a necessary but potentially expensive compo- nent. Each PASM PE has a 32-bit t imer that can be clocked by a single clock line distr ibuted th rough the machine at a resolution as small as 125 nanoseconds. It is possible to clear and start all times in SIMD mode so their values proceed iden- tically. Therefore , PASM has sup- por t for a relatively inexpensive global system clock that permits simultaneous access by all PEs.

In addit ion, work is continuing toward improved user interfaces. Graphical user interfaces show much promise in conveying infor- mation on the execution of parallel programs. It may be possible for a p rog rammer to assimilate informa- tion about a greater number of nodes with greater detail if the in- format ion can be presented in a sophisticated graphical format.

Conclusion This work shows that a small amount of addit ional hardware can be used to implement a useful re- mote access and debugging envi- ronment . This environment pro- vides remote access to a parallel machine for mult iple users and in- tegrates system features such as downloading code, code develop- ment, interactive I/O, and run-t ime

moni tor ing of programs with so- phisticated workstation windowing capabilities.

The implementat ion of the CAPS environment on the PASM prototype shows that the amount of extra hardware necessary is small and a somewhat low degree o f sys- tem intrusion can be maintained. The added hardware necessary for the CAPS environment imple- mented on the PASM prototype costs on the o rde r of $300. Of course, the LAN and workstation are not included in this cost.

I t is likely that the cost o f this sys- tem on other parallel machines would be higher because the LAN i n t e r f a c e - - I O P pair may not exist. The general idea of using a termi- nal server, however, to combine multiple streams of low bandwidth I/O to a single high-bandwidth channel that can be t ransmit ted to a workstation or o ther display device, can be extended to almost any par- allel system. By using channels ex- ternal to the parallel system's data path, the level of intrusion can be minimized and the intrusion can be localized to the processor sending debugging information. While se- rial channels may not exist in a par- allel system, they can be added to the nodes as memory-mapped I/O devices. Therefore , a system similar to CAPS can be used with equal utility on many parallel systems.

The scalability of CAPS was dis- cussed and it was argued that the technical challenges do not lie in the development of super-worksta- tions and high-bandwidth channels.

C O M M U N I C A T I O N S OF THE ACM/November 1991/Vol.34, No.ll 1 l S

Page 13: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

but instead pivot on the design of architectural support that intrudes minimally with the execution of a parallel program. Since a user may not possibly assimilate the masses of data from thousands of processors or even the amount of data that a LAN's 'bandwidth can provide, it was shown how discriminate usage of an expanded CAPS system could enable a user to monitor a pro- gram.

Pure ihardware approaches to the problem of monitor ing multiproc- essor systems is not the ultimate so- lution. A modest amount of special- ized hardware combined with execution trace analysis techniques to compensate for intrusion appear promising. Work to develop more sophisticated program monitor ing and tracing tools is continuing. These tools will provide more in- formative graphic displays of pro- gram execution to aid in the debug- ging, testing, and study of parallel algorithms and architectures.

Acknowledgments A preliminary version of portions of this material was presented at the 1989 Workshop on Experiences with Diistributed and Multiproces- sor Systems. We thank the review- ers of this manuscript for their many insightful and useful sugges- tions. El

References 1. Aral, Z. and Gertner, I. Parasight: A

high-level debugger/profiler archi- tecture for share-memory mul- tiprocessors. In Proceedings of the ACM 1988 International Conference on Supercomputing (July 1988), pp. 131- 139.

2. Bates, P. Distributed debugging tools for heterogeneous distributed sys- tems. In Proceedings of the Eighth Inter- national Conference on Distributed Com- puting Systems (June 1988), pp. 308- 316.

3. Bates, P.C. and Wileden, J.C. High- level debugging of distributed sys- tems:: The behavioral abstraction approach. J. Syst. Softw. 3, 4 (Apr. 1983), 255-264.

4. Berg, T.B., Kim, S.D. and Siegel, H.J. Limitations imposed on mixed- mode performance of optimized phases due to temporal juxtaposi-

tion. J Parallel and Distributed Comput., To be published Oct. 1991.

5. Brantley, W., McAuliffe, K. and Ngo, T. RP3 performance monitoring hardware. In Instrumentation for Fu- ture Parallel Computing Systems, M. Simmons, Ed., Addison Wesley, Reading, Mass., 1989, pp. 35-48.

6. Bronson, E.C., Casavant, T.L. and Jamieson, L.H. Experimental appli- cation-driven architecture analysis of an SIMD/MIMD parallel processing system. IEEE Trans. Parallel Distrib- uted Syst. 1, 2 (Apr. 1990), 195-205.

7. Cheriton, D.R. and Zwaenopol, W. The distributed V Kernel and its performance for diskless worksta- tions. In Proceedings of the 9th Sympo- sium on Operating Systems (Oct. 1985), pp. 128-139.

8. Couch, A.L. Seecube user's manual. Tech. Rep. CS Department, Tufts University, 1987.

9. Encore Computer Corporation, ANNEX Hardware Installation Guide, and ANNEX User's Guide, Documents 716-02887 and 716- 02886, Encore Computer Corpora- tion, 1986.

10. Fineberg, S.A., Casavant, T.L., Schwederski, T. and Siegel, H.J. Non-deterministic instruction time experiments on the PASM system prototype. In Proceedings of the 1988 International Conference on Parallel Processing (Aug. 1988), pp. 444- 451.

11. Fineberg, S.A., Casavant, T.L. and Siegel, H.J. Experimental analysis of a mixed-mode parallel architec- ture using bitonic sequence sorting. J. Parallel and Distributed Comput. 11 (Mar. 1991), pp. 239-251.

12. Flexible Computer Corporation, The Flex/32 MultiComputer: Sys- tem Overview, Report No. 030- 0000-002, Flexible Computer Cor- poration, 1985.

I3. Hayes, J.P., Mudge, T.N., Stout, Q.F. and Colley, S. Architecture of a hypercube supercomputer. In Pro- ceedings of the 1986 International Con- ference on Parallel Processing, (Aug. 1986), pp. 653-660.

14. Intel Corporation, iPSC/2 and iPSC/ 860 User's Guide, Document No. 311532-006, Intel Corporation, 1990.

15. Kohl, J.A. and Casavant, T.L. Use of PARADISE: A meta-tool for vis- ualizing parallel systems. In Proceed- ings of the Fifth International Parallel Processing Symposium (IPPS), To be

published, Apr. 1991. 16. Lamport, L. Time, clocks and or-

dering of events in a distributed sys- tem. Commun. ACM 21, 7 (July 1978), 558-565.

17. LeBlanc, T.J. and Mellor-Crum- mey, J.M. Debugging parallel pro- grams with instant replay. IEEE Trans Comput. C-36, 4 (Apr. 1987), 471-482.

18. Lumpp, J.E., Jr., Casavant, T.L., Siegel, H.J. and Marinescu, D.C. Specification and identification of events for debugging and perfor- mance monitoring of distributed multiprocessor systems. In Proceed- ings of the 10th International Confer- ence on Distributed Computing Systems (June 1990), pp. 476-483.

19. Marinescu, D.C., Lumpp, J.E., Jr., Casavant, T.L. and Siegel, H.J. Models for monitoring and debug- ging tools for parallel and distrib- uted software. J Parallel and Distrib- uted Comput. 9, 2 (June 1990), 171- 184.

20. McDaniel, G. Metric: A kernel in- strumentation system for distrib- uted environments. In Proceedings of the 6th Symposium on Operating Sys- tems Principles (Nov. 1975), pp. 93- 99.

21. McDowell, C.E. and Helmbold, D.P. Debugging concurrent programs. ACM Comput. Surv. 21, 4 (Dec. 1989), 593-622.

22. Miller, B.P. and Choi, J.D. Break- points and halting in distributed programs. In Proceedings of the Eighth International Conference on Distributed Computing Systems (June 1988), pp. 316-323.

23. Miller, B.P. and Yang, C.Q. IPS: An interactive and automatic perfor- mance measurement tool for paral- lel and distributed programs. In Proceedings of the Seventh Interna- tional Conference on Distributed Com- puting Systems (Sept. 1987), pp. 482- 489.

24. Motorola. VMEbug: Debugging Package User's Manual. MVMEbuglD3, Motorola, Inc, 1983.

25. Powell, M.L. and Miller, B.P. Pro- cess migration in DEMOS/MP. In Proceedings of the 9th Symposium on Operating Systems (Oct. 1983), pp. 110-119.

26. Scheifler, R.W. and Gettys, J. The X window system. ACM Trans. Graph. 5 (Apr. 1986), 79-109.

27. Schiffenbauer, R .D. Interactive

116 November 1991/Vol.34, No. 11/COMMUNICATIONS OF THE A C M

Page 14: A CODINGhj/journals/38.pdf · is executing on the parallel system and does not function correctly or performs poorly, the debugging 1 of multiple threads is a complicated problem

debugging in a distributed compu- tational environment. Tech. Rep. MIT/LCS/TR 264, Massachusetts Institute of Technology, 1981.

28. Schwederski, T., Nation, W.G., Sie- gel, H.J. and Meyer, D.G. Design and implementation of the PASM prototype control hierarchy. In Pro- ceedings of the Second International Supercomputing Conference (May 1987), pp. 418-427.

29. Siegel, H.J. Interconnection Networks for Large-Scale Parallel Processing: Theory and Case Studies, Second Edi- tion, McGraw-Hill, New York, N.Y., 1990.

30. Siegel, H.J., Armstrong, J.B. and Watson, D.W. Mapping tasks onto the PASM reconfigurable parallel processing system. In Proceedings of the 1990 Parallel Computing Work- shop, sponsored by the Computer and Information Science Depart- ment at the Ohio State University (Mar. 1990), pp. 13-24.

31. Siegel, H.[., Nation, W.G. and AI- lemang, M.D. The organization of the PASM reconfigurable parallel processing system. In Proceedings of the 1990 Parallel Computing Work- shop, sponsored by the Computer and Information Science Depart- ment at the Ohio State University (Mar. 1990), pp. 1-12.

32. Sun Microsystems. Sun System Over- view, Sun Microsystems Manual, Sun Microsystems, Inc., 1986.

33. Taylor, G.S. Arithmetic on the ELXSI system 6400. In Proceedings of the 6th Symposium on Computer Arithmetic (May 1983), pp. 110-115.

34. Thomas, B., Gurwitz, R., Goodhue, J., Allen, D. and Beeler, M. Butter- fly parallel processor overview. BBN Rep. 6148, BBN Advanced Computers, Inc., 1986.

35. van Tilborg, A. Ed., Workshop on Instrumentation for Distributed Com- puting Systems, IEEE Computer So- ciety and ACM, 1987.

About the Authors:

JAMES E. LUMPP, JR. is currently a Ph.D. candidate in The Department of Electrical and Computer Engineering at the University of Iowa. His research in- terests include parallel processing, com- puter architecture, operating systems, and execution monitoring.

SAMUEL A. FINEBERG is currently enrolled in the Ph.D. program in the Department of Electrical and Computer

Engineering at the University of Iowa. His research interests include parallel computer architecture and perfor- mance evaluation.

THOMAS L. CASAVANT is currently an assistant professor on the faculty of the Department of Electrical and Com- puter Engineering at the University of Iowa. His research interests include parallel processing, computer architec- ture, programming environments for parallel computers, and performance analysis.

Authors' Present Address: Parallel Processing Laboratory, Department of Electrical and Computer Engineering, University of Iowa, Iowa City, IA 52242.

WAYNE G. NATION is with the IBM Corporation, Systems Technology Divi- sion, Endicott, New York, His research interests include interconnection net- works, parallel processing, and parallel processor prototyping. Author's Pres- ent Address: IBM Corporation, Sys- tems Technology Division, Endicott, NY 13760.

EDWARD C. BRONSON is currently with the Department of Biology at Pur- due University. His research interests include computer architectures and al- gorithms, parallel and distributed pro- cessing, speech analysis and recognition, signal processing, DNA pattern recog- nition, and computer analysis of the high-order structure of the cellular ge- nome. Author's Present Address: Han- sen Life Sciences Research Building, Purdue University, West Lafayette, IN 47907.

HOWARD JAY SIEGEL is a professor and coordinator of the Parallel Process- ing Laboratory in the School of Electri- cal Engineering, at Purdue University. His current research focuses on inter- connection networks and the use and design of the unique PASM recon- figurable parallel computer system.

PIERRE H. PERO is employed as the systems engineer for the Parallel Pro- cessing Laboratory in the Purdue Uni- versity School of Electrical Engineering where he contributed to the design and construction of the PASM parallel pro- cessing system. His research interests are parallel processing and microcom- puters, Authors' Present Address: Parallel Pro- cessing Laboratory, School of Electrical

C O M P U T I N G P R A C T I C E S

Engineering, Purdue University, West Lafayette, IN 47907.

DAN C. MARINESCU is associate pro- fessor in the Computer Science Depart- ment at Purdue University. His research areas are parallel processing and scien- tific computing, Petri nets, distributed systems and networking, performance evaluation, and realtime systems. Author's Present Address: Department of Computer Science, Purdue Univer- sity, West Lafyette, IN 47907.

THOMAS SCHWEDERSKI is with the Institute for Microelectronics Stuttgart, Germany, where he heads the Custom Processor and Test Department. His research interests are VLSI design, par- allel processing, and interconnection networks for parallel computers and ATM. Author's Present Address: Insti- tute for Microelectronics Stuttgart, Allmandring 30a, D-7000 Stuttgart 80, Germany.

CR Categories and Subject Descrip- tors: C.4 [Performance of System]: de- sign studies; C.5m [Computer System Implementation]: Miscellaneous; D.2.5 [Software Engineering]: Testing and Debugging; D.2.6 [Software Engineer- ing]: Programming Environments- - interactive.

Additional Key Words and Phrases: Design, experimentation instrumenta- tion.

This work is supported by the National Sci- ence Foundation under grant numbers CCR- 8702115, CCR-8704826, CCR-8809600, CCR-8846388, and CDA-9015696; by the NSF Software Engineering Research Center (SERC), by the SDI under ARO contract under DDAAL03-86K-0106, by the Naval Ocean Systems Center under the High Per- formance Computing Block, ONT, and by the office of Naval Research under grant number N00014-90-J- 1937.

Sun Workstation is a registered trademark of Sun Microsystems Inc.

Annex is a registered trademark of Encore Computer Corporation.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appeal, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.

© ACM 0002-0782/91/1100-104 $1.50

COMMUNICATIONS OF THE ACM/November 1991/Vol.34, No. 11 117