a configuration approach to parallel programming

Future Generation Computer Systems 8 (1992) 337-347 337 North-Holland

A configuration approach to parallel programming

J e f f M a g e e a n d N a r a n k e r D u l a y

Department of Computing, Imperial College of Science, Technology and Medicine, 180 Queen's Gate, London SW7 2BZ, UK

Abstract

Magee, J. and N. Dulay, A configuration approach to parallel programming, Future Generation Computer Systems 8 (1992) 337-347.

This paper advocates a configuration approach to parallel programming for distributed memory multicomputers, in particular, arrays of transputers. The configuration approach prescribes the rigorous separation of the logical structure of a program from its component parts. In the context of parallel programs, components are processes which communicate by exchanging messages. The configuration defines the instances of these processes which exist in the program and the paths by which they are interconnected.

The approach is demonstrated by a toolset (Tonic) which embodies the configuration paradigm. A separate configuration language is used to describe both the logical structure of the parallel program and the physical structure of the target multicomputer. Different logical to physical mappings can be obtained by applying different physical configurations to the same logical configuration. The toolset has been developed from the Conic system for distributed programming. The use of the toolset is illustrated through its application to the development of a parallel program to compute Mandelbrot sets.

Keywords. Parallel programming environment; parallel programming language; configuration language; transputers.

1. In t roduc t ion

T h e work de sc r ibed in this p a p e r a rose f rom our in te res t in apply ing the p r inc ip les e m b o d i e d in Conic [8,10] to a p r o g r a m m i n g e n v i r o n m e n t for m u l t i c o m p u t e r s [1]. T h e shor tcomings we per - ce ived in exist ing p r o g r a m m i n g env i ronmen t s for m u l t i c o m p u t e r s b a s e d on t r a n s p u t e r a r rays p ro - v ided add i t i ona l mot iva t ion . T h e modi f i ca t ions necessa ry to Conic to enab l e its eff ic ient use in the t r a n s p u t e r e n v i r o n m e n t l ed to naming the va r ian t Tonic , for obvious reasons .

Conic is a toolk i t for cons t ruc t ing d i s t r i bu t ed systems. I t p rov ides two languages : the first, a dec la ra t ive conf igura t ion l anguage used to de- scr ibe the s t ruc ture of a logical node in t e rms of its cons t i tuen t p rocess types, p rocess ins tances

Correspondence to: Jeff Magee, Department of Computing, Imperial College of Science, Technology and Medicine, 180 Queen's Gate, London SW7 2BZ, UK.

and process i n t e rconnec t ions and the second, a p r o g r a m m i n g l anguage used to p r o g r a m individ- u a l p rocess types. T h e p r o g r a m m i n g l anguage is Pasca l a u g m e n t e d with message pass ing pr imi- tives. D i s t r i b u t e d systems are cons t ruc t ed in Conic by dynamica l ly ass igning ins tances of logical nodes to physical nodes and in t e rconnec t ing these in- s tances. Conic e m b o d i e s the conf igura t ion ap- p r o a c h is r igorous ly s epa ra t i ng the logical structure of a d i s t r ibu ted p r o g r a m f rom the compo- nen ts which i m p l e m e n t its c o m p u t a t i o n a l func- t ion. T h e d i f fe rences b e t w e e n Ton ic and Conic ar ise f rom the cha rac te r i s t i c d i f fe rences be tw e e n pa ra l l e l and d i s t r ibu ted p rograms . W e see these as being:

Objectiue Dis t r i bu t ed p r o g r a m s can be consid- e r ed as consis t ing of a n u m b e r of logical ly dis t inct en t i t i es which i n t e r c o m m u n i c a t e to achieve some overal l goal - typical ly access to geograph ica l ly d i s t r ibu ted resources . Pa ra l l e l p r o g r a m s are logi-

0376-5075/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

338 J. Magee, N. Dulay

cally one entity where the constituents co-operate to achieve some computational goal - the overall objective of performing the computation in parallel being speedup.

Failure Failure of one component of a distributed program generally requires continued operation albeit in degraded mode whereas failure of one component of a parallel computation can generally be allowed to cause termination of the overall computation. In distributed environments, the longevity of execution, together with the probability of communication and node failure means that software development toolkits must provide programming abstractions to deal with such failures. This is not the case in multicomputers where we can assume reliable communication and low probability of node failure during the execution of programs which have as their primary objective speedup rather than continuous execution.

Evolution A large class of critical distributed programs execute perpetually and for economic or safety reasons require the facility to be modified and updated on-line. The Conic toolkit supports this requirement through its ability to dynamically configure running systems. On-line evolution is not a requirement for parallel programs which run for relatively short periods, completing when a result has been computed.

Heterogeneity Distributed programs are generally designed such that components of the program may run on computers with different processor types. Programming environments for distributed systems deal with this hardware heterogeneity by providing multiple code generators and message datatype conversion facilities. Distributed memory multicomputers, typified by transputer net- works, provide hardware homogeneity, although they are usually hosted by a computer with a different processor type. However, programming environments for multicomputers should optimise for hardware homogeneity.

In summary, Tonic is optimised to support the development of parallel programs for distributed memory multicomputers where the primary objective is speedup. Tonic does not support dynamic configuration and assumes reliable proces-

sors and reliable inter-processor communication. It inherits from Conic the configuration approach in providing a separate language to define logical structure and extends the Conic configuration facilities by also applying this language to describing the physical structure of the target multicomputer. This physical configuration description is used to drive the logical to physical mapping process.

Currently, the most commonly used toolset for developing parallel programs for transputer based multicomputers is the Occam language [6] and the Transputer Development System (TDS) [7]. While these are efficient tools for developing embedded applications, they have drawbacks when used for developing application programs for the current generation of transputer based multicomputers such as the Meiko Computing Surface and the Supernode. The drawbacks are primarily concerned with the flexibility permitted in mapping an arbitrary network of communicat- ing Occam processes to the hardware topology of interconnected transputers. The developer must take into account the limit of four links per transputer and laboriously place logical channels onto physical channels. Explicit multiplexor and demultiplexor processes must be provided where it is necessary to map more than one logical communication channel onto a physical link. Changing the number of processors on which an application runs requires recompilation.

More recent toolsets such as CStools [13] ad- dress the problem of making the logical structure independent from the underlying hardware structure through the use of configuration descriptors (termed 'par files'), however, these descriptions do not permit runtime parameterisation of the number of processors or flexible logical to physical mapping unless the user resorts to the underlying library routines. The Helios operating system [5] allows a user to describe the hardware configuration (Resource Map) separately from the logical configuration (CDL). These descriptions use different notations and are limited to compile time parameterisation. Further, the application programmer has little control over the logical to physical mapping. Helios programmers are limited to Unix style I / O for inter-process communication.

In the following section, the Tonic facilities for developing parallel programs are illustrated

A configuration approach to parallel programming 339

through the development of a program to compute the Mandelbrot Set. The facilities provided for mapping this program to different hardware topologies are described in Section 3. Section 4 overviews the implementat ion of Tonic and provides some performance data. Finally, Section 5 evaluates the approach.

2. Pn~gram construction in Tonic (logical struc-

ture)

The following overviews the programming features offered by Tonic for parallel programming through the example of a program to generate Mandelbrot sets. The program generates a 512 by 512 pixel image where the colour of each pixel is represented by an 8 bit quantity. This quantity is computed as the number of iterations (up to a maximum of 255) of the calculation z = z * z + c before I z I > 2 where z is a complex variable and c a complex constant. If the maximum is reached c is assumed to be in the Mandelbrot Set, other- wise the number of iterations indicates how 'close' c is to the set. The simplistic approach to paral- lelising this is to divide the image into the same number of chunks as there are processors and hand each chunk to a processor for computation. Since some images areas, far outside the set, require much less computat ion than others this approach leads to poor load balancing and thus poor performance. A more sophisticated approach employs a work allocator or supervisor process to hand out smaller chunks to worker or slave processes [12]. A slave process computes a chunk and hands it back to the supervisor for display and then gets another chunk to compute until none are left. In the following, chunks are the size of one horizontal line of pixels. The logical structure of the program is shown in Fig. 1.

The types of message exchanged between components of the program together with program wide constants are defined in a definitions unit (c.f. Modula-2 modules) as shown below.

I define mandbrot: Xmax, Ymax, mandmsg,

mandp;

2 const Xmax=512; Ymax=512;

3 type mandp=~mandmsg;

4 mandmsg=record

m a n d g e n

I s u ~ l s o r

computed line

new line to compute (integer)

Fig. 1. Logical structure of the Mandelbrot generator program.

5 lineno: integer;

6 linebuf: packed array

[1..Xmax] of char;

7 end;

8 end.

The program unit shown below is the definition of the slave process type - in Tonic process are task modules. Tasks ] communicate with the outside world by sending messages to exitports and receiving messages from entryports. A task has no direct knowledge of which other tasks it will be connected to. This configuration indepen- dence greatly facilitates reuse. In this case, the slave task sends a computed line of pixe[ colour values (type mandmsg) to its exitport result (line 4). The communication primitive used is a send- wait (line 23) which sends a request message to the exitport and then suspends the task waiting for a reply. The first message from a slave to the supervisor has a zero l inenumber indicating that the message does not include a computed line - it is merely a request for the first line to compute. Subsequent messages overload a request for a new line with the results of computing the last line.

I task module slave (xO,yO,dO:real);

2 use

3 mandbrot: Xmax,Ymax,mandmsg;

4 exitport

5 result: mandmsg reply integer;

6 var

I The terms task and process are used interchangeably throughout the paper.


7 M: mandmsg; xl,yl,delta:real;

i:integer;

8

9 function mandcalc (cx,cy:real):in-

teger;

10 var i:integer;

zx,zy,xx,xy,yy,t:real;

11 b e g i n 12 i:=O; zx:=cx; zy:=cy;

13 repeat

14 xy:=zx*zy; xx:zx*zx; yy:zy*zy;

zy:=xy+xy+cy;

15 zx:=xx-yy+cx; t:=xx+yy;

i:=i+l;

16 until (t>4.0) or (i=256);

17 mandcalc:=i;

18 end; 19 20 b e g i n 21 M.lineno:=O; delta:=dO/Xmax;

22 loop

23 send M to result wait M.

lineno;

24 x1:=xO;

25 yl:=yO-(M.lineno-1)*delta;

26 for i:=I to Xmax do begin

27 M. linebuf[i]:=

chr(mandcalc(xl,yl));

xl:=xl+delta;

28 end;

29 end;

30 end.

The superuisor component shown in Fig. 1 is implemented by two tasks as shown in Fig. 2.

The supervisor composite component is described in the Tonic configuration language by the group module below. Note that the interface

N " N

Fig. 2. Supervisor composite component.

to a group is defined in an identical way to task module interfaces. Thus tasks can be replaced by groups and vice-versa at any point during program development without affecting the rest of the program. The configuration description consists of four parts:

(i) the use clause (line 2) specifies the message and component types used to construct the group, (ii) the interface to the component in terms of

entry and exitports (line 6), (iii) the instance of component types from which the component is constructed (line 8) and (iv) the interconnections between instances of these components (line 11).

Since the master is critical to the overall performance of the program, its associated pragma (line 10) indicates that this task ins~tance should be run as a high priority transputer process with its workspace in on-chip memory if possible.

1 group module supervisor;

2 use

3 mandbrot :mandmsg;

4 display;

5 master;

6 entryport

7 result:mandmsg reply integer;

8 c r e a t e 9 display;

10 master <PRI=O, MEM=ONCHIP>;

11 link

12 display.out to master.out;

13 result to master.result;

14 end.

The program for task master is given below. This task allocates lines to be computed to slave tasks through replies to the entryport result (line 18) and stores computed lines in the array burs. Since lines will be computed in different times by slave processes, computed lines will not be received by master in line order. The guard on the receive from the entryport out (line 25) ensures that the display task receives lines in the correct order. The select statement is similar to that provided by Ada [3]. One of the set of eligible receive statements is selected for execution together with the computational statements associated with it. A receive is eligible if the associated guard is true (or there is no guard) and a message is queued to the entryport on which the receive is being performed. It should be noted that the


master task does not have or need information on the number of slaue tasks that are connected to it. In the following, this will allow us to simply parameter ise the overall program with the number of slaue tasks.

1 task module master;

2 use

3 mandbrot:Xmax,Ymax,mandmsg,mandp;

4 entryport

5 result:mandmsg reply integer;

6 out:signaltype reply mandp;

7 var

8 written,allocated:integer; cur-

rent:mandp;

9 bufs:array[O..Ymax] of mandp;

10 begin

11 for written:=O to Ymax do

bufs[written]:=nil;

12 written:=O; allocated:=O; new

current);

13 loop

14 select

15 receive currentT from result

16 =>allocated:=allocated+l;

17 if allocated<=Ymax then

18 reply allocated to re-

sult;

19 with currentT do

20 if lineno<>O then begin

21 bufs[lineno-1]:=current;

new(current);

22 end;

23

24

25

or

when bufs[written]<>nil

receive signal from out reply

bufs[written]

26 =>written:=written+1;

27 end;

28 end;

29 end.

The remaining task display, listed below, exists to decouple I / O latencies to the display from the response time of master to requests for lines. Note that this is the only task in the program that terminates. The Tonic termination model simply states that when any one task terminates (cor- rectly or erroneously) the entire program is termi- nated. In this Tonic differs considerably from its predecessor which allowed continued operat ion

in the presence of failures. This decision is con- sistent with the different characteristics of parallel and distributed programs identified in the introduction.

I task module display;

2 use

3 mandbrot:Xmax,Ymax,mandmsg,mandp;

4 exitport

5 out:signaltype reply mandp;

6 var

7 current:mandp; i,j:integer; out-

put:text;

8 b e g i n 9 for i:=I to Ymax do begin

10 send signal to out wait current;

11 for j:=1 to Xmax do

12 write(output,currentT.lin-

ebuf[j]);

13 end;

14 e n d .

The final step in developing the Mandelbrot program is to describe the overall configuration structure of slave and supervisor components together with an abstract description of how we wish these to be executed on the multicomputer. At this stage, we merely indicate a mapping of components to an abstract machine which consists of maxprocessor indicator processors. We are not concerned with the physical details of how these processors are interconnected. In fact, we assume that they are fully interconnected or, as termed in the following - globally interconnected. The configuration description for the program mandgen is given below.

2

3 execpar;

4 supervisor;

5 slave;

6 create

7 execpar;

8 c r e a t e 9 supervisor;

10 create forall

(k)

11

{default parameter values}

group module mandgen (x:real=

-2.0;y:real=2.0;d:real=4.0);

use

k:[1..maxprocessor] at

slave[k](x,y,d) <MEM=ONCHIP>;


12 link forall k:[1..maxprocessor]

13 slave[k].result to supervisor.re-

sult;

14 end.

The replicator forall is used to declare vectors of components (line 10) or links (line 12). The at

clause (line 10) specifies the processor at which the component instance is to be located. Any integer expression may follow the keyword at to denote the processor. Components with no at clause are by default allocated to the processor to which their parent group is allocated (in this case 1). More precisely, the rules governing allocation to processor numbers are: (1) Processors are numbered from 1 to maxpro-

cessor. (2) Any component instance can be allocated to

a processor by at. (3) The default allocation is to the parent group

(i.e. at not used). (4) The top-level group is conceptually allocated

to processor 1. These rules mean that an at clause can appear

at any level of a configuration description. For example, the configuration description for the parallel executive component execpar (below), allocates an instance of the component exec800 to each processor. Execpar provides I / O to the host, error reporting and inter-processor communication. It should be noted that maxprocessor is not a compile-t ime constant - it is initialised at run-time as described in the next section.

I

=4);

2 use

3 exec800;

4 create forall

(k)

5 exec8OO[k]

6 end.

group module execpar (buffers:integer

k:[1..maxprocessor] at

(buffers);

This section has demonstrated how parallel programs are constructed in Tonic. The mandgen program includes examples of request-reply communication and one-to-one and many-to-one communication (many slave exitports to one supervisor entryport). Tonic also includes unidirec- tional synchronous communication primitives, a mechanism for responding to requests in a differ-

I " ' !

Transputers

.-g ~ u.- , - , - - ,

, . . , . - , , - - , - - ,

, - - , - - ,

, - - , - - , , , . . , - , ,

Fig. 3. Meiko computing surface.

ent order to which they were received and a forwarding facility. Description of these is beyond the scope of the present paper. The next section describes how the Mandelbrot program can be executed on many different hardware configurations.

3. Logical to physical m a p p i n g

The previous section has described the logical structure of the Mandelbrot program mandgen. This logical structure is annotated with a mapping of components to an abstract machine consisting of maxprocessor globally interconnected identical processors. In this section, we outline how this abstract machine is realised on the ac- tual hardware. In our case the target hardware is a Meiko Computing Surface consisting of a SPARC based host (running Unix) and 32 T800 transputers each with 4 Megabytes of memory (Fig. 3). Via a utility provided by Meiko (svcds), a user can reserve a variable number of transputers and set up inter-transputer links before down- loading an application program. Svcsd provides a bidirectional message passing interface from the host to link 0 of one of the reversed transputers.

The Tonic Configuration language is used to describe the desired physical topology of interconnected transputers. This physical configuration description is used to drive the logical to physical mapping process. Figure 4 depicts the configuration view of an individual T800 transputer.

The T800 transputer type is represented by a Tonic task module. The task has no code since it


2 3 4

link°ut[O]"l~ t800 ~ link°ut[2]5 l i n k i n [ O ] ~ linkin[2] 7 6 8

rmkin[3l I v finkout[3]

task module t800; exitport

linkout[O..3]:byte;

entryport linkin[O..3]:byte;

begin {never executed}

end.

Fig. 4. Configuration view of T800 transputer.

is never executed. It serves only to provide an interface specification to configuration descriptions. Using this definition, we can now describe physical topologies of transputers. The following group module describes a pipeline where link 1 of each transputer is connected to link 0 of its successor in the pipeline. The pragma (line 7) associates an integer processor identifier with each T800 instance. These processor identifiers are used during the mapping process. A component in the logical configuration will execute at the processor whose identity corresponds to that specified by the component 's at clause. If the pragma is omitted, a default numbering scheme is applied. In the interest of conciseness, it is only necessary to define either linkout[m] to linkin[n] or linkout[n] to linkin[m] to specify a hardware connection between two transputers.

I group module pipeline (length:in-

teger);

2 entryport

3 linkin:byte;

4 u s e

5 t800;

6 create forall k:[1..length]

7 t8OO[k] <PID=k>;

8 link forall k:[1..length-1]

9 t8OO[k].linkout[1] to t8OO[k+1].

linkin[O];

10 link

11 linkin to t8OO[1].linkin[O];

12 end.

The following description of a ternary tree of transputers illustrates some of the more powerful features of the Tonic configuration language - namely guards and recursion. In this example we

have omitted pragmas to explicitly associate identities to processors but relied on the default as- signment of identities supplied by the underlying system. This definition of a ternary tree is of limited usefulness since it only generates configurations of 1 processor (depth = 0), 4 processors (depth = 1), 13 processors (depth =2) , etc. In practice, we use a definition of ternarytree which generates balances ternary trees for any number of processors.

I group module ternarytree(depth:in-

teger);

2 entryport

3 linkin:byte;

4 use

5 t800;

6 create

7 root:t800;

8 link

9 linkin to root.linkin[O];

10 when depth > 0

11 create forall k:[I..3]

12 child[k]:ternarytree(depth-1);

13 when depth > 0

14 link forall k:[I..3]

15 root.linkout[k] to child[k].lin-

kin;

16 end.

To complete the hardware configuration description the connection between the host processor and target transputer system must be specified as shown below for both the pipeline and ternary tree topologies. Line 9 specifies the connection between the host, represented by the component gin, and the first transputer in the pipeline (or ternarytree). Gin is the engine which performs most of the work of providing the Globally INter- connected abstract machine required to execute the logical configuration. Its implementation is described in outline in the next section. The name was irresistible.

I group module pipe (length:integer);

2 use

3 gin;

4 pipeline;

5 c r e a t e 6 gin;

7 pipeline(length);


Fig. 5. Mandgen mapped to a pipeline of 4 transputers.

8 L i n k 9 gin.linkout to pipeline.linkin;

10 end.

1 group module ttree(depth:integer);

2 use

3 gin;

4 ternarytree;

5 create

6 gin;

7 ternarytree(depth);

8 Link

9 gin.linkout to ternarytree.linkin;

10 e n d .

The above group modules pipe and ttree compile into the host executable files pipe and ttree. The program mandgen describes in the previous section compiles into the target executable file mandgen. To execute the Mandelbrot program on a pipeline of four processors the user types the following command on the host. The logical to physical mapping is depicted in Fig. 5.

pipe 4 mandgen 2.0 2.0 4.0 [pixdisp 2

Similarly, to execute the Mandelbrot program on a ternary tree of depth 2 (13 processors), the user types the command:

ttree 2 mandgen 2.0 2.0 4.0 Ipixdisp

Note that ttree and pipe can be applied to any application program, they are not specific to the Mandelbrot example.

4. Implementation and performance

In this section, we give an overview of how Tonic programs are executed on the Meiko Com-

2 Pixdisp is a program executing on the Unix host which reads from its standard input and displays the bytes read as coloured pixels on an Xwindow.

puting Surface. The latter part of the section discusses performance.

4.1 Tonic configuration language

Each group module in a configuration description compiles into a procedure to elaborate the structure of that group at run-time. The set of these elaboration procedures when executed at run-time generates a directed graph in which the nodes are task instances and the arcs are inter- task links. Group modules are not represented in this graph, it is a flat representation of the hierar- chical configuration structure [4]. Consequently, no penalty is paid at run-time for using hierarchi- cally structured configuration descriptions.

4.2 Bootstrapping the transputer network

When a physical configuration description is executed on the host (e.g. ttree) the graph structure generated in passed ot Gin. This graph represents the desired configuration of transputers required to execute the logical configuration. Gin performs the following sequence of actions: (1) The graph is checked to ensure that it repre-

sents a legal transputer configuration. That is, all transputer connections are one-to-one, processor identifiers are in a contiguous range, and the graph is fully connected (so that there is a path to boot every processor).

(2) Gin then computes a minimum depth spanning tree for the graph. This identifies which transputer links will be used for bootstrapping. The complete graph is recorded in an adjacency matrix AJ[O..maxprocessor, O..maxlink] where maxlink = 3 and AJ[i,j] is the identity of the processor to which link j of processor i is connected. Processor 0 represents the host. The matrix is marked with those links which will be booted.

(3) Using the svcds utility, gin grabs the required number of processors and interconnects them to conform to the graph generated by the physical configuration description.

(4) In the next stage, gin bootstraps the first transputer by sending the application program (e.g. mandgen.800) to its link 0. Once the program starts executing, gin sends it four further pieces of information:


(a) its processor identif ier (in the range 1..maxprocessor ).

(b) the maximum number of processors maxprocessor.

(c) the adjacency matrix AJ. (d) the command arguments represented as

strings (Unix argc & argv). For the example, these would be mandgen.800 2.0 2.0 4.0.

(5) At this stage, gin is finished with the bootstrapping process and becomes a server which services I / O requests from the application program. It runs until either the program running on the network reports an error or terminates.

The application program loaded into each transputer continues the bootstrap process. When started it receives the items (a) to (d) listed in (4) above. The program then examines the entry in the adjacency matrix corresponding to its processor identifier. If an entry AJ[self, j] is marked to be bootstrapped, the program sends its code to outgoing link j to bootstrap the processor to which link j is connected. It then sends the processor identifier AJ[self,,j], maxprocessor, AJ and the command arguments to complete the bootstrap. Note that exactly the same code is loaded into each processor. The only value which a processor receives to distinguish it from others is its processor identifier. After the first level of the boot spanning tree has been bootstrapped, booting continues in parallel until the leaves of the tree have been bootstrapped.

4.3 Initialisation

After completing bootstrapping, each transputer executes the same initialisation code. This code first calculates a minimum distance routing table from the adjacency matrix. The routing table is used by execpar at execution time. Initial- isation proceeds by invoking the group elaboration procedures. Each transputer node thus has a copy of the complete logical configuration graph. However, the kernel (which is part of execpar) only instantiates tasks which correspond to its processor identifier, i.e. an at clause in the logical configuration specified 'this' processor. The kernel in addition to instantiating tasks creates datastructures to implement both local and remote inter-task communication. These communication datastructures contain initialised trans-

Intra.Processor

Inter- Processor

+ time per additional intermediate processor

4 byte Request 4 bv~ p.enlv

17uS

174uS

14.4uS

100 byte Request 4 byte Renlv

19uS

241uS

210uS

Fig. 6. Request-reply times.

puter channel words. The Tonic communication primitives are implemented using the transputer communication instructions, in, out, alt, etc.

Loading the entire code for the application at each processor has the disadvantage of wasting storage when task types are loaded but not in- stantiated. However, this scheme has the following major advantages: (1) It permits the bootstrap to proceed in parallel. This considerably reduces startup time for large numbers of processor. (2) Since each node has a complete copy of the logical configuration graph, the setup of inter-task communication associations at initialisation time does not require remote communication. The initialisation of each transputer proceeds in parallel. Again this reduces application startup time. Tonic applications typically take less than a second to startup.

4.4 Performance

Figure 6 shows the times required for a request-reply message exchange. The time is measured from the time the sending task initiates the exchange by a send-wait to the time the reply message completes the exchange. The receiving task executes a receive followed by a reply.

The times above are for one-to-one request-reply communication, i.e. one exitport connected to one entryport. Where the communication is n to 1 (n exitports connected to 1 entryport as in the Mandelbrot example) the time for an individual intra-processor request-reply is T + 5 * n uS (for n > 1) where T is the time for a one-to-one communication. This is because many to communication is implemented using the transputer alt in- struction. The receiver task waits on a set of transputer channels representing the set of exit-

3 4 6 3". Magee, N. Dulay

35.00

30.00

i 25.00 20.00

15.00

10.00

5.00

0.00

" " ttree O"Linear " '" pipe . I

7

| , , . , l l l l | | | | | , , | , l . | l , , , , , l l . I , , , , , . , , . , . . . . . , , . . . , , . . ~ , , , . . 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

#processors

Fig. 7. Mandgen program speedup.

ports connected to it. Consequently, the time to receive from an entryport is proportional to the number of exitports connected to it. This represents a considerable performance penalty for large fan-in configurations. For example, with 32 processors, the mandgen program has a 32 to 1 connection to the supervisor's result entryport. Consequently, the performance penalty is 160uS per communication. We are currently re-imple- menting intra-processor communication using critical regions rather than transputer communication channels to make the receive time independent of the fan-in factor.

Figure 7 represents the speedup for the Man- delbrot program example plotted against the number of processors using a balanced ternary tree hardware configuration (ttree) and a pipeline hardware configuration (pipe). Speedup(n) is measured as the time for 1 processor (i.e. 1 slave & supervisor) divided by the time for n processors (i.e. n slaves & 1 supervisor). The 1 processor case with a supervisor and a slave running on the same processor is only marginally slower than the sequential program slave-supervisor communication is local. Not surprisingly, the ternary tree mapping outperforms the pipeline mapping. The average processing rate for the 32 processor ternary tree mapping is 26Mflop/S or 0.8Mflop/S per transputer. The overall time with 32 proces-

sors to complete the computation with the pa- rameters (2.0, 2.0, 4.0) was 3.3 seconds.

5. Discussion and conclusions

The paper has presented a configuration approach to the construction of parallel programs in which the functional behaviour of individual components is specified by a programming language and the overall parallel program structure is specified by a configuration language. The configuration specification declares the instances of component types and their interconnections. Compo- nent instances execute in parallel. This logical configuration is annotated with a mapping to an abstract machine which consists of maxprocessor identical, globally interconnected processors. The physical configuration of the real machine is specified using the same configuration language. This physical description drives the logical to physical mapping process. Both the logical and physical configuration specifications are considerably more flexible than those provided by existing systems [7,5,13]. Running a Tonic program on different physical configurations with different numbers of processors and different inter-processor connections requires no re-compilation. This facilities both portability and experimentation with different logical to physical mappings. The configuration approach is similar to that of the MUPPET [14] system which uses a graphical notation to express configurations. However, MU P P ET does not clearly separate logical from physical configurations and is limited in the mappings which can be expressed. Tonic also has a graphical notation for expressing configurations and an associated display tool [9]. However, for describing regular structures, we find the power of the textual language to be more useful.

In concurrent programming terms, Tonic falls into the class of languages which support the process/message passing paradigm [2]. The best known of these languages are Occam [7] and Ada [3]. Unlike these languages, Tonic incorporates a separate language to describe the structure of concurrent programs in terms of task instances and inter-task message paths (links). These languages also differ in the time at which the program's process structure is fixed. Occam defines the process structure statically at compile time,


Tonic at instantiation/initialisation time and Ada dynamically at run-time. In fairness, it should be noted that Ada semantics do not permit an efficient distributed memory implementation. We are currently experimenting with an approach which would allow changes to the process structure at run-time while retaining a strict separation between programming and configuration [11]. This will permit a limited, but efficient, form of process migration to facilitate dynamic load balancing.

We regard Tonic as a prototype implementation to validate the configuration approach to parallel programming. Its application is restricted by the reliance on one specific programming language - Pascal + message passing. Currently, we are engaged in the development of a configuration language (Darwin) [11] and associated tools which will permit the configuration approach to be applied to parallel programs composed of components written in commonly available languages such as C & Fortran.

Despite the limitations expressed above, Tonic is a practical and efficient tool for developing parallel programs. It hides many of the irrelevant details about the underlying hardware which currently harass parallel programmers. Typically, application programmers select a physical configuration from a library rather than programming their own. The library currently includes pipeline, ring, mesh, torus, binary tree, ternary tree, cube connected cycles and WK-Recursive physical topologies. The toolset is used by both research and undergraduate students.

~ , c k n o w l e d l l e m e n t s

The authors would like to acknowledge discus- sions with our colleagues in the Parallel and Distributed Systems Group during the formula- tion of these ideas. We gratefully acknowledge the SERC under grant GR/G31079, and the CEC in the REX Project (2080) for their financial support.

R e f e r e n c e s

[1] W.C. Athas and C.L. Seitz, Multicomputers: message- passing concurrent computers, IEEE Comput. 21 (8) (Aug. 1988) 9-24.

[2] H.Bal, J. Steiner, and A. Tanenbaum, Programming languages for distributed computing systems, ACM Comput. Surveys, 21 (3) (Sept. 1989) 261-322.

[3] Department of Defense, U.S.A., Reference manual for the Ada programming language, ANSI/MIL-STD- 1815A, DoD, Washington D.C. Jan 1983.

[4] N. Dulay, A configuration language for distributed programming, Ph .D. Thesis, Dept. of Computing, Imperial College, February 1990.

[5] Distributed Software Ltd, The Helios parallel programming tutorial, 670 Aztec West, Bristol, January 1990.

[6] Inmos Ltd. OCCAM 2 reference manual, Prentice Hall, 1988.

[7] Inmos Ltd. Transputer development system, Prentice Hall, 1988.

[8] J. Kramer and J. Magee, Dynamic configuration for distributed systems, IEEE Trans. Software Engrg. SE-11 (4) (Apr. 1985) 424-436.

[9] J. Kramer, J. Magee and K. Ng, Graphical configuration programming, 1EEE Comput. 22 (10) (1989) 53-65.

[10] J. Magee, J. Kramer and M. Sloman, Constructing distributed systems in Conic, 1EEE Trans. Software Engrg. SE-15 (6) (Jun. 1989) 663-675.

[11] J. Magee, J. Kramer, M. Sloman and N. Dulay, An overview of the REX software architecture, Proc. 2nd 1EEE Workshop on Future Trends of Distributed Comput- ing Systems, Cairo, Egypt (Sep. 1990) 396-402.

[12] J.N. Magee and S.C. Cheung, Parallel algorithm design for workstation clusters, Software-Practice and Experience 21 (Mar. 1991) 235-250.

[13] Meiko Ltd, CS tools documentation guide, 650 Aztec West, Bristol, 1989.

[14] H. Muhlenbein, Th. Scheider and S. Streitz, Network programming with MUPPET, J. Parallel Distributed Corn- put. 5 (1988) 641-653.

,left' Maaee graduated from Queens University, Belfast with a degree in electrical engineering in 1973. After working with the British Post Office on the design and development of System X he returned to college where he received the M. Sc. and Ph. D. degrees in computing science from Imperial College, London, in 1978 and 1984, respectively.

He is currently a Senior Lecturer in the Department of Computing at Im- perial College. His research interests

include parallel algorithms, distributed operating systems and tool support for the design and development of parallel and distributed programs. Dr. Magee is a member of the I.E.E.

Naranker Dula3 graduated from Manchester University with a B. Sc. in Computer Science in 1979 and was awarded a Ph.D. in computing from Imperial College, London in 1990. He is currently a Lecturer in the Depart- ment of Computing at Imperial Col- lege. His research interest lie in the areas of languages, compilers, algorithms, and architectures for distributed and parallel computing. Dr. Dulay is a member of the BCS.

a configuration approach to parallel programming

Documents