a configuration approach to parallel programming

11
Future Generation Computer Systems 8 (1992) 337-347 337 North-Holland A configuration approach to parallel programming Jeff Magee and Naranker Dulay Department of Computing, Imperial College of Science, Technology and Medicine, 180 Queen's Gate, London SW7 2BZ, UK Abstract Magee, J. and N. Dulay, A configuration approach to parallel programming, Future Generation Computer Systems 8 (1992) 337-347. This paper advocates a configuration approach to parallel programming for distributed memory multicomputers, in particular, arrays of transputers. The configuration approach prescribes the rigorous separation of the logical structure of a program from its component parts. In the context of parallel programs, components are processes which communicate by exchanging messages. The configuration defines the instances of these processes which exist in the program and the paths by which they are interconnected. The approach is demonstrated by a toolset (Tonic) which embodies the configuration paradigm. A separate configuration language is used to describe both the logical structure of the parallel program and the physical structure of the target multicomputer. Different logical to physical mappings can be obtained by applying different physical configurations to the same logical configuration. The toolset has been developed from the Conic system for distributed programming. The use of the toolset is illustrated through its application to the development of a parallel program to compute Mandelbrot sets. Keywords. Parallel programming environment; parallel programming language; configuration language; transputers. 1. Introduction The work described in this paper arose from our interest in applying the principles embodied in Conic [8,10] to a programming environment for multicomputers [1]. The shortcomings we per- ceived in existing programming environments for multicomputers based on transputer arrays pro- vided additional motivation. The modifications necessary to Conic to enable its efficient use in the transputer environment led to naming the variant Tonic, for obvious reasons. Conic is a toolkit for constructing distributed systems. It provides two languages: the first, a declarative configuration language used to de- scribe the structure of a logical node in terms of its constituent process types, process instances Correspondence to: Jeff Magee, Department of Computing, Imperial College of Science, Technology and Medicine, 180 Queen's Gate, London SW7 2BZ, UK. and process interconnections and the second, a programming language used to program individ- ual process types. The programming language is Pascal augmented with message passing primi- tives. Distributed systems are constructed in Conic by dynamically assigning instances of logical nodes to physical nodes and interconnecting these in- stances. Conic embodies the configuration ap- proach is rigorously separating the logical struc- ture of a distributed program from the compo- nents which implement its computational func- tion. The differences between Tonic and Conic arise from the characteristic differences between parallel and distributed programs. We see these as being: Objectiue Distributed programs can be consid- ered as consisting of a number of logically distinct entities which intercommunicate to achieve some overall goal - typically access to geographically distributed resources. Parallel programs are logi- 0376-5075/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

Upload: jeff-magee

Post on 21-Jun-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A configuration approach to parallel programming

Future Generation Computer Systems 8 (1992) 337-347 337 North-Holland

A configuration approach to parallel programming

J e f f M a g e e a n d N a r a n k e r D u l a y

Department of Computing, Imperial College of Science, Technology and Medicine, 180 Queen's Gate, London SW7 2BZ, UK

Abstract

Magee, J. and N. Dulay, A configuration approach to parallel programming, Future Generation Computer Systems 8 (1992) 337-347.

This paper advocates a configuration approach to parallel programming for distributed memory multicomputers, in particular, arrays of transputers. The configuration approach prescribes the rigorous separation of the logical structure of a program from its component parts. In the context of parallel programs, components are processes which communicate by exchanging messages. The configuration defines the instances of these processes which exist in the program and the paths by which they are interconnected.

The approach is demonstrated by a toolset (Tonic) which embodies the configuration paradigm. A separate configuration language is used to describe both the logical structure of the parallel program and the physical structure of the target multicomputer. Different logical to physical mappings can be obtained by applying different physical configurations to the same logical configuration. The toolset has been developed from the Conic system for distributed programming. The use of the toolset is illustrated through its application to the development of a parallel program to compute Mandelbrot sets.

Keywords. Parallel programming environment; parallel programming language; configuration language; transputers.

1. In t roduc t ion

T h e work de sc r ibed in this p a p e r a rose f rom our in te res t in apply ing the p r inc ip les e m b o d i e d in Conic [8,10] to a p r o g r a m m i n g e n v i r o n m e n t for m u l t i c o m p u t e r s [1]. T h e shor tcomings we per - ce ived in exist ing p r o g r a m m i n g env i ronmen t s for m u l t i c o m p u t e r s b a s e d on t r a n s p u t e r a r rays p ro - v ided add i t i ona l mot iva t ion . T h e modi f i ca t ions necessa ry to Conic to enab l e its eff ic ient use in the t r a n s p u t e r e n v i r o n m e n t l ed to naming the va r ian t Tonic , for obvious reasons .

Conic is a toolk i t for cons t ruc t ing d i s t r i bu t ed systems. I t p rov ides two languages : the first, a dec la ra t ive conf igura t ion l anguage used to de- scr ibe the s t ruc ture of a logical node in t e rms of its cons t i tuen t p rocess types, p rocess ins tances

Correspondence to: Jeff Magee, Department of Computing, Imperial College of Science, Technology and Medicine, 180 Queen's Gate, London SW7 2BZ, UK.

and process i n t e rconnec t ions and the second, a p r o g r a m m i n g l anguage used to p r o g r a m individ- u a l p rocess types. T h e p r o g r a m m i n g l anguage is Pasca l a u g m e n t e d with message pass ing pr imi- tives. D i s t r i b u t e d systems are cons t ruc t ed in Conic by dynamica l ly ass igning ins tances of logical nodes to physical nodes and in t e rconnec t ing these in- s tances. Conic e m b o d i e s the conf igura t ion ap- p r o a c h is r igorous ly s epa ra t i ng the logical struc- ture of a d i s t r ibu ted p r o g r a m f rom the compo- nen ts which i m p l e m e n t its c o m p u t a t i o n a l func- t ion. T h e d i f fe rences b e t w e e n Ton ic and Conic ar ise f rom the cha rac te r i s t i c d i f fe rences be tw e e n pa ra l l e l and d i s t r ibu ted p rograms . W e see these as being:

Objectiue Dis t r i bu t ed p r o g r a m s can be consid- e r ed as consis t ing of a n u m b e r of logical ly dis t inct en t i t i es which i n t e r c o m m u n i c a t e to achieve some overal l goal - typical ly access to geograph ica l ly d i s t r ibu ted resources . Pa ra l l e l p r o g r a m s are logi-

0376-5075/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved

Page 2: A configuration approach to parallel programming

338 J. Magee, N. Dulay

cally one entity where the constituents co-operate to achieve some computational goal - the overall objective of performing the computation in paral- lel being speedup.

Failure Failure of one component of a dis- tributed program generally requires continued operation albeit in degraded mode whereas fail- ure of one component of a parallel computation can generally be allowed to cause termination of the overall computation. In distributed environ- ments, the longevity of execution, together with the probability of communication and node fail- ure means that software development toolkits must provide programming abstractions to deal with such failures. This is not the case in multi- computers where we can assume reliable commu- nication and low probability of node failure dur- ing the execution of programs which have as their primary objective speedup rather than continuous execution.

Evolution A large class of critical distributed pro- grams execute perpetually and for economic or safety reasons require the facility to be modified and updated on-line. The Conic toolkit supports this requirement through its ability to dynamically configure running systems. On-line evolution is not a requirement for parallel programs which run for relatively short periods, completing when a result has been computed.

Heterogeneity Distributed programs are generally designed such that components of the program may run on computers with different processor types. Programming environments for distributed systems deal with this hardware heterogeneity by providing multiple code generators and message datatype conversion facilities. Distributed mem- ory multicomputers, typified by transputer net- works, provide hardware homogeneity, although they are usually hosted by a computer with a different processor type. However, programming environments for multicomputers should optimise for hardware homogeneity.

In summary, Tonic is optimised to support the development of parallel programs for distributed memory multicomputers where the primary ob- jective is speedup. Tonic does not support dy- namic configuration and assumes reliable proces-

sors and reliable inter-processor communication. It inherits from Conic the configuration approach in providing a separate language to define logical structure and extends the Conic configuration facilities by also applying this language to describ- ing the physical structure of the target multicom- puter. This physical configuration description is used to drive the logical to physical mapping process.

Currently, the most commonly used toolset for developing parallel programs for transputer based multicomputers is the Occam language [6] and the Transputer Development System (TDS) [7]. While these are efficient tools for developing embedded applications, they have drawbacks when used for developing application programs for the current generation of transputer based multicomputers such as the Meiko Computing Surface and the Supernode. The drawbacks are primarily concerned with the flexibility permitted in mapping an arbitrary network of communicat- ing Occam processes to the hardware topology of interconnected transputers. The developer must take into account the limit of four links per transputer and laboriously place logical channels onto physical channels. Explicit multiplexor and demultiplexor processes must be provided where it is necessary to map more than one logical communication channel onto a physical link. Changing the number of processors on which an application runs requires recompilation.

More recent toolsets such as CStools [13] ad- dress the problem of making the logical structure independent from the underlying hardware struc- ture through the use of configuration descriptors (termed 'par files'), however, these descriptions do not permit runtime parameterisation of the number of processors or flexible logical to physi- cal mapping unless the user resorts to the under- lying library routines. The Helios operating sys- tem [5] allows a user to describe the hardware configuration (Resource Map) separately from the logical configuration (CDL). These descrip- tions use different notations and are limited to compile time parameterisation. Further, the ap- plication programmer has little control over the logical to physical mapping. Helios programmers are limited to Unix style I / O for inter-process communication.

In the following section, the Tonic facilities for developing parallel programs are illustrated

Page 3: A configuration approach to parallel programming

A configuration approach to parallel programming 339

through the development of a program to com- pute the Mandelbrot Set. The facilities provided for mapping this program to different hardware topologies are described in Section 3. Section 4 overviews the implementat ion of Tonic and pro- vides some performance data. Finally, Section 5 evaluates the approach.

2. Pn~gram construction in Tonic (logical struc-

ture)

The following overviews the programming fea- tures offered by Tonic for parallel programming through the example of a program to generate Mandelbrot sets. The program generates a 512 by 512 pixel image where the colour of each pixel is represented by an 8 bit quantity. This quantity is computed as the number of iterations (up to a maximum of 255) of the calculation z = z * z + c before I z I > 2 where z is a complex variable and c a complex constant. If the maximum is reached c is assumed to be in the Mandelbrot Set, other- wise the number of iterations indicates how 'close' c is to the set. The simplistic approach to paral- lelising this is to divide the image into the same number of chunks as there are processors and hand each chunk to a processor for computation. Since some images areas, far outside the set, require much less computat ion than others this approach leads to poor load balancing and thus poor performance. A more sophisticated ap- proach employs a work allocator or supervisor process to hand out smaller chunks to worker or slave processes [12]. A slave process computes a chunk and hands it back to the supervisor for display and then gets another chunk to compute until none are left. In the following, chunks are the size of one horizontal line of pixels. The logical structure of the program is shown in Fig. 1.

The types of message exchanged between com- ponents of the program together with program wide constants are defined in a definitions unit (c.f. Modula-2 modules) as shown below.

I define mandbrot: Xmax, Ymax, mandmsg,

mandp;

2 const Xmax=512; Ymax=512;

3 type mandp=~mandmsg;

4 mandmsg=record

m a n d g e n

I s u ~ l s o r

computed line

new line to compute (integer)

Fig. 1. Logical structure of the Mandelbrot generator pro- gram.

5 lineno: integer;

6 linebuf: packed array

[1..Xmax] of char;

7 end;

8 end.

The program unit shown below is the defini- tion of the slave process type - in Tonic process are task modules. Tasks ] communicate with the outside world by sending messages to exitports and receiving messages from entryports. A task has no direct knowledge of which other tasks it will be connected to. This configuration indepen- dence greatly facilitates reuse. In this case, the slave task sends a computed line of pixe[ colour values (type mandmsg) to its exitport result (line 4). The communication primitive used is a send- wait (line 23) which sends a request message to the exitport and then suspends the task waiting for a reply. The first message from a slave to the supervisor has a zero l inenumber indicating that the message does not include a computed line - it is merely a request for the first line to compute. Subsequent messages overload a request for a new line with the results of computing the last line.

I task module slave (xO,yO,dO:real);

2 use

3 mandbrot: Xmax,Ymax,mandmsg;

4 exitport

5 result: mandmsg reply integer;

6 var

I The terms task and process are used interchangeably throughout the paper.

Page 4: A configuration approach to parallel programming

340 J. Magee, N. Dulay

7 M: mandmsg; xl,yl,delta:real;

i:integer;

8

9 function mandcalc (cx,cy:real):in-

teger;

10 var i:integer;

zx,zy,xx,xy,yy,t:real;

11 b e g i n 12 i:=O; zx:=cx; zy:=cy;

13 repeat

14 xy:=zx*zy; xx:zx*zx; yy:zy*zy;

zy:=xy+xy+cy;

15 zx:=xx-yy+cx; t:=xx+yy;

i:=i+l;

16 until (t>4.0) or (i=256);

17 mandcalc:=i;

18 end; 19 20 b e g i n 21 M.lineno:=O; delta:=dO/Xmax;

22 loop

23 send M to result wait M.

lineno;

24 x1:=xO;

25 yl:=yO-(M.lineno-1)*delta;

26 for i:=I to Xmax do begin

27 M. linebuf[i]:=

chr(mandcalc(xl,yl));

xl:=xl+delta;

28 end;

29 end;

30 end.

The superuisor component shown in Fig. 1 is implemented by two tasks as shown in Fig. 2.

The supervisor composite component is de- scribed in the Tonic configuration language by the group module below. Note that the interface

N " N

Fig. 2. Supervisor composite component.

to a group is defined in an identical way to task module interfaces. Thus tasks can be replaced by groups and vice-versa at any point during pro- gram development without affecting the rest of the program. The configuration description con- sists of four parts:

(i) the use clause (line 2) specifies the message and component types used to construct the group, (ii) the interface to the component in terms of

entry and exitports (line 6), (iii) the instance of component types from which the component is constructed (line 8) and (iv) the interconnections between instances of these components (line 11).

Since the master is critical to the overall per- formance of the program, its associated pragma (line 10) indicates that this task ins~tance should be run as a high priority transputer process with its workspace in on-chip memory if possible.

1 group module supervisor;

2 use

3 mandbrot :mandmsg;

4 display;

5 master;

6 entryport

7 result:mandmsg reply integer;

8 c r e a t e 9 display;

10 master <PRI=O, MEM=ONCHIP>;

11 link

12 display.out to master.out;

13 result to master.result;

14 end.

The program for task master is given below. This task allocates lines to be computed to slave tasks through replies to the entryport result (line 18) and stores computed lines in the array burs. Since lines will be computed in different times by slave processes, computed lines will not be re- ceived by master in line order. The guard on the receive from the entryport out (line 25) ensures that the display task receives lines in the correct order. The select statement is similar to that provided by Ada [3]. One of the set of eligible receive statements is selected for execution to- gether with the computational statements associ- ated with it. A receive is eligible if the associated guard is true (or there is no guard) and a message is queued to the entryport on which the receive is being performed. It should be noted that the

Page 5: A configuration approach to parallel programming

A configuration approach to parallel programming 341

master task does not have or need information on the number of slaue tasks that are connected to it. In the following, this will allow us to simply parameter ise the overall program with the num- ber of slaue tasks.

1 task module master;

2 use

3 mandbrot:Xmax,Ymax,mandmsg,mandp;

4 entryport

5 result:mandmsg reply integer;

6 out:signaltype reply mandp;

7 var

8 written,allocated:integer; cur-

rent:mandp;

9 bufs:array[O..Ymax] of mandp;

10 begin

11 for written:=O to Ymax do

bufs[written]:=nil;

12 written:=O; allocated:=O; new

current);

13 loop

14 select

15 receive currentT from result

16 =>allocated:=allocated+l;

17 if allocated<=Ymax then

18 reply allocated to re-

sult;

19 with currentT do

20 if lineno<>O then begin

21 bufs[lineno-1]:=current;

new(current);

22 end;

23

24

25

or

when bufs[written]<>nil

receive signal from out reply

bufs[written]

26 =>written:=written+1;

27 end;

28 end;

29 end.

The remaining task display, listed below, exists to decouple I / O latencies to the display from the response time of master to requests for lines. Note that this is the only task in the program that terminates. The Tonic termination model simply states that when any one task terminates (cor- rectly or erroneously) the entire program is termi- nated. In this Tonic differs considerably from its predecessor which allowed continued operat ion

in the presence of failures. This decision is con- sistent with the different characteristics of paral- lel and distributed programs identified in the introduction.

I task module display;

2 use

3 mandbrot:Xmax,Ymax,mandmsg,mandp;

4 exitport

5 out:signaltype reply mandp;

6 var

7 current:mandp; i,j:integer; out-

put:text;

8 b e g i n 9 for i:=I to Ymax do begin

10 send signal to out wait current;

11 for j:=1 to Xmax do

12 write(output,currentT.lin-

ebuf[j]);

13 end;

14 e n d .

The final step in developing the Mandelbrot program is to describe the overall configuration structure of slave and supervisor components to- gether with an abstract description of how we wish these to be executed on the multicomputer. At this stage, we merely indicate a mapping of components to an abstract machine which con- sists of maxprocessor indicator processors. We are not concerned with the physical details of how these processors are interconnected. In fact, we assume that they are fully interconnected or, as termed in the following - globally intercon- nected. The configuration description for the pro- gram mandgen is given below.

2

3 execpar;

4 supervisor;

5 slave;

6 create

7 execpar;

8 c r e a t e 9 supervisor;

10 create forall

(k)

11

{default parameter values}

group module mandgen (x:real=

-2.0;y:real=2.0;d:real=4.0);

use

k:[1..maxprocessor] at

slave[k](x,y,d) <MEM=ONCHIP>;

Page 6: A configuration approach to parallel programming

342 J. Magee, N. Dulay

12 link forall k:[1..maxprocessor]

13 slave[k].result to supervisor.re-

sult;

14 end.

The replicator forall is used to declare vectors of components (line 10) or links (line 12). The at

clause (line 10) specifies the processor at which the component instance is to be located. Any integer expression may follow the keyword at to denote the processor. Components with no at clause are by default allocated to the processor to which their parent group is allocated (in this case 1). More precisely, the rules governing allocation to processor numbers are: (1) Processors are numbered from 1 to maxpro-

cessor. (2) Any component instance can be allocated to

a processor by at. (3) The default allocation is to the parent group

(i.e. at not used). (4) The top-level group is conceptually allocated

to processor 1. These rules mean that an at clause can appear

at any level of a configuration description. For example, the configuration description for the parallel executive component execpar (below), al- locates an instance of the component exec800 to each processor. Execpar provides I / O to the host, error reporting and inter-processor commu- nication. It should be noted that maxprocessor is not a compile-t ime constant - it is initialised at run-time as described in the next section.

I

=4);

2 use

3 exec800;

4 create forall

(k)

5 exec8OO[k]

6 end.

group module execpar (buffers:integer

k:[1..maxprocessor] at

(buffers);

This section has demonstrated how parallel programs are constructed in Tonic. The mandgen program includes examples of request-reply com- munication and one-to-one and many-to-one communication (many slave exitports to one su- pervisor entryport). Tonic also includes unidirec- tional synchronous communication primitives, a mechanism for responding to requests in a differ-

I " ' !

Transputers

.-g ~ u.- , - , - - ,

, . . , . - , , - - , - - ,

, - - , - - ,

, - - , - - , , , . . , - , ,

Fig. 3. Meiko computing surface.

ent order to which they were received and a forwarding facility. Description of these is beyond the scope of the present paper. The next section describes how the Mandelbrot program can be executed on many different hardware configura- tions.

3. Logical to physical m a p p i n g

The previous section has described the logical structure of the Mandelbrot program mandgen. This logical structure is annotated with a map- ping of components to an abstract machine con- sisting of maxprocessor globally interconnected identical processors. In this section, we outline how this abstract machine is realised on the ac- tual hardware. In our case the target hardware is a Meiko Computing Surface consisting of a SPARC based host (running Unix) and 32 T800 transputers each with 4 Megabytes of memory (Fig. 3). Via a utility provided by Meiko (svcds), a user can reserve a variable number of transputers and set up inter-transputer links before down- loading an application program. Svcsd provides a bidirectional message passing interface from the host to link 0 of one of the reversed transputers.

The Tonic Configuration language is used to describe the desired physical topology of inter- connected transputers. This physical configura- tion description is used to drive the logical to physical mapping process. Figure 4 depicts the configuration view of an individual T800 trans- puter.

The T800 transputer type is represented by a Tonic task module. The task has no code since it

Page 7: A configuration approach to parallel programming

A configuration approach to parallel programming 343

2 3 4

link°ut[O]"l~ t800 ~ link°ut[2]5 l i n k i n [ O ] ~ linkin[2] 7 6 8

rmkin[3l I v finkout[3]

task module t800; exitport

linkout[O..3]:byte;

entryport linkin[O..3]:byte;

begin {never executed}

end.

Fig. 4. Configuration view of T800 transputer.

is never executed. It serves only to provide an interface specification to configuration descrip- tions. Using this definition, we can now describe physical topologies of transputers. The following group module describes a pipeline where link 1 of each transputer is connected to link 0 of its successor in the pipeline. The pragma (line 7) associates an integer processor identifier with each T800 instance. These processor identifiers are used during the mapping process. A compo- nent in the logical configuration will execute at the processor whose identity corresponds to that specified by the component 's at clause. If the pragma is omitted, a default numbering scheme is applied. In the interest of conciseness, it is only necessary to define either linkout[m] to linkin[n] or linkout[n] to linkin[m] to specify a hardware connection between two transputers.

I group module pipeline (length:in-

teger);

2 entryport

3 linkin:byte;

4 u s e

5 t800;

6 create forall k:[1..length]

7 t8OO[k] <PID=k>;

8 link forall k:[1..length-1]

9 t8OO[k].linkout[1] to t8OO[k+1].

linkin[O];

10 link

11 linkin to t8OO[1].linkin[O];

12 end.

The following description of a ternary tree of transputers illustrates some of the more powerful features of the Tonic configuration language - namely guards and recursion. In this example we

have omitted pragmas to explicitly associate iden- tities to processors but relied on the default as- signment of identities supplied by the underlying system. This definition of a ternary tree is of limited usefulness since it only generates configu- rations of 1 processor (depth = 0), 4 processors (depth = 1), 13 processors (depth =2) , etc. In practice, we use a definition of ternarytree which generates balances ternary trees for any number of processors.

I group module ternarytree(depth:in-

teger);

2 entryport

3 linkin:byte;

4 use

5 t800;

6 create

7 root:t800;

8 link

9 linkin to root.linkin[O];

10 when depth > 0

11 create forall k:[I..3]

12 child[k]:ternarytree(depth-1);

13 when depth > 0

14 link forall k:[I..3]

15 root.linkout[k] to child[k].lin-

kin;

16 end.

To complete the hardware configuration descrip- tion the connection between the host processor and target transputer system must be specified as shown below for both the pipeline and ternary tree topologies. Line 9 specifies the connection between the host, represented by the component gin, and the first transputer in the pipeline (or ternarytree). Gin is the engine which performs most of the work of providing the Globally INter- connected abstract machine required to execute the logical configuration. Its implementation is described in outline in the next section. The name was irresistible.

I group module pipe (length:integer);

2 use

3 gin;

4 pipeline;

5 c r e a t e 6 gin;

7 pipeline(length);

Page 8: A configuration approach to parallel programming

344 J. Magee, N. Dulay

Fig. 5. Mandgen mapped to a pipeline of 4 transputers.

8 L i n k 9 gin.linkout to pipeline.linkin;

10 end.

1 group module ttree(depth:integer);

2 use

3 gin;

4 ternarytree;

5 create

6 gin;

7 ternarytree(depth);

8 Link

9 gin.linkout to ternarytree.linkin;

10 e n d .

The above group modules pipe and ttree com- pile into the host executable files pipe and ttree. The program mandgen describes in the previous section compiles into the target executable file mandgen. To execute the Mandelbrot program on a pipeline of four processors the user types the following command on the host. The logical to physical mapping is depicted in Fig. 5.

pipe 4 mandgen 2.0 2.0 4.0 [pixdisp 2

Similarly, to execute the Mandelbrot program on a ternary tree of depth 2 (13 processors), the user types the command:

ttree 2 mandgen 2.0 2.0 4.0 Ipixdisp

Note that ttree and pipe can be applied to any application program, they are not specific to the Mandelbrot example.

4. Implementation and performance

In this section, we give an overview of how Tonic programs are executed on the Meiko Com-

2 Pixdisp is a program executing on the Unix host which reads from its standard input and displays the bytes read as coloured pixels on an Xwindow.

puting Surface. The latter part of the section discusses performance.

4.1 Tonic configuration language

Each group module in a configuration descrip- tion compiles into a procedure to elaborate the structure of that group at run-time. The set of these elaboration procedures when executed at run-time generates a directed graph in which the nodes are task instances and the arcs are inter- task links. Group modules are not represented in this graph, it is a flat representation of the hierar- chical configuration structure [4]. Consequently, no penalty is paid at run-time for using hierarchi- cally structured configuration descriptions.

4.2 Bootstrapping the transputer network

When a physical configuration description is executed on the host (e.g. ttree) the graph struc- ture generated in passed ot Gin. This graph rep- resents the desired configuration of transputers required to execute the logical configuration. Gin performs the following sequence of actions: (1) The graph is checked to ensure that it repre-

sents a legal transputer configuration. That is, all transputer connections are one-to-one, processor identifiers are in a contiguous range, and the graph is fully connected (so that there is a path to boot every processor).

(2) Gin then computes a minimum depth span- ning tree for the graph. This identifies which transputer links will be used for bootstrap- ping. The complete graph is recorded in an adjacency matrix AJ[O..maxprocessor, O..max- link] where maxlink = 3 and AJ[i,j] is the identity of the processor to which link j of processor i is connected. Processor 0 repre- sents the host. The matrix is marked with those links which will be booted.

(3) Using the svcds utility, gin grabs the required number of processors and interconnects them to conform to the graph generated by the physical configuration description.

(4) In the next stage, gin bootstraps the first transputer by sending the application pro- gram (e.g. mandgen.800) to its link 0. Once the program starts executing, gin sends it four further pieces of information:

Page 9: A configuration approach to parallel programming

A configuration approach to parallel programming 345

(a) its processor identif ier (in the range 1..maxprocessor ).

(b) the maximum number of processors maxpro- cessor.

(c) the adjacency matrix AJ. (d) the command arguments represented as

strings (Unix argc & argv). For the example, these would be mandgen.800 2.0 2.0 4.0.

(5) At this stage, gin is finished with the boot- strapping process and becomes a server which services I / O requests from the application program. It runs until either the program running on the network reports an error or terminates.

The application program loaded into each transputer continues the bootstrap process. When started it receives the items (a) to (d) listed in (4) above. The program then examines the entry in the adjacency matrix corresponding to its proces- sor identifier. If an entry AJ[self, j] is marked to be bootstrapped, the program sends its code to outgoing link j to bootstrap the processor to which link j is connected. It then sends the processor identifier AJ[self,,j], maxprocessor, AJ and the command arguments to complete the bootstrap. Note that exactly the same code is loaded into each processor. The only value which a processor receives to distinguish it from others is its processor identifier. After the first level of the boot spanning tree has been bootstrapped, booting continues in parallel until the leaves of the tree have been bootstrapped.

4.3 Initialisation

After completing bootstrapping, each trans- puter executes the same initialisation code. This code first calculates a minimum distance routing table from the adjacency matrix. The routing table is used by execpar at execution time. Initial- isation proceeds by invoking the group elabora- tion procedures. Each transputer node thus has a copy of the complete logical configuration graph. However, the kernel (which is part of execpar) only instantiates tasks which correspond to its processor identifier, i.e. an at clause in the logical configuration specified 'this' processor. The ker- nel in addition to instantiating tasks creates datastructures to implement both local and re- mote inter-task communication. These communi- cation datastructures contain initialised trans-

Intra.Processor

Inter- Processor

+ time per additional intermediate processor

4 byte Request 4 bv~ p.enlv

17uS

174uS

14.4uS

100 byte Request 4 byte Renlv

19uS

241uS

210uS

Fig. 6. Request-reply times.

puter channel words. The Tonic communication primitives are implemented using the transputer communication instructions, in, out, alt, etc.

Loading the entire code for the application at each processor has the disadvantage of wasting storage when task types are loaded but not in- stantiated. However, this scheme has the follow- ing major advantages: (1) It permits the bootstrap to proceed in paral- lel. This considerably reduces startup time for large numbers of processor. (2) Since each node has a complete copy of the logical configuration graph, the setup of inter-task communication associations at initialisation time does not require remote communication. The ini- tialisation of each transputer proceeds in parallel. Again this reduces application startup time. Tonic applications typically take less than a second to startup.

4.4 Performance

Figure 6 shows the times required for a re- quest-reply message exchange. The time is mea- sured from the time the sending task initiates the exchange by a send-wait to the time the reply message completes the exchange. The receiving task executes a receive followed by a reply.

The times above are for one-to-one request-re- ply communication, i.e. one exitport connected to one entryport. Where the communication is n to 1 (n exitports connected to 1 entryport as in the Mandelbrot example) the time for an individual intra-processor request-reply is T + 5 * n uS (for n > 1) where T is the time for a one-to-one com- munication. This is because many to communica- tion is implemented using the transputer alt in- struction. The receiver task waits on a set of transputer channels representing the set of exit-

Page 10: A configuration approach to parallel programming

3 4 6 3". Magee, N. Dulay

35.00

30.00

i 25.00 20.00

15.00

10.00

5.00

0.00

" " ttree O"Linear " '" pipe . I

7

| , , . , l l l l | | | | | , , | , l . | l , , , , , l l . I , , , , , . , , . , . . . . . , , . . . , , . . ~ , , , . . 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

#processors

Fig. 7. Mandgen program speedup.

ports connected to it. Consequently, the time to receive from an entryport is proportional to the number of exitports connected to it. This repre- sents a considerable performance penalty for large fan-in configurations. For example, with 32 pro- cessors, the mandgen program has a 32 to 1 connection to the supervisor's result entryport. Consequently, the performance penalty is 160uS per communication. We are currently re-imple- menting intra-processor communication using critical regions rather than transputer communi- cation channels to make the receive time inde- pendent of the fan-in factor.

Figure 7 represents the speedup for the Man- delbrot program example plotted against the number of processors using a balanced ternary tree hardware configuration (ttree) and a pipeline hardware configuration (pipe). Speedup(n) is measured as the time for 1 processor (i.e. 1 slave & supervisor) divided by the time for n proces- sors (i.e. n slaves & 1 supervisor). The 1 proces- sor case with a supervisor and a slave running on the same processor is only marginally slower than the sequential program slave-supervisor commu- nication is local. Not surprisingly, the ternary tree mapping outperforms the pipeline mapping. The average processing rate for the 32 processor ternary tree mapping is 26Mflop/S or 0.8Mflop/S per transputer. The overall time with 32 proces-

sors to complete the computation with the pa- rameters (2.0, 2.0, 4.0) was 3.3 seconds.

5. Discussion and conclusions

The paper has presented a configuration ap- proach to the construction of parallel programs in which the functional behaviour of individual com- ponents is specified by a programming language and the overall parallel program structure is spec- ified by a configuration language. The configura- tion specification declares the instances of com- ponent types and their interconnections. Compo- nent instances execute in parallel. This logical configuration is annotated with a mapping to an abstract machine which consists of maxprocessor identical, globally interconnected processors. The physical configuration of the real machine is spec- ified using the same configuration language. This physical description drives the logical to physical mapping process. Both the logical and physical configuration specifications are considerably more flexible than those provided by existing systems [7,5,13]. Running a Tonic program on different physical configurations with different numbers of processors and different inter-processor connec- tions requires no re-compilation. This facilities both portability and experimentation with differ- ent logical to physical mappings. The configura- tion approach is similar to that of the MUPPET [14] system which uses a graphical notation to express configurations. However, MU P P ET does not clearly separate logical from physical configu- rations and is limited in the mappings which can be expressed. Tonic also has a graphical notation for expressing configurations and an associated display tool [9]. However, for describing regular structures, we find the power of the textual lan- guage to be more useful.

In concurrent programming terms, Tonic falls into the class of languages which support the process/message passing paradigm [2]. The best known of these languages are Occam [7] and Ada [3]. Unlike these languages, Tonic incorporates a separate language to describe the structure of concurrent programs in terms of task instances and inter-task message paths (links). These lan- guages also differ in the time at which the pro- gram's process structure is fixed. Occam defines the process structure statically at compile time,

Page 11: A configuration approach to parallel programming

A configuration approach to parallel programming 347

Tonic at instantiation/initialisation time and Ada dynamically at run-time. In fairness, it should be noted that Ada semantics do not permit an effi- cient distributed memory implementation. We are currently experimenting with an approach which would allow changes to the process structure at run-time while retaining a strict separation be- tween programming and configuration [11]. This will permit a limited, but efficient, form of pro- cess migration to facilitate dynamic load balanc- ing.

We regard Tonic as a prototype implementa- tion to validate the configuration approach to parallel programming. Its application is restricted by the reliance on one specific programming lan- guage - Pascal + message passing. Currently, we are engaged in the development of a configura- tion language (Darwin) [11] and associated tools which will permit the configuration approach to be applied to parallel programs composed of components written in commonly available lan- guages such as C & Fortran.

Despite the limitations expressed above, Tonic is a practical and efficient tool for developing parallel programs. It hides many of the irrelevant details about the underlying hardware which cur- rently harass parallel programmers. Typically, ap- plication programmers select a physical configu- ration from a library rather than programming their own. The library currently includes pipeline, ring, mesh, torus, binary tree, ternary tree, cube connected cycles and WK-Recursive physical topologies. The toolset is used by both research and undergraduate students.

~ , c k n o w l e d l l e m e n t s

The authors would like to acknowledge discus- sions with our colleagues in the Parallel and Distributed Systems Group during the formula- tion of these ideas. We gratefully acknowledge the SERC under grant GR/G31079, and the CEC in the REX Project (2080) for their financial support.

R e f e r e n c e s

[1] W.C. Athas and C.L. Seitz, Multicomputers: message- passing concurrent computers, IEEE Comput. 21 (8) (Aug. 1988) 9-24.

[2] H.Bal, J. Steiner, and A. Tanenbaum, Programming lan- guages for distributed computing systems, ACM Comput. Surveys, 21 (3) (Sept. 1989) 261-322.

[3] Department of Defense, U.S.A., Reference manual for the Ada programming language, ANSI/MIL-STD- 1815A, DoD, Washington D.C. Jan 1983.

[4] N. Dulay, A configuration language for distributed pro- gramming, Ph .D. Thesis, Dept. of Computing, Imperial College, February 1990.

[5] Distributed Software Ltd, The Helios parallel program- ming tutorial, 670 Aztec West, Bristol, January 1990.

[6] Inmos Ltd. OCCAM 2 reference manual, Prentice Hall, 1988.

[7] Inmos Ltd. Transputer development system, Prentice Hall, 1988.

[8] J. Kramer and J. Magee, Dynamic configuration for distributed systems, IEEE Trans. Software Engrg. SE-11 (4) (Apr. 1985) 424-436.

[9] J. Kramer, J. Magee and K. Ng, Graphical configuration programming, 1EEE Comput. 22 (10) (1989) 53-65.

[10] J. Magee, J. Kramer and M. Sloman, Constructing dis- tributed systems in Conic, 1EEE Trans. Software Engrg. SE-15 (6) (Jun. 1989) 663-675.

[11] J. Magee, J. Kramer, M. Sloman and N. Dulay, An overview of the REX software architecture, Proc. 2nd 1EEE Workshop on Future Trends of Distributed Comput- ing Systems, Cairo, Egypt (Sep. 1990) 396-402.

[12] J.N. Magee and S.C. Cheung, Parallel algorithm design for workstation clusters, Software-Practice and Experience 21 (Mar. 1991) 235-250.

[13] Meiko Ltd, CS tools documentation guide, 650 Aztec West, Bristol, 1989.

[14] H. Muhlenbein, Th. Scheider and S. Streitz, Network programming with MUPPET, J. Parallel Distributed Corn- put. 5 (1988) 641-653.

,left' Maaee graduated from Queens University, Belfast with a degree in electrical engineering in 1973. After working with the British Post Office on the design and development of System X he returned to college where he received the M. Sc. and Ph. D. degrees in computing science from Imperial College, London, in 1978 and 1984, respectively.

He is currently a Senior Lecturer in the Department of Computing at Im- perial College. His research interests

include parallel algorithms, distributed operating systems and tool support for the design and development of parallel and distributed programs. Dr. Magee is a member of the I.E.E.

Naranker Dula3 graduated from Manchester University with a B. Sc. in Computer Science in 1979 and was awarded a Ph.D. in computing from Imperial College, London in 1990. He is currently a Lecturer in the Depart- ment of Computing at Imperial Col- lege. His research interest lie in the areas of languages, compilers, algo- rithms, and architectures for dis- tributed and parallel computing. Dr. Dulay is a member of the BCS.