eden trace viewer a tool to visualize parallel functional

10
Eden Trace Viewer: A Tool to Visualize Parallel Functional Program Executions Pablo Rold´ an G´ omez *† [email protected] Universidad Complutense de Madrid Facultad de Inform´ atica E-28040 Madrid, Spain Jost Berthold * [email protected] * Philipps-Universit¨ at Marburg, Fachbereich Mathematik und Informatik D-35032 Marburg, Germany PRELIMINARY VERSION submitted to PACT 2005 conference Abstract— This paper presents the Eden Trace Viewer (EdenTV), a tool which generates graphic representations of parallel program executions. The tracing tool was conceived to analyze the behavior of programs written in the parallel functional language Eden. With the high abstraction offered by parallel functional languages, tools for the analysis of the runtime behaviour are all the more important. We describe the concepts of Trace Generation, Trace Interpretation and Trace Representation in the EdenTV, giving brief technical details about its implementation. The functionality of the EdenTV is illustrated by two examples, emphasizing the utility of the EdenTV to analyze and improve the performance of parallel programs. An area of future work is to improve and optimize the tool in future versions, as well as to adapt it for other parallel functional languages. I. I NTRODUCTION Due to the high complexity and nondeterministic interac- tion in parallel systems, optimizing parallel programs is a very complex process. Parallel functional languages “eliminate many of the most unpleasant burdens of parallel program- ming[. . . ]” (foreword of [5]) because they remain abstract about the operational properties of a program. Algorithmic skeletons([2], [16]) capture typical patterns of parallelism, which can often be expressed in the language itself (offering concepts like higher-order functions or demand-driven eval- uation, which makes infinite data structures possible), but parallel functional languages “[leave] a large gap for the implementation to bridge”[5] In the area of parallel programming, and especially opti- mization, the high abstraction offered by functional languages makes parallel programming much less error-prone, but on the other hand, it is very difficult to know or guess what really happens inside the parallel machine during a program execution. The programmer has few possibilities to analyze and optimize the behavior of a program by only knowing the runtime or speedup achieved. In order to improve the performance of parallel functional programs productively, we need more information about the execution. The common way to analyze the performance of a program is to use profiling tools. Specific information about the behav- ior of a program is collected during the execution and often written in files for a subsequent study. After the execution, we have a file with data about which events have occurred, when and where. That is what we call trace generation. Afterwards, the collected information is statistically processed by an analysis software (trace interpretation) and the result can be (graphically or textually) presented to the programmer (trace representation). Toolkits exist to support all steps of this procedure, but for high-level languages, the profiling tools often need to implement specific language concepts not present in standard profiling and monitoring tools. We opted to use parts of the well established Pablo Toolkit [17] and its Self-Describing Data Format (SDDF) [1] for trace generation and only cus- tomize the subsequent steps in a special software. Trace files in a specialized SDDF are generated by an instrumented runtime system which uses the Pablo Trace Library. The generated files are then loaded into the Eden Trace Viewer (EdenTV), our interpretation and visualization tool programmed in Java which produces interactive bar diagrams for all computational units of the parallel functional language Eden. SDDF contains portable, semantically enriched data (following the same con- cept as XML), EdenTV can be used on any OS and hardware which supports Java. To understand the target language for this tool and its concepts, Section II gives a short description of Eden, with the basics about language concepts and runtime system (RTS). The next section of the paper (III) explains the components of our profiling system in the three steps mentioned above: First, we look at the trace generation and describe the data which is stored during the execution of a program. Then we describe the trace interpretation step, and how trace data is stored by the EdenTV in order to build the graphics. The last part of the section refers to the representation of the stored data and talks about the style of the EdenTV graphics. Section IV describes features of the EdenTV and shows

Upload: others

Post on 05-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Eden Trace Viewer A Tool to Visualize Parallel Functional

Eden Trace Viewer:A Tool to Visualize Parallel Functional Program

ExecutionsPablo Roldan Gomez∗†

[email protected]

† Universidad Complutense de MadridFacultad de InformaticaE-28040 Madrid, Spain

Jost Berthold∗

[email protected]

∗ Philipps-Universitat Marburg,Fachbereich Mathematik und Informatik

D-35032 Marburg, Germany

PRELIMINARY VERSION♠submitted to PACT 2005 conference ♠

Abstract— This paper presents the Eden Trace Viewer(EdenTV), a tool which generates graphic representations ofparallel program executions. The tracing tool was conceivedto analyze the behavior of programs written in the parallelfunctional language Eden. With the high abstraction offeredby parallel functional languages, tools for the analysis of theruntime behaviour are all the more important. We describethe concepts ofTrace Generation, Trace Interpretationand TraceRepresentationin the EdenTV, giving brief technical details aboutits implementation. The functionality of the EdenTV is illustratedby two examples, emphasizing the utility of the EdenTV toanalyze and improve the performance of parallel programs. Anarea of future work is to improve and optimize the tool in futureversions, as well as to adapt it for other parallel functionallanguages.

I. I NTRODUCTION

Due to the high complexity and nondeterministic interac-tion in parallel systems, optimizing parallel programs is avery complex process. Parallelfunctionallanguages “eliminatemany of the most unpleasant burdens of parallel program-ming[. . . ]” (foreword of [5]) because they remain abstractabout the operational properties of a program. Algorithmicskeletons([2], [16]) capture typical patterns of parallelism,which can often be expressed in the language itself (offeringconcepts like higher-order functions or demand-driven eval-uation, which makes infinite data structures possible), butparallel functional languages “[leave] a large gap for theimplementation to bridge”[5]

In the area of parallel programming, and especially opti-mization, the high abstraction offered by functional languagesmakes parallel programming much less error-prone, but onthe other hand, it is very difficult to know or guess whatreally happensinside the parallel machine during a programexecution. The programmer has few possibilities to analyzeand optimize the behavior of a program by only knowingthe runtime or speedup achieved. In order to improve theperformance of parallel functional programs productively, weneed more information about the execution.

The common way to analyze the performance of a programis to useprofiling tools. Specific information about the behav-ior of a program is collected during the execution and oftenwritten in files for a subsequent study. After the execution,we have a file with data about which events have occurred,when and where. That is what we calltrace generation.Afterwards, the collected information is statistically processedby an analysis software (trace interpretation) and the resultcan be (graphically or textually) presented to the programmer(trace representation).

Toolkits exist to support all steps of this procedure, butfor high-level languages, the profiling tools often need toimplement specific language concepts not present in standardprofiling and monitoring tools. We opted to use parts of thewell established Pablo Toolkit [17] and itsSelf-DescribingData Format (SDDF) [1] for trace generation and only cus-tomize the subsequent steps in a special software. Trace files ina specialized SDDF are generated by an instrumented runtimesystem which uses the Pablo Trace Library. The generatedfiles are then loaded into theEden Trace Viewer (EdenTV),our interpretation and visualization tool programmed in Javawhich produces interactive bar diagrams for all computationalunits of the parallel functional language Eden. SDDF containsportable, semantically enriched data (following the same con-cept as XML), EdenTV can be used on any OS and hardwarewhich supports Java.

To understand the target language for this tool and itsconcepts, Section II gives a short description of Eden, withthe basics about language concepts and runtime system (RTS).The next section of the paper (III) explains the components ofour profiling system in the three steps mentioned above: First,we look at thetrace generationand describe the data whichis stored during the execution of a program. Then we describethe trace interpretationstep, and how trace data is stored bythe EdenTV in order to build the graphics. The last part ofthe section refers to therepresentationof the stored data andtalks about the style of the EdenTV graphics.

Section IV describes features of the EdenTV and shows

Page 2: Eden Trace Viewer A Tool to Visualize Parallel Functional

two example analyses. Section V points to related work, andthe last section (VI) concludes and discusses extensions andimprovements of EdenTV as future work.

II. EDEN LANGUAGE AND IMPLEMENTATION CONCEPTS

The target language of the tool is Eden [11]. This lan-guage extends the purely functional language Haskell[13]with syntactic constructs forexplicitly defining and creatingparallel processes in machines of a closed parallel system.The programmer has direct control over process granularity,data distribution and communication topology [8], but does nothave to manage synchronization and data exchange betweenprocesses, which is done by the runtime system through im-plicit communication channels, transparent to the programmer.

A. Coordination Constructs

The essential two coordination constructs of Eden are pro-cess abstraction and instantiation. Functions can be embeddedinto process abstractionsand aprocess instantiationoperatorstarts a process by supplying input to the abstraction.

process :: (Trans a, Trans b) => (a -> b) -> Process a b

embeds functions of typea->b into process abstractionsoftypeProcess a b where the context(Trans a, Trans b)

states that botha and b are overloaded values belonging totheTrans class of transmissible values. Aprocess abstractionprocess (\x -> e) defines the behavior of a process withparameterx as input and expressione as output.

A process instantiationuses the predefined infix operator

( # ) :: (Trans a, Trans b) => Process a b -> a -> b

to provide a process abstraction with actual input parameters.Processes are distinguished from functions by their operationalproperty to be executed remotely, while their denotationalmeaning remains unchanged as compared to the underlyingfunction.

Haskell, the computation language of Eden, useslazy eval-uation i.e. data is only evaluated on demand, if it is neededfor the overall result. Eden deviates from this principle: Edenprocesses are eagerly instantiated and evaluate and send theiroutput eagerly. This aims at increasing the parallelism degreeand at speeding up the distribution of the computation. Localgarbage collection detects unnecessary results and stops theevaluating remote threads.

B. Process Communication

In general, Eden processes do not share data among eachother and are encapsulated units of computation. All datais communicated eagerly via (internal) channels, avoidingthe need for global memory management and data requestmessages, but possibly duplicating data.

Evaluation of the expression(process funct) # arg

leads to the creation of a new process for evaluating theapplication of the functionfunct to the argumentarg . Theargument is evaluated by new concurrent threads in the parentprocess and sent to the new child process, which, in turn,fully evaluates and sends back the result of the function

application. Both are using implicit 1:1communication chan-nelsestablished between child and parent process on processinstantiation.

Communication is encoded in the type classTrans (Trans-missible) shown in the type signatures above. As a matterof principle, every transmitted value is evaluated to normalform (NF) prior to transmission. Furthermore, the class usesoverloaded communication functions forlists, which are trans-mitted as streams, element by element, and fortuples, whichare evaluated component-wise by concurrent threads in thesame process. An Eden process can thus contain a variablenumber of threads during its lifetime, supplying input to childprocesses or concurrently evaluating results.

As communication channels are normally connections be-tween parent and child process, the communication topologiesare hierarchical. In order to support other topologies, Edenprovides additional language constructs to create channelsdynamically, which is another source of concurrent threadsin the sender process.

C. Runtime System

Eden is implemented on the basis of the Glasgow HaskellCompiler GHC [12], a mature and efficient Haskell implemen-tation. While the compiler frontend is almost unchanged, Edenuses a modified runtime system (RTS), where kernel parts areshared with GUM, the implementation ofGlasgow ParallelHaskell (GpH)[20].

The parallel RTS uses suitable middleware (currentlyPVM[15]) to manage parallel execution and communication.Besides, the implementation of Eden relies essentially onthe GHC implementation of concurrency. Concurrent threads,modelled in GHC byThread State Objects(TSOs), are the ba-sic unit for the implementation, so the central task for profilingis to keep track of their execution. TSOs in GHC are scheduledRound-Robin, depending on the memory consumption, whichleads to the possible state transitions shown in Figure 1.

Runnable

Running

Finished

Blocked

new thread deblock thread

run thread

suspend thread

kill thread

kill thread

block thread

kill thread

Fig. 1. Thread state transition diagram in Eden

A thread can be in one of the following states:runnable,running, blockedandfinished. The first state of a thread afterits creation isrunnable, i.e. it can be executed, but the RTS hasnot chosen it yet. When the RTS schedules it for execution,the thread goes to staterunning, and possibly back to staterunnablewhen descheduled. If a running thread needs a valuewhich is not available, itblockson a placeholder node in thegraph heap. When the missing value arrives, the thread goesback to staterunnable again. Finally, a thread finishes after

Page 3: Eden Trace Viewer A Tool to Visualize Parallel Functional

evaluating and sending its output, or when its output is notrequired any more (detected by a local or remote garbagecollection).

The channels between processes are modelled byinportsandoutportsin the RTS, both administered by respective tablesto ensure the 1:1 condition. A channel is thus a connectionfrom an outport to exactly one inport. An Eden process, asa purely conceptual unit, consists of a number of concurrentthreads which share a common graph heap (as opposed toprocesses, which only communicate via channels). A processtable keeps track of the thread count in a process and itsnumber of open ports.

The Eden RTS does not support the migration of threads orprocesses to other machines during execution, so every threadis located on exactly one machine during its lifetime. Thefirst task for profiling is thus to keep track of the thread statetransitions within a process, and of communication betweenprocesses.

III. I MPLEMENTATION OF THE EDENTV

As explained in Section I, the typical steps of profilingusually aretrace generation, trace interpretationand tracerepresentation. The EdenTV separates these steps as much aspossible, so that single parts can easily be maintained andmodified for other purposes. The overall aim of EdenTV isa post-mortem analysis, thus the steps are also temporarilyseparated.

A. First Step: Trace Generation

To profile the execution of an Eden program, we mustcollect information about the behavior of machines, processes,threads and messages by writing selectedeventsinto a tracefile. Usually, a trace event is considered as an instance of theexecution of a specific statement or instruction in an applica-tion, on a specific machine. In our case, an event indicates thecreation or a state transition of aunit of computation, machine,process or thread. Table 2 contains the events selected for theinstrumentation, which have self-explanatory names.

Start Machine End MachineNew Process Kill ProcessNew Thread Kill ThreadRun Thread Suspend Thread

Block Thread Deblock ThreadSend Message Receive Message

Fig. 2. Events to capture during the execution of an Eden program

As already mentioned in Section I, we use the Pablo Toolkit[17] and itsSelf-Defining Data Format (SDDF)[1] developedat the University of Illinois. The PabloTrace Capture Libraryprovides an API of tracing methods for Fortran and C that canbe called from the target code to generate trace files. The EdenRTS is instrumented with these calls to give us the necessaryinformation about the execution. This is important, because itmeans that we do not have to change the Eden program itself

# 201:"Block Thread" {

int "timeStamp"[]double "Seconds"int "Event ID"int "Machine Number"int "Process ID"int "Thread ID"int "Inport ID"int "Block Reason"

};;

#217:"Send Message" {

int "timeStamp"[]double "Seconds"int "Event ID"int "Machine Number"int "Sending Process ID"int "Outport ID"int "Receiving Machine";int "Receiving Process"int "Inport ID"int "Tag of the message"

};;

Fig. 3. Examples of SDDF data record structures

to obtain the traces. The instrumentation is just a matter ofplacing library functions properly in the RTS source code.

The SDDF is a flexible file meta-format that specifiesboth data record structuresand data record instances[1].This meta-format is not based on predefined layouts andallows to add new event types easily. Thanks to this wecan define our own event family to meet the requirementsof our instrumentation. SDDF events are specified by datarecord structures, which are written at the beginning of eachSDDF file. Data record instances in the file contain the traceevent information itself, and identifiers linking them to theircorresponding data record structures. This idea is the samethat underlies the XML format, but in the case of the SDDFboth definition structures and data instances can be found inthe same file.By capturing an event we obtain specific data depending on itstype. This data is saved in the trace file by data record instancesaccording the Pablo SDDF, in ASCII representation (twoexamples of SDDF data record descriptors are in Figure 3).Each event is characterized by an integerEventID appearingat the beginning of each SDDF data record instance. In orderto localize the events, SDDF record instances contain a uniquetuple with the machine, process and thread numbers in whichthe event has occurred. A field calledSecondsdetermines thetime of the execution when the corresponding event occurred.

Not all events require the same values to be saved in theSDDF records. For example, for the event”Send Message”we should save integers representing the sending and the re-ceiving processes respectively. For the event”Block Thread”(Figure 3), we need different information. Thus, according tothe nature of each captured event, a different type of SDDFrecord instance is written into the trace file.

The instrumentation of the RTS has been carried out mostlyin a separate module, which contains the overloaded tracefunctions and initializations. To cover all the events, only fourexisting source code files of the RTS had to be modified.The instrumented Eden RTS is implemented as an additionalcompiler way1. The compiler itself (code generation) is notmodified at all. In order to use the profiling version, theprogram only has to be linked to a different RTS. The Pablo

1Ways are a concept of GHC which allows to select between differentcompiler and RTS versions when compiling a program.

Page 4: Eden Trace Viewer A Tool to Visualize Parallel Functional

Trace Library is efficiently implemented. Measurements haveshown that the performance of Eden programs is hardlyaffected at all by writing the event files.

B. Second Step: Trace Interpretation

To read and interpret the trace files produced by the instru-mentation of the Eden RTS, we model the logical units of theexecution by the object hierarchy shown in Figure 4.

EdenObject TraceData

EdenProcessEdenMachineEdenObject

EdenThread EdenMessage

<<contains>>

<<contains>><<contains>>

<<contains>>

Fig. 4. Object hierarchy of the EdenTV

For the EdenTV, logical units aremachines, processes,threadsandmessages, abstracted by the classEdenObject. Thefour concrete subclasses derived from the classEdenObjectareEdenMachine, EdenProcess, EdenThreadand EdenMessage.Each class defines fields to identify the objects and savesinformation that rebuilds the activity and behavior of eachlogical unit.

The most important class for the reconstruction of theexecution is the classTraceData. An instance of this classrepresents the whole execution of an Eden program, containingarrays of machines, processes, threads and messages. Thecomplete information collected fromonetrace file is accessiblein an object of this class. TheTraceDatauses aparser classto obtain the data from the trace file and fills the mentionedarrays from it. We model a class for each event of theinstrumentation as well. Since each event needs different fieldsto be saved, new subclasses of theeventclass were definedfor each different event of Figure 2.

To explain it shortly, the constructor of the classTrace-Data creates, updates and arranges the objects of the classEdenObject(machines, processes, threads and messages) fromthe eventobjects that theparserclass produces while readingthe trace file. The creation of an object of the classTraceDatafrom a trace file is thetrace interpretationpart of the EdenTV.The following and last step is to represent the informationcollected in theTraceDatainstance graphically.

C. Third Step: Trace Representation

In the space-time diagrams generated by the EdenTV, ma-chines, processes and threads are represented by horizontalbars, where the horizontal axis of the diagram is the executiontime. Messages between processes or machines are optionallyrepresented by grey arrows which start from the sending unitbar and point at the receiving unit bar. The representation ofmessages is very important for the programmer, since he/shecan observe hot spots and inefficiencies in the communicationduring the execution.

TABLE I

COLOR CODES FOR BAR GRAPHICS INEDENTV

Machine Process ThreadBlue Idle Idle n/a

(no processes) (no threads)Yellow System Time Runnable Runnable

(threads runnable) (one thread) (this thread)Green Running Running Running

(one process) (one thread) (this thread)Red Blocked Blocked Blocked

(all processes) (all threads) (this thread)

EdenTV produces six diagrams:All Machines, All Pro-cesses, All Threads, Processes In Machines, Threads In Pro-cessesand Threads In Machines. The diagramProcess InMachines is similar to the diagramAll Processes, but theprocesses are arranged by machines and not by process ID.The same holds for the diagramsThreads In ProcessesandThreads In Machines, which are like the diagramAll Threads,but arranged by processes and machines respectively. For thisreason, this section only gives a description for the first threediagrams.

The diagram bars have segments in different colors, whichindicate the activities of the respective logical unit in a periodduring the execution. These color codes are shown in Table I.

The diagramAll Machines(Figure 5) shows the activity ofall machines that have taken part in the execution. A machineis displayed asidle if it has no process to execute. It is in statesystemif at least one thread in it is runnable, but no thread isrunning (garbage collection, receiving messages, other user’sprocesses). The machine is in staterunning if one thread isrunning in it. When all threads in the machine are blocked, itis in stateblocked.

A similar diagramAll Processes(Figure 6(a)) shows theactivity of all processes. The same four colors are used todifferentiate the states of the processes. The conditions for thestatessystem, running andblocked are exactly the same fora process as for a machine. A process is in stateidle if theprocess does not contain any threads2.

In general, the following equation holds for the thread countinside one process or machine:

0 ≤ Runnable Threads+ Blocked Threads≤ Total Threads

Color codes for processes and machines are assigned followingthe comparison of these numbers. In general, the first conditionchecked by the EdenTV is if the machine or process is in stateidle (Total Processes resp. Total Threads= 0). An inequality onthe right implies that a thread isrunning . Otherwise the stateis either runnable or blocked, depending on the number ofrunnable threads (the state isrunnable if Runnable Threads> 0). This context information for all units is the basis of thegraphical representations.

For the diagramAll Threads (Figure 6(b)) the color-codification is similar, but the colors of the bar fragments

2An idle process is normally inexistent, but exists in the implementation,so it was modelled in the representation as well.

Page 5: Eden Trace Viewer A Tool to Visualize Parallel Functional

Fig. 5. All Machines Graphic

(a) All Processes Graphic (b) All Threads Graphic

Fig. 6. Diagrams of EdenTV

correspond to the state of the thread itself. Note that a threadcan not be in stateidle.

To implement the diagram construction, a classGraphicE-denPanelwas defined. The instances of this class contain aTraceDatainstance which is processed to generate a diagram.Since each diagram is generated independently, six subclassesof GraphicEdenPanelwere created for the six different dia-grams.

IV. U SING EDENTV

A. Functionality and Mode of Use

When a trace file generated by the instrumented RTS isopened, the gathered information is presented in the bardiagrams shown in the previous section as well as in textualform. The text field on the right gives detailedstatisticsaboutthe overall computation and details for each unit (machine,process or thread) shown in the graphic representation.

The bar diagrams presented by EdenTV can bezoomedin

Page 6: Eden Trace Viewer A Tool to Visualize Parallel Functional

Fig. 7. Zoomed View with Messages and Interactive mode.

order to get a closer view on the activities at critical pointsduring the execution. Furthermore, the user can select to seethe messagessent and received by each machine or processduring the execution of an Eden program. Messages are notshown by default because showing the messages often hidesother details in the diagrams, and due to resource consumptionin the viewer when handling bigger trace files. Additionally, aninteractive modeprovides extra information when a bar in thediagram is clicked. This viewer mode is resource-consumingas well and therefore turned off by default. It provides theinformation for the event immediately preceding the point intime which has been clicked, i.e. the cause of the current stateof the clicked unit.

The described options are situated on the lower right of thepanel below the text field. Some other options only controlthe appearance, such as the separation between the displayedbars and the bar style. A more important feature is the abilityto leave out units from the display if their lifetime was lessthan an adjustable minimum (Minimum Time).

B. Examples: Tuning Eden Programs with EdenTV

1) Warshall: Lazy Evaluation vs. Parallelism:Using a lazyfunctional computation language, a crucial issue – for Edenas well as other Haskell-based parallelism – is to start theevaluation of needed subexpressions early enough and to fullyevaluate them for later use. While this has lead to Eden’seager communication in the language design described earlier,evaluation strategies [19] are another well-established meansof starting parallel evaluation.

However, the basic choice to either evaluate a final result toWHNF or completely (NF) sometimes does not offer enoughcontrol to optimize an algorithm – strategies must then be

applied to certain subresults. On the sparse basis of runtimemeasurements, such an optimization would be rather cumber-some. The EdenTV, accompanied with code inspection, makessuch inefficiencies obvious, as in the following example.

Example: The program measured here is a parallel imple-mentation of Warshall’s algorithm to compute shortest pathsfor all nodes of a graph from the adjacency matrix (adaptedfrom [14]). The minimum distances from one node to ev-ery other are continuously updated and successively passedthrough a ring structure linking all processes, starting withdistances from the first node. Implementing this algorithm isvery simple, provided the existence of a suitablering skeleton.The task reduces to defining the function each ring processshould apply to its input from the parent and from the ringneighbour.

The trace visualizations shown in Fig. 8 show theAllmachinesview of the EdenTV for two different warshallprograms on 16 processors of a Beowulf cluster, the inputgraph consisting of 500 nodes.

The first straight-forward version of the program showsbad performance, due to the inherent data dependence inthe algorithm. Using the trace viewer makes obvious thatthe first phase of the algorithm runs on several machines,but completely sequentially. Only the second phase of thealgorithm really runs in parallel on all machines. The length ofthis second phase depends on the position in the ring: the firstring process (which started working first) must do the mostwork in the end, thereby dominating runtime.

Viewing the communication, one can see that the nodescommunicate as intended, but do not process the receiveddata. The reason for this bad performance is demand-drivenevaluation: Data received from a ring neighbour is passed

Page 7: Eden Trace Viewer A Tool to Visualize Parallel Functional

Runtimes:154.2 sec. (above) and40.7 sec.(below)

Fig. 8. Warshall-Algorithm, first (above) and optimized version(Beowulf Cluster, Heriot Watt University, Edinburgh,16 machines)

to the next nodeunchanged, until the node sends its ownrows, which are updated with all ring data received before.Only when the node’s updated own rows are passed into thering, their evaluation begins. Before this inherent demand, theprocesses only accumulate data from the ring communication,but do not proceed updating their rows with the distancesreceived from the ring neighbour. Inserting a normal formevaluation strategy (rnf ) for an intermediate result into therespective node function dramatically improves runtime. Westill see the impact of the data dependency, leading to a shortwait phase passing through the ring, but the optimized versionshows good speedup and load balance. A crucial issue for suchoptimizations is however to identify the displayed units fromtheir behaviour, without a link to the program’s source code./

2) Mandelbrot: Irregularity and Cost of Load Balancing:Parallel skeletons (higher-order functions implementing com-mon patterns of parallel computation [2], [16], [8]) providean easy parallelization of existing programs by replacingcommon higher-order functions by parallel equivalents. Aprime example is the well-known higher-order functionmap,which applies a function to all elements of a list, and can beparallelized in different ways. In the case ofmap, problemsmay arise when the tasks to solve expose irregular complexity.

Irregularity can lead to uneven load distribution when tasks arestatically distributed, and the machine with the most complextasks will dominate the runtime of a parallel computation.

By using a feedback from process results to process inputs,work can also be distributed dynamically in a parallelmapimplementation, thereby adapting to the current load andspeed of the different machines. But the dynamic distributionnecessarily has more overhead than a static one.

Example: We compare these different work distributionschemes using a program which generates Mandelbrot Setvisualizations by applying a uniform, but potentially irregularcomputation to a set of coordinates. The underlying computa-tion scheme,map, is parallelized by two skeletons presentedin [8].

1) Farm: the tasks are statically distributed to a set ofworkers, one for each available machine. Each workersolves a fixed set of tasks.

2) Workpool : themainmachine distributes only few initialtasks among the others (worker machines). Each timea worker has finished one task, the main machinesends a new one, i.e. subsequent tasks are distributeddynamically on request.

From previous results in older measurements ([8]), we couldexpect that the dynamic distribution in the workpool skeleton

Page 8: Eden Trace Viewer A Tool to Visualize Parallel Functional

a.- Farm Skeleton (Runtime:8.6 sec.)

b.- Workpool Skeleton (Runtime:11.5 sec.)

Fig. 9. All Processesdiagrams for the Mandelbrot program(Beowulf Cluster, Heriot Watt University, Edinburgh,16 machines)

performs better than a static distribution, since the tasks havehighly irregular complexity. On the other hand, the dynamicload balancing requires additional, time-critical communica-tion and prevents some optimization in the communication

system.The program was executed on 16 machines of a Beowulf-

Cluster, with a problem size of1000×1000 pixels. As we cansee in the diagrams shown in Fig. 9, the workpool-skeleton

Page 9: Eden Trace Viewer A Tool to Visualize Parallel Functional

does not perform as good as expected. Runtime is longer forthe workpool skeleton, where EdenTV reveals that the masternode is a serious bottleneck, which receives requests fromall other 15 machines and thus spends much time receivingmessages (system time). One of the worker machines seemsto work slower than the others, most likely due to other taskson the respective machine. The other machines have spent alot of time in blocked state, waiting for new work assigned bythe heavily loaded master node. We see that all machines stopat the same time, i.e. the system is well synchronized, but hasa severe bottleneck.

The trace of the farm skeleton shows an obvious loadimbalance among the workers. The main machine (bottom bar)merges all results at the end of the computation, leading to alonger sequential end phase. Both versions are far below thetheoretical optimum of using all processors in parallel all ofthe time. /

The analysis using the EdenTV has revealed why the Farmskeleton has better performance than the Workpool skeletonfor this programand theproblem size investigated here. Byincreasing the amount of tasks a worker receives initially, wegain only slightly better performance. Increasing granularityby combining several tasks would lead to less tasks in totaland contradicts the intended load balancing properties. A betterway is to parallelize the work distribution as well, and to usemore than one main machine in a “hierarchy of workpools”.

V. RELATED WORK

The EdenTV conceptually follows a rather general conceptof tracing tools: trace information is gathered during execution,saved in files, and afterwards processed by a tool whichis aware of specific language concepts and the relationshipbetween units of computation.

A rather simple (yet insufficient) way to obtain infor-mation about the program’s parallelism would be to tracethe behaviour of the communication subsystem. The EdenRTS currently uses PVM, which comes with a built-in tracecapability (using Pablo SDDF as well). Tracing PVM-specificand user-defined actions would be possible and visualisationcould be carried out by XPVM [9]. But the weak points of sucha solution off-the-shelf are obvious: PVM-tracing yields onlyinformation about spawned subtasks and concrete messagesbetween the machines. Internal buffering and threads in theRTS would remain invisible, unless user-defined events areused. So we opted not to bind us to PVM tracing. Switchingto a different message passing system would otherwise be veryhard, although one can expect every suitable system to comewith comparable features.

Comparable to the tracing included in XPVM, many effortshave been made in the past to create standard tools for traceanalysis and representation (e.g. Pablo Analysis GUI [17]or the ParaGraph Suite [6], to mention only two essentialresults). These tools have interesting features EdenTV does notyet include, like stream-based online trace analysis, executionreplay, and a wide range of standard diagrams. But the aim

of EdenTV was a specific visualisation of logical Eden unitsand needed a more customized solution.

The more closely related tracing concepts for parallel func-tional languages have collectively influenced the EdenTV intracing and representation.

Using the trace format and tools of the Granularity Sim-ulator GranSim [10], one can not onlysimulate programsin the parallel functional language GpH, but also obtaininformation aboutreal parallel program executions. Due to thedifferent language concept, GranSim does not trace any kindof communication and offers rather bird’s eye views on theexecution than to concentrate on the single computation units.The EdenTV diagrams are inspired by the per-Processor viewof GranSim, commonly known as Gantt diagrams, or space-time-diagrams when combined with message display [3].

The Eden derivative of GranSim, Paradise [7], was a puresimulator, but it offers an interesting feature not yet presentin EdenTV: respecting the explicit coordination concept ofEden, Paradise offers to label instantiated processes in thevisualisation, thereby linking the trace information to theprogram source code.

Last but not least, the direct predecessor of EdenTV wasthe work by Ralf Freitag [4], which was entirely based on thePablo Analysis GUI (not available any more).

VI. CONCLUSION AND FUTURE WORK

The Eden Trace Viewer is a combination of an instrumentedruntime system to generate, and an interactive GUI to interprettrace information and to represent it in interactive diagrams.These three steps are common to many profiling solutions,separated as much as possible in the EdenTV implementation.Its usefulness for the analysis and optimization of parallelprogram performance is due to a view of concrete parallelexecutions. All important aspects of parallel execution areshown in the diagrams, which can be explored in great detail.

Specific language concepts of the parallel functional lan-guage Eden are present in all parts of the tool, but thedescribed concepts of design and implementation can beapplied to target other parallel languages as well. The codeof the EdenTV is written in Java and can be easily extendedto support a larger set of events and to generate new diagramtypes. All languages using logical units like processes, threadsand messages in a way similar to Eden can be analyzed by sucha tool. With minor extensions to the instrumentation carriedout for Eden, trace files for GpH programs could be generatedin SDDF by the shared RTS. As the instrumentation is cleanlyseparated from other parts and uses a well-established library,only few work should be necessary for this task.

Once the trace files contain new event types, appropriatenew objects can be introduced into the code of the EdenTVto parse, interpret and represent these new files. Other logicalunits than Eden units can be represented in new diagrams.[18] explains in detail where such modifications would belocated, as well as reasonable optimizations, which could notbe integrated in the code due to time limitations. As another

Page 10: Eden Trace Viewer A Tool to Visualize Parallel Functional

extension, the EdenTV could show an analysis while theprogram still runs, using stream-based trace file processing.

AcknoledgementsThe authors thank Rita Loogen for her helpfulcomments improving this paper, and the colleagues from Heriot-Watt-University, Edinburgh for the opportunity to work with their beowulfcluster.

REFERENCES

[1] R. A. Aydt. The Pablo Self-Defining Data Format. Technical report,University of Illinois. Department of Computer Science, 1992.http://www-pablo.cs.uiuc.edu .

[2] M. I. Cole. Algorithmic Skeletons: Structured Management of ParallelComputation. Research Monographs in Parallel and Distributed Com-puting. The MIT Press, Cambridge, MA, 1989.

[3] I. Foster.Designing and Building Parallel Programs. Addison-Wesley,1995. http://www.mcs.anl.gov/dbpp/ .

[4] R. Freitag. Entwicklung eines Werkzeugs zur Analyse des Laufzeitver-haltens paralleler funktionaler Programme. Master’s thesis, Philipps-Universitat Marburg, Germany, 1999. In German.

[5] K. Hammond and G. Michaelson, editors.Research Directions inParallel Functional Programming. Springer-Verlag, 1999.

[6] M. T. Heath and J. A. . Etheridge. Visualizing the performance ofparallel programs.IEEE Software, 8(5), 1991.

[7] F. Hernandez, R. Pena, and F. Rubio. From GranSim to Paradise. InTrends in Functional Programming 1 (SFP’99). Intellect, 2000.

[8] U. Klusik, R. Loogen, S. Priebe, and F. Rubio. Implementation Skeletonsin Eden — Low-Effort Parallel Programming. InImplementationof Functional Languages: IFL’00, volume 2011 ofLNCS, Aachen,Germany, 2000. Springer.

[9] J. A. Kohl and G. A. Geist. The PVM 3.4 tracing facility and XPVM 1.1.In Proceedings of HICSS-29, pages 290–299. IEEE Computer SocietyPress, 1996.

[10] H. W. Loidl. GranSim User’s Guide. Technical report, University ofGlasgow. Department of Computer Science, 1996.

[11] R. Loogen, Y. Ortega, and R. Pena. Parallel Functional Programmingin Eden. Journal of Functional Programming, 2005. To appear.

[12] S. Peyton Jones, C. Hall, K. Hammond, W. Partain, and P. Wadler. TheGlasgow Haskell Compiler: a Technical Overview. InJFIT’93, March1993.

[13] S. Peyton Jones and J. Hughes. Haskell 98: A Non-strict, PurelyFunctional Language, 1999. Available athttp://www.haskell.org/ .

[14] M. Plasmeijer and M. van Eekelen.Functional Programming andParallel Graph Rewriting. Addison-Wesley, Reading, Massachusetts,USA, 1993.

[15] Parallel Virtual Machine Reference Manual, Version 3.2. University ofTennessee, August 1993.

[16] F. A. Rabhi and S. Gorlatch, editors.Patterns and Skeletons for Paralleland Distributed Computing. Springer, 2002.

[17] D. A. Reed, R. D. Olson, R. A. Aydt, T. Madhyastha, T. Birkett,J. Jensen, B. A. A. Nazief, and B. K. Totty. Scalable performanceenvironments for parallel systems. Technical Report 1673, Dep. ofComputer Sc., University of Illinois, Illinois, 1991.

[18] P. Roldan. Eden Trace Viewer: Ein Werkzeug zur Visualisierungparalleler funktionaler Programme. Master’s thesis, Philipps-UniversitatMarburg, Germany, 2004. In German.

[19] P. Trinder, K. Hammond, H.-W. Loidl, and S. Peyton Jones. Algorithm+ Strategy = Parallelism.J. of Functional Programming, 8(1), January1998.

[20] P. W. Trinder, K. Hammond, J. S. M. Jr., A. S. Partridge, and S. L. P.Jones. Gum: A portable parallel implementation of haskell. InPLDI,1996.