synchronizing the timestamps of concurrent events in

10
Synchronizing the timestamps of concurrent events in traces of hybrid MPI/OpenMP applications Daniel Becker a1,b,c , Markus Geimer c2 , Rolf Rabenseifner d3 , and Felix Wolf a4,b,c a German Research School for Simulation Sciences, Laboratory for Parallel Programming, 52062 Aachen, Germany { 1 d.becker, 4 f.wolf}@grs-sim.de b RWTH Aachen University, Department of Computer Science, 52056 Aachen, Germany c Forschungszentrum J¨ ulich, J¨ ulich Supercomputing Centre, 52428 J¨ ulich, Germany 2 [email protected] d University of Stuttgart, High Performance Computing-Center, 70550 Stuttgart, Germany 3 [email protected] Abstract—Event traces are helpful in understanding the per- formance behavior of parallel applications since they allow the in- depth analysis of communication and synchronization patterns. However, the absence of synchronized clocks on most cluster systems may render the analysis ineffective because inaccurate relative event timings may misrepresent the logical event order and lead to errors when quantifying the impact of certain behaviors or confuse the users of time-line visualization tools by showing messages flowing backward in time. In our earlier work, we have developed a scalable algorithm that eliminates inconsistent inter-process timings postmortem in traces of pure MPI applications. Since hybrid programming, the combination of MPI and OpenMP in a single application, is becoming more popular on clusters in response to rising numbers of cores per chip and widening shared-memory nodes, we present an extended version of the algorithm that in addition to message-passing event semantics also preserves and restores shared-memory event semantics. I. I NTRODUCTION Due to the availability of inexpensive commodity com- ponents produced in large quantities, clusters now represent the majority of parallel computing systems, exhibiting a vast diversity in terms of architecture, interconnect technology, and software environment. As a common trend that can be ob- served in response to the proliferation of multicore processors with their rising numbers of cores per chip, the shared-memory nodes most clusters are composed of are becoming much wider. At the same time, the memory-per-core ratio is expected to shrink in the long run. To utilize the available memory more efficiently, many code developers now resort to using OpenMP for node-internal work sharing, while employing MPI for parallelism among different nodes. This has the advantage that (i) the extra memory needed to maintain separate private address spaces (e.g., for ghost cells or communication buffers) is no longer needed, (ii) the effort to copy data between these address spaces can be reduced, and (iii) the number of external MPI links per node can be kept at a minimum to improve scalability. The use of such inherently different programming models in a complimentary manner is usually referred to as hybridization. While potentially improving efficiency and scal- ability, hybridization usually comes at the price of increased programming complexity. To ameliorate unfavorable effects of hybrid parallelization on programmer productivity, developers therefore depend even more on powerful and robust software tools that help them find errors in and tune the performance of their codes. One technique widely used by cluster tools is event tracing with a broad spectrum of applications ranging from perfor- mance analysis [1], performance prediction [2] and model- ing [3] to debugging [4]. Recording time-stamped runtime events in event traces is especially helpful for understanding the parallel performance behavior, because it enables the postmortem analysis of communication and synchronization patterns. For instance, time-line browsers such as Vampir [1] allow these patterns to be visually explored, while the Scalasca toolset [5] scans traces automatically for wait states that occur when processes or threads fail to reach synchroniza- tion points in a timely manner, for example, as a result of unevenly distributed workloads. Usually, events are recorded along with the time of their occurrence to measure the temporal distance between them and/or to establish a total event order. Obviously, measuring the time between concurrent events necessitates either a global clock or well-synchronized processor-local clocks. While some custom-built clusters such as IBM Blue Gene offer sufficiently accurate global clocks, most commodity clusters provide only processor-local clocks that are either entirely non-synchronized or synchronized only within disjoint partitions (e.g., SMP node). Moreover, external software synchronization via NTP [6] is usually not accurate enough for the purpose of event tracing. Assuming that potentially different drifts of local clocks remain constant over time, linear offset interpolation can be applied to map local onto global timestamps. However, given that in reality the drift of realistic clocks is usually time dependent, the error of timestamps derived in this way can easily lead to a misrepresentation of the logical event order imposed by the semantics of the underlying communication substrate [7]. This may lead to errors when quantifying the impact of certain behaviors or confuse the users of time-line visualization tools by showing messages flowing backward in time.

Upload: others

Post on 27-Apr-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Synchronizing the timestamps of concurrent events in

Synchronizing the timestamps of concurrent eventsin traces of hybrid MPI/OpenMP applications

Daniel Becker a1,b,c, Markus Geimer c2, Rolf Rabenseifner d3, and Felix Wolf a4,b,c

a German Research School for Simulation Sciences, Laboratory for Parallel Programming, 52062 Aachen, Germany{1d.becker, 4f.wolf}@grs-sim.de

b RWTH Aachen University, Department of Computer Science, 52056 Aachen, Germanyc Forschungszentrum Julich, Julich Supercomputing Centre, 52428 Julich, Germany

2 [email protected] University of Stuttgart, High Performance Computing-Center, 70550 Stuttgart, Germany

3 [email protected]

Abstract—Event traces are helpful in understanding the per-formance behavior of parallel applications since they allow the in-depth analysis of communication and synchronization patterns.However, the absence of synchronized clocks on most clustersystems may render the analysis ineffective because inaccuraterelative event timings may misrepresent the logical event orderand lead to errors when quantifying the impact of certainbehaviors or confuse the users of time-line visualization toolsby showing messages flowing backward in time. In our earlierwork, we have developed a scalable algorithm that eliminatesinconsistent inter-process timings postmortem in traces of pureMPI applications. Since hybrid programming, the combinationof MPI and OpenMP in a single application, is becoming morepopular on clusters in response to rising numbers of cores perchip and widening shared-memory nodes, we present an extendedversion of the algorithm that in addition to message-passingevent semantics also preserves and restores shared-memory eventsemantics.

I. INTRODUCTION

Due to the availability of inexpensive commodity com-ponents produced in large quantities, clusters now representthe majority of parallel computing systems, exhibiting a vastdiversity in terms of architecture, interconnect technology, andsoftware environment. As a common trend that can be ob-served in response to the proliferation of multicore processorswith their rising numbers of cores per chip, the shared-memorynodes most clusters are composed of are becoming muchwider. At the same time, the memory-per-core ratio is expectedto shrink in the long run. To utilize the available memory moreefficiently, many code developers now resort to using OpenMPfor node-internal work sharing, while employing MPI forparallelism among different nodes. This has the advantagethat (i) the extra memory needed to maintain separate privateaddress spaces (e.g., for ghost cells or communication buffers)is no longer needed, (ii) the effort to copy data between theseaddress spaces can be reduced, and (iii) the number of externalMPI links per node can be kept at a minimum to improvescalability. The use of such inherently different programmingmodels in a complimentary manner is usually referred to ashybridization. While potentially improving efficiency and scal-ability, hybridization usually comes at the price of increased

programming complexity. To ameliorate unfavorable effects ofhybrid parallelization on programmer productivity, developerstherefore depend even more on powerful and robust softwaretools that help them find errors in and tune the performanceof their codes.

One technique widely used by cluster tools is event tracingwith a broad spectrum of applications ranging from perfor-mance analysis [1], performance prediction [2] and model-ing [3] to debugging [4]. Recording time-stamped runtimeevents in event traces is especially helpful for understandingthe parallel performance behavior, because it enables thepostmortem analysis of communication and synchronizationpatterns. For instance, time-line browsers such as Vampir [1]allow these patterns to be visually explored, while the Scalascatoolset [5] scans traces automatically for wait states thatoccur when processes or threads fail to reach synchroniza-tion points in a timely manner, for example, as a result ofunevenly distributed workloads. Usually, events are recordedalong with the time of their occurrence to measure thetemporal distance between them and/or to establish a totalevent order. Obviously, measuring the time between concurrentevents necessitates either a global clock or well-synchronizedprocessor-local clocks. While some custom-built clusters suchas IBM Blue Gene offer sufficiently accurate global clocks,most commodity clusters provide only processor-local clocksthat are either entirely non-synchronized or synchronizedonly within disjoint partitions (e.g., SMP node). Moreover,external software synchronization via NTP [6] is usually notaccurate enough for the purpose of event tracing. Assumingthat potentially different drifts of local clocks remain constantover time, linear offset interpolation can be applied to maplocal onto global timestamps. However, given that in realitythe drift of realistic clocks is usually time dependent, theerror of timestamps derived in this way can easily lead toa misrepresentation of the logical event order imposed by thesemantics of the underlying communication substrate [7]. Thismay lead to errors when quantifying the impact of certainbehaviors or confuse the users of time-line visualization toolsby showing messages flowing backward in time.

Page 2: Synchronizing the timestamps of concurrent events in

In our earlier work [8], we have introduced a scalablealgorithm for synchronizing timestamps that eliminates incon-sistent inter-process timings postmortem in traces of pure MPIapplications. This algorithm, the most recent version of thecontrolled logical clock (CLC) [9], restores the consistency ofinter-process event timings based on happened-before relationsimposed by point-to-point and collective MPI event seman-tics. Scalability is ensured by performing the corrections forindividual processes in parallel while replaying the originalcommunication recorded in the trace. However, the algorithmis unsuitable for hybrid applications because it neither restoresnor preserves happened-before relations in shared-memoryevent semantics. Most important, the restoration of MPI eventsemantics may introduce violations of OpenMP event seman-tics even though those were not violated in the original trace.

To remove these limitations, we describe in this paper howthe controlled logical clock is extended to correctly handletraces from hybrid applications. Our work makes the followingspecific contributions:• Identification of happened-before relations in OpenMP

constructs and library calls as well as their integrationinto the existing algorithmic framework

• Extension of the parallel replay mechanism so that it canreplay traces from hybrid codes

• Integration of the enhanced algorithm and replay engineinto the Scalasca performance analysis software

The remainder of this article is organized as follows: Afterreviewing related work in Section II, we introduce the pureMPI version of the CLC algorithm and describe its limitationsin Section III. In Section IV, we present the extensionsnecessary to correctly synchronize the timestamps that occurduring the execution of OpenMP constructs. Then, we describethe hybrid parallelization of the extended algorithm and itsimplementation within Scalasca in Section V. In Section VI,we evaluate the new scheme with respect to its accuracy, show-ing that the collaterally introduced deviations of local intervallengths remain within acceptable limits, and its scalability.Finally in Section VII, we summarize our results and outlinefuture work.

II. RELATED WORK

In this section we cite several approaches for avoidingor correcting inconsistent timestamps, most of them usuallyapplied postmortem. Network-based synchronization protocolsaim at synchronizing distributed clocks before reading them.The clocks query the global time from reference clocks, whichare often organized in a hierarchy of servers. For instance,NTP [6] uses widely accessible and already synchronizedprimary time servers. Secondary time servers and clients canquery time information via both private networks and the In-ternet. To reduce network traffic, the time servers are accessedonly at regular intervals to adjust the local clock. Jumps areavoided by changing the drift while leaving the actual timeunmodified. Unfortunately, varying network latencies limit theaccuracy of NTP to about one millisecond compared to afew microseconds required to guarantee the correct total event

order of event traces taken on clusters equipped with moderninterconnect technology.

Time differences among distributed clocks can be charac-terized in terms of their relative offset and drift (i.e., the rateat which the offset changes over time). In a simple modelassuming different but constant drifts, the global time canbe established by measuring offsets to a designated masterclock using Cristian’s probabilistic remote clock reading tech-nique [10]. After estimating the drift, the local time can bemapped onto the global (i.e., master) time via linear interpo-lation. Offset values among participating clocks are measuredeither at program initialization [11], [12] or at initializationand finalization [13], and are subsequently used as parame-ters of the linear correction function. So as not to perturbthe program, offset measurements in between are usuallyavoided, although a recent approach proposes periodic offsetmeasurements during global synchronization operations whilelimiting the effort required in each step by resorting to indirectmeasurements across several hops [14]. While linear offsetinterpolation might prove satisfactory for short runs (or in-terpolation intervals), measurement errors and time-dependentdrifts may create inaccuracies and violated happened-beforerelations during longer runs [7]. Additionally, repeated driftadjustments caused by NTP may impede linear interpolation,as they deliberately introduce non-constant drifts.

If linear interpolation alone turns out to be inadequate toachieve the desired level of accuracy, error estimation allowsthe retroactive correction of clock values in event tracesafter assessing synchronization errors among all distributedclock pairs. First, difference functions among clock values arecalculated from the differences between clock values of receiveevents and clock values of send events (plus the minimummessage latency). Second, a medial smoothing function canbe found and used to correct local clock values because foreach clock pair two difference functions exist. Regressionanalysis and convex hull algorithms have been proposedby Duda [15] to determine the smoothing function. Usinga minimal spanning tree algorithm, Jezequel [16] adoptedDuda’s algorithm for arbitrary processor topologies. In addi-tion, Hofmann [17] improved Duda’s algorithm using a simpleminimum/maximum strategy and further proposed that theexecution time should be divided into several intervals to com-pensate for different clock drifts in long running applications.Using a graph-theory algorithm to calculate the shortest paths,Hofmann and Hilgers [18] simplified Jezequel ’s algorithmfor handling multiprocessor topologies. Biberstein et al. [19]rewrote Hofmann’s and Hilgers’s algorithm for use on theCell BE architecture using a short and intelligible notation.Their version solves the clock-condition problem only forshort intervals (i.e., without splitting into sub-intervals forhandling a non-linear drift of the physical clocks). Babaogluand Drummond [20], [21] have shown that clock synchro-nization is possible at minimal cost if the application makesa full message exchange between all processors at sufficientlyshort intervals. However, jitter in message latency, nonlinearrelations between message latency and message length, and

Page 3: Synchronizing the timestamps of concurrent events in

one-sided communication topologies limit the usefulness oferror estimation approaches.

In contrast, logical synchronization uses happened-beforerelations among send and receive pairs to synchronizedistributed clocks. Lamport introduced a discrete logicalclock [22] with each clock being represented by a monoton-ically increasing software counter. As local clocks are incre-mented after every local event and the updated values are ex-changed at synchronization points, happened-before relationscan be exploited to further validate and synchronize distributedclocks. If a receive event appears before its correspondingsend event, that is, if a clock condition violation occurs, thereceive event is shifted forward in time according to the clockvalue exchanged. As an enhancement of Lamport’s discretelogical clock, Fidge [23], [24] and Mattern [25] proposeda vector clock. In their scheme, each processor maintains avector representing all processor-local clocks. While the localclock is advanced after each local event as before, the vectoris updated after receiving a message using an element-wisemaximum operation between the local vector and the remotevector that has been sent along with the message.

Finally, Rabenseifner’s controlled logical clock (CLC) algo-rithm [9], [26] , recently extended and parallelized by Beckeret al. [8], retroactively corrects clock condition violationsin event traces of message-passing applications by shiftingmessage events in time while trying to preserve the lengthof intervals between local events. The algorithm restoresthe clock condition using happened-before relations derivedfrom both point-to-point and collective MPI event semantics.Starting from the parallel MPI version of the algorithm, thispaper describes its hybridization, retaining its good accuracyand scalability characteristics.

III. CONTROLLED LOGICAL CLOCK

Because we focus on a timestamp synchronization methodto be used within the Scalasca trace-analysis framework, wefirst describe the Scalasca event model as the algorithm’sfoundation. The information Scalasca records for an individualevent includes at least the timestamp, the location (e.g., theprocess or thread) causing the event, and the event type.Depending on the type, additional information may be sup-plied. The event model distinguishes between programming-model independent events, such as entering and exiting coderegions, and events related to MPI and OpenMP operations.MPI-related events include events representing point-to-pointoperations, such as sending and receiving messages, and anevent representing the completion of collective MPI opera-tions. OpenMP-related events, which are fashioned accordingto the POMP event model [27], include events that representthe creation and termination of a team of threads, leavinga parallel or barrier region, and acquiring or releasing lockvariables. A fork event indicates that the master thread createsa team of threads (i.e., workers) and a join event recordindicates that the team of threads is terminated. In addition,the collective OpenMP exit event indicates that the program

TABLE IEVENT SEQUENCES RECORDED FOR TYPICAL MPI AND OPENMP

OPERATIONS.

Operation name Event sequenceMPIMPI_Send() (enter, send, exit)MPI_Recv() (enter, receive, exit)MPI_Allreduce() (enter, MPI collective exit)

for each participating processOpenMPparallel construct (fork, enter, OpenMP collective exit, join)

for the participating master thread(enter, OpenMP collective exit)for each participating worker thread

Implicit and explicit barrier (enter, OpenMP collective exit)for each participating thread

omp_set_lock (enter, lock-acquisition, exit)omp_unset_lock (enter, lock-release, exit)critical construct (lock-acquisition, enter, exit, lock-release)atomic construct (enter, exit)

leaves either a parallel or a barrier region. Furthermore, a lock-acquisition event indicates that a lock variable is set, whereasa lock-release event indicates that this variable is unset. Nestedparallelism and tasking is not yet supported in POMP, althoughan extension for tasking is already in preparation [28]. Eventsequences recorded for typical MPI and OpenMP operationsare given in Table I. In preparation of its hybridization, we nowbriefly recapitulate the basic principles of the CLC algorithmin the remainder of this section, where we explain how it iscurrently used to synchronize the timestamps of pure MPIapplications.

In general, clock errors may have both quantitative andqualitative effects. The first category includes changing theabsolute position of an event in the trace or the distance be-tween two consecutive events. As shown in a recent study [7],the second category of effects, which manifests as a change ofthe logical event order, are also very common as soon as anapplication is traced for more than a few minutes. If an event ehappened before another event e′, the happened-before relatione → e′ between both events requires that their respectivetimestamps C(e) and C(e′) satisfy the clock condition [22]:

∀ e, e′ : e→ e′ =⇒ C(e) < C(e′). (1)

The equation given above can be refined by requiring a tempo-ral minimum distance (i.e., latency) between the two events –if its amount is known. While the errors of single timestampsare hard to assess, obvious violations of the clock conditionbetween events with a logical happened-before relation, suchas sending and receiving a message, can be easily detected andoffer a toehold to increase the fidelity of inter-process timings.If the clock condition is violated for a send-receive eventpair, the receive event is corrected (i.e., moved forward intime). To preserve the length of intervals between local events,events following or immediately preceding the corrected eventare also adjusted. These adjustments are called forward andbackward amortization, respectively.

Figure 1 illustrates the different steps of the CLC algo-

Page 4: Synchronizing the timestamps of concurrent events in

S

processes

E1 E3

E2 E4R

time

(a) Inconsistent event trace: Clock condition violation in point-to-pointcommunication pair.

R

S

processes

E1 E3

E2 E4R

time

(b) Locally corrected event trace: The timestamp of the violatingreceive event is advanced to restore the clock condition.

R

S

processes

E1 E3

E2 E4R E4

time

(c) Forward-amortized event trace: Event E4 following the receiveevent is adjusted to preserve the length of the interval between thetwo events.

R

S

processes

E1 E3

E2 E4R E4E2

time

(d) Backward-amortized event trace: Event E2 preceding the receiveevent is advanced to smooth the jump.

Fig. 1. Backward and forward amortization in the controlled logical clockalgorithm.

rithm using a simple example consisting of two processesexchanging a single message. The subfigures show the timelines of the two processes along with their send (S) or receive(R) event, each of them enclosed by two other events (Ei).Figure 1(a) shows the initial event trace based on the measuredtimestamps with insufficiently synchronized local clocks. Itexhibits a violation of the clock condition by having thereceive event appear earlier than the matching send event. Torestore the clock condition, R is moved forward in time to belmin ahead of S (Figure 1(b)), with lmin being the minimummessage latency. Because the distance between R and E4 isnow too short, E4 is adjusted during the forward amortizationto preserve the length of the interval between the two events(Figure 1(c)). However, the jump discontinuity introduced byadjusting R affects not only events later than R but alsoevents earlier than R. This is corrected during the backwardamortization, which shifts E2 closer to the new position of R

(Figure 1(d)). As can be seen in this example, the algorithmonly moves events forward in time.

Moreover, happened-before relations also exist among theconstituent events of collective MPI operations. The algorithmconsiders a single collective message-passing operations asbeing composed of multiple point-to-point operations, takingthe semantics of the different flavors of such operations intoaccount (e.g., 1-to-N, N-to-1,...). For instance, let us consideran N-to-1 operation such as gather where one root processreceives data from N other processes. Given that the rootprocess is not allowed to exit the operation before it hasreceived data from the last process to enter the operation, theclock condition must be observed between the enter events ofall sending processes and the exit event of the receiving rootprocess. Because the algorithm synchronizes the timestampsof concurrent events using happened-before relations, therespective “receive” event is put forward in time whenever thematching “send” event appears too late in the trace to satisfythe clock condition. In reference to the fact that this methodis based on logical clocks, the send and receive event typesassigned during this mapping are called the logical event typesas opposed to the actual event types (e.g., enter, collective exit)specified in the event trace. The logical event type can usuallybe derived from the name of the MPI operation and the rolea process plays in it.

Although the CLC algorithm removes residual inconsisten-cies (i.e., those left after applying linear offset interpolation) inevent traces of MPI applications postmortem, it is limited bytwo factors. First, it does not account for direct violations ofshared-memory event semantics in the original trace. Althoughrare, instances of such violations have been reported [7].Second and more important, the algorithm does not preservehappened-before relations in shared-memory operations, be-cause the constituent events of such constructs are currentlytreated as internal events not involved in happened-beforerelations with events of other threads. Thus, the restorationof message-passing semantics may introduce violations ofshared-memory event semantics even though they were notviolated in the original trace. For this reason, the current CLCalgorithm is not suitable for hybrid cluster applications thatuse MPI and OpenMP in combination.

The potential implications of isolated corrections based onMPI event semantics for the semantics of OpenMP events areexemplified in Figure 2 using the time lines of three threads.Shown is a violated message exchange between a send andreceive pair followed by the execution of an OpenMP parallelregion. Here, the execution of the OpenMP parallel region bytwo threads is enclosed by a fork (F ) and a join (J) event ofthe master thread. Whereas in Figure 2(a) the point-to-pointevent order is violated, the parallel regions appear clearly afterthe worker has been forked. However, while in Figure 2(b)the logical point-to-point event order is restored, now onethread enters the parallel region before it has been forked,which is impossible. In other words, the algorithm detectsand corrects the clock condition violation in the point-to-pointmessage exchange, while the subsequent forward amortization

Page 5: Synchronizing the timestamps of concurrent events in

S

time

thre

ad

R parallel

parallel

JF

(a) Inconsistent point-to-point event semantics followed by consistentshared-memory fork semantics: All threads enter the parallel regionafter they have been forked.

S

time

thre

ad

R parallel

parallel

JF

(b) The correction of inconsistent point-to-point event semantics maylead to inconsistent shared-memory fork semantics: Here, one threadenters the parallel region before it has been forked.

Fig. 2. Violations of OpenMP event semantics in the wake of restoring MPIevent semantics.

introduces a new violation as a result of the algorithm notaccounting for event semantics in shared-memory operations.

To address these limitations, Section IV describes thealgorithmic extensions required to restore and preserve notonly point-to-point and collective message-passing but alsoshared-memory event semantics, which are relevant for hybridcodes. Since rapidly increasing parallelism demands that thiscorrection scales to large numbers of processes and threads,Section V shows how the hybrid version is parallelized andintegrated into the scalable Scalasca trace-analysis framework.

IV. EXTENSIONS FOR OPENMP

Like in the case of MPI, happened-before relations existamong the constituent events of OpenMP regions (i.e., ex-ecuted constructs). To enforce also these relations alongsidethose implied by the MPI standard, we treat the eventsinvolved as logical point-to-point communication events. Thus,a happened-before relation between two events in an OpenMPregion is modeled as the exchange of a logical messagebetween the two events. Depending on the temporal depen-dencies among the events characterizing an OpenMP region,an event can be mapped either onto a logical send or onto alogical receive event. Once those mappings are defined, ourearlier algorithmic framework [8] can essentially be reused.This is why we do not repeat the formulas here again and,instead, concentrate on the identification of happened-beforerelations in OpenMP. Table II lists all OpenMP regions cur-rently supported by our event model where we can identifyhappened-before relations and divides them into groups with

very similar logical communication patterns, which are de-picted in Figure 3. Although not yet provided by our imple-mentation, we also consider tasking, whose integration into ourevent model is already in progress. Note that the algorithmicextensions do neither cover thread migration between coresnor shared-memory event semantics imposed by cluster-wideOpenMP implementations (e.g., Intel Cluster OpenMP [29])because in those cases additional communication may be usedintroducing further constraints which are currently ignored bythe algorithm.

1) Team creation and termination: Figure 3(a) shows thetime-line visualization of three threads executing a parallelregion. The master thread creates a team of threads (at forkevent F ) whose members subsequently enter the parallelregion (at enter events Ei). In such a situation the masterthread sends a logical message to all worker threads. The forkevent of the master thread is considered a logical send event,whereas all the enter events of the corresponding parallelregion, one from each worker in the team, are consideredlogical receive events.

After each thread left the parallel region (at OpenMPcollective exit events OXi), this team of threads is terminated,as indicated by the join event (J) of the master thread. Here,the master thread receives logical messages from all workerthreads. The join event (J) of the master thread is considered alogical receive event. All OpenMP collective exit events (OXi)of the corresponding parallel region, one from each worker inthe team, are considered logical send events.

2) Barrier: OpenMP barrier constructs are similar to MPIbarriers and therefore adhere to the same execution semantics,which require that no thread is allowed to exit a barrierregion before the last thread has entered it. Such a situationis illustrated in Figure 3(b). All threads in the team are atthe same time sender and receiver. All enter events (Ei) areconsidered logical send events and all OpenMP collectiveexit events (OXi) are considered logical receive events. Thissituation shows up in explicit and implicit barrier regions.

3) Locking: Figure 3(c) visualizes two threads competingfor a lock variable. First, one thread acquires (LA1) and

TABLE IICLASSIFICATION OF OPENMP REGIONS.

Category OpenMP regionTeam creation begin of parallel regionTeam termination end of parallel regionBarrier explicit barrier region

implicit barrier (if executed) at the end ofparallel regionloop region (i.e., for, do)single regionworkshare regionsections region

Locking omp_set_lockomp_unset_lockcritical region

Tasking task regiontaskwait region

Page 6: Synchronizing the timestamps of concurrent events in

E1

time

thre

ad

E2

E3

F JOX1

OX2

OX3

(a) Team creation and termination: The execution of a parallel regionis chronologically enclosed by fork and join events.

E1

time

thre

ad

E2

E3

OX1

OX2

OX3

(b) OpenMP barrier: A thread is allowed to exit a barrier region onlyafter the last thread has entered it.

LA1 LR

LA2

time

thre

ad

(c) OpenMP lock sequence: A thread can acquire a lock only after ithas been released – if it was acquired earlier.

TB TS

TR

time

thre

ad

TT

(d) OpenMP task sequence: A task can only be resumed after it hasbeen suspended.

Fig. 3. Happened-before relations in OpenMP regions visualized as arrowsrepresenting logical messages.

releases (LR) the lock variable. Then, the other thread locks(LA2) the same variable after the lock was released by thefirst thread. In such a situation, two different happened-beforerelations exist. First, the thread represented by the uppertime line is allowed to release the lock only after it hasbeen acquired (LA1 → LR). Since these two events occuron the same time line, this relation is trivially enforced.Second, the thread represented by the lower time line canonly acquire the lock once it has been released by the other

thread (LR→ LA2). Given that a lock variable can be ownedonly by one thread at a time, the releasing thread sends alogical message to the next thread acquiring the lock. Thelock-acquisition event is considered a logical receive event,whereas the lock-release event of the thread releasing the lockis considered a logical send event. Since at program start noneof the locks is occupied, the first thread acquiring a given lockdoes not need to wait for a preceding lock-release event.

As Scalasca models the execution of critical constructswith lock events, the above-mentioned happened-before re-lations also exist in critical regions. Given that a criticalconstruct restricts the execution of a structured block to asingle thread at a time, a lock-acquisition event is recordedbefore a thread enters the critical region, whereas a lock-release event is recorded after the thread leaves the criticalregion. Because the same unspecified name or a user-definedname is used to identify a critical region, the event modelprovides lock identifiers representing the name of a criticalregion. In addition, similar happened-before relations are alsofound in atomic and flush constructs, but the source-codeinstrumentation applied by Scalasca does not allow eventsinside these regions to be recorded [27], although this wouldbe necessary to determine when a thread enters or leavessuch a region. For instance, an atomic construct ensures thata specific storage location is updated atomically. Similar tocritical constructs, it would be required to record when a threadexecutes inside the atomic construct. However, the executionof an atomic construct is restricted to statements that can becalculated atomically, which prevents the insertion of tracingcalls. The flush directive, which does not have any codeattached to it, is even more restrictive in this regard. Sincewe cannot record events inside such regions, these constructsare currently ignored by the algorithm.

Given that our current event model does not provide eventattributes such as a sequence count indicating the logicalorder of lock events, this order can only be derived fromthe timestamps as they are recorded in the trace. Assumingthat on most systems the thread-local clocks within a teamare synchronized, these timestamps provide a reasonably re-liable sequence indicator. However, as on some systems thisassumption cannot be maintained, the timestamp alone maybe insufficient to determine the correct precedence order oflock events and their violation in the course of MPI-relatedCLC corrections. Nevertheless, on all systems the algorithmcan preserve the event order as found in the original trace.Hence, the original order of lock events is determined once atprogram start and subsequently used when the roles of logicalsenders and receivers are determined.

4) Tasking: Figure 3(d) shows the time-line visualizationof two threads executing an untied task. One thread creates atask at the task-begin event (TB) and subsequently suspendsthis task at the task-suspend event (TS). Afterward, a threaddifferent from the one that executed the task before it wassuspended resumes the task at the task-resume event (TR) andfinally terminates the task at the task-termination event (TT ).Note that a task may be suspended at any point, not only at

Page 7: Synchronizing the timestamps of concurrent events in

Local

event traces

Parallel execution

of target application

Parallel wait

state analysis

Linear interpolation &

controlled logical clock

Local

event tracesTrace

visualization

rewrite

Fig. 4. Parallel trace-analysis workflow in Scalasca. Gray rectangles denoteprograms and white rectangles with the upper right corner turned down denotefiles. Stacked symbols indicate multiple instances of programs or files runningor being processed in parallel.

implied scheduling points, although some compilers respectscheduling points even for untied tasks. The task-suspendevent (TS) is considered a logical send event, whereas thetask-resume event (TR) is considered a logical receive event.Note that this happened-before relation is naturally fulfilledfor tied tasks and does not demand any correction. Finally,taskwait regions impose further happened-before relations (notshown) between the exit events of the child tasks created bythe surrounding task region and the exit event of the taskwaitregion, which can be handled accordingly.

V. HYBRID PARALLELIZATION

Scalasca, the performance analysis toolset for which ouralgorithm has been primarily designed, uses event traces toidentify wait states that occur, for example, as a result of anunevenly distributed workload and that may harm performanceespecially at larger scales [5]. The basic analysis workflowis depicted in Figure 4. To ensure scalability of the wait-state search, the traces are scanned in parallel using as manyprocessors as have been used to execute the target applicationitself. After loading the process-local traces into the potentiallydistributed main memory of the machine, Scalasca traversesthem simultaneously while replaying the original communi-cation recorded in the trace to exchange information relevantto the search. To increase the accuracy of this analysis onclusters without global clock, Scalasca applies the parallelCLC algorithm after the traces have been loaded and before thewait-state analysis takes place. To optimize the fidelity of thecorrection, the timestamps first undergo a pre-synchronizationstep, which performs linear offset interpolation based on offsetmeasurements taken during initialization and finalization of thetarget application. As an alternative to the wait-state search,the corrected traces can also be rewritten and visualized usingan external time-line browser.

Just like the wait-state analysis, the CLC algorithm requirescomparing events involved in the same communication opera-tion, which is why it follows a similar parallelization strategy,adopting a variation of the parallel-replay idea. Since traceprocessing capabilities (i.e., processors and memory) growproportionally with the number of application processes, wecan achieve good scalability on large processor configurations.During the replay, sending and receiving processes exchangethe information needed to synchronize the event timestamps.However, different from the wait-state search, which requires

only a forward replay, the CLC algorithm performs the replayin both directions – forward and backward. The backwardreplay, during which the roles of sender and receiver are re-versed, is needed because the backward amortization requiresknowledge of receiver timestamps on the sender side. To makethe parallel CLC implementation applicable to realistic tracesfrom hybrid codes, we

• enabled the replay engine to deal with hybrid traces,• added logic for the proper identification of logical senders

and receivers among OpenMP-related events, and• facilitated the exchange of timestamps between the

threads responsible for these events as a prerequisite forthe synchronization.

The parallel CLC algorithm is, again like the wait-stateanalysis, implemented on top of PEARL [30], a parallel librarythat offers higher-level abstractions to read and analyze largevolumes of trace data including random access to individualevents, links between related events, functionality to transferand access remote events, and replay support. In the pureMPI case, the usage model of the library assumes a one-to-one mapping between analysis (i.e., correction) and target-application processes. That is, for every process of the targetapplication, one correction process responsible for the tracedata of this application process is created. Data exchangeduring the replay is accomplished via MPI. The basic ideabehind our new hybrid scheme was again to mirror the processand thread structure of the target application and to make theCLC implementation a hybrid program in its own right. To thisend, our trace-access library was extended such that the eventsof every application thread including new OpenMP-specificevent types can be accessed and processed by a dedicatedanalysis thread. Now, the one-to-one correspondence betweenthe executions of the target application and the CLC algorithmexist on two levels: processes and threads. The resultingparallel processing scheme becomes a hybrid parallel replayof the target application. Note that the current usage modelis restricted in that it supports only MPI calls on the masterthread (i.e., MPI funneled mode) and only a fixed number ofthreads per process.

To synchronize the timestamps, each thread scans theevent trace for clock-condition violations and applies forwardand backward amortization, as introduced earlier. During theforward amortization, events belonging to OpenMP regionsmay be classified as logical senders or receivers accordingto their role in these regions. Events indicating the creationor termination of a team of threads and events indicating theacquisition and release of lock variables are easily classifiedbased on their event type as specified in the trace. For eventsrelated to entering or leaving parallel or barrier regions, thelogical event type is derived from the region name (e.g.,parallel, barrier) and the role (e.g., master thread) aparticular thread plays therein.

Furthermore, we defined functions to exchange and com-pare timestamps between threads during the different replayphases – mostly representing different flavors of reduction

Page 8: Synchronizing the timestamps of concurrent events in

operations. The required communication pattern depends onthe type of OpenMP regions whose event timestamps are tobe synchronized. As mentioned earlier, in the absence of moreprecise order attributes the logical event order of lock eventsis currently derived from their relative timings as recordedin the trace, which may be inaccurate only in the rare caseswhere the thread-local clocks within a team exhibit significanterrors. For this purpose, the chronological event order of lockevents is determined in advance and stored in a global datastructure before the actual replay is applied. During the replayof lock operations, timestamps are exchanged between thethreads competing for the same lock – similar to the replay ofpoint-to-point messages.

VI. EXPERIMENTAL EVALUATION

In this section, we evaluate the accuracy and scalability ofthe parallel controlled logical clock algorithm when applied totraces of hybrid codes and also give evidence of the frequencyand the extent of clock condition violations in such traces.We ran our experiments on the Nicole cluster of the JulichSupercomputing Centre. This cluster consists of 32 computenodes, each with two quad-core AMD Opteron processorsrunning at 2.4 GHz. The individual compute nodes of theNicole cluster are linked with an Infiniband network. Themeasured MPI inter-node latency was 4.5 µs, the measuredMPI intra-node latency was 1.5 µs. Unless stated otherwise,all numbers presented in this section represent the averageacross at least three measurements.

As a first test case served the application PEPC, a paralleltree-code for rapid computation of long-range Coulomb forcesin n-body particle systems [31]. In the course of this evaluationstudy, the original parallel processing scheme, an MPI imple-mentation of the Barnes-Hut tree algorithm [32] accordingto the Warren-Salmon hashed oct-tree structure [33], wasenriched with shared-memory parallelism within the solverand integrator part. Applying a strong scaling strategy, a fixedoverall number of particles (i.e., 524288) with 100 solveriterations was configured resulting in an approximately idealspeedup behavior [34]. In our test configurations, the runtimewas approximately 30, 15 and 7.5 min. Given that tracingthe full run would consume a prohibitively large amount ofstorage space, selective tracing was applied so that the solverand integrator part were traced only during iteration 50. Thismimics the common practice of tracing only pivotal points thatwarrant a more detailed analysis.

A hybrid version of the Jacobi solver, which originallycomes along with the OpenMP Source Code Repository of theParallel Computing Group at the La Laguna University, wasused as a second test case [35]. This benchmark solves thePoisson equation on a rectangular grid assuming uniform dis-cretization in each direction and Dirichlet boundary conditions.The original benchmark, a pure OpenMP implementation,had been combined with MPI-based parallelism. Followinga strong scaling strategy, a fixed matrix size of 2000 × 2000was configured. To emulate a run long enough so that driftdeviations may have a noticeable effect, we inserted sleep

TABLE IIIEXECUTION CONFIGURATIONS AND THE DISTRIBUTION OF REVERSED

MESSAGES IN THE ORIGINAL TRACE AND VIOLATED MESSAGES DETECTEDDURING THE TIMESTAMP SYNCHRONIZATION WITH RESPECT TO THE

PROGRAMMING-MODEL SEMANTICS THEY VIOLATE.

PEPC Jacobi# CPUs 64 128 256 32 64 128 256# processes 16 32 64 16 32 64 128# threads 4 4 4 2 2 2 2Distribution of reversed messages in the original trace [%]MPI 100 100 100 100 100 100 100OpenMP 0 0 0 0 0 0 0Violated messages detected during synchronization [%]MPI 44 46 42 81 84 81 40OpenMP 55 53 57 18 15 18 59

statements immediately before and after the main compu-tational phase so that it was carried out ten minutes afterinitialization and ten minutes before finalization, resulting ina total execution time of roughly twenty minutes.

Table III lists the investigated execution configurationsalong with the distribution of reversed logical messages withrespect to the programming model semantics they violate(i.e., MPI or OpenMP). Apparently, in the original traceonly violations of MPI event semantics occurred. However, topreserve the logical event order in the corrected trace, OpenMPevent semantics were temporarily violated and subsequentlyrestored by the hybrid version of the CLC algorithm. Whilethe pure MPI version that does not account for OpenMP eventsemantics would leave these violations unnoticed, the hybridCLC algorithm recognizes such situations and preserves thecorrect order of OpenMP events in the synchronized eventstream. After applying the algorithm, the traces were free ofany clock-condition violations.

0.00  

1.00  

2.00  

3.00  

4.00  

5.00  

6.00  

64  (0.3M)   128  (0.8M)   256  (2.5M)  

reversed

 messages  [%

]  

#  threads  (#  messages)  

(a) PEPC on Nicole.

0.00  

0.20  

0.40  

0.60  

0.80  

1.00  

32  (0.3M)   64  (1.1M)   128  (2.9M)   256  (17.9)  

reversed

 messages  [%

]  

#  threads  (#  messages)  

(b) Jacobi on Nicole.

Fig. 5. Percentage of (logical) MPI messages with the order of send andreceive events being reversed in the original trace.

Page 9: Synchronizing the timestamps of concurrent events in

!"!!!!!!#$

!"!!!!!%#$

!"!!!!%!#$

!"!!!%!!#$

!"!!%!!!#$

!"!%!!!!#$

!"%!!!!!#$

%"!!!!!!#$

%!"!!!!!!#$

%!!"!!!!!!#$

&'$(!")*+$ %,-$(!"-*+$ ,.&$(,".*+$

!"#$%&'()"

#*$+,

-$

.$/01#2*($3.$#4#'/(5$

/$!"!$#$

/$!"!%$#$

/$!"%$#$

/$%$#$

/$%!$#$

/$%!!$#$

(a) PEPC on Nicole.

!"!!!!!!#$

!"!!!!!%#$

!"!!!!%!#$

!"!!!%!!#$

!"!!%!!!#$

!"!%!!!!#$

!"%!!!!!#$

%"!!!!!!#$

%!"!!!!!!#$

%!!"!!!!!!#$

&'$(!"&)*$ +,$(%"%)*$ %'-$('".)*$ '/+$(%0".*$

!"#$%&'()"

#*$+,

-$

.$/01#2*($3.$#4#'/(5$

1$!"!$#$

1$!"!%$#$

1$!"%$#$

1$%$#$

1$%!$#$

1$%!!$#$

(b) Jacobi on Nicole.

Fig. 6. Relative deviation of the event distance: Percentage of execution timeconsumed by intervals with deviation above threshold

Moreover, Figure 5 shows the frequency of reversed mes-sages as percentage of the total number of messages. Giventhat none of the OpenMP event semantics was violated inthe original trace, the numbers only refer to point-to-pointmessages and logical messages that can be derived by mappingcollective MPI communication onto point-to-point commu-nication. Although the graphs suggest that the number ofviolations decreases as the number of processors is increased,such a relationship was not generally confirmed in otherstudies [8]. In the case of PEPC, the drop in the overallruntime might offer an explanation though. The extent of theseclock condition violations can be assessed by the average andmaximum displacement errors (i.e., the time the receive eventappears earlier than the send event) of logical message eventsin backward order, as seen in the original trace. The PEPCtraces exhibit an average error of 21.7µs and a maximumerror of 531.0µs and an error, whereas the Jacobi exhibitan average error of 3.5µs and a maximum error of 98.0µs.The numbers demonstrate that clock-condition violations mayappear frequently and that individual violations can be largein absolute terms.

To assess the collateral error inflicted on local timingswhile applying the CLC algorithm, we determined the relative

deviation of local interval lengths, considering two differenttypes of intervals:• intervals between an event and the first event on the same

node, which is referred to as the event position, and• intervals between adjacent thread-local events (i.e., in-

tervals between an event and its immediate successor),which is referred to as the event distance.

For the Jacobi experiments only the middle section of the tracebetween the sleep statements was considered. The maximumrelative deviation of the event position across all PEPC andJacobi measurements was negligible. The maximum absolutedeviation of the event position was 535.78 µs for PEPC and102.67 µs for Jacobi, roughly corresponding to the respectivemaximum displacement error observed. Moreover, Figure 6shows the relative deviation of the event distance acrossdifferent numbers of processors for both test applications. Eachbar indicates the percentage of execution time consumed byintervals in a certain error class. All numbers represent themaximum across three measurements. It can be seen thatin spite of very small averages, deviations of occasionallymore than 100% are still possible, but the aggregate timeconsumed by those deviations is very small and their influenceon performance analysis results will usually be negligible.

On identical configurations, the timestamp synchronizationwas a factor of 2-3 slower than the equivalent uninstrumentedexecution of PEPC, which we hope to optimize in futureversions of our implementation. To evaluate the scaling be-havior of the hybrid synchronization method, Figure 7 showsa comparison to the Scalasca wait-state analysis and the unin-strumented PEPC solver. The numbers for each configurationare normalized with respect to the execution time of PEPC inthe 64 thread configuration. The results demonstrate that theparallel timestamp synchronization, the wait-state analysis, andthe execution of PEPC itself exhibit roughly equivalent scalingbehavior, which was to be expected due to the replay-basednature of the two trace processing mechanisms.

VII. CONCLUSION

Event traces of parallel applications on clusters may sufferfrom inaccurate timestamps in the absence of synchronizedclocks. As a consequence, the analysis of such traces may yieldwrong quantitative and qualitative results, among other effectsconfusing the users of time-line visualizations with messagesflowing backward in time. Because linear offset interpolation

0%  

20%  

40%  

60%  

80%  

100%  

64   128   256  

!me  consum

ed  [%

]  

#  threads  

PEPC  

-mestamp  synchroniza-on  

wait-­‐state  analysis  

Fig. 7. Normalized execution time of the parallel timestamp synchronizationon Nicole.

Page 10: Synchronizing the timestamps of concurrent events in

can account for such deficiencies only for very short runs,the CLC algorithm retroactively synchronizes timestamps inevent traces and restores the correct logical event order. It doesso in a scalable manner by replaying the traces in parallel.In this paper, we extended the CLC algorithm, which waspreviously designed for pure MPI applications, to also coverhybrid applications that use MPI and OpenMP in combination.The major contribution was the identification of happened-before relations in OpenMP to be taken into account by thealgorithm and the hybridization of the parallel replay mech-anism. Finally, the hybrid CLC version was integrated intothe Scalasca performance-analysis toolset. Our experimentalevaluation showed that the good accuracy and scalabilitycharacteristics of the pure MPI version were retained in thehybrid version.

In the future, we plan to adapt our algorithm to moreadvanced OpenMP features such as nested parallelism andtasking as those features are successively integrated into thePOMP event model used by Scalasca. Moreover, we wantto further increase the accuracy of the CLC algorithm byimproving the preceding pre-synchronization via linear clock-offset interpolation, which currently rests on only two offsetmeasurements taken during program initialization and final-ization. With low-overhead offset measurements periodicallytaken during globally synchronizing operations, as introducedby Doleschal et al. [14], the linear interpolation can betteraccount for drift deviations, reducing the number of violationsour algorithm needs to correct in the first place and furtherimproving the overall quality of the event timestamps.

REFERENCES

[1] W. Nagel, A. Arnold, M. Weber, H.-C. Hoppe, and K. Solchenbach,“Vampir: Visualization and analysis of MPI resources,” Supercomputer,vol. 12, no. 1, pp. 69–80, January 1996.

[2] J. Labarta, S. Girona, V. Pillet, T. Cortes, and L. Gregoris, “DiP: Aparallel program development environment,” in Proc. of the EuropeanConference on Parallel Computing, ser. LNCS 1124. Lyon, France:Springer, August 1996, pp. 665–674.

[3] G. Rodriguez, R. M. Badia, and J. Labarta, “Generation of simpleanalytical models for message passing applications,” in Proc. of theEuropean Conference on Parallel Computing, ser. LNCS 3149. Pisa,Italy: Springer, August - September 2004, pp. 183–188.

[4] S. Huband and C. McDonald, “A Preliminary Topological Debugger forMPI Programs,” in Proc. of the 1st IEEE/ACM International Symposiumon Cluster Computing and the Grid. Brisbane, Australia: IEEE, 2001,pp. 422–429.

[5] M. Geimer, F. Wolf, B. J. Wylie, and B. Mohr, “A scalable tool ar-chitecture for diagnosing wait states in massively parallel applications,”Parallel Computing, vol. 35, no. 7, pp. 375–388, Jul. 2009.

[6] D. L. Mills, “Network Time Protocol (Version 3),” The Internet Engi-neering Task Force - Network Working Group, March 1992, RFC 1305.

[7] D. Becker, R. Rabenseifner, and F. Wolf, “Implications of non-constantclock drifts for the timestamps of concurrent events,” in Proc. of theIEEE Cluster Conference. Tsukuba, Japan: IEEE, 2008, pp. 59–68.

[8] D. Becker, R. Rabenseifner, F. Wolf, and J. C. Linford, “Scalable times-tamp synchronization for event traces of message-passing applications,”Parallel Computing, Special Issue EuroPVM/MPI 2007, vol. 35, no. 12,pp. 595–607, December 2009.

[9] R. Rabenseifner, “The controlled logical clock - a global time fortrace based software monitoring of parallel applications in workstationclusters,” in Proc. of the 5th EUROMICRO Workshop on Parallel andDistributed. London, UK: IEEE, January 1997, pp. 477–484.

[10] F. Cristian, “Probabilistic clock synchronization,” Distributed Comput-ing, vol. 3, no. 3, pp. 146–158, 1989.

[11] T. H. Dunigan, “Hypercube clock synchronization,” Oak Ridge NationalLaboratory, TN, USA, Technical Report ORNL TM-11744, Feb. 1991.

[12] ——, “Hypercube clock synchronization,” ORNL TM-11744 (updated),September 1994, www.epm.ornl.gov/∼dunigan/clock.ps.

[13] E. Maillet and C. Tron, “On efficiently implementing global time forperformance evaluation on multiprocessor systems,” Journal of Paralleland Distributed Computing, vol. 28, pp. 84–93, 1995.

[14] J. Doleschal, A. Knupfer, M. S. Muller, and W. Nagel, “Internal timersynchronization for parallel event tracing,” in Proc. of the 15th EuropeanPVM/MPI Users’ Group Meeting, ser. LNCS 5205. Dublin, Ireland:Springer, September 2008, pp. 202–209.

[15] A. Duda, G. Harrus, Y. Haddad, and G. Bernard, “Estimating global timein distributed systems,” in Proc. of the 7th International Conference onDistributed Computing Systems. Berlin, Germany: IEEE, September1987, pp. 299–306.

[16] J.-M. Jezequel, “Building a global time on parallel machines,” in Proc.of the 3rd International Workshop on Distributed Algorithms, ser. LNCS392. Nice, France: Springer, 1989, pp. 136–147.

[17] R. Hofmann, “Gemeinsame Zeitskala fur lokale Ereignisspuren,” inMessung, Modellierung und Bewertung von Rechen- und Kommunika-tionssystemen. Aachen: Springer, September 1993, pp. 333–345.

[18] R. Hofmann and U. Hilgers, “Theory and tool for estimating globaltime in parallel and distributed systems,” in Proc. of the 6th EuromicroWorkshop on Parallel and Distributed Processing. Madrid, Spain: IEEE,January 1998, pp. 173–179.

[19] M. Biberstein, Y. Harel, and A. Heilper, “Clock synchronization in CellBE traces,” in Proc. of the 14th Euro-Par Conference, ser. LNCS 5168.Las Palmas de Gran Canaria, Spain: Springer, 2008, pp. 3–12.

[20] O. Babaoglu and R. Drummond, “(Almost) no cost clock synchroniza-tion,” Cornell University, Technical Report TR86-791, 1986.

[21] R. Drummond and O. Babaoglu, “Low-cost clock synchronization,”Distributed Computing, vol. 6, no. 4, pp. 193–203, July 1993.

[22] L. Lamport, “Time, clocks, and the ordering of events in a distributedsystem,” Communications of the ACM, vol. 21, no. 7, pp. 558–565, 1978.

[23] C. J. Fidge, “Timestamps in message-passing systems that preservepartial ordering,” Australian Computer Science Communications, vol. 10,no. 1, pp. 56–66, February 1988.

[24] ——, “Partial orders for parallel debugging,” ACM SIGPLAN Notices,vol. 24, no. 1, pp. 183–194, January 1989.

[25] F. Mattern, “Virtual time and global states of distributed systems,”in Proc. of the International Workshop on Parallel and DistributedAlgorithms. Chateau de Bonas, France: Elsevier Science PublishersB. V., Amsterdam, Oct. 1989, pp. 215–226.

[26] R. Rabenseifner, “Die geregelte logische Uhr, eine globale Uhr fur dietracebasierte Uberwachung paralleler Anwendungen,” Ph.D. dissertation,University of Stuttgart, Stuttgart, March 2000.

[27] B. Mohr, A. Malony, S. Shende, and F. Wolf, “Design and prototype ofa performance tool interface for OpenMP,” The Journal of Supercom-puting, vol. 23, no. 1, pp. 105–128, August 2002.

[28] D. Lorenz, B. Mohr, C. Rossel, D. Schmidl, and F. Wolf, “How toreconcile event-based performance analysis with tasking in OpenMP,”in Proc. of the 6th International Workshop on OpenMP (IWOMP), ser.LNCS. Tsukuba, Japan: Springer, June 2010, (accepted).

[29] J. P. Hoeflinger, “Extending OpenMP to clusters,” Intel Corporation,2005, cache-www.intel.com/cd/00/00/28/58/285865 285865.pdf.

[30] M. Geimer, F. Wolf, A. Knupfer, B. Mohr, and B. J. N. Wylie, “A paralleltrace-data interface for scalable performance analysis,” in Proc. of theWorkshop on State-of-the-Art in Scientific and Parallel Computing, ser.LNCS 4699. Umea, Sweden: Springer, June 2006, pp. 398–408.

[31] S. Pfalzner and P. Gibbon, Many-Body Tree Methods in Physics.Cambridge University Press, Oct. 1996.

[32] J. E. Barnes and P. Hut, “A hierarchical O(N log N) force-calculationalgorithm,” Nature, vol. 324, no. 6096, pp. 446–449, Dec. 1986.[Online]. Available: http://dx.doi.org/10.1038/324446a0

[33] M. S. Warren and J. K. Salmon, “A parallel hashed oct-tree n-bodyalgorithm,” in Proc. of the Conference on High Performance Networkingand Computing. Portland, Oregon, USA: ACM, 1993, pp. 12–21.

[34] G. M. Amdahl, “Validity of the single processor approach to achievinglarge scale computing capabilities,” in Proc. of the AFIPS Joint Com-puter Conferences. Atlantic City, NJ, USA: ACM, 1967, pp. 483–485.

[35] A. J. Dorta, C. Rodriguez, F. de Sande, and A. Gonzalez-Escribano,“The OpenMP source code repository,” in Proc. of the 13th EuromicroInternational Conference on Parallel, Distributed and Network-BasedProcessing. Lugano, Switzerland: IEEE, February 2005, pp. 244–250.