d. becker, m. geimer, r. rabenseifner, and f. wolf laboratory for parallel programming | september...

Post on 16-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

D. Becker, M. Geimer, R. Rabenseifner, and F. Wolf Laboratory for Parallel Programming | September 21 2010

Synchronizing the timestamps of concurrent events in traces of hybrid MPI/OpenMP applications

2Daniel Becker

• Cluster systems represent majority of today’s supercomputers– Availability of inexpensive

commodity components

• Vast diversity– Architecture– Interconnect technology– Software environment

• Message-passing and shared-memory programming models for communication and synchronization

Cluster systems

3Daniel Becker

• Application areas– Performance analysis

• Time-line visualization• Wait-state analysis

– Performance modeling – Performance prediction– Debugging

• Events recorded at runtime to enable post-mortem analysis of dynamic program behavior

• Event includes at least timestamp, location, and event type

Event tracing

Send

Recv

Barrier

Barrier

E

E

S X E MX

R X E MX

… …E S X E MX

… …E R X E MX

… …S R X X EE E MX MX

merge(opt.)

write

record

4Daniel Becker

Problem: Non-synchronized clocks

5Daniel Becker

Outline

6Daniel Becker

Lamport, Mattern, Fidge,Rabenseifner

Restore and preserve logical correctness

Lamport, Mattern, Fidge,Rabenseifner

Restore and preserve logical correctness

Dunigan, Maillet, Tron, Doleschal

Measure offset values and determine interpolation function

Determine medial smoothing function based on send/receive differences

Duda, Hofman, Hilgers

Query time from reference clocks synchronized at regular intervals

Mills

Clock synchronization

7Daniel Becker

Controlled logical clock

E X

E

S

µmin

XX RE

8Daniel Becker

MPI semantics

E

E

MX

MX

E MX

E

E

MX

MX

E MX

MX

MX

MX

E

E

E MX

MX

MX

E

E

E

9Daniel Becker

• Neither restores nor preserves clock condition in OpenMP event semantics

• May introduce violations in locations that were previously intact

Limitations of the CLC algorithm

R

S

omp_barrier

omp_barrier

Romp_barrier

10Daniel Becker

Collective communication

omp_barrier

omp_barrier

E

E

OX

OX

Consider OpenMP constructs as composed of multiple logical messages

Define logical send/receive pairs for each flavor

11Daniel Becker

OpenMP semantics

E

E

E

F J

OX

OX

OX

OX

OX

OX

E

E

E

U

U

L

Tasking

U

U

L

U

12Daniel Becker

• Operation may have multiple logical receive and send events

• Multiple receives used to synchronize multiple clocks• Latest send event is the relevant send event

Happened-before relation

MXE

E OX

OXE

OXE

13Daniel Becker

• Correct local traces in parallel– Keep whole trace in memory– Exploit distributed memory &

processing capabilities

Parallelization

• Replay communication– Traverse trace in parallel– Exchange data at

synchronization points – Use operation of same type

• MPI functions• OpenMP constructs

14Daniel Becker

222

1

3

Forward replay

1… …

3… …

2… …omp_barrier

omp_barrier

2

omp_barrier1

3

15Daniel Becker

• Avoid new violations• Do not advance send

farther than matching receive

Backward amortization

RS

S

R

16Daniel Becker

• Data on sender side needed

• Communication direction– Communication precedes

in backward direction– Roles of sender and

receiver are inverted

• Traversal direction– Start at end of trace– Avoid deadlocks

Backward replay

S R… …

S R… …

S

S R

R

R

R S

S

17Daniel Becker

Piece-wise correction

LCib

RR

R

RSSSSS

∆tR

R

LCib Controlled logical clock without jump discontinuities

LCi’ – LCib Controlled logical clock with jump discontinuities

LCiA’ - LCi

b Linear interpolation for backward amortization

LCiA - LCi

b Piecewise linear interpolation for backward amortization

Amortization interval

min(LCk’(corr. receive event) - µ - LCib)

dif

fere

nce

s t

o L

Cib

18Daniel Becker

Experimental evaluation

Significant percentage of messages was violated (up to 5%)

After correction all traces were free of clock condition violations

Nic

ole

clus

ter • JSC@FZJ

• 32 compute nodes• 2 quad-core Opteron running at 2.4 GHz• Infiniband Ap

plic

ation

s • PEPC (4 threads per process)

• Jacobi solver (2 threads per process)

Evaluation focused on frequency of clock violations, accuracy, and scalability of the correction

19Daniel Becker

• Event position– Absolute deviations correspond to

value clock condition violations– Relative deviations are negligible

Accuracy of the algorithm

• Event distance– Larger relative deviations possible– Impact on analysis results negligible

Correction only marginally changes the length of local

intervals

Correction changed the length of local intervals

only marginally

20Daniel Becker

• Only violated MPI semantics in original trace• Roughly half of the corrections correspond to

OpenMP semantics

Synchronizing hybrid codes

Algorithm preserved OpenMP semantics

RR

S

omp_barrier

omp_barrier

omp_barrier

omp_barrier

21Daniel Becker

Scalability

22Daniel Becker

Summary

23Daniel Becker

• Exploit knowledge of MPI-internal messaging inside collective operations using PERUSE

• Leverage periodic offset measurements at global synchronization points

Outlook

24Daniel Becker

Thanks!

top related