1 blue gene simulator gengbin zheng [email protected] gunavardhan kakulapati [email protected]...
TRANSCRIPT
1
Blue Gene SimulatorBlue Gene Simulator
Gengbin [email protected]
Gunavardhan [email protected]
Parallel Programming LaboratoryDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://charm.cs.uiuc.edu
2
OverviewOverview
Blue Gene Emulator
Blue Gene Simulator
Timing correction schemes
Performance and results
3
Emulation on a Parallel MachineEmulation on a Parallel Machine
Simulating (Host) Processor
BG/C Nodes
Hardware thread
4
Blue Gene Emulator: functional viewBlue Gene Emulator: functional view
Communication threads
Non-affinity message queues Affinity message queues
Worker threads
inBuffer
One Blue Gene/C node
CorrectionQ
5
Blue Gene Emulator: functional viewBlue Gene Emulator: functional view
Affinity message queues
Communication threads
Worker threads
inBuff
Non-affinity message queues
CorrectionQ
Converse scheduler
Converse Q
Communication threads
Worker threads
inBuff
Non-affinity message queues
CorrectionQ Affinity message
queues
6
What is capable …What is capable …
Blue Gene API supportBlue Gene Charm++
– Structured DaggerTrace Projections
7
Emulator to SimulatorEmulator to Simulator
Emulator:
– Study programming model and application development
Simulator:
– performance prediction capability
– models communication latency based on network model;
– Doesn’t model memory access on chip, or network
contention
8
SimulatorSimulator
Parallel performance is hard to model– Communication subsystem
Out of order messagesCommunication/computation overlap
– Event dependenciesParallel Discrete Event Simulation
– Emulation program executes in parallel with event time stamp correction.
– Exploit inherent determinacy of application
9
How to simulate?How to simulate? Time stamping events
– Per thread timer (sharing one physical timer)
– Time stamp messages Calculate communication latency based on network model
Parallel event simulation– When a message is sent out, calculate the predicted
arrival time for the destination bluegene-processor
– When a message is received, update current time. currTime = max(currTime,recvTime)
– Time stamp correction
10
Thread Timer: curT
Time Stamping messages and threadsTime Stamping messages and threadsMessage sent:RecvT(msg) = curT+Latency
Message scheduled:curT = max(curT, RecvT(msg))
11
Need for timestamp correctionNeed for timestamp correction
Time stamp correction needed for out-of-order messages
Out-of-order delivery can occur:– A message arrives late while some other
message updates the thread time to future– So late message executes in the context of
future, although its predicted time is earlier
12
Parallel correction algorithmParallel correction algorithmSort message execution by receive time;Adjust time stamps when neededUse correction message to inform the change
in event startTime.Send out correction messages following the
path message was sentThe events already in the timeline may have
to move.
13
M8
M1 M7M6M5M4M3M2
RecvTime
ExecutionTimeLine
Timestamps CorrectionTimestamps Correction
14
M8M1 M7M6M5M4M3M2
RecvTime
ExecutionTimeLine
Timestamps CorrectionTimestamps Correction
15
M1 M7M6M5M4M3M2
RecvTime
ExecutionTimeLine
M8
ExecutionTimeLineM1 M7M6M5M4M3M2 M8
RecvTime
Correction Message
Timestamps CorrectionTimestamps Correction
16
M1 M7M6M5M4M3M2
RecvTime
ExecutionTimeLine
Correction Message (M4)
M4
Correction Message (M4)
M4
M1 M7M4M3M2
RecvTime
ExecutionTimeLineM5 M6
Correction Message
M1 M7M6M4 M3M2
RecvTime
ExecutionTimeLineM5
Correction Message
Timestamps CorrectionTimestamps Correction
17
Linear-order correctionLinear-order correction
Works only when– Programs have no alternate orders of
execution possible– Messages are processed in the same order for
multiple executions– Eg: MPI programs with no-wildcard recvs,
structured-dagger code with no “overlap” or “forall”.
18
Reasons:Reasons:
Correction algorithm breaks dependency logic– Only based on receive time;– Cases:
When an event depends on several messages– Last message triggers the computation
Message buffered until some condition holdsExample for invalid correction scheme:
Jacobi-1D
19
20
SolutionSolution
Use structured dagger to retrieve dependence information
As the program runs, form a chain of bluegene logs preserving the dependency information .
Bluegene logs for entry functions and structured dagger functions
21
Timestamp correction schemeTimestamp correction scheme
Every event has a list of backward and forward dependents.
An event cannot start till its backward dependents have finished.
Define effRecvTime =
max(recvTime, endOfBackDeps) An event can start only after its effRecvTime.
startTime = max(effRecvTime,timeline.last.endTime)
22
Timestamp correction schemeTimestamp correction scheme
Timeline is not sorted on the recvTime of the event like the previous case.
Timeline is sorted based on the effRecvTime. Steps to process a correction message
– Find the earliest updated event due to the message
– Cut the timeline from that event
– Calculate new effRecvTimes from then.
– Reinsert into the timeline in the order of effRecvTime
23
Non-linear order correction Non-linear order correction schemeschemeThe new scheme :
– Takes into account the event dependencies– Works even when messages can be received in
different orders in different runs.– Requires all the dependencies to be captured
using structured dagger.But the timing correction is very slow.
Several optimizations possible.
24
Optimizations to online Optimizations to online correction schemecorrection schemeOverwrite old corrections:
– An event can get multiple correction messages.
– Reduce the number of corrections– Same scheme if correction message arrives
earlier than the message itself Use multisend
– Messages destined to same real processor but different events can be sent collectively.
25
More optimizationsMore optimizations Prioritize messages based on their predicted
recvTime. Lazy processing
– Process correction messages periodically.
– Allows corrections to be overwritten. Batch processing
– Process many correction messages at a time
– Many events will be affected
– Choose the earliest and reinsert in the order of effRecvTime.
Ability to start corrections in the middle– Can ignore the startup events for timing correction
26
Timing correction still very slow.Observations:
– Don’t let the execution go far ahead of the correction wave.
– A large difference means many wrong events to be corrected.
– Closely following the execution wave also may not help.
A new scheme – Similar to the one used for gvt (Global virtual
time)
27
GVT-like schemeGVT-like schemeUse heartbeat
– Periodically broadcast asking for gvtGvt
– Is the time after which the events are invalid due to pending corrections
– Compute the gvt as the minimum of predict recvTimes of all correction messages and startTimes of all affected events.
Use a parameter “leash”. Execution of the program cannot go beyond “gvt + leash”
28
Projections before correctionProjections before correction
29
Projections after correctionProjections after correction
30
Correctness of the scheme (using Correctness of the scheme (using Jacobi1D)Jacobi1D)
31
Predicted time vs latency factorPredicted time vs latency factor
32
Predicted speedupPredicted speedup
33
More workMore workOngoing work
– Make sure gvt scheme is correctFuture work
– The presented scheme is on-line correction– Explore the off-line (post-mortem) correction
scheme using generated traces.