dist-gem5: distributed simulation of computer clusters · advantage fast evaluations for...
TRANSCRIPT
![Page 1: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/1.jpg)
1
dist-gem5: Distributed Simulation of
Computer Clusters
Illinois: Mohammad Alian, Prof. Nam Sung Kim
ARM: Gabor Dozsa, Stephan Diestelhorst, Nikos Nikoleris, Radhika Jagtap
Tutorial at IEEE International Symposium on Workload Characterization (IISWC), Seattle, USA
1 Oct 2017
![Page 2: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/2.jpg)
2
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
![Page 3: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/3.jpg)
3
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
![Page 4: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/4.jpg)
4
▪ Definition
▪ A cluster of computers that communicate and interact with each other by passing messages over
the network to process given tasks.
▪ Examples
▪ Datacenters, supercomputers
Distributed computer systems
The IBM Blue Gene/P supercomputer "Intrepid"
at Argonne National Laboratory runs 164,000 processor
cores in 40 racks/cabinets connected by a high-speed 3-
D torus network.
A Google datacenter
![Page 5: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/5.jpg)
5
▪ To maximize performance and/or energy-efficiency, we must capture the intricate
interplay amongst computers and their HW/SW sub-systems, especially due to
communications and interactions w/ each other by passing messages over the network
Exploring and optimizing distributed computer systems
Request Response
ResponseRequest
Network
Clients
Servers
0
0.5
1
1.5
2
2.5
3
3.5
0.0
0.2
0.4
0.6
0.8
1.0
0.14 0.19 0.24
Fre
qu
en
cy (
GH
z)
Uti
lizati
on
Time (s)
BW(rx)
BW(tx)
U(core)
F(core)
![Page 6: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/6.jpg)
6
Using physical computers
▪ Advantage
▪ Fast evaluations for large-scale distributed computer systems
▪ Disadvantage
▪ Limited design space exploration (unable to explore distributed computer systems based on future
processor and computer sub-systems architectures that have not been developed yet)
Using queuing-theoretic models
▪ Advantage
▪ Simple and fast evaluations for large-scale distributed computer systems
▪ Disadvantage
▪ Inaccurate/misleading evaluations (unable to capture complex interplay b/w HW/SW sub-systems of
computers)
Past methods exploring distributed computer systems [1]
![Page 7: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/7.jpg)
7
Using existing (full-system) simulators
▪ Advantage
▪ More flexible design space exploration than physical computer systems
▪ More precise evaluation than queuing-theoretic models
▪ Disadvantage
▪ gem5: limited scalability w/ slow evaluation (legacy gem5)
▪ Not flexible (SST + gem5)
▪ Proprietary and limited to x86 (COTSON)
Past methods exploring distributed computer systems [2]
![Page 8: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/8.jpg)
8
▪ Evaluating performance and power dissipation of a distributed system
▪ Complex interplay among system components at scale
▪ Demanding a full-system, cycle-level simulator which is fast enough to simulate a large-
scale computer system
▪ Enabling distributed simulation:
▪ Simulation of a distributed computer
▪ system w/ many simulation hosts
dist-gem5
scaleOS
ISAscaches
memory
network
devices
performancePower
cores
![Page 9: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/9.jpg)
9
▪ Product of excellent synergistic collaboration b/w industry and academia
▪ Integrating the best features of concurrently developed multi-gem5 from ARM and pd-gem5 from U.
of Illinois for fast and deterministic simulations of distributed computer simulations
History of dist-gem5 development
pd-gem5 multi-gem5
U. of Illinois ARM Research
dist-gem5
[Best Paper Finalist] M. Alian, et al., “dist-gem5: Distributed Simulation of Computer Clusters,”IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2017.
M. Alian, et al., pd-gem5: Simulation infrastructure for parallel/distributed computer systems. IEEE Computer Architecture Letters, vol: 15, no: 1, 2016.
![Page 10: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/10.jpg)
10
Example of research w/ dist-gem5
Datacenter power management algorithm
▪ Desired P/C-state governor
▪ react to change in core utilization in a timely manner
▪ Approaches
▪ predict changes in core utilization
▪ core utilization is highly correlated w/ network activity
▪ Hide P/C-state transition latency
▪ overlap P/C-state transition w/ packet reception
and processing
BW(rx)
U(core)
MC
DMA
Interrupt
Handler
SoftIRQ
1 2 n...
rx_desc_ring
s
k
b
Network
Stack
DRAM
s
k
b
s
k
b
Copy to
User
p
k
t
NIC
NIC
CPU
DRAMRCPCIe
Channel
[Nominated for the Best Paper Award] M. Alian, et al. “NCAP: Network-Driven, Packet
Context-Aware Power Management for client-server architecture. IEEE International
Symposium on High-Performance Computer Architecture (HPCA), February 2017.
![Page 11: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/11.jpg)
11
Other promising research directions
▪ Exploring HW/SW cross-layer approaches for datacenter computers and their sub-
systems
▪ Exploiting information from network HW/SW layers as hints for efficient management of computer
resource management (e.g., prefetching pages from slow to fast memory in hybrid memory system)
▪ Off-loading simple data-intensive operations to network interface cards (NICs)
▪ Developing efficient evaluation methodologies for large-scale distributed computer
systems
▪ Exploring systematic hybrid evaluation approaches judiciously mixing queuing-theoretic modeling
and dist-gem5-based simulation approaches for efficiently evaluating a VERY large-scale distributed
computer systems (e.g., obtaining detailed parameters for queuing-theoretic analytical model using
dist-gem5)
![Page 12: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/12.jpg)
12
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
![Page 13: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/13.jpg)
13
Michigan m5 + Wisconsin GEMS = gem5
“The gem5 simulator is a modular platform for computer-system architecture research,
encompassing system-level architecture as well as processor microarchitecture.”
What is gem5?
![Page 14: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/14.jpg)
14
Level of detail
▪ HW Virtualization
▪ Very no/limited timing
▪ The same Host/guest ISA
▪ Functional mode
▪ No timing, chain basic blocks of instructions
▪ Can add cache models for warming
▪ Timing mode
▪ Single time for execute and memory lookup
▪ Advanced on bundle
▪ Detailed mode
▪ Full out-of-order, in-order CPU models
▪ Hit-under-miss, reodering, …
µarch Exploration
HW Validation
Perf. Validation
Cycle Accurate
1–50 KIPS
RTL simulation
High-level perf./power
Architecture exploration
Approximately Timed
0.2–3 MIPS
gem5
Loosely Timed
50–200 MIPS
Qemu
SW Dev
HW Virt.
gem5 + kvm
GIPS
![Page 15: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/15.jpg)
15
Level of detail
▪ HW Virtualization
▪ Very no/limited timing
▪ The same Host/guest ISA
▪ Functional mode
▪ No timing, chain basic blocks of instructions
▪ Can add cache models for warming
▪ Timing mode
▪ Single time for execute and memory lookup
▪ Advanced on bundle
▪ Detailed mode
▪ Full out-of-order, in-order CPU models
▪ Hit-under-miss, reodering, …
µarch Exploration
HW Validation
Perf. Validation
Cycle Accurate
1–50 KIPS
RTL simulation
High-level perf./power
Architecture exploration
Approximately Timed
0.2–3 MIPS
gem5
Loosely Timed
50–200 MIPS
Qemu
SW Dev
HW Virt.
gem5 + kvm
GIPS
![Page 16: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/16.jpg)
16
When not to use gem5
▪ Performance validation
▪ gem5 is not a cycle-accurate microarchitecture model!
▪ This typically requires more accurate models such as RTL simulation.
▪ Commercial products such as ARM CycleModels operate in this space.
▪ Core microarchitecture exploration
▪ Only do this if you have a custom, detailed, CPU model!
▪ gem5’s core models were not designed to replace more accurate microarchitectural models.
▪ To validate functional correctness or test bleeding-edge ISA improvements
▪ gem5 is not as rigorously tested as commercial products.
▪ New (ARMv8.0+) or optional instructions are sometimes not implemented
▪ Commercial products such as ARM FastModels offer better reliability in this space.
![Page 17: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/17.jpg)
17
Why gem5?
▪ Runs real workloads▪ Analyze workloads that customers use and care about
▪ … including complex workloads such as Android
▪ Comprehensive model library▪ Memory and I/O devices
▪ Full OS, Web browsers
▪ Clients and servers
▪ Rapid early prototyping▪ New ideas can be tested quickly
▪ System-level impact can be quantified
▪ System-level insights▪ Enables us to study complex
memory-system interactions
▪ Can be wired to custom models
Ubuntu (Linux 4.x) Android Nougat
![Page 18: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/18.jpg)
18
• Some timing
• Caches
• No BPs
• Fast
• Some timing
• Caches
• Limited BPs
• Fast
• Full timing
• Caches
• Branch predictors
• Slow
• No timing
• No caches
• No BP
• Really fast
CPU models overview
BaseCPU
BaseKvmCPU TraceCPUBaseSimpleCPU
AtomicSimpleCPU
TimingSimpleCPU
DerivO3CPU MinorCPU
X86KvmCPU
ArmV8KvmCPU
![Page 19: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/19.jpg)
19
Discrete event based simulation
▪ Discrete: Handles time in discrete steps
▪ Each step is a tick
▪ Usually 1THz in gem5
▪ Simulator skips to the next event on the timeline
Time
Event handler
Event handlerMyObj::startup()Schedule
Call
![Page 20: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/20.jpg)
Simulating a distributed system
![Page 21: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/21.jpg)
21
Host #1
Distributed gem5 Simulation – high level view
Host #1
simulated
system
#1
Host #2
Host #3
Packet
forwarding
▪ gem5 processes modeling full systems run in parallel
on a cluster of host machines
▪ Packet forwarding engine
▪ Forward packets among the simulated systems
▪ Synchronize the distributed simulation
▪ Simulate network topology
gem5 process
host machine
simulated
system
#2
simulated
system
#3
![Page 22: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/22.jpg)
22
Core components
Packet forwardingDistributed
checkpointing
Synchronization
Simulated network
![Page 23: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/23.jpg)
23
Core components
Packet forwardingDistributed
checkpointing
Synchronization
Simulated network
![Page 24: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/24.jpg)
24
physical host #1
physical host #3
physical host #2
physical switch
phys
NIC#1
phys
NIC#
2
phys
port1
phys
port2
phys
port3
phys
NIC#3
dist-gem5 architecture – packet forwarding
![Page 25: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/25.jpg)
25
physical host #1
physical host #3
physical host #2
physical switch
phys
NIC#1
phys
NIC#2
phys
port1
phys
port2
phys
port3
phys
NIC#3
dist-gem5 architecture – packet forwarding
gem5 #1
simulated
system #1
sim
NIC
gem5 #3
simulated switch
gem5 #2
simulated
system #2
sim
NIC
sim
port
0
sim
port
1
![Page 26: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/26.jpg)
26
physical host #1
physical host #3
physical host #2
physical switch
phys
NIC#1
phys
NIC#2
phys
port1
phys
port2
phys
port3
phys
NIC#3
gem5 #1
simulated
system #1
sim
NIC
gem5 #3
simulated switch
gem5 #2
simulated
system #2
sim
NIC
sim
port
0
sim
port
1
dist-gem5 architecture – packet forwarding
simulated packets
are embedded into
host TCP/IP
packetssim pkt
TCP sim pkt
sim pktTCP sim pkt
sim pkt
![Page 27: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/27.jpg)
27
Asynchronous processing of incoming messages
▪ Simulation thread (main thread)
▪ Process/insert events in the event queue
▪ In case of send pkt event, encapsulate the simulated
Ethernet packet in a message and send it out
▪ Receiver thread
▪ Create for each gem5 process
▪ Waits for incoming packets
▪ Creates a recv pkt event and insert it to the event
queue
eventQsimulation
threadsend pkt
recv pkt
physical host
gem5 process
receiver
thread
phys
NIC
![Page 28: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/28.jpg)
28
▪ What is the correct tick for the receive event?
▪ st : send tick
▪ lat: simulated link latency
▪ bw: simulated link bandwidth (bytes/tick)
▪ size: simulated packet size (bytes)
▪ rt: receive tick
rt = st + lat + size / bw
▪ Accurate simulation
▪ rt >= curTick() when the receiver gem5 gets the real message encapsulating the simulated packet
▪ receiver gem5 can schedule the receive event for the simulated NIC
Simulation accuracy and packet forwarding
Head
Tail
event queue
receive frame
tim
e
curTick
![Page 29: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/29.jpg)
29
Core components
Packet forwardingDistributed
checkpointing
Synchronization
Simulated network
![Page 30: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/30.jpg)
30
Need for synchronization
30
wall clock time
• Receiver gem5 can run ahead of
sender gem5
✓Physical host mismatch
✓Different events to be processed
• Slowed down receiver gem5 to
ensure simulation accuracy
• Quantum-based synchronization
gem5#0
gem5#1
send time
expected
delivery time
simulated network delay
recv time
late packet arrival
![Page 31: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/31.jpg)
31
Accurate packet forwarding
31
wall clock time
• quantum: interval for periodic synchronization in simulated time
• Sync-event flushes inter gem5 communication channels
• If quantum ≤ simulated link delay:
✓expected delivery tick falls inside the next quantum
• Optimal quantum size for accurate forwarding == simulated link delay
gem5#0
gem5#1
gem5#0
gem5#1
send time
packet arrival wall
clock time
expected
delivery time
simulated network delay
quantum
global sync
quantum
![Page 32: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/32.jpg)
32
▪ Simulation progress gets stopped at each sync
tick in each gem5 process
▪ Simulated compute node
▪ Sends out ‘synq request’ message
▪ Waits until ‘sync ack’ message comes back
▪ Simulated switch
▪ Waits until it receives a ‘sync request’ message
▪ Sends out ‘sync ack’ message
Compute nodes, switch and synchronization
compute
node
gem5
Ethernet
switch
gem5
compute
node
gem5
compute
node
gem5
![Page 33: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/33.jpg)
33
▪ A vanilla global gem5 event is scheduled at each sync tick in each gem5 process
▪ A global gem5 event is a transparent thread barrier (in case of multiple simulation threads)
▪ dist-gem5 global sync is prepared to work with multi-queue/multi-threaded gem5 simulations
The global sync event
▪ The process() method in a compute node
▪ sends out ‘sync request’ messages for each
simulated link
▪ waits on a condition variable to get notified
about completion by the receiver thread
▪ The process() method in a switch
▪ waits for completion notification from the
receiver thread
▪ sends out ‘sync ack’ messages for each
simulated link
▪ Receiver thread keeps processing incoming messages while simulation thread is blocked
▪ creates receive events in the event queue for simulated Ethernet frames
▪ notifies blocked simulation thread when ‘sync
ack’ messages arrive
▪ notifies blocked simulation thread when
‘sync request’ messages arrive
![Page 34: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/34.jpg)
34
Core components
Packet forwardingDistributed
checkpointing
Synchronization
Simulated network
![Page 35: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/35.jpg)
35
▪ Checkpoint support for dist-gem5 relies on the mainline gem5 checkpoint support
▪ Each gem5 process of a dist-gem5 run creates its own checkpoint
▪ dist-gem5 adds an extra co-ordination layer to ensure correctness
▪ No in-flight message may exist among gem5 processes when the distributed checkpoint is taken
Distributed checkpointing
m5 checkpoint pseudo inst
exitSimLoop() drain() serialize() drainResume() simulate()
m5 checkpoint pseudo inst
global syncexitSimLoop()
drain() global sync serialize()drain
Resume()simulate()
dist-gem5 checkpoint co-ordination
![Page 36: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/36.jpg)
36
Distributed checkpointing (cont.)
▪ Checkpoint can only be initiated at a periodic global sync
▪ Simplifying implementation without scarifying usability
Checkpoint flavour collaborative
checkpoint
immediate checkpoint
Condition all compute nodes signal
intent
at least one compute node
signals intent
Example use case Instrumented MPI
application source code to
take a checkpoint at the
MPI_barrier() before ROI
Taking a checkpoint from
the bootscript before
starting an MPI application
(i.e. before calling ‘mpirun’)
![Page 37: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/37.jpg)
37
Checkpoint @ global sync
▪ In practical use cases a distributed checkpoint is taken “near” an application barrier (e.g.
MPI_Barrier() or mpirun)
▪ We want to take the checkpoint when all processes hit the barrier in the application code =>
desired application state can be captured even if we allow checkpoint writes only at global sync
▪ At a global sync
▪ A compute node gem5 processes can signal its intention to take a checkpoint
▪ ‘m5 checkpoint’ pseudo instruction => ‘need checkpoint’ meta info in the next ‘sync request’ message
▪ Switch gem5 process can command to write a checkpoint
▪ ‘write checkpoint’ meta info in the ‘sync ack’ message => exitSimLoop() in all gem5 processes
![Page 38: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/38.jpg)
38
Writing a checkpoint
draining
gem5#0
gem5#1
Wall clock time
p
writing
checkpoint
p
global sync
d0
dist checkpoint starts
d1
q – d0
q : sync quantum ticks
d0, d1: drain ticks
▪ Distributed checkpoint can start
only at a global sync
▪ Draining may require different
number of ticks in each gem5
▪ After drain complete, we flush out
in-flight messages with an extra
global sync
▪ Global sync implements both an
execution and a data (message)
barrier
q – d1
![Page 39: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/39.jpg)
39
Restoring from a checkpoint
Wall clock time
global sync
q’ : sync quantum ticks
d0, d1 : drain ticks
align
ticks
restoringfrom
checkpoint
d0
d1
q’
q’
d’
draining
▪ Checkpoint might be written at
different ticks in different gem5
processes
▪ An additional global sync to align
the ticks:
d0 + d’ = d1
▪ Global sync delivers the max tick
value to all gem5 processes
▪ Periodic global sync always happens
at the same tick in every gem5
▪ Global sync period may change at
restore
▪ Same checkpoint can be used to
explore different network link
latency/bandwidth effects
![Page 40: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/40.jpg)
40
▪ User is allowed to change simulated link parameters when restoring from a checkpoint
▪ Same checkpoint can be used to explore different network link latency/bandwidth effects
▪ Global sync period may change at restore (if the simulated link latency change)
▪ Checkpoint may contain simulated packets to get received in the future
▪ Receive ticks for such packets need to be adjusted to reflect the change of the simulated link
parameters
Restoring from a checkpoint (cont.)
![Page 41: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/41.jpg)
41
Core components
Packet forwardingDistributed
checkpointing
Synchronization
Simulated network
![Page 42: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/42.jpg)
42
server #2
dist-gem5 architecture – network modeling
42
Server #1
server #3
server #4
server #5
server #6
server #7
Server #0
top of rack
switch #0
server #10
server #9
server #11
server #12
server #13
server #14
server #15
server #8
top of rack
switch #1
server #58
server #57
server #59
server #60
server #61
server #62
server #63
server #56
top of rack
switch #7
aggregate
switch
. . .
simulate in one gem5 process
![Page 43: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/43.jpg)
43
▪ Configurable baseline Ethernet switch model
▪ Port number, delay, bandwidth, buffer size
physical host
Configurable network model
43
top of rack
switch #0
top of rack
switch #1top of rack
switch #7
aggregate
switch
p8
p0 p7 p0 p7
p8
. . .. . . . . .
p0 p7p1
p8
gem5
simulated etherLinksimulated port
distEtherLink
simulated etherSwitch
p0 p7
MAC
Table In-orderQ#0
In-orderQ#n
IPORT#0
IPORT#n
OPORT#0
OPORT#n
. . .
. . .
![Page 44: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/44.jpg)
Deterministic simulation
![Page 45: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/45.jpg)
45
▪ We assume that a single compute node gem5 simulation is deterministic
▪ Ordering and speed of dist-gem5 messages in real world
▪ Speed of gem5 processes (relative to each other) may vary
▪ Communication speed among gem5 process may vary
▪ Global sync guarantees deterministic packet forwarding
▪ sync quantum <= simulated link latency
▪ global sync is a message barrier
Deterministic execution issues
![Page 46: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/46.jpg)
46
: message
delivery in
wall clock
time
Global sync and deterministic packet forwarding
gem5#0
gem5#1
gem5#0
gem5#1
qsend tick#1
receive tick#1n
q
global sync
q : global sync
period in ticks
(quantum)
n: simulated
link latency in
ticks
gem5#2 gem5#2
n receive
tick#2
send tick#2
wall clock time
▪ Receive tick for a simulated
packet may not fall within the
same quantum which the
message gets received in
▪ A message is always gets sent
and received within a single
quantum
![Page 47: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/47.jpg)
Validation and speedup
[Best Paper Finalist] M. Alian, et al., “dist-gem5: Distributed Simulation of Computer Clusters,”
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2017.
![Page 48: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/48.jpg)
49
Methodology – simulation techniques
▪ For example, simulating a cluster w/ 7 nodes and 1 network switch:
quad core physical host
gem5#6
system#6
gem5#7
switch
gem5#4
system#4
gem5#2
system#2
gem5#0
system#0
gem5#5
system#5
gem5#3
system#3
gem5#1
system#1
quad core physical host
gem5#6
system#6
gem5#7
switch
gem5#4
system#4
gem5#5
system#5
quad core physical host
gem5#0
system#6 switch
system#4
system#2
system#0
system#5
system#3
system#1
quad core physical host
gem5#6
system#2
gem5#7
system#3
gem5#4
system#0
gem5#5
system#1
single-threaded-gem5 dist-gem5parallel-gem5
![Page 49: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/49.jpg)
50
Methodology – experimental setup
▪ Focus on off-chip network performance using network intensive applications
▪ iperf, memcached, httperf, tcptest, netperf, NAS parallel benchmark
▪ Verification/validation against:
▪ Single-threaded-gem5
▪ Physical cluster
▪ 4 node cluster w/ AMD A10-5800K
▪ Speedup comparison against:
▪ Single-threaded-gem5
▪ Parallel-gem5
category gem5 configuration
O3 core 4 cores; 4 way superscalar
memory 8GB DDR3 1600 MHz
network Intel GbE NIC; 1 μs Link latency
OS Linux Ubuntu 14.04 (Kernel 4.3)
![Page 50: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/50.jpg)
51
Validation – network latency and bandwidth
▪ iperf (left) and memcahed (right)
▪ Follows the behavior of physical setup
▪ 17.5% lower response time for memcached
0.0
0.3
0.6
0.9
1.2
1.5
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
Late
ncy (
ms
)
Bandwidth (Gbps)
dist-gem5
phys
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
1 5 10 20 30 40 50 60 70 80 90 95
Late
ncy (
ms)
memcached Distribution Percentile
dist-gem5
phys
![Page 51: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/51.jpg)
52
Speedup – simulation time reduction
▪ Running httperf on each simulated node sending
fixed number of requests to a unique simulated
node (apache server)
▪ Compared with single-threaded-gem5
▪ dist-gem5 simulating 63 nodes on 16 physical
hosts is
▪ 83.1 faster than single-threaded-gem5
▪ 12.8 faster than parallel-gem5
2.76.3
21.8
36.0
83.1
2.7 3.76.6 6.0 6.5
0
10
20
30
40
50
60
70
80
90
3 7 15 31 63
Sp
ee
du
p (
No
rm.
sin
gle
-th
read
ed
-gem
5)
Number of Simulated Nodes
dist-gem5 parallel-gem5
speedup of parallel-gem5 saturates!
![Page 52: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/52.jpg)
53
Scalability – simulation time vs. simulated cluster size
▪ Simulation time increase for simulating 63 vs. 3 nodes:
▪ 57.3 for Single-threaded-gem5
▪ 23.9 for parallel-gem5
▪ 1.9 for dist-gem5
1.41.9
1.9
3.9
11.2
23.9
1.0
2.6
9.4
25.0
57.3
1.0
10.0
100.0
0 10 20 30 40 50 60 70
No
rmalized
Sim
ula
tio
n T
ime
Number of Simulated Nodes
dist-gem5 parallel-gem5 single-threaded-gem5
dist-gem5 scales well!
![Page 53: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/53.jpg)
54
Synchronization overhead
▪ Sweep synchronization quantum size
▪ # of http req remains near constants
▪ Maximum 2.6% variance
▪ Almost the same amount of work done at each
quantum size
▪ Simulation time improvement
▪ 4.9% from 0.5 μs to 1 μs
▪ 15.7% from 0.5 μs to 128 μs
0
4
8
12
16
20
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.5 1 2 4 8 16 32 64 128
Nu
mb
er
of
Req
uests
(K
Req
)
No
rma
lized
Sim
ula
tio
n T
ime
Synchronization Quantum Size (μs)
Simulation Time Req#
dist-gem5 synchronization is efficient!
![Page 54: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/54.jpg)
55
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
![Page 55: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/55.jpg)
56
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
![Page 56: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/56.jpg)
Getting started with gem5 FS mode
![Page 57: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/57.jpg)
58
Download and build gem5
▪ Guest architecture
▪ Several architectures in the source
tree.
▪ Most common ones are:
▪ ARM
▪ NULL – Used for trace-drive simulation
▪ X86
▪ Optimization level:
▪ debug: Debug symbols, no/few
optimizations
▪ opt: Debug symbols + most
optimizations
▪ fast: No symbols + even more
optimizations
dist-gem5 currently support ARM. We have tested x86 and the patches are on their way
![Page 58: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/58.jpg)
59
Example disk images
▪ Example kernels and disk images can be downloaded from gem5.org/Download
▪ This includes pre-compiled boot loaders
▪ Set the M5_PATH variable to point to the extracted directory
▪ Most example scripts try to find files using M5_PATH
▪ Kernels/boot loaders/device trees in ${M5_PATH}/binaries
▪ Disk images in ${M5_PATH}/disks
![Page 59: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/59.jpg)
60
Running an example script
▪ Simulates an arm64 system with 4 cores
▪ Uses a functional ‘atomic’ CPU model
▪ Runs the script on the simulated system after booting the Linux
▪ Using “init-param” you can set a parameter which is accessible from simulated system
![Page 60: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/60.jpg)
61
Sample rcS script and gem5 terminal output
gem5 terminal outputsample rcS script
![Page 61: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/61.jpg)
62
System overview
▪ gem5 will sketch the system overview for you if you install “pydot” on the host
▪ apt-get install python-pydot
Core0 Core1 Core2 Core3
MemCrtl IO Devices
Ethernet Card
Chipset
![Page 62: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/62.jpg)
Getting started with dist-gem5
![Page 63: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/63.jpg)
64
▪ before you run a dist-gem5 simulation you need:
▪ Setup passwordless ssh between the “launch host” and “simulation hosts”
▪ Set LSB_MCPU_HOSTS to map gem5 processes to simulation hosts
▪ Default will run all processes on localhost
▪ Assuming sim cluster size 4, the following will run 2 full-system gem5 processes on 10.10.10.2 and 2 on
10.10.10.3
Getting started with dist-gem5
![Page 64: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/64.jpg)
65
Running an example script
dist-gem5 launch script
![Page 65: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/65.jpg)
66
Running an example script
gem5-dist.sh options
simulated cluster sizegem5 executableswitch node config scriptfull-system nodes config script
root dir that stores logs and stats
$RUNDIRlog.switchlog.0log.1…log.(N-1)m5out.switchm5out.0…m5out.(N-1)
gem5 strerr/stdout
gem5 outdir
![Page 66: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/66.jpg)
67
Running an example script
simulated cluster sizegem5 executableswitch node config scriptfull-system nodes config script
root dir that stores logs and statsfull-system nodes arguments
gem5 binary arguments
![Page 67: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/67.jpg)
68
Sample rcS script for a dist-gem5 run
![Page 68: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/68.jpg)
69
Sample rcS script for a dist-gem5 run
assign MAC/IP addr and bring up the NIC
ping other nodes from node 0
![Page 69: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/69.jpg)
70
dist-gem5 output terminal for the example rcS script
node w/ rank 0
rank 1
rank 2
rank 3
![Page 70: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/70.jpg)
dist-gem5 launch script walk-through
![Page 71: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/71.jpg)
72
1. Launch a gem5 process simulating the network switch
2. Wait for switch process to start
3. Read dist-iface port# from log.switch
4. Start full-system gem5 processes
gem5-dist.sh script big picture
switch
ssh
sw port #
log.switch
ssh,
sw port#
Node Node Node Node
dist-etherlink
Each FS gem5 process will connect to acorresponding switch iface at the process startup
![Page 72: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/72.jpg)
73
gem5-dist.sh script walk through
Step 1
Step 2
Step 3
![Page 73: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/73.jpg)
74
gem5-dist.sh script walk through
Step 4
![Page 74: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/74.jpg)
75
▪ Default gem5-dist.sh runs identical full-system gem5 nodes
Heterogenous cluster modeling
shared for all full-system nodes
![Page 75: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/75.jpg)
76
▪ Default gem5-dist.sh runs identical full-system gem5 nodes
▪ Not desirable always
▪ We do not need to always simulate the entire cluster with high fidelity
▪ Server node with OOO and clients with atomic CPU
▪ Simulating a heterogenous cluster
▪ Nodes with different number/type of CPUs
▪ Nodes with different memory size/type
▪ Nodes with different ISAs!
▪ Modify gem5-dist.sh script to easily achieve that
▪ Let’s see how to have different arguments for node 0
Heterogenous cluster modeling
![Page 76: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/76.jpg)
77
▪ Declare a new variable for node0 arguments (“N0_ARGS”)
gem5-dist.sh changes to support heterogeneous dist runs (1)
![Page 77: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/77.jpg)
78
▪ Define a new option flag “—node0-args” and set “N0_ARGS” from command line
arguments
gem5-dist.sh changes to support heterogeneous dist runs (2)
![Page 78: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/78.jpg)
79
▪ Add “N0_ARGS” to node 0’s arguments
gem5-dist.sh changes to support heterogeneous dist runs (3)
![Page 79: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/79.jpg)
80
Example script with –node0-args
node0 is quad core and the rest
are single core
![Page 80: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/80.jpg)
81
▪ The key is gem5-dist.sh scripts
▪ You can easily extend it to have desired dist-gem5 launches
▪ Heterogenous cluster simulation
▪ Arbitrary gem5 process to physical host/core mapping
▪ …
▪ Support for simulation pool management software
▪ Instead of explicitly mapping processes to nodes, and using ssh to run gem5 processes, the cluster
management software maps and runs processes
▪ E.g. a HT-Condor version is available in https://publish.illinois.edu/icsl-pdgem5/download/
Other dist-gem5 simulation approaches
![Page 81: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/81.jpg)
dist-gem5 checkpointing/restoring
![Page 82: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/82.jpg)
83
▪ Specify the root checkpoint directory using “-c” option in gem5-dist.sh
▪ All the checkpoints would have the same dump tick
▪ You can restore by passing “—checkpoint-restore” option to all gem5 processes (full-
system processes + switch process)
Checkpointing/restoring
$CKPTDIR
m5out.switch
m5out.0
…
m5out.(N-1)
Stores checkpoint files for the gem5 process simulating network switch
Stores checkpoint files for the gem5 process simulating full-system node #0
…
![Page 83: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/83.jpg)
84
Restoring from a checkpoint
--cf-args will add its options to all the gem5
processes in the simulated cluster
(full-system processes + switch process)
![Page 84: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/84.jpg)
85
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
![Page 85: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/85.jpg)
86
Star tree
Network topologies
physical host
top of rack
switch #0top of rack
switch #7
aggregate
switch
p8
p0 p7
p8
. . . . . .
p0 p7p1
gem5
simulated etherLink
simulated port
distEtherLink
simulated etherSwitch
p0 p7
physical host
top of rack
switch
p0 p7
. . . . . .
gem5
p0 p63
. . .
. . .
. . .
![Page 86: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/86.jpg)
87
Star network topology config script
Instantiate 64 DistEtherLink SimObjects (dist_size == 64)
![Page 87: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/87.jpg)
88
Star network topology config script
physical host
top of rack
switch
p0 p7
. . . . . .
gem5
p0 p63
![Page 88: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/88.jpg)
89
Tree topology config script
Instantiate 8 top of rack and 1 aggregate switch
![Page 89: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/89.jpg)
90
Tree topology config script
Again we need 64 DistEtherLinks
![Page 90: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/90.jpg)
91
Tree topology config script (cont.)
Instantiate 8 aggregate EtherLinks
![Page 91: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/91.jpg)
92
Tree topology config script (cont.)
Connect DistEtherLinks to top of rack switches
![Page 92: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/92.jpg)
93
Tree topology config script (cont.)
Use EtherLinks to connect aggregate and top of rack switches
![Page 93: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/93.jpg)
94
Tree topology config script (cont.)
physical host
top of rack
switch #0top of rack
switch #7
aggregate
switch
p8
p0 p7
p8
. . . . . .
p0 p7p1
gem5
p0 p7
![Page 94: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/94.jpg)
95
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
![Page 95: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/95.jpg)
96
▪ Look into log.switch, log.0, …, log.N-1
▪ Unexpected abortion of a gem5 process
▪ Segmentation fault, panic, failed connection, …
▪ Normal exit from rcS script
▪ E.g. “info: m5 exit called with non-zero delay” message in a log file
▪ Check m5out.X/system.terminal to find out why “/sbin/m5 exit” gets called
▪ Check if there is any gem5 process running on simulation hosts
▪ Trace based debug
▪ Enable some debug flags and look into log files to get more info
▪ gem5-dist.sh:
dist-gem5 debugging checklist
![Page 96: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/96.jpg)
97
▪ GDB debugging
▪ Use “-debug” option for gem5-dist.sh
▪ Debug each gem5 binary using gbd debugger
▪ Each gem5 process will open a gdb terminal and runs from there
dist-gem5 debugging checklist (cont.)
![Page 97: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/97.jpg)
101
4 node simulated cluster debugging using gdb
![Page 98: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/98.jpg)
102
4 node simulated cluster debugging using gdb
![Page 99: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/99.jpg)
103
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
![Page 100: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/100.jpg)
104
1. Install/build the benchmark on disk-image
▪ Mount disk-image, copy source, chroot to disk-image, build source OR
▪ Mount disk-image, chroot to disk-image, install application using “apt-get install” OR
▪ Cross compile application, mount disk-image, copy application binary to disk-image
Steps to prepare a benchmark (e.g. apache-bench) on dist-gem5
![Page 101: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/101.jpg)
105
2. Take a checkpoint
3. Restore from checkpoint and run the benchmark
▪ Example rcS for running apache-bench; one master node and multiple slaves:
Steps to prepare a benchmark (e.g. apache-bench) on dist-gem5
![Page 102: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/102.jpg)
106
▪ Response time of a request is ts1 – ts0
▪ Monitoring response time at software disturbs
client application
▪ Client side queueing
▪ Imprecise statistics
▪ Solution:
▪ Use m5 pseudo instruction to measure response time
Annotating benchmarks for accurate stat collection
Request Response
ResponseRequest
Network
Clients
Servers
Client Server
ts1
ts0 request
respond
![Page 103: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/103.jpg)
107
▪ Call m5_work_begin(req.id) before sending a request
▪ Call m5_work_end(req.id) when receiving a response
▪ Example output:
Annotating benchmarks for accurate stat collection
Client Server
m5_work_end(req.id)
m5_work_begin(req.id)request
respond
work_item_end reports the round trip latency
![Page 104: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/104.jpg)
108
1. Generate libm5.a for your desired ISA
▪ E.g for aarch64:
2. Copy “libm5.a” and “m5op.h” to the application’s root dir
3. Include “m5op.h” in the source code of the application
4. Add the desired m5ops to application source code
5. Add “-L. –lm5” flags to the gcc compiler and build the application
Steps to annotate an application using m5 ops
![Page 105: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/105.jpg)
109
Sample ab.c annotation with m5 ops
C->reqId is a unique ID for each request
![Page 106: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/106.jpg)
110
▪ Introduction
▪ dist-gem5 architecture
▪ Packet forwarding; Synchronization; Checkpointing; Network simulation
▪ Validation; Speedup
-- 10:15 AM to 10:45AM -- Break --
▪ Getting started with dist-gem5
▪ Prerequisites; Compiling; Running example script
▪ Launch script walk through; Checkpointing/restoring
▪ Network modeling
▪ Debugging
▪ Preparing benchmarks
▪ Apache bench
▪ Demo
Programme
![Page 107: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/107.jpg)
111
▪ If you have questions, you can post your questions in gem5 mailing lists
▪ Or contact us directly
▪ Mohammad Alian [email protected]
▪ Gabor Dozsa [email protected]
▪ Please check dist-gem5 web-page for more updates and resources for dist-gem5
▪ https://publish.illinois.edu/icsl-pdgem5/
Thank you
![Page 108: dist-gem5: Distributed Simulation of Computer Clusters · Advantage Fast evaluations for large-scale distributed computer systems Disadvantage ... RTL simulation High-level perf./power](https://reader031.vdocuments.net/reader031/viewer/2022021901/5b8e18ce09d3f2187e8d0b5b/html5/thumbnails/108.jpg)
112
Dist-gem5: Distributed Simulate of
Computer Clusters
Illinois: Mohammad Alian, Prof. Nam Sung Kim
ARM: Gabor Dozsa, Stephan Diestelhorst, Nikos Nikoleris, Radhika Jagtap
Tutorial at IEEE International Symposium on Workload Characterization (IISWC), Seattle, USA
1 Oct 2017