dist-gem5: distributed simulation of compute...
TRANSCRIPT
dist-gem5: Distributed Simulation of Compute Clusters
Mohammad Alian, Umur Darbaz, Gabor Dozsa, Stephan Diestelhorst, Daehoon Kim, Nam Sung Kim
University of Illinois Urbana-Champaign
ARM Ltd., Cambridge, UK
1
Outline
• motivationaccelerating large-scale simulation
• dist-gem5 architecturepacket forwarding
synchronization
checkpointing
network model
• evaluation
validation, speedup, synchronization overhead
• conclusion
2
dist-gem5 architecture evaluation conclusionwhat is gem5
Outline
• motivationaccelerating large-scale simulation
• dist-gem5 architecturepacket forwarding
synchronization
checkpointing
network model
• evaluation
validation, speedup, synchronization overhead
• conclusion
3
dist-gem5 architecture evaluation conclusionwhat is gem5
What is gem5 – overview
• full-system, cycle-level, event-driven simulator
• used/maintained at universities and industry
4
Core Integrated IP
ARM ISA Support
ARMv7a ARMv8
GICv2
CPU Models
L1-L3 $ SCU
ArchTimer PMU
IO components
Simulation support
UARTUHDLCD
10Gb
NIC
NVMe
DMA
KVMv7
Traffic
Gen
Traffic
Monitor
Memory
Flash DRAM
Interconnect
CrossbarSnoop
filterBridges
Stream
Line
Sim
Points
Power
Model Int.KVMv8
FracFact
PCA
UFS
Timers
RTC
GPU models
NoMali
HMC
dist-gem5 architecture evaluation conclusionwhat is gem5
Atomic Timing
Out of
OrderIn Order
Why dist-gem5?
• performance and power dissipation of a distributed systemcomplex interplay among system components at scale
• need a full-system, cycle-level simulator which is fast enough to simulate a large-scale computer system
• distributed simulation:simulate a distributed system
w/ many simulation hosts
5
scaleOS
ISAscaches
memory
network
devices
performancePower
cores
dist-gem5 architecture evaluation conclusionwhat is gem5
dist-gem5 architecture – high level view
• gem5 processes modeling full systems run inparallel on a cluster of physical machines
• simulated network switchforward packets among the simulated systems
synchronize the distributed simulation
simulate network topology
6
host #1
dist-gem5 architecture evaluation conclusionwhat is gem5
simulated system #1 gem5 process
physical machine
simulated network switch
host #4
simulated system #3
host #3
simulated system #2
host #2
Outline
• motivationaccelerating large-scale simulation
• dist-gem5 architecturepacket forwarding
synchronization
checkpointing
network model
• evaluation
validation, speedup, synchronization overhead
• conclusion
7
dist-gem5 architecture evaluation conclusionwhat is gem5
dist-gem5 architecture – core components
8
packet forwarding
simulated network
distributed check-pointing
synchronization
dist-gem5 architecture evaluation conclusionwhat is gem5
dist-gem5 architecture – core components
9
packet forwarding
simulated network
distributed check-pointing
synchronization
dist-gem5 architecture evaluation conclusionwhat is gem5
physical host #1
physical host #3
physical host #2
physical switch
physNIC#1
physNIC#2
physport1
physport2
physport3
physNIC#3
dist-gem5 architecture – packet forwarding
10
dist-gem5 architecture evaluation conclusionwhat is gem5
physical host #1
physical host #3
physical host #2
physical switch
physNIC#1
physNIC#2
physport1
physport2
physport3
physNIC#3
dist-gem5 architecture – packet forwarding
11
gem5 #1
simulated system #1
simNIC
gem5 #3
simulated switch
dist-gem5 architecture evaluation conclusionwhat is gem5
gem5 #2
simulated system #2
simNIC
simport0
simport1
physical host #1
physical host #3
physical host #2
physical switch
physNIC#1
physNIC#2
physport1
physport2
physport3
physNIC#3
gem5 #1
simulated system #1
simNIC
gem5 #3
simulated switch
gem5 #2
simulated system #2
simNIC
simport0
simport1
dist-gem5 architecture – packet forwarding
12
simulated packets are embedded into host TCP/IP packets
sim pkt
TCP sim pkt
sim pktTCP sim pkt
sim pkt
dist-gem5 architecture evaluation conclusionwhat is gem5
Asynchronous processing of incoming messages
• simulation thread (main thread)process/insert events in the event queue
in case of send pkt event, encapsulate the simulated Ethernet packet in a message and send it out
• receiver threadcreate for each gem5 process
waits for incoming packets
creates a recv pkt event and insert it to the event queue
13
eventQsimulation
threadsend pkt
recv pkt
physical hostgem5 process
receiverthread
physNIC
dist-gem5 architecture evaluation conclusionwhat is gem5
dist-gem5 architecture – core components
14
packet forwarding
simulated network
distributed check-pointing
synchronization
dist-gem5 architecture evaluation conclusionwhat is gem5
Need for synchronization
15
wall clock time
• receiver gem5 can run ahead of sender gem5
physical host mismatch
different events to be processed
• slowed down receiver gem5 to ensure simulation accuracy
• quantum-based synchronization
gem5#0
gem5#1
send time
expected delivery time
simulated network delay
dist-gem5 architecture evaluation conclusionwhat is gem5
recv time
late packet arrival
Accurate packet forwarding
16
wall clock time
• quantum: interval for periodic synchronization in simulated time
• sync-event flushes inter gem5 communication channels
• if quantum ≤ simulated link delay:
expected delivery tick falls inside the next quantum
• optimal quantum size for accurate forwarding == simulated link delay
gem5#0
gem5#1
gem5#0
gem5#1
send time
packet arrival wallclock time
expected delivery time
simulated network delay
quantum
global sync
dist-gem5 architecture evaluation conclusionwhat is gem5
quantum
dist-gem5 architecture – core components
17
packet forwarding
simulated network
distributed check-pointing
synchronization
dist-gem5 architecture evaluation conclusionwhat is gem5
server #2
dist-gem5 architecture – network modeling
18
Server #1
server #3
server #4
server #5
server #6
server #7
Server #0
top of rackswitch #0
server #10
server #9
server #11
server #12
server #13
server #14
server #15
server #8
top of rackswitch #1
server #58
server #57
server #59
server #60
server #61
server #62
server #63
server #56
top of rackswitch #7
aggregateswitch
. . .
simulate in one gem5 process
dist-gem5 architecture evaluation conclusionwhat is gem5
• configurable baseline Ethernet switch modelport number, delay, bandwidth, buffer size
physical host
Configurable network model
19
top of rackswitch #0
top of rackswitch #1
top of rackswitch #7
aggregateswitch
p8
p0 p7 p0 p7
p8
. . .. . . . . .
p0 p7p1
p8
gem5simulated etherLinksimulated port
distEtherLink
simulated etherSwitch
dist-gem5 architecture evaluation conclusionwhat is gem5
p0 p7
MAC
Table In-orderQ#0
In-orderQ#n
IPORT#0
IPORT#n
OPORT#0
OPORT#n
. . .
. . .
Outline
• motivationaccelerating large-scale simulation
• dist-gem5 architecturepacket forwarding
synchronization
checkpointing
network model
• evaluation
validation, speedup, synchronization overhead
• conclusion
20
dist-gem5 architecture evaluation conclusionwhat is gem5
Methodology – simulation techniques
21
quad core physical host
gem5#6
system#6
gem5#7
switch
gem5#4
system#4
gem5#2
system#2
gem5#0
system#0
gem5#5
system#5
gem5#3
system#3
gem5#1
system#1
quad core physical host
gem5#6
system#6
gem5#7
switch
gem5#4
system#4
gem5#5
system#5
quad core physical host
gem5#0
system#6 switch
system#4
system#2
system#0
system#5
system#3
system#1
quad core physical host
gem5#6
system#2
gem5#7
system#3
gem5#4
system#0
gem5#5
system#1
dist-gem5 architecture evaluation conclusionwhat is gem5
single-threaded-gem5 dist-gem5parallel-gem5
• For example, simulating a cluster w/ 7 nodes and 1 network switch:
Methodology – experimental setup
• focus on off-chip network performance using network intensive applicationsiperf, memcached, httperf, tcptest, netperf, NAS parallel benchmark
• verification/validation against:single-threaded-gem5
physical clustero 4 node cluster w/ AMD A10-5800K
• speedup comparison against:single-threaded-gem5
parallel-gem5
22
category gem5 configuration
O3 core 4 cores; 4 way superscalar
memory 8GB DDR3 1600 MHz
network Intel GbE NIC; 1 μs Link latency
OS Linux Ubuntu 14.04 (Kernel 4.3)
dist-gem5 architecture evaluation conclusionwhat is gem5
Verification
• same node/network configdist-gem5 generates identical
simulation statistics compared to single-threaded-gem5
different cluster sizes
23
quad core physical host
gem5#6
system#6
gem5#7
switch
gem5#4
system#4
gem5#5
system#5
quad core physical host
gem5#0
system#6 switch
system#4
system#2
system#0
system#5
system#3
system#1
quad core physical host
gem5#6
system#2
gem5#7
system#3
gem5#4
system#0
gem5#5
system#1
single-threaded-gem5 dist-gem5
=
Validation – network latency and bandwidth
• iperf (left) and memcahed (right)
• follows the behavior of physical setup
• 17.5% lower response time for memcached
24
0.0
0.3
0.6
0.9
1.2
1.5
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
Lata
ncy
(ms
)
Bandwidth (Gbps)
dist-gem5
phys
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
1 5 10 20 30 40 50 60 70 80 90 95
Late
ncy (
ms)
memcached Distribution Percentile
dist-gem5
phys
dist-gem5 architecture evaluation conclusionwhat is gem5
Speedup – simulation time reduction
• running httperf on each simulated node sending fixed number of requests to a unique simulated node (apache server)
• compared with single-threaded-gem5
• dist-gem5 simulating 63 nodes on 16 physical hosts is83.1 faster than single-threaded-gem5
12.8 faster than parallel-gem5
25
2.76.3
21.8
36.0
83.1
2.7 3.76.6 6.0 6.5
0
10
20
30
40
50
60
70
80
90
3 7 15 31 63
Sp
ee
du
p (
No
rm.
sin
gle
-th
read
ed
-gem
5)
Number of Simulated Nodes
dist-gem5 parallel-gem5
dist-gem5 architecture evaluation conclusionwhat is gem5
speedup of parallel-gem5 saturates!
Scalability – simulation time vs. simulated cluster size
• simulation time increase for simulating 64 vs. 3 nodes:57.3 for Single-threaded-gem5
23.9 for parallel-gem5
1.9 for dist-gem5
26
1.41.9
1.9
3.9
11.2
23.9
1.0
2.6
9.4
25.0
57.3
1.0
10.0
100.0
0 10 20 30 40 50 60 70
No
rmalized
Sim
ula
tio
n T
ime
Number of Simulated Nodes
dist-gem5 parallel-gem5 single-threaded-gem5
dist-gem5 scales well!
dist-gem5 architecture evaluation conclusionwhat is gem5
Synchronization overhead
• sweep synchronization quantum size
• # of http req remains near constantsmaximum 2.6% variance
almost the same amount of work done at each quantum size
• simulation time improvement4.9% from 0.5 μs to 1 μs
15.7% from 0.5 μs to 128 μs
27
0
4
8
12
16
20
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.5 1 2 4 8 16 32 64 128
Nu
mb
er
of
Req
uests
(K
Req
)
No
rma
lized
Sim
ula
tio
n T
ime
Synchronization Quantum Size (μs)
Simulation Time Req#
dist-gem5 synchronization is efficient!
dist-gem5 architecture evaluation conclusionwhat is gem5
Conclusion
• dist-gem5 is a distributed version of gem5 for modeling computer clustersvalidated against a physical cluster
accurate/deterministic
rich off-chip network modeling
83.1x speedup over single-threaded-gem5 simulating a 63 node cluster
• integrated to mainstream gem5available at gem5.org
enabled via “--dist” command line option
• developed/maintained by university and industry
28
dist-gem5 architecture evaluation conclusionwhat is gem5
Thank You
29