dist-gem5: distributed simulation of compute...

29
dist-gem5: Distributed Simulation of Compute Clusters Mohammad Alian , Umur Darbaz, Gabor Dozsa, Stephan Diestelhorst, Daehoon Kim, Nam Sung Kim University of Illinois Urbana-Champaign ARM Ltd., Cambridge, UK 1

Upload: others

Post on 04-Aug-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

dist-gem5: Distributed Simulation of Compute Clusters

Mohammad Alian, Umur Darbaz, Gabor Dozsa, Stephan Diestelhorst, Daehoon Kim, Nam Sung Kim

University of Illinois Urbana-Champaign

ARM Ltd., Cambridge, UK

1

Page 2: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Outline

• motivationaccelerating large-scale simulation

• dist-gem5 architecturepacket forwarding

synchronization

checkpointing

network model

• evaluation

validation, speedup, synchronization overhead

• conclusion

2

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 3: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Outline

• motivationaccelerating large-scale simulation

• dist-gem5 architecturepacket forwarding

synchronization

checkpointing

network model

• evaluation

validation, speedup, synchronization overhead

• conclusion

3

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 4: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

What is gem5 – overview

• full-system, cycle-level, event-driven simulator

• used/maintained at universities and industry

4

Core Integrated IP

ARM ISA Support

ARMv7a ARMv8

GICv2

CPU Models

L1-L3 $ SCU

ArchTimer PMU

IO components

Simulation support

UARTUHDLCD

10Gb

NIC

NVMe

DMA

KVMv7

Traffic

Gen

Traffic

Monitor

Memory

Flash DRAM

Interconnect

CrossbarSnoop

filterBridges

Stream

Line

Sim

Points

Power

Model Int.KVMv8

FracFact

PCA

UFS

Timers

RTC

GPU models

NoMali

HMC

dist-gem5 architecture evaluation conclusionwhat is gem5

Atomic Timing

Out of

OrderIn Order

Page 5: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Why dist-gem5?

• performance and power dissipation of a distributed systemcomplex interplay among system components at scale

• need a full-system, cycle-level simulator which is fast enough to simulate a large-scale computer system

• distributed simulation:simulate a distributed system

w/ many simulation hosts

5

scaleOS

ISAscaches

memory

network

devices

performancePower

cores

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 6: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

dist-gem5 architecture – high level view

• gem5 processes modeling full systems run inparallel on a cluster of physical machines

• simulated network switchforward packets among the simulated systems

synchronize the distributed simulation

simulate network topology

6

host #1

dist-gem5 architecture evaluation conclusionwhat is gem5

simulated system #1 gem5 process

physical machine

simulated network switch

host #4

simulated system #3

host #3

simulated system #2

host #2

Page 7: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Outline

• motivationaccelerating large-scale simulation

• dist-gem5 architecturepacket forwarding

synchronization

checkpointing

network model

• evaluation

validation, speedup, synchronization overhead

• conclusion

7

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 8: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

dist-gem5 architecture – core components

8

packet forwarding

simulated network

distributed check-pointing

synchronization

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 9: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

dist-gem5 architecture – core components

9

packet forwarding

simulated network

distributed check-pointing

synchronization

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 10: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

physical host #1

physical host #3

physical host #2

physical switch

physNIC#1

physNIC#2

physport1

physport2

physport3

physNIC#3

dist-gem5 architecture – packet forwarding

10

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 11: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

physical host #1

physical host #3

physical host #2

physical switch

physNIC#1

physNIC#2

physport1

physport2

physport3

physNIC#3

dist-gem5 architecture – packet forwarding

11

gem5 #1

simulated system #1

simNIC

gem5 #3

simulated switch

dist-gem5 architecture evaluation conclusionwhat is gem5

gem5 #2

simulated system #2

simNIC

simport0

simport1

Page 12: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

physical host #1

physical host #3

physical host #2

physical switch

physNIC#1

physNIC#2

physport1

physport2

physport3

physNIC#3

gem5 #1

simulated system #1

simNIC

gem5 #3

simulated switch

gem5 #2

simulated system #2

simNIC

simport0

simport1

dist-gem5 architecture – packet forwarding

12

simulated packets are embedded into host TCP/IP packets

sim pkt

TCP sim pkt

sim pktTCP sim pkt

sim pkt

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 13: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Asynchronous processing of incoming messages

• simulation thread (main thread)process/insert events in the event queue

in case of send pkt event, encapsulate the simulated Ethernet packet in a message and send it out

• receiver threadcreate for each gem5 process

waits for incoming packets

creates a recv pkt event and insert it to the event queue

13

eventQsimulation

threadsend pkt

recv pkt

physical hostgem5 process

receiverthread

physNIC

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 14: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

dist-gem5 architecture – core components

14

packet forwarding

simulated network

distributed check-pointing

synchronization

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 15: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Need for synchronization

15

wall clock time

• receiver gem5 can run ahead of sender gem5

physical host mismatch

different events to be processed

• slowed down receiver gem5 to ensure simulation accuracy

• quantum-based synchronization

gem5#0

gem5#1

send time

expected delivery time

simulated network delay

dist-gem5 architecture evaluation conclusionwhat is gem5

recv time

late packet arrival

Page 16: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Accurate packet forwarding

16

wall clock time

• quantum: interval for periodic synchronization in simulated time

• sync-event flushes inter gem5 communication channels

• if quantum ≤ simulated link delay:

expected delivery tick falls inside the next quantum

• optimal quantum size for accurate forwarding == simulated link delay

gem5#0

gem5#1

gem5#0

gem5#1

send time

packet arrival wallclock time

expected delivery time

simulated network delay

quantum

global sync

dist-gem5 architecture evaluation conclusionwhat is gem5

quantum

Page 17: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

dist-gem5 architecture – core components

17

packet forwarding

simulated network

distributed check-pointing

synchronization

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 18: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

server #2

dist-gem5 architecture – network modeling

18

Server #1

server #3

server #4

server #5

server #6

server #7

Server #0

top of rackswitch #0

server #10

server #9

server #11

server #12

server #13

server #14

server #15

server #8

top of rackswitch #1

server #58

server #57

server #59

server #60

server #61

server #62

server #63

server #56

top of rackswitch #7

aggregateswitch

. . .

simulate in one gem5 process

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 19: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

• configurable baseline Ethernet switch modelport number, delay, bandwidth, buffer size

physical host

Configurable network model

19

top of rackswitch #0

top of rackswitch #1

top of rackswitch #7

aggregateswitch

p8

p0 p7 p0 p7

p8

. . .. . . . . .

p0 p7p1

p8

gem5simulated etherLinksimulated port

distEtherLink

simulated etherSwitch

dist-gem5 architecture evaluation conclusionwhat is gem5

p0 p7

MAC

Table In-orderQ#0

In-orderQ#n

IPORT#0

IPORT#n

OPORT#0

OPORT#n

. . .

. . .

Page 20: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Outline

• motivationaccelerating large-scale simulation

• dist-gem5 architecturepacket forwarding

synchronization

checkpointing

network model

• evaluation

validation, speedup, synchronization overhead

• conclusion

20

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 21: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Methodology – simulation techniques

21

quad core physical host

gem5#6

system#6

gem5#7

switch

gem5#4

system#4

gem5#2

system#2

gem5#0

system#0

gem5#5

system#5

gem5#3

system#3

gem5#1

system#1

quad core physical host

gem5#6

system#6

gem5#7

switch

gem5#4

system#4

gem5#5

system#5

quad core physical host

gem5#0

system#6 switch

system#4

system#2

system#0

system#5

system#3

system#1

quad core physical host

gem5#6

system#2

gem5#7

system#3

gem5#4

system#0

gem5#5

system#1

dist-gem5 architecture evaluation conclusionwhat is gem5

single-threaded-gem5 dist-gem5parallel-gem5

• For example, simulating a cluster w/ 7 nodes and 1 network switch:

Page 22: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Methodology – experimental setup

• focus on off-chip network performance using network intensive applicationsiperf, memcached, httperf, tcptest, netperf, NAS parallel benchmark

• verification/validation against:single-threaded-gem5

physical clustero 4 node cluster w/ AMD A10-5800K

• speedup comparison against:single-threaded-gem5

parallel-gem5

22

category gem5 configuration

O3 core 4 cores; 4 way superscalar

memory 8GB DDR3 1600 MHz

network Intel GbE NIC; 1 μs Link latency

OS Linux Ubuntu 14.04 (Kernel 4.3)

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 23: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Verification

• same node/network configdist-gem5 generates identical

simulation statistics compared to single-threaded-gem5

different cluster sizes

23

quad core physical host

gem5#6

system#6

gem5#7

switch

gem5#4

system#4

gem5#5

system#5

quad core physical host

gem5#0

system#6 switch

system#4

system#2

system#0

system#5

system#3

system#1

quad core physical host

gem5#6

system#2

gem5#7

system#3

gem5#4

system#0

gem5#5

system#1

single-threaded-gem5 dist-gem5

=

Page 24: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Validation – network latency and bandwidth

• iperf (left) and memcahed (right)

• follows the behavior of physical setup

• 17.5% lower response time for memcached

24

0.0

0.3

0.6

0.9

1.2

1.5

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Lata

ncy

(ms

)

Bandwidth (Gbps)

dist-gem5

phys

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

1 5 10 20 30 40 50 60 70 80 90 95

Late

ncy (

ms)

memcached Distribution Percentile

dist-gem5

phys

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 25: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Speedup – simulation time reduction

• running httperf on each simulated node sending fixed number of requests to a unique simulated node (apache server)

• compared with single-threaded-gem5

• dist-gem5 simulating 63 nodes on 16 physical hosts is83.1 faster than single-threaded-gem5

12.8 faster than parallel-gem5

25

2.76.3

21.8

36.0

83.1

2.7 3.76.6 6.0 6.5

0

10

20

30

40

50

60

70

80

90

3 7 15 31 63

Sp

ee

du

p (

No

rm.

sin

gle

-th

read

ed

-gem

5)

Number of Simulated Nodes

dist-gem5 parallel-gem5

dist-gem5 architecture evaluation conclusionwhat is gem5

speedup of parallel-gem5 saturates!

Page 26: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Scalability – simulation time vs. simulated cluster size

• simulation time increase for simulating 64 vs. 3 nodes:57.3 for Single-threaded-gem5

23.9 for parallel-gem5

1.9 for dist-gem5

26

1.41.9

1.9

3.9

11.2

23.9

1.0

2.6

9.4

25.0

57.3

1.0

10.0

100.0

0 10 20 30 40 50 60 70

No

rmalized

Sim

ula

tio

n T

ime

Number of Simulated Nodes

dist-gem5 parallel-gem5 single-threaded-gem5

dist-gem5 scales well!

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 27: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Synchronization overhead

• sweep synchronization quantum size

• # of http req remains near constantsmaximum 2.6% variance

almost the same amount of work done at each quantum size

• simulation time improvement4.9% from 0.5 μs to 1 μs

15.7% from 0.5 μs to 128 μs

27

0

4

8

12

16

20

0.0

0.2

0.4

0.6

0.8

1.0

1.2

0.5 1 2 4 8 16 32 64 128

Nu

mb

er

of

Req

uests

(K

Req

)

No

rma

lized

Sim

ula

tio

n T

ime

Synchronization Quantum Size (μs)

Simulation Time Req#

dist-gem5 synchronization is efficient!

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 28: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Conclusion

• dist-gem5 is a distributed version of gem5 for modeling computer clustersvalidated against a physical cluster

accurate/deterministic

rich off-chip network modeling

83.1x speedup over single-threaded-gem5 simulating a 63 node cluster

• integrated to mainstream gem5available at gem5.org

enabled via “--dist” command line option

• developed/maintained by university and industry

28

dist-gem5 architecture evaluation conclusionwhat is gem5

Page 29: dist-gem5: Distributed Simulation of Compute Clustersispass.org/ispass2017/slides/alian_dist-gem5.pdf · process/insert events in the event queue in case of send pkt event, encapsulate

Thank You

29