t s niperm ultic ores imulator - snipersim.orgsnipersim.org/documents/2013-09-22 sniper iiswc...

THE SNIPER MULTI-‐CORE SIMULATOR

HTTP://WWW.SNIPERSIM.ORG SUNDAY, SEPTEMBER 22ND, 2013

IISWC 2013, PORTLAND

09:00 INTRODUCTION, SNIPER OVERVIEW & RATIONALE 09:45 SNIPER INTERNALS 10:00 – COFFEE BREAK – 10:30 VALIDATION RESULTS 10:45 RUNNING SIMULATIONS AND PROCESSING RESULTS 11:30 HANDS-‐ON DEMO 12:00 – END –

INTEL EXASCIENCE LAB

•  SoHware and hardware for ExaFLOPS scale machines •  CollaboraXon between Intel, imec and 5 Flemish universiXes

•  Study Space Weather as an HPC workload

Simulation Toolkit

Architectural Simulation Hardware design

Space Weather Modeling Visualization

2

THE SNIPER MULTI-‐CORE SIMULATOR INTRODUCTION

WIM HEIRMAN, TREVOR E. CARLSON, IBRAHIM HUR KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT

HTTP://WWW.SNIPERSIM.ORG

SUNDAY, SEPTEMBER 22ND, 2013 IISWC 2013, PORTLAND

TRENDS IN PROCESSOR DESIGN: CACHE

Cache sizes are increasing

4

0

10

20

30

40

50

60

Jan-‐93 Oct-‐95 Jul-‐98 Apr-‐01 Jan-‐04 Oct-‐06 Jul-‐09 Apr-‐12 Dec-‐14

Mbytes

LLC Cache Sizes

IPF x86

TRENDS IN PROCESSOR DESIGN: CORES

Number of cores per node is increasing – 2001: Dual-‐core POWER4 – 2005: Dual-‐core AMD Opteron – 2011: 10-‐core Intel Xeon Westmere-‐EX – 2012: Intel MIC Knights Corner (60+ cores)

5

SIMULATION

•  Design tomorrow’s processor using today’s hardware

•  SimulaXon – Obtain performance characterisXcs for new architectures

– Architectural exploraXon – Early soHware opXmizaXon

6

DEMANDS ON SIMULATION ARE INCREASING

Increasing core counts – Linear increase in simulator workload – Single-‐threaded simulator sees a rising gap

•  workload: increasing target cores •  available processing power: near-‐constant single-‐thread performance of host machine

– Need to use all cores of the host machine

à Parallel simulaXon 7

DEMANDS ON SIMULATION ARE INCREASING

Increasing cache size – Need a large working set to exercise large caches

•  Scaled-‐down applicaXons won’t exhibit the same behavior

– ApplicaXon designers want to see full-‐program behavior

à Long-‐running simulaXons are required

8

UPCOMING CHALLENGES

•  Future systems will be diverse –  Varying processor speeds –  Varying failure rates for different components –  Homogeneous applicaXons become heterogeneous

•  SoHware and hardware soluXons are needed to solve these challenges –  Handle heterogeneity (reacXve load balancing) –  Be fault tolerant –  Improve power efficiency at the algorithmic level

(extreme data locality)

•  Hard to model accurately with analyXcal models

9

FAST AND ACCURATE SIMULATION IS NEEDED

•  SimulaXon use cases – Architecture exploraXon –  Pre-‐silicon soHware opXmizaXon –  [ValidaXon]

•  Cycle-‐accurate simulaXon is too slow for exploring mulX/many-‐core design space and soHware

•  Key quesXons –  Can we raise the level of abstracXon? – What is the right level of abstracXon? – When to use these abstracXon models?

10

EXPERIMENT DESIGN IN ARCHITECTURE EXPLORATION/EVALUATION •  OpXmizing the probability of success (i.e., finding the best architecture/parameters):

•  Coverage: how many architecture configuraXons can I run •  Confidence: # benchmarks, re-‐runs for variable applicaXons •  Accuracy: simulaXon model detail vs. runXme

•  How many scenarios can I run? •  N = total number of simulaXon scenarios •  d = days unXl paper deadline •  t = average Xme per simulaXon •  B = number of benchmarks •  A = number of architectures

11

N =d

t ×B× Aminimize t to maximize N

FAST OR ACCURATE SIMULATION?

12

performance

architecture

A B C D E

? ? ? performance

architecture

A B C D E

Cycle-‐accurate simulator Higher-‐abstracXon level simulator

THE ARCHITECTURE DESIGN WATERFALL

13

AnalyXcal models High-‐level simulaXon Cycle-‐accurate

simulaXon

# archite

ctures

considered

1010 105 1000 10 1

design process (Xme)

benchm

arks

/app

licaX

ons

Program characterisXcs

RepresentaXve applicaXons

Traces / Microbenchmarks

Pre-‐silicon soHware opXmizaXon, co-‐design

NEEDED DETAIL DEPENDS ON FOCUS

14

Component Single-‐event Cme scale

Required sim Cme

single clock cycle

microseconds

millions of cycles

seconds

RTL

OOO execuXon

Core memory ops

L1 cache access

LLC access

Off-‐socket

Too slow

Not accurate enough

ONE-‐IPC MODELING – TOO SIMPLE?

•  Simple high-‐abstracXon model, oHen used in uncore studies

•  AlternaXve for memory access traces – Aims to provide more-‐realisXc access paserns – Allows for Xming feedback

•  But: One-‐IPC core models do not exhibit ILP/MLP – Memory request rates are not as accurate as more detailed simulators

–  # outstanding requests incorrect: underesXmate required queue sizes

– No latency is hidden: overesXmate runXme improvements

15

DETAILED MODEL VS. INTERVAL SIM

16

FuncXonal Simulator

Decode

ROB

ExecuXon Units

LSQ

Cache Hierarchy

Commit

DRAM

Issue Queue

Fetch

BP I$

Interval SimulaXon

Balanced architecture Main factors in core performance: •  Parallelism exposed by ROB •  Miss events

Out-‐of-‐order core performance model with in-‐order simulaXon speed

INTERVAL SIMULATION

17

effecXve dispatch ra

te

Xme

I-‐cache miss branch mispredicXon

long-‐latency load miss

interval 1 interval 2 interval 3

D. Genbrugge et al., HPCA’10 S. Eyerman et al., ACM TOCS, May 2009

T. Karkhanis and J. E. Smith, ISCA’04, ISCA’07

INTERVAL SIMULATION FROM 30,000 FEET

Interval simulaXon considers instrucXons (in-‐order) at dispatch •  dispatch not possible

–  InstrucXon cache / TLB miss –  Branch mispredicXon (not dispatching useful instrucXons) –  Front-‐end refill aHer mispredicXon –  ROB full: long-‐latency miss at head of ROB

18

Decode

ROB

ExecuXon Units

LSQ

Cache Hierarchy

Commit

DRAM

Issue Queue

Fetch

BP I$

INTERVAL SIMULATION FROM 30,000 FEET

Interval simulaXon considers instrucXons (in-‐order) at dispatch •  dispatch not possible •  dispatch possible: at rate governed by ROB

–  Lisle’s law: progress rate = #elements / Xme spent in queue –  Computed using ROB fill and criXcal path through ROB

•  Computed using dynamic instrucXon dependencies and latencies

19

Decode

ROB

ExecuXon Units

LSQ

Cache Hierarchy

Commit

DRAM

Issue Queue

Fetch

BP I$

KEY BENEFITS OF THE INTERVAL MODEL

•  Models superscalar OOO execuXon •  Models impact of ILP •  Models second-‐order effects: MLP •  Allows for construcXng CPI stacks

20

CYCLE STACKS

•  Where did my cycles go? •  CPI stack – Cycles per instrucXon – Broken up in components

•  Normalize by either – Number of instrucXons (CPI stack) – ExecuXon Xme (Xme stack)

•  Different from miss rates: cycle stacks directly quanXfy the effect on performance

21

CPI

L2 cache I-‐cache Branch Base

CONSTRUCTING CPI STACKS

•  Interval simulaXon: track why Xme is advanced – No miss events •  Dispatch instrucXons at base CPI •  Increment base component

– Miss event •  Fast-‐forward Xme by X cycles •  Increment component by X

22

CPI

L2 cache I-‐cache Branch Base

CYCLE STACKS FOR PARALLEL APPLICATIONS

By thread: heterogeneous behavior in a homogeneous applicaXon?

23 DRAM

L2

L1 L1 L1 L1

L2 L2 L2

L3

L2

L1 L1 L1 L1

L2 L2 L2

L3 data

CYCLE STACKS AND SCALING BEHAVIOR

•  Scaling to more cores, larger input set size •  How does execuXon Xme scale, and why?

24


•  Scale input: applicaXon becomes DRAM bound

25


•  Scale input: applicaXon becomes DRAM bound •  Scale core count: sync losses increase to 20%

26

SNIPER: A FAST AND ACCURATE SIMULATOR

•  Hybrid simulaXon approach – AnalyXcal interval core model – Micro-‐architecture structure simulaXon

•  branch predictors, caches (incl. coherency), NoC, etc. •  Hardware-‐validated, Pin-‐based •  Models mulX/many-‐cores running mulX-‐threaded and mulX-‐program workloads

•  Parallel simulator scales with the number of simulated cores

•  Available at http://snipersim.org 27

SIMULATION IN SNIPER

28

funcXonal simulator (Pin)

memory hierarchy simulator

branch predictor simulator

processor cores

A single-‐process, mulCthreaded workload (v1.0)

MulCple, single-‐threaded workloads (v2.0)

ExecuXon-‐driven simulaXon

Trace-‐driven simulaXon

SIMULATION IN SNIPER WITH SIFT

29

memory hierarchy simulator

processor cores

MulCple, mulC-‐threaded workloads (v3.03)

FuncXonal-‐directed simulaXon + Xming-‐feedback

Pin

Pin

A bi-‐direcXonal single-‐thread SIFT connecXon

PinPlay PinPlay support (v4.2)

MANY ARCHITECTURE OPTIONS

L2

L1I L1D L1I L1

D

L2

L1I L1D L1I L1

D

L3

L2

L1I L1D L1I L1

D

L2

L1I L1D L1I L1

D

L3

L2

L1I L1D L1I L1

D

L2

L1I L1D L1I L1

D

L3

L2

L1I L1D L1I L1

D

L2

L1I L1D L1I L1

D

L3

DRAM

30

L1I L1D

L2

L1I L1D

L2

L1I L1D

L2

L1I L1D

L2

L1I L1D

L2

L1I L1D

L2

L1I L1D

L2

L1I L1D

L2

L1I L1D

L2

L1I L1D

L2

L1I L1D

L2

NoC

L1I L1D

L2

THREAD SCHEDULING AND MIGRATION

31

L2

L1

L3

L1

L2

L1 L1

L2

L1 L1

L2

L1 L1

Scheduler

Threads

Cores

Thread migraCon (v4.0)

ADVANCED VISUALIZATION

•  Cycle stacks through Xme

32

TOP SNIPER FEATURES •  Interval Model •  CPI Stacks and InteracXve VisualizaXon •  Parallel MulXthreaded Simulator •  x86-‐64 and SSE2 support •  Validated against Core2, Nehalem •  Thread scheduling and migraXon •  Full DVFS support •  Shared and private caches •  Modern branch predictor •  Supports pthreads and OpenMP, TBB, OpenCL, MPI, … •  SimAPI and Python interfaces to the simulator •  Many flavors of Linux supported (Redhat, Ubuntu, etc.)

33

SIMULATOR COMPARISON

Sniper Graphite Gem5 COTSon MARSSx86

Integrated X

Func-‐directed X X X X

User-‐level X X X

Full-‐system X X X

Archs Supported x64 x64 x64 Alpha SPARC

x64 x64

Parallel (in-‐node) X X

Flexible cache hierarchy

X X X X

34

SNIPER LIMITATIONS

•  User-‐level –  Perfect for HPC (hybrid MPI/OpenMP applicaXons) – Not the best match for workloads with significant OS involvement

•  FuncXonal-‐directed – No simulaXon / cache accesses along false paths

•  High-‐abstracXon core model – Not suited to model all effects of core-‐level changes –  Perfect for memory subsystem or NoC work

•  x86 only 35

SNIPER HISTORY •  November, 2011: SC’11 paper, first public release •  March 2012, version 2.0: MulX-‐program workloads •  May 2012, version 3.0: Heterogeneous architectures •  November 2012, version 4.0: Thread scheduling and migraXon •  April 2013, version 5.0: MulX-‐threaded applicaXon sampling •  June 2013, version 5.1: SuggesXons for opXmizaXon visualizaXon •  September 2013, version 5.2: MESI/F, 2-‐level TLBs, Python scheduling •  Today: 700+ downloads from 60 countries

36

NEW IN RELEASE 5.2: MESI/F COHERENCY

•  Added MESI and MESIF coherency protocol – EXCLUSIVE state: allows for silent upgrades – FORWARDING state: special case of SHARED, allows for cache-‐to-‐cache transfers of shared data

37

0.96

0.97

0.98

0.99

1

npb/A

norm

alized

run-‐Cm

e

Xeon Phi-‐like

MSI MESI MESIF

0.985

0.99

0.995

1

npb/A

Xeon/i7-‐like

USE CASE #1: ARCHITECTURE EXPLORATION

38

•  Many architectural opXons for next-‐generaXon architecture (45nm to 22nm) – Compare performance, energy efficiency

baseline: 2x quad-‐core

8 cores 16 cores, no L3, stacked DRAM

16 slow cores 16 thin cores

core cache

Heirman et al., PACT’12


39

Performance: –  3D (no LLC) is good when streaming, bad when sharing

–  Dual-‐issue good when compute bound, bad when memory bound (smaller ROB exposes less MLP!)


40

Power: –  3D reduces DRAM energy, lack of LLC may increase total energy

–  Low-‐frequency can be significantly more energy-‐efficient

USE CASE #2: HW/SW CO-‐OPTIMIZATION

41

•  Heat transfer: stencil on regular grid –  Important kernel, part of Berkeley Dwarfs (structured grid)

•  Improve memory locality: Xling over mulXple Xme steps –  Trade off locality with redundant computaXon –  OpXmum depends on relaXve cost (performance & energy) of computaXon, data transfer à requires integrated simulator

42

USE CASE #2: HW/SW CO-‐OPTIMIZATION

•  TradiXonally: –  Architects design hardware based on fixed benchmarks

–  Programmers opXmize soHware for today’s architecture

SoHware parameters depend on architecture! •  Co-‐opXmizaXon yields 66% more performance, or 25% more energy efficiency, than isolated opXmizaXon

(a) Performance (simulated time steps per second)

0

50

100

150

200

250

300

0 1 2 3 4

Ste

ps/

time

(1

/s)

Arithmetic intensity (FLOP/byte)

8-core

32 64 128 256 512

0

50

100

150

200

250

300

0 1 2 3 4

Ste

ps/

time

(1

/s)


3D

32 64 128 256 512

0

50

100

150

200

250

300

0 1 2 3 4

Ste

ps/

time

(1

/s)


low-frequency

32 64 128 256 512

0

50

100

150

200

250

300

0 1 2 3 4

Ste

ps/

time

(1

/s)


dual-issue

32 64 128 256 512

(b) Energy e⌅ciency (simulated time steps per Joule)

0.0

0.5

1.0

1.5

2.0

2.5

0 1 2 3 4

Ste

ps/

En

erg

y (1

/J)


8-core

32 64 128 256 512

0.0

0.5

1.0

1.5

2.0

2.5

0 1 2 3 4

Ste

ps/

En

erg

y (1

/J)


3D

32 64 128 256 512

0.0

0.5

1.0

1.5

2.0

2.5

0 1 2 3 4

Ste

ps/

En

erg

y (1

/J)


low-frequency

32 64 128 256 512

0.0

0.5

1.0

1.5

2.0

2.5

0 1 2 3 4

Ste

ps/

En

erg

y (1

/J)


dual-issue

32 64 128 256 512

Figure 9: Hardware/software co-design for the heat benchmark when optimizing for performance and energye�ciency: four architecture design points are considered while varying software parameters such as tile size(see legend) and arithmetic intensity (horizontal axes).

for performance and power to explore these trade-o�s andhardware/software interactions.

7.3 Co-design analysisFigure 9 plots the simulation results for hard-

ware/software co-design for the heat benchmark application.The grid domain is 4096�4096 elements. This domain issplit up into square tiles measuring between 32 and 512 datapoints on each side. The complete domain is 128 MB in totaland does not fit in the last-level cache for any of the archi-tectures considered, so tiles always have to be loaded frommain memory. The number of time steps performed on eachtile before moving on to the next one varies between 1 and65 steps. The graphs plot the achieved number of time stepsper second, or per Joule of consumed energy, as a function ofarithmetic intensity. The performance graphs in Figure 9(a)follow the basic roofline model from Figure 8: initially, in-creasing the number of time steps improves performance,which later falls back once the amount of redundant com-putation becomes too high. Additionally, each architecturehas an optimal tile size which maximizes data reuse whilestill fitting in the cache. For example, the 8-core design pointreach a performance level of around 150 simulated time stepsper second, using a tile size of 128�128. The working set ofthis application is two tiles worth of data, corresponding tothe previous and current time steps. At one double-precisionfloating-point number or 8 bytes per element, the workingset of 256 KB for the 1282-sized tile fits in a core’s private512 KB L2 cache, whereas for the larger 2562 tiles it doesnot — which makes performance significantly lower thanthat predicted by the roofline model. The three other ar-chitectures, which have an L2 cache of only 256 KB, reachtheir optimal performance using the smaller 642 tiles.

If we consider the e�ect of arithmetic intensity on energyrather than performance, we see that generally the optimumhas shifted towards the left, meaning that less redundantwork — and a slightly increased access rate to main mem-

ory — is preferred. This is because the ratio between accesstimes of caches and DRAM is very high, making the perfor-mance aspect of more redundant work cheap when comparedto extra DRAM accesses. On the other hand, when compar-ing the energy cost of DRAM accesses versus that of extracomputation, the ratio is lower which makes the relative costof the redundant work higher. Note that this extra cost con-sists of not just extra floating-point operations, which are inthemselves very cheap (in the order of 0.5 nJ per double-precision operation, vs. 100 nJ per DRAM access1) butalso the cost of extra instructions flowing through all stagesof the out-of-order pipeline, extra cache accesses, etc. (InFigure 5, floating-point ALU energy represents only a smallfraction of total core energy.)

When considering performance alone, the 3D architectureclearly wins for this benchmark. The additional bandwidthof the 3D stacked memory allows for a steeper performanceslope in the leftmost part of the performance graph, whilethe availability of 16 full-sized cores results in the highestpeak performance across all architecture design points con-sidered. Yet, tiles have to be kept small enough such thatthey fit in L2 cache; this architecture does not have an L3cache so the cost of L2 misses, which become significantwith tile sizes larger than 642, is much higher here than inthe other architectures.

On the other hand, the low-frequency architecture, whilebeing less high-performance than both the 8-core and 3Ddesign points, reaches the highest energy e⌅ciency. Becausethis application scales fairly good with core count, and thepower envelope of the low-frequency chip is still modest, onecould even consider another doubling of the number of coreswhich should almost double performance at very little addi-tional cost in energy.

1According to the relevant component in the dynamic en-ergy stacks computed by Sniper/McPAT, scaled by the num-ber of FP operations or DRAM accesses throughout thebenchmark, respectively.

USE CASE #3: SCHEDULING HETERO. CMPS

•  Heterogeneous CMPs provide flexible power/performance trade-‐off (e.g. ARM big.LITTLE)

•  Throughput-‐opXmized schedulers place thread with highest benefit on big cores

•  Does not help parallel applicaXons: fast threads need to wait for slow threads

43

power-efficient

high-performance

B

S

B B …

S S S …

norm

alize

d runX

me


•  Equal-‐progress scheduler esXmates speedup/slowdown of thread if it were to run on the other core type

44 Van Craeynest et al., PACT’13

CPI

big

ILP change

CPI

smal

l

MLP change

CPIbig MLPbig ILPbig

CPIsmall MLPsmall ILPsmall

S

B


•  Equal-‐progress scheduler esXmates speedup/slowdown of thread if it were to run on the other core type

•  Matches progress (=hypotheXcal speed thread always ran on big core)

•  Results in performance improvement, even for heterogeneous applicaXons

45 Van Craeynest et al., PACT’13

THE SNIPER MULTI-‐CORE SIMULATOR SIMULATOR INTERNALS

WIM HEIRMAN, TREVOR E. CARLSON, IBRAHIM HUR, KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT



RELAXED SYNCHRONIZATION

•  Graphite introduced relaxed synchronizaXon with a number of different synchronizaXon schemes –  none: only synchronizes when the applicaXon does; for pthread calls, etc.

–  random-‐pairs: synchronizes random pairs of threads –  barrier: synchronizes all threads at a given simulated Xme interval

•  Sniper defaults to barrier synchronizaXon with 100ns intervals – MulX-‐machine mode not supported, so Xght synchronizaXon is easier

47

BARRIER SYNCHRONIZATION IN ACTION

real Xme

simulated

Xme

pthread_cond_wait

pthread_cond_signal

barrier

barrier

barrier

48

PARALLELISM INSIDE SNIPER

•  Each simulated thread corresponds to a thread on the host –  FuncXonal simulaXon (Pin mode) / reading trace (SIFT mode) –  Executes Xming model for the core (and associated caches) on which the thread currently runs (when scheduled)

–  Each core model maintains its own local Xme

•  Extra threads for network and DRAM models –  Can process invalidaXon requests without interrupXng the core model

•  Each thread can independently make progress –  Causality errors can occur, no rollback –  Skew is limited to 100ns

49

THREADS IN SNIPER

50

core L1 I/D L2

core L1 I/D L2

L3

NI

core L1 I/D L2

core L1 I/D L2

L3

NI

DRAM

applicaXon threads network threads

core L1 I/D L2

L3

NI

TIME IN SNIPER

•  Each memory access instantly returns latency •  ApplicaXon threads maintain Xme •  Network threads reset Xme for each request

51

core L1 I/D L2

core L1 I/D L2

L3

NI

core L1 I/D L2

core L1 I/D L2

L3

NI

DRAM

t=0

t=1

t=5

t=10

t=15

t=20

t=20 t=24 t=26

t=27

t=28

t=29

t=30

t=22

t=4

MODELING CONTENTION

•  Events may happen out of order •  How to model bandwidth / contenXon? – History list

•  Resource in use at Xmes 0…10, 12…17, 25…30 •  Access at 15: delay = 2 •  Access at 8, length 5: ?

•  Causality errors are possible – Effect is limited, as long as average bandwidth is OK – Allows for faster simulaXon, easier implementaXon – Speed versus accuracy trade-‐off

52

53

SIMULATOR ARCHITECTURE: PIN FRONT-‐END

Scheduler

Core performance models common/performance_model

Cache performance models common/core/memory_subsystem L2

L1 L1

L2

L1 L1

L2

L1 L1

L2

L1 L1

NoC NoC performance models common/network

Thread scheduling common/scheduler

Timing model common/

FuncSonal simulaSon: instrumentaSon pin/

ApplicaXon threads (single process) run-‐sniper -‐-‐ ... benchmarks/run-‐sniper -‐p

54

SIMULATOR ARCHITECTURE: SIFT MODE

Scheduler

Core performance models common/performance_model

Cache performance models common/core/memory_subsystem L2

L1 L1

L2

L1 L1

L2

L1 L1

L2

L1 L1

NoC NoC performance models common/network

Thread scheduling common/scheduler

Timing model common/

ApplicaXon 1 sift/recorder

SIFT pipe sift/

App 2

TraceThread common/system/trace_thread.cc

run-‐sniper -‐-‐traces run-‐sniper -‐-‐mpi benchmarks/run-‐sniper -‐-‐benchmarks

PIN FRONT-‐END: EXAMPLE

001: mov $123, eax 002: mov $321, ebx 003: mov $2, ecx 004: mov (eax), edx 005: add edx, 4 006: mov edx, (ebx) 007: add $4, eax 008: add $4, ebx 009: add $-1, ecx 00a: jnz $004

PM

BasicBlockQueue

DynamicInstrucXonInfoQ

ueue

Addr(123) – Addr(321) – Branch(taken) – Addr(127) – Addr(325) – Branch(not taken)

pm-‐>pushDynamicInstruc2onInfo ()

handleBasicBlock(0x001) handleBasicBlock(0x004)

handleMemoryRead(value of eax) handleMemoryWrite(value of ebx) handleBranch(taken/noFaken)

pm-‐>queueBasicBlock()

001: out(eax) 002: out(ebx) 003: out(ecx) – 004: in(eax), out(edx), mem_rd(.) 005: in(edx), out(edx) 006: in(edx,ebx), mem_wr(.) 007: in(eax), out(eax) 008: in(ebx), out(ebx) 009: in(ecx), out(ecx) 00a: jump(.) – 004: …

THE SNIPER MULTI-‐CORE SIMULATOR SIMULATOR ACCURACY

AND HARDWARE VALIDATION

TREVOR E. CARLSON, WIM HEIRMAN, IBRAHIM HUR, KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT



HARDWARE VALIDATION

•  Why validaXon? – Debugging – Verifying modeling assumpXons – Balance between accuracy and generality

•  e.g.: loop buffer in Nehalem/Westmere; uop-‐cache in Sandy Bridge

•  Current status: – Validated against Core2 (internal, results @ SC’11) – Nehalem ongoing (public version)

57

EXPERIMENTAL SETUP

•  Benchmarks – Complete SPLASH-‐2 suite

•  1 to 16 threads •  Linux pthreads API

– Extensive use of microbenchmarks to tune parameters and track down problems

•  Hardware – Four-‐socket Intel Xeon X7460 machine – Core2 (45nm, Penryn) with 6 cores/socket

58

EXPERIMENTAL SETUP: ARCHITECTURE

L2

L1I L1D L1I L1

D

L2

L1I L1D L1I L1

D

L3

L2

L1I L1D L1I L1

D

L2

L1I L1D L1I L1

D

L3

L2

L1I L1D L1I L1

D

L2

L1I L1D L1I L1

D

L3

L2

L1I L1D L1I L1

D

L2

L1I L1D L1I L1

D

L3

DRAM

59

HINTS FOR COMPARING TO HARDWARE

•  Threads are pinned to their own core pthread_setaffinity_np()

•  Steepstep is disabled echo performance > /sys/devices/system/cpu/*/cpufreq/

scaling_governor

•  Turbo mode, Hyperthreading disabled – BIOS se~ng

•  Use hardware performance counters – But can be difficult to interpret – Overlapping cache misses (HW) vs. hits (Sniper)

60

INTERVAL PROVIDES NEEDED ACCURACY

61

The interval core model provides consistent accuracy of 25% avg. abs. error, with a minimal slowdown

INTERVAL: GOOD OVERALL ACCURACY

62

Good accuracy for the enXre benchmark suite

INTERVAL: BETTER RELATIVE ACCURACY •  ApplicaXon scalability is affected by memory bandwidth •  Interval model provides more realisXc memory request

streams, which results in a more accurate scaling predicXon

63

APPLICATION OPTIMIZATION •  Splash2-‐Raytrace shows very bad scaling behavior •  CPI stack shows why: heavy lock contenXon •  Conversion to use locked increment instrucXon helps

64

SIMULATOR PERFORMANCE

65

Sniper currently scales to 2 MIPS

Typical simulators run at 10s-‐100s KIPS, without scaling

SYNCHRONIZATION VARIABILITY

66

Variability due to relaxed synchronizaXon is applicaXon specific

SYNCHRONIZATION: SPEED VS. ACCURACY

67

�Speed vs. simulaXon accuracy for barrier quanta of 10 ns, 100 ns, 1 μs, 10 μs, 100 μs and 1 ms

FLEXIBILITY TO CHOOSE NEEDED FIDELITY

68

MANY-‐CORE SIMULATIONS High simulaXon speed up to 1024 simulated cores

–  Efficient simulaXon: L1-‐based benchmarks execute faster –  Host system: dual-‐socket Xeon X5660 (6-‐core Westmere), 96 GB RAM

69

VALIDATING FOR NEHALEM

70

0 0.2 0.4 0.6 0.8 1

1.2 1.4 1.6 1.8 2

HW M

easuremen

t / Snipe

r Result

0 0.2 0.4 0.6 0.8 1

1.2 1.4 1.6 1.8 2

HW M

easuremen

t / Snipe

r Result

NEHALEM VALIDATION

•  Currently working to validate addiXonal modern systems –  The end goal is to improve both the interval model accuracy as well as the overall Sniper accuracy for both the core and uncore

– Ongoing work which unfortunately takes quite a bit of Xme

•  Recent example: radix –  CVT* instrucXons were not properly modeled

•  Defaults to 1 cycle, but they are actually 3 to 6 cycles – Adding CVT* instrucXons improved error: 33% to 4%

71

POWER/ENERGY VALIDATION •  Goal: –  Validate correctness of McPAT’s input –  Gain confidence in relaXve accuracy

•  Most important for HW/SW design space exploraXon –  Not necessarily: obtain high absolute accuracy

•  Illusory at this level of abstracXon! •  ValidaXon setup: –  Dual-‐socket Intel Xeon (X5570 Nehalem) server – Measure wall power using RackXvity RC0816 PDU

72

POWER/ENERGY VALIDATION: METHODOLOGY •  One HW power measurement per second

–  Select benchmarks with long parallel region –  Subtract server idle power –  Apply temperature correcXon (staXc power) –  Compute average dynamic power when thermal equilibrium is reached

•  McPAT + DRAM power model –  Compute CPU + DRAM dynamic power of workload’s parallel region –  Validate against HW measurments

73

POWER/ENERGY VALIDATION: RESULTS

Good absolute and relaXve accuracy – 8.3% average absolute error – 16% max over SPEComp

74

60

80

100

120

140

160

180

O-ammp_m

O-applu_m

O-equake_m

O-fma3d_m

O-mgrid_m

O-swim_m

heat-64-1

heat-64-17

heat-128-21

heat-256-37

Dyn

am

ic p

ow

er

(W)

Measured Predicted

THE SNIPER MULTI-‐CORE SIMULATOR RUNNING SIMULATIONS

AND PROCESSING RESULTS

WIM HEIRMAN, TREVOR E. CARLSON, KENZO VAN CRAEYNEST, IBRAHIM HUR AND LIEVEN EECKHOUT


SUNDAY, SEPTEMBER 22RD, 2013 IISWC 2013, PORTLAND

OVERVIEW

•  Obtain and compile Sniper •  Running •  ConfiguraXon •  SimulaXon results •  InteracXng with the simulaXon – SimAPI: applicaXon – Python scripXng

76

RUNNING SNIPER

•  Download Sniper – hsp://snipersim.org/w/Download

•  Download tar.gz •  Git clone

~/sniper$ export SNIPER_ROOT=$(pwd) #optional ~/sniper$ make

•  Running an applicaXon ~/sniper$ ./run-‐sniper -‐-‐ /bin/true ~/sniper/test/fft$ make run

77

RUNNING SNIPER

•  Integrated benchmarks distribuXon – hsp://snipersim.org/w/Download_Benchmarks ~/benchmarks$ export BENCHMARKS_ROOT=$(pwd) ~/benchmarks$ make ~/benchmarks$ ./run-‐sniper –p splash2-‐fft \ –i small –n 4

•  Standardizes input sets and command lines •  Includes SPLASH-‐2, PARSEC, NAS Parallel Benchmarks, integraXon for SPEC CPU 2006

78

INTEGRATION WITH BENCHMARKS

•  To add a new benchmark – Add source code – Add __init__.py file

•  Provides applicaXon invocaXon details •  Define input sets (e.g.: test, small, large)

– Mark the ROI region – Simple example: see local/pi

79

MULTI-‐PROGRAMMED WORKLOADS

•  Recording traces (SIFT format) $ ./record-‐trace -‐o fft -‐-‐ test/fft/fft -‐p1

•  Limited trace, by instrucXon count: Fast-‐forward (-‐f), detailed length (-‐d), block size (-‐b) $ ./record-‐trace -‐o fft –f 1e9 –d 1e9 –b 1e8 \ -‐-‐ test/fft/fft -‐p1 –m20

•  Running traces $ ./run-‐sniper -‐c gainestown -‐n 4 \ -‐-‐traces=gcc.sift,swim.sift,\ swim.sift,equake.sift

80

PINPLAY TRACES

•  PinPlay is a record/replay framework – PinBall consists of checkpoint + external input – Replay starts from checkpoint, then funcXonally executes applicaXon

– Smaller than SIFT trace (checkpoint+I/O vs. full dynamic instrucXon stream)

•  Running PinBalls $ ./run-‐sniper -‐n 2 -‐-‐pinballs=gcc.sift,swim.sift

81

PINPOINTS

•  SimPoint methodology selects representaXve samples from a long applicaXon – See Perelman, SIGMETRICS '03 – PinPoints = PinPlay + SimPoint

•  Released the full-‐program and PinPoints versions of CPU2006 on the Sniper website – hsp://www.snipersim.org/w/Pinballs

82

MPI WORKLOADS

•  Run single-‐node (shared-‐memory) MPI apps – Supports MPICH2 and derivaXves (Intel MPI, etc.)

•  Running an MPI app with one rank per core: $ ./run-‐sniper -‐-‐mpi -‐n 4 -‐-‐ ./pi

•  See test/mpi and test/mpi-‐omp for pure MPI and hybrid MPI+OpenMP examples

83

REGION OF INTEREST

•  Skip benchmark iniXalizaXon and cleanup •  Mark code with ROI begin / end markers – SimRoiStart() / SimRoiEnd() in your own applicaXon

–  $ ./run-‐sniper -‐-‐roi -‐-‐ test/fft/fft •  Already done in benchmarks distribuXon –  benchmarks/run-‐sniper implies -‐-‐roi – Use -‐-‐no-‐roi to override

•  Cache warming during pre-‐ROI period – Use -‐-‐no-‐cache-‐warming to override

84

CONFIGURATION

•  Stackable configuraXon files (run-‐sniper -‐c) and explicit command-‐line opXons (-‐g) –  Template configuraXons in sniper/config/*.cfg (-‐c name) –  Your own local configuraXon files (-‐c filename.cfg) –  Explicit opXon: -‐g -‐-‐secXon/key=value

•  MulXple configuraXon files, and -‐g opXons, can be combined –  Config files specified later on the command line take precedence –  config/base.cfg is always included –  If no -‐c opXon is provided, config/gainestown.cfg is the default

(quad-‐core Nehalem-‐based Xeon)

•  Complete configuraXon is stored in sim.cfg aHer each run

85

CONFIGURATION •  Example configuraXon: largecache.cfg

[perf_model/l3_cache] cache_size = 16384 # KB

$ run-‐sniper -‐c gainestown -‐c largecache.cfg

•  Equivalent to:

$ run-‐sniper -‐c gainestown \ -‐g -‐-‐perfmodel/l3_cache/cache_size=16384

86

SIMULATION RESULTS

•  Files created aHer each simulaXon: –  sim.cfg: all configuraXon opXons used for this run (includes defaults, all -‐c and -‐g opXons)

–  sim.out: basic staXsXcs (number of cycles, instrucXons per core, cache access and miss rates, …)

–  sim.stats[.sqlite3]: complete set of all recorded staXsXcs at key points in the simulaXon (start, roi-‐begin, roi-‐end, stop)

•  Processing and visualizaXon scripts in tools/ •  Use the sniper_lib Python package for parsing

87

SIMULATION RESULTS •  sim.out: Quick overview of basic performance results

88

| Core 0 | Core 1 Instructions | 506505 | 505562 Cycles | 469101 | 468620 Time (ns) | 176354 | 176173 Branch predictor stats | | num correct | 15338 | 15205 num incorrect | 1280 | 1218 misprediction rate | 7.70% | 7.42% mpki | 2.53 | 2.41 Cache Summary | | Cache L1-‐I | | num cache accesses | 46642 | 46555 num cache misses | 217 | 178 miss rate | 0.47% | 0.38% mpki | 0.43 | 0.35 Cache L1-‐D | | num cache accesses | 332771 | 332412 num cache misses | 517 | 720 miss rate | 0.16% | 0.22% mpki | 1.02 | 1.42 Cache L2 | | num cache accesses | 984 | 1090 num cache misses | 459 | 853 miss rate | 46.65% | 78.26% mpki | 0.91 | 1.69

SIMULATION RESULTS – TOOLS •  CPI stacks $ ./tools/cpistack.py [-‐-‐time|-‐-‐cpi|-‐-‐abstime] [-‐-‐aggregate] CPI CPI % Time % Core 0 depend-‐int 0.20 23.42% 23.42% depend-‐fp 0.16 18.94% 18.94% branch 0.12 14.04% 14.04% ifetch 0.04 4.16% 4.16% mem-‐l1d 0.21 24.41% 24.41% mem-‐l3 0.02 2.72% 2.72% mem-‐dram 0.05 5.73% 5.73% sync-‐mutex 0.02 2.59% 2.59% sync-‐cond 0.03 3.01% 3.01% other 0.01 0.97% 0.97% total 0.84 100.00% 0.00s Core 1 depend-‐int 0.20 23.92% 23.92% depend-‐fp 0.16 18.79% 18.79% branch 0.12 13.72% 13.72% mem-‐l1d 0.20 24.06% 24.06% mem-‐l3 0.06 6.79% 6.79% sync-‐mutex 0.04 5.22% 5.22% sync-‐cond 0.05 5.60% 5.60% other 0.02 1.89% 1.89% total 0.85 100.00% 0.00s 89

SIMULATION RESULTS – TOOLS

•  McPAT: power, energy and area $ ./tools/mcpat.py [-‐t static|dynamic|total|area] Power Energy Energy % core-‐core 13.77 W 0.00 J 30.14% core-‐ifetch 4.44 W 0.00 J 9.71% core-‐int 1.22 W 0.00 J 2.67% core-‐fp 1.47 W 0.00 J 3.21% core-‐mem 6.30 W 0.00 J 13.78% icache 5.55 W 0.00 J 12.16% dcache 4.74 W 0.00 J 10.37% l3 3.38 W 0.00 J 7.39% dram 4.51 W 0.00 J 9.87% other 0.32 W 0.00 J 0.70% core 27.19 W 0.00 J 59.52% total 45.68 W 0.01 J 100.00%

90


•  Topology $ ./tools/gen_topology.py

91


•  LLC miss latency breakdown $ ./tools/llcstack.py

Requests: 97485627 Total time: 85.16 dram-‐bus: 0.48 dram-‐device: 19.45 dram-‐queue: 0.07 inv-‐imbalance: 3.95 noc-‐base: 31.94 noc-‐queue: 2.11 remote-‐cache-‐inv: 0.80 remote-‐cache-‐wb: 10.15 td-‐access: 16.22

92

SIMULATION RESULTS – TOOLS •  Per-‐funcXon staXsXcs (gprof2dot, KCacheGrind output) $ run-‐sniper -‐-‐profile calls time t.self icount ipc l2.mpki name 1 50.06% 0.00% 50.05% 1.06 0.86 _start 1 50.06% 0.00% 50.05% 1.06 0.86 _start 1 50.06% 0.00% 50.05% 1.06 0.86 __libc_start_main 1 50.06% 0.00% 50.05% 1.06 0.86 main 1 49.11% 0.20% 49.95% 1.08 0.69 SlaveStart 1 44.16% 0.18% 46.65% 1.12 0.40 FFT1D 32 32.87% 17.91% 36.83% 1.19 0.05 FFT1DOnce 32 14.96% 1.85% 10.62% 0.75 0.08 Reverse 1024 13.12% 13.12% 8.57% 0.69 0.02 BitReverse 3 5.92% 5.92% 6.24% 1.12 1.31 Transpose 16 2.17% 2.17% 3.25% 1.58 0.18 TwiddleOneCol 1 49.94% 0.00% 49.95% 1.06 1.63 clone 1 49.94% 0.02% 49.95% 1.06 1.63 start_thread 1 49.91% 0.17% 49.94% 1.06 1.58 SlaveStart 1 43.89% 0.15% 46.60% 1.12 0.66 FFT1D 32 32.60% 17.79% 36.83% 1.20 0.06 FFT1DOnce 32 14.81% 1.79% 10.62% 0.76 0.09 Reverse 1024 13.03% 13.03% 8.57% 0.70 0.03 BitReverse 3 7.02% 7.02% 6.24% 0.94 3.34 Transpose 16 2.04% 2.04% 3.25% 1.68 0.18 TwiddleOneCol

93

94

VIZ: CYCLES STACKS IN TIME

Time

Cycles (%)

Legend

OpXons

Markers

IPC VisualizaXon

$ run-‐sniper -‐-‐viz

95

VIZ: MCPAT OUTPUT OVER TIME

96

3D VISUALIZATION: IPC VS. TIME VS. CORE

AUTOMATIC RESULTS PROCESSING

•  Avoid manually copy/pasXng simulaXon results for reliability, scalability (how else will you auto-‐generate graphs based on 100s of runs?)

•  sniper_lib.get_results() parses sim.cfg, sim.stats, returns configuraXon and staXsXcs (roi-‐end – roi-‐begin) for all cores

~/sniper/tools$ python > import sniper_lib > results = sniper_lib.get_results(resultsdir = ‘..’) > print results {‘config’: {‘general/total_cores’: ‘64’, ‘perf_model/core/frequency’: ‘2.66’, …}, ‘results’: {‘performance_model.instruction_count’:[123], ‘performance_model.elapsed_time’: [23000000], …}}

97

SIMULATION RESULTS

•  Let’s compute the IPC for core 0 •  Core frequency is variable (DVFS) so cycle count has to be computed –  Time is in femtoseconds, frequency in GHz

> instrs = results[‘results’] [‘performance_model.instruction_count’][0] > cycles = results[‘results’] [‘performance_model.elapsed_time’][0] * float(results[‘config’][‘perf_model/core/frequency’]) * 1e-‐6 # femtoseconds -‐> nanoseconds > ipc = instrs / cycles 2.0

98

INTERACTING WITH SNIPER

99

Sniper simulator

applicaXon

binary input/

cmdline

SimAPI

configuraXon

Python scripts

staXsXcs

visualizaXon

SIMAPI IMPLEMENTATION

•  Magic instrucXons allow the applicaXon to talk to the simulator directly __asm__ __volatile__ ( "xchg %%bx, %%bx\n" : "=a" (_res) /* output */ : "a" (_cmd), "b" (_arg0), "c" (_arg1) /* input */ ); /* clobbered */

•  Pin intercepts this instrucXon and passes control to the simulator

•  Command and arguments passed through rax/rbx/rcx registers, result in rax

100

APPLICATION SIMAPI

•  Calling simulator API funcXons from your C program #include <sim_api.h> –  SimInSimulator()

•  Return 1 when running inside Sniper, 0 when running naXvely –  SimGetProcId()

•  Return processor number of caller

–  SimRoiStart() / SimRoiEnd() •  Start/end detailed mode (when using ./run-‐sniper -‐-‐roi)

–  SimSetFreqMHz(proc, mhz) / SimGetFreqMHz(proc) •  Set / get processor frequency (integer, in MHz)

–  SimUser(cmd, arg) •  User-‐defined funcXon 101

PYTHON SCRIPTING

•  Scripts are run on simulator startup – Register hooks: callbacks when certain events happen during the simulaXon

– See common/system/hooks_manager.h for all available hooks

•  Use an exisXng script from sniper/scripts/*.py: ./run-‐sniper -‐s scriptname

•  Or your own script: ./run-‐sniper -‐s myscriptname.py

•  Use sim package for convenience wrappers 102

PYTHON SCRIPTING

•  Low-‐level script •  Execute “foo” at each barrier synchronizaXon

import sim_hooks def foo(t): print 'The time is now', t sim_hooks.register(sim_hooks.HOOK_PERIODIC, foo)

103

PYTHON SCRIPTING

•  Higher-‐level script •  Execute “foo” at each barrier synchronizaXon

import sim class Class: def hook_periodic(self, t): print 'The time is now', t sim.util.register(Class())

104

PYTHON SCRIPTING •  High-‐level script: execute “foo” every X ms •  Pass in parameter using ./run-‐sniper -‐s myscript.py:X

import sim class Class: def setup(self, args): sim.util.Every(long(args)*sim.util.Time.MS, self.periodic) def periodic(self, t, t_delta): print 'The time is now', t print 'Elapsed time since last call', t_delta sim.util.register(Class())

105

PYTHON SCRIPTING

•  Access configuraXon, staXsXcs, DVFS •  Live periodic IPC trace: –  See scripts/ipctrace.py for a more complete example

class IPCTracer: def setup(self, args): sim.util.Every(1*sim.util.Time.US, self.periodic) self.instrs_prev = 0 def periodic(self, t, t_delta): freq = sim.dvfs.get_frequency(0) cycles = t_delta * freq * 1e-‐9 # fs * MHz -‐> cycles instrs = long(sim.stats.get('performance_model', 0, 'instruction_count')) print 'IPC =', (instrs – self.instrs_prev) / cycles self.instrs_prev = instrs

106

PYTHON & MAGIC INSTRUCTIONS

•  Communicate informaXon between applicaXon and Python script – E.g.: simulated hardware performance counters

•  ApplicaXon: uint64_t ninstrs = SimUtil(0xdeadbeef, SimGetProcId())

•  Python script: class PerfCtr: def setup(self): sim.util.register_command(0xdeadbeef, self.compute) def compute(self, arg): return sim.stats.get('performance_model', arg, 'instruction_count')

107

EXAMPLE: REGION-‐BASED STATISTICS We want to know the L2 miss rate for a snipet of code •  ApplicaXon: mark code snippet with Sim[Named]Marker

void myfunc(int a) { SimNamedMarker(1, "myfunc"); // function body SimNamedMarker(2, "myfunc");

}

•  Script: intercept markers and write uniquely named staXsXcs snapshots import sim class Marker: def __init__(self): self.i = 0 def hook_magic_marker(self, thread, core, a, b, s): sim.stats.write('marker-‐%s-‐%d-‐%d' % (s, a, self.i)) if a == 2: self.i += 1 sim.util.register(Marker())

108

EXAMPLE: REGION-‐BASED STATISTICS •  Run simulaXon

# ./run-‐sniper –cgainestown –smyscript.py -‐-‐ ./myapp # ./tools/dumpstats.py -‐-‐list

•  Process results import sniper_stats, sniper_lib stats = sniper_stats.SniperStats(resultsdir = '.') niters = max([ int(name.split('-‐')[-‐1]) for name in stats.get_snapshots() if name.startswith('marker-‐myfunc-‐1-‐') ]) for i in range(0, niters+1): snapshot = sniper_lib.get_results(resultsdir = '.', partial =

('marker-‐myfunc-‐1-‐%d' % i, 'marker-‐myfunc-‐2-‐%d' % i))['results'] accesses = sum(snapshot['L2.loads']) + sum(snapshot['L2.stores']) misses = sum(snapshot['L2.load-‐misses']) \ + sum(snapshot['L2.store-‐misses']) print 'Iteration #%d: %.2f%%' % (i, 100. * misses / float(accesses))

109

THE SNIPER MULTI-‐CORE SIMULATOR THREAD SCHEDULING

KENZO VAN CRAEYNEST, TREVOR E. CARLSON, WIM HEIRMAN AND LIEVEN EECKHOUT



SCHEDULING THREADS ON HETEROGENEOUS MULTI-‐CORE PROCESSORS •  MulXple core types – Different power/performance trade-‐offs

small power-‐efficient cores

big high-‐performance cores

B

S

B B …

S S S …

intermediate cores

… M M M

111

CONFIGURING HETEROGENEOUS CONFIGURATIONS

•  TradiXonal heterogeneous configuraXon opXons – Comma separated configuraXon parameters supported for most se~ngs

–  (Re)defining core parameters -‐-‐perf_model/core/interval_Xmer/window_size=16,128,16,128 -‐-‐perf_model/core/interval_Xmer/dispatch_width=2,4,2,4 …

•  Get’s complicated fast •  Error prone

112

A BETTER WAY: DEFINING HETEROGENEOUS CONFIGURATIONS

113

big.cfg

medium.cfg

small.cfg small power-‐efficient cores

big high-‐performance cores

B

S

B B …

S S S …

intermediate cores

… M M M

./run_sniper –c big,small,big,small,medium -- /bin/ls

USING HETEROGENEITY INSIDE SNIPER

•  Example: hardware scheduling – Need to know what core types there are

•  and which are which – OpXon 1: hardcoded

•  core_id 1 is a big core, core_id is a small core, … •  hard to maintain, difficult to port for different experiments

– OpXon 2: IdenXfy core type by examining core parameters •  Sim()-‐>getCfg()-‐>getBoolArray

("perf_model/core/interval_Xmer/window_size", coreId)

•  Not flexible

114

TAGS: A CONVENIENT WAY TO USING HETEROGENEITY INSIDE SNIPER •  Defining tags – OpXon 1: manually assign tags to core (or any other object) •  -‐-‐tags/core/big=0,1,0,1 •  -‐-‐tags/core/small=1,0,1,0

– OpXon 2: use heterogeneous .cfg files •  Add “–c big.cfg,small,big,small” to run-‐sniper •  AutomaXcally creates tags and heterogeneous configuraXon

•  Using tags •  Sim()-‐>getTagsManager()-‐>hasTag("core", coreId, "small"); •  Sim()-‐>getTagsManager()-‐>hasTag("core", coreId, “big");

115

•  Thread scheduling support added in Sniper 4.0 – Fully implemented inside the simulator, no applicaXon modificaXon or re-‐linking needed

•  Base scheduling infrastructure supports: – Affinity mask for each thread (allowed cores)

•  IniXalized at thread start based on scheduling policy •  Updated by sched_affinity system calls •  Updated by scheduling policy, user code

– Pre-‐empXve round-‐robin scheduling •  Cores pick a new thread when Xme quantum expires, or when running thread goes to sleep

THREAD SCHEDULING

116

•  Pinned (scheduler/type=pinned) [default] –  Threads are assigned to a single core, round-‐robin – No migraXon – OpXonal core mask for disabled cores: --scheduler/pinned/core_mask=1,0,1,1,0,0,0,1

•  Roaming (scheduler/type=roaming) –  IniXal affinity: all cores (with opXonal mask) –  Threads freely migrate to idle cores

•  StaXc [deprecated] –  Fixed assignment, no migraXon – No oversubscripXon (#threads <= #cores)

AVAILABLE SCHEDULERS

117

CUSTOM THREAD SCHEDULING

•  When: periodically, in response to hooks (thread create, stall, resume, exit)

•  How – hard way: moveThread() –  Immediately moves a thread (but watch out for safe pre-‐empXon points, deadlocks, manual oversubscripXon, …)

•  How – easy way: manipulaXng affinity masks – Threads will be moved as soon as it is safe, automaXc oversubscripXon (round-‐robin, quantum-‐based)

118

BIGSMALL SCHEDULER

•  Provided as a demo/starXng point – Uses TagsManager to idenXfy core types

•  Recommended because of flexibility

•  Randomly reschedule threads between (mulXple) big and (mulXple) small cores – By sorXng cores based on random values – Sets affinity mask of threads to either “all small cores” or “all big cores”

– Round-‐robin scheduling within each core pool

119

MEMORY INTENSITY BASED SCHEDULER

•  Threads that spend most of their Xme waiXng for memory get scheduled on small core types [Kumar at al., 2010] – Common pracXce to achieve energy-‐efficiency

•  StarXng from BigSmall scheduler – Use memory cycles in the core Xming models – Register as a per-‐thread staXsXc – Use instead of random value

•  Other straight-‐forward opXons: –  IPC [Becchi and Crowley,2008], predicted slowdown [Van Craeynest et al., 2012]

120

SCHEDULING: PER-‐THREAD STATISTICS

•  Most staXsXcs are collected per core (instrucXon count, cache misses, etc.)

•  Support for per-‐thread staXsXcs – Linked to per-‐core staXsXc – Updated automaXcally when thread moves – Access per-‐thread staXsXc

Sim()-‐>getThreadStatsManager()-‐>getThreadStaXsXc (thread_id, ThreadStatsManager::INSTRUCTIONS)

– Or register your own staXsXc SchedulerDynamic::ThreadStats::ThreadStats

121

THE SNIPER MULTI-‐CORE SIMULATOR HANDS-‐ON DEMO

TREVOR E. CARLSON, WIM HEIRMAN, KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT



SNIPER DEMO

•  Downloading •  Compiling •  Running a demo applicaXon •  EvaluaXng Performance – CPI Stacks

•  ConfiguraXon and Run-‐Xme ModificaXons – ConfiguraXon files – Python scripXng – ROI markers and Magic instrucXons

123

REFERENCES

•  Sniper website –  hsp://snipersim.org/

•  Download –  hsp://snipersim.org/w/Download –  hsp://snipersim.org/w/Download_Benchmarks

•  Ge~ng started –  hsp://snipersim.org/w/Ge~ng_Started

•  QuesXons? –  hsp://groups.google.com/group/snipersim –  hsp://snipersim.org/w/Frequently_Asked_QuesXons

124

THE SNIPER MULTI-‐CORE SIMULATOR

WIM HEIRMAN, TREVOR E. CARLSON, IBRAHIM HUR KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT



t s niperm ultic ores imulator - snipersim.orgsnipersim.org/documents/2013-09-22 sniper iiswc...

Documents