t s niperm ultic ores imulator - snipersim.orgsnipersim.org/documents/2013-09-22 sniper iiswc...
TRANSCRIPT
THE SNIPER MULTI-‐CORE SIMULATOR
HTTP://WWW.SNIPERSIM.ORG SUNDAY, SEPTEMBER 22ND, 2013
IISWC 2013, PORTLAND
09:00 INTRODUCTION, SNIPER OVERVIEW & RATIONALE 09:45 SNIPER INTERNALS 10:00 – COFFEE BREAK – 10:30 VALIDATION RESULTS 10:45 RUNNING SIMULATIONS AND PROCESSING RESULTS 11:30 HANDS-‐ON DEMO 12:00 – END –
INTEL EXASCIENCE LAB
• SoHware and hardware for ExaFLOPS scale machines • CollaboraXon between Intel, imec and 5 Flemish universiXes
• Study Space Weather as an HPC workload
Simulation Toolkit
Architectural Simulation Hardware design
Space Weather Modeling Visualization
2
THE SNIPER MULTI-‐CORE SIMULATOR INTRODUCTION
WIM HEIRMAN, TREVOR E. CARLSON, IBRAHIM HUR KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT
HTTP://WWW.SNIPERSIM.ORG
SUNDAY, SEPTEMBER 22ND, 2013 IISWC 2013, PORTLAND
TRENDS IN PROCESSOR DESIGN: CACHE
Cache sizes are increasing
4
0
10
20
30
40
50
60
Jan-‐93 Oct-‐95 Jul-‐98 Apr-‐01 Jan-‐04 Oct-‐06 Jul-‐09 Apr-‐12 Dec-‐14
Mbytes
LLC Cache Sizes
IPF x86
TRENDS IN PROCESSOR DESIGN: CORES
Number of cores per node is increasing – 2001: Dual-‐core POWER4 – 2005: Dual-‐core AMD Opteron – 2011: 10-‐core Intel Xeon Westmere-‐EX – 2012: Intel MIC Knights Corner (60+ cores)
5
SIMULATION
• Design tomorrow’s processor using today’s hardware
• SimulaXon – Obtain performance characterisXcs for new architectures
– Architectural exploraXon – Early soHware opXmizaXon
6
DEMANDS ON SIMULATION ARE INCREASING
Increasing core counts – Linear increase in simulator workload – Single-‐threaded simulator sees a rising gap
• workload: increasing target cores • available processing power: near-‐constant single-‐thread performance of host machine
– Need to use all cores of the host machine
à Parallel simulaXon 7
DEMANDS ON SIMULATION ARE INCREASING
Increasing cache size – Need a large working set to exercise large caches
• Scaled-‐down applicaXons won’t exhibit the same behavior
– ApplicaXon designers want to see full-‐program behavior
à Long-‐running simulaXons are required
8
UPCOMING CHALLENGES
• Future systems will be diverse – Varying processor speeds – Varying failure rates for different components – Homogeneous applicaXons become heterogeneous
• SoHware and hardware soluXons are needed to solve these challenges – Handle heterogeneity (reacXve load balancing) – Be fault tolerant – Improve power efficiency at the algorithmic level
(extreme data locality)
• Hard to model accurately with analyXcal models
9
FAST AND ACCURATE SIMULATION IS NEEDED
• SimulaXon use cases – Architecture exploraXon – Pre-‐silicon soHware opXmizaXon – [ValidaXon]
• Cycle-‐accurate simulaXon is too slow for exploring mulX/many-‐core design space and soHware
• Key quesXons – Can we raise the level of abstracXon? – What is the right level of abstracXon? – When to use these abstracXon models?
10
EXPERIMENT DESIGN IN ARCHITECTURE EXPLORATION/EVALUATION • OpXmizing the probability of success (i.e., finding the best architecture/parameters):
• Coverage: how many architecture configuraXons can I run • Confidence: # benchmarks, re-‐runs for variable applicaXons • Accuracy: simulaXon model detail vs. runXme
• How many scenarios can I run? • N = total number of simulaXon scenarios • d = days unXl paper deadline • t = average Xme per simulaXon • B = number of benchmarks • A = number of architectures
11
N =d
t ×B× Aminimize t to maximize N
FAST OR ACCURATE SIMULATION?
12
performance
architecture
A B C D E
? ? ? performance
architecture
A B C D E
Cycle-‐accurate simulator Higher-‐abstracXon level simulator
THE ARCHITECTURE DESIGN WATERFALL
13
AnalyXcal models High-‐level simulaXon Cycle-‐accurate
simulaXon
# archite
ctures
considered
1010 105 1000 10 1
design process (Xme)
benchm
arks
/app
licaX
ons
Program characterisXcs
RepresentaXve applicaXons
Traces / Microbenchmarks
Pre-‐silicon soHware opXmizaXon, co-‐design
NEEDED DETAIL DEPENDS ON FOCUS
14
Component Single-‐event Cme scale
Required sim Cme
single clock cycle
microseconds
millions of cycles
seconds
RTL
OOO execuXon
Core memory ops
L1 cache access
LLC access
Off-‐socket
Too slow
Not accurate enough
ONE-‐IPC MODELING – TOO SIMPLE?
• Simple high-‐abstracXon model, oHen used in uncore studies
• AlternaXve for memory access traces – Aims to provide more-‐realisXc access paserns – Allows for Xming feedback
• But: One-‐IPC core models do not exhibit ILP/MLP – Memory request rates are not as accurate as more detailed simulators
– # outstanding requests incorrect: underesXmate required queue sizes
– No latency is hidden: overesXmate runXme improvements
15
DETAILED MODEL VS. INTERVAL SIM
16
FuncXonal Simulator
Decode
ROB
ExecuXon Units
LSQ
Cache Hierarchy
Commit
DRAM
Issue Queue
Fetch
BP I$
Interval SimulaXon
Balanced architecture Main factors in core performance: • Parallelism exposed by ROB • Miss events
Out-‐of-‐order core performance model with in-‐order simulaXon speed
INTERVAL SIMULATION
17
effecXve dispatch ra
te
Xme
I-‐cache miss branch mispredicXon
long-‐latency load miss
interval 1 interval 2 interval 3
D. Genbrugge et al., HPCA’10 S. Eyerman et al., ACM TOCS, May 2009
T. Karkhanis and J. E. Smith, ISCA’04, ISCA’07
INTERVAL SIMULATION FROM 30,000 FEET
Interval simulaXon considers instrucXons (in-‐order) at dispatch • dispatch not possible
– InstrucXon cache / TLB miss – Branch mispredicXon (not dispatching useful instrucXons) – Front-‐end refill aHer mispredicXon – ROB full: long-‐latency miss at head of ROB
18
Decode
ROB
ExecuXon Units
LSQ
Cache Hierarchy
Commit
DRAM
Issue Queue
Fetch
BP I$
INTERVAL SIMULATION FROM 30,000 FEET
Interval simulaXon considers instrucXons (in-‐order) at dispatch • dispatch not possible • dispatch possible: at rate governed by ROB
– Lisle’s law: progress rate = #elements / Xme spent in queue – Computed using ROB fill and criXcal path through ROB
• Computed using dynamic instrucXon dependencies and latencies
19
Decode
ROB
ExecuXon Units
LSQ
Cache Hierarchy
Commit
DRAM
Issue Queue
Fetch
BP I$
KEY BENEFITS OF THE INTERVAL MODEL
• Models superscalar OOO execuXon • Models impact of ILP • Models second-‐order effects: MLP • Allows for construcXng CPI stacks
20
CYCLE STACKS
• Where did my cycles go? • CPI stack – Cycles per instrucXon – Broken up in components
• Normalize by either – Number of instrucXons (CPI stack) – ExecuXon Xme (Xme stack)
• Different from miss rates: cycle stacks directly quanXfy the effect on performance
21
CPI
L2 cache I-‐cache Branch Base
CONSTRUCTING CPI STACKS
• Interval simulaXon: track why Xme is advanced – No miss events • Dispatch instrucXons at base CPI • Increment base component
– Miss event • Fast-‐forward Xme by X cycles • Increment component by X
22
CPI
L2 cache I-‐cache Branch Base
CYCLE STACKS FOR PARALLEL APPLICATIONS
By thread: heterogeneous behavior in a homogeneous applicaXon?
23 DRAM
L2
L1 L1 L1 L1
L2 L2 L2
L3
L2
L1 L1 L1 L1
L2 L2 L2
L3 data
CYCLE STACKS AND SCALING BEHAVIOR
• Scaling to more cores, larger input set size • How does execuXon Xme scale, and why?
24
CYCLE STACKS AND SCALING BEHAVIOR
• Scale input: applicaXon becomes DRAM bound
25
CYCLE STACKS AND SCALING BEHAVIOR
• Scale input: applicaXon becomes DRAM bound • Scale core count: sync losses increase to 20%
26
SNIPER: A FAST AND ACCURATE SIMULATOR
• Hybrid simulaXon approach – AnalyXcal interval core model – Micro-‐architecture structure simulaXon
• branch predictors, caches (incl. coherency), NoC, etc. • Hardware-‐validated, Pin-‐based • Models mulX/many-‐cores running mulX-‐threaded and mulX-‐program workloads
• Parallel simulator scales with the number of simulated cores
• Available at http://snipersim.org 27
SIMULATION IN SNIPER
28
funcXonal simulator (Pin)
memory hierarchy simulator
branch predictor simulator
processor cores
A single-‐process, mulCthreaded workload (v1.0)
MulCple, single-‐threaded workloads (v2.0)
ExecuXon-‐driven simulaXon
Trace-‐driven simulaXon
SIMULATION IN SNIPER WITH SIFT
29
memory hierarchy simulator
processor cores
MulCple, mulC-‐threaded workloads (v3.03)
FuncXonal-‐directed simulaXon + Xming-‐feedback
Pin
Pin
A bi-‐direcXonal single-‐thread SIFT connecXon
PinPlay PinPlay support (v4.2)
MANY ARCHITECTURE OPTIONS
L2
L1I L1D L1I L1
D
L2
L1I L1D L1I L1
D
L3
L2
L1I L1D L1I L1
D
L2
L1I L1D L1I L1
D
L3
L2
L1I L1D L1I L1
D
L2
L1I L1D L1I L1
D
L3
L2
L1I L1D L1I L1
D
L2
L1I L1D L1I L1
D
L3
DRAM
30
L1I L1D
L2
L1I L1D
L2
L1I L1D
L2
L1I L1D
L2
L1I L1D
L2
L1I L1D
L2
L1I L1D
L2
L1I L1D
L2
L1I L1D
L2
L1I L1D
L2
L1I L1D
L2
NoC
L1I L1D
L2
THREAD SCHEDULING AND MIGRATION
31
L2
L1
L3
L1
L2
L1 L1
L2
L1 L1
L2
L1 L1
Scheduler
Threads
Cores
Thread migraCon (v4.0)
ADVANCED VISUALIZATION
• Cycle stacks through Xme
32
TOP SNIPER FEATURES • Interval Model • CPI Stacks and InteracXve VisualizaXon • Parallel MulXthreaded Simulator • x86-‐64 and SSE2 support • Validated against Core2, Nehalem • Thread scheduling and migraXon • Full DVFS support • Shared and private caches • Modern branch predictor • Supports pthreads and OpenMP, TBB, OpenCL, MPI, … • SimAPI and Python interfaces to the simulator • Many flavors of Linux supported (Redhat, Ubuntu, etc.)
33
SIMULATOR COMPARISON
Sniper Graphite Gem5 COTSon MARSSx86
Integrated X
Func-‐directed X X X X
User-‐level X X X
Full-‐system X X X
Archs Supported x64 x64 x64 Alpha SPARC
x64 x64
Parallel (in-‐node) X X
Flexible cache hierarchy
X X X X
34
SNIPER LIMITATIONS
• User-‐level – Perfect for HPC (hybrid MPI/OpenMP applicaXons) – Not the best match for workloads with significant OS involvement
• FuncXonal-‐directed – No simulaXon / cache accesses along false paths
• High-‐abstracXon core model – Not suited to model all effects of core-‐level changes – Perfect for memory subsystem or NoC work
• x86 only 35
SNIPER HISTORY • November, 2011: SC’11 paper, first public release • March 2012, version 2.0: MulX-‐program workloads • May 2012, version 3.0: Heterogeneous architectures • November 2012, version 4.0: Thread scheduling and migraXon • April 2013, version 5.0: MulX-‐threaded applicaXon sampling • June 2013, version 5.1: SuggesXons for opXmizaXon visualizaXon • September 2013, version 5.2: MESI/F, 2-‐level TLBs, Python scheduling • Today: 700+ downloads from 60 countries
36
NEW IN RELEASE 5.2: MESI/F COHERENCY
• Added MESI and MESIF coherency protocol – EXCLUSIVE state: allows for silent upgrades – FORWARDING state: special case of SHARED, allows for cache-‐to-‐cache transfers of shared data
37
0.96
0.97
0.98
0.99
1
npb/A
norm
alized
run-‐Cm
e
Xeon Phi-‐like
MSI MESI MESIF
0.985
0.99
0.995
1
npb/A
Xeon/i7-‐like
USE CASE #1: ARCHITECTURE EXPLORATION
38
• Many architectural opXons for next-‐generaXon architecture (45nm to 22nm) – Compare performance, energy efficiency
baseline: 2x quad-‐core
8 cores 16 cores, no L3, stacked DRAM
16 slow cores 16 thin cores
core cache
Heirman et al., PACT’12
USE CASE #1: ARCHITECTURE EXPLORATION
39
Performance: – 3D (no LLC) is good when streaming, bad when sharing
– Dual-‐issue good when compute bound, bad when memory bound (smaller ROB exposes less MLP!)
USE CASE #1: ARCHITECTURE EXPLORATION
40
Power: – 3D reduces DRAM energy, lack of LLC may increase total energy
– Low-‐frequency can be significantly more energy-‐efficient
USE CASE #2: HW/SW CO-‐OPTIMIZATION
41
• Heat transfer: stencil on regular grid – Important kernel, part of Berkeley Dwarfs (structured grid)
• Improve memory locality: Xling over mulXple Xme steps – Trade off locality with redundant computaXon – OpXmum depends on relaXve cost (performance & energy) of computaXon, data transfer à requires integrated simulator
42
USE CASE #2: HW/SW CO-‐OPTIMIZATION
• TradiXonally: – Architects design hardware based on fixed benchmarks
– Programmers opXmize soHware for today’s architecture
SoHware parameters depend on architecture! • Co-‐opXmizaXon yields 66% more performance, or 25% more energy efficiency, than isolated opXmizaXon
(a) Performance (simulated time steps per second)
0
50
100
150
200
250
300
0 1 2 3 4
Ste
ps/
time
(1
/s)
Arithmetic intensity (FLOP/byte)
8-core
32 64 128 256 512
0
50
100
150
200
250
300
0 1 2 3 4
Ste
ps/
time
(1
/s)
Arithmetic intensity (FLOP/byte)
3D
32 64 128 256 512
0
50
100
150
200
250
300
0 1 2 3 4
Ste
ps/
time
(1
/s)
Arithmetic intensity (FLOP/byte)
low-frequency
32 64 128 256 512
0
50
100
150
200
250
300
0 1 2 3 4
Ste
ps/
time
(1
/s)
Arithmetic intensity (FLOP/byte)
dual-issue
32 64 128 256 512
(b) Energy e⌅ciency (simulated time steps per Joule)
0.0
0.5
1.0
1.5
2.0
2.5
0 1 2 3 4
Ste
ps/
En
erg
y (1
/J)
Arithmetic intensity (FLOP/byte)
8-core
32 64 128 256 512
0.0
0.5
1.0
1.5
2.0
2.5
0 1 2 3 4
Ste
ps/
En
erg
y (1
/J)
Arithmetic intensity (FLOP/byte)
3D
32 64 128 256 512
0.0
0.5
1.0
1.5
2.0
2.5
0 1 2 3 4
Ste
ps/
En
erg
y (1
/J)
Arithmetic intensity (FLOP/byte)
low-frequency
32 64 128 256 512
0.0
0.5
1.0
1.5
2.0
2.5
0 1 2 3 4
Ste
ps/
En
erg
y (1
/J)
Arithmetic intensity (FLOP/byte)
dual-issue
32 64 128 256 512
Figure 9: Hardware/software co-design for the heat benchmark when optimizing for performance and energye�ciency: four architecture design points are considered while varying software parameters such as tile size(see legend) and arithmetic intensity (horizontal axes).
for performance and power to explore these trade-o�s andhardware/software interactions.
7.3 Co-design analysisFigure 9 plots the simulation results for hard-
ware/software co-design for the heat benchmark application.The grid domain is 4096�4096 elements. This domain issplit up into square tiles measuring between 32 and 512 datapoints on each side. The complete domain is 128 MB in totaland does not fit in the last-level cache for any of the archi-tectures considered, so tiles always have to be loaded frommain memory. The number of time steps performed on eachtile before moving on to the next one varies between 1 and65 steps. The graphs plot the achieved number of time stepsper second, or per Joule of consumed energy, as a function ofarithmetic intensity. The performance graphs in Figure 9(a)follow the basic roofline model from Figure 8: initially, in-creasing the number of time steps improves performance,which later falls back once the amount of redundant com-putation becomes too high. Additionally, each architecturehas an optimal tile size which maximizes data reuse whilestill fitting in the cache. For example, the 8-core design pointreach a performance level of around 150 simulated time stepsper second, using a tile size of 128�128. The working set ofthis application is two tiles worth of data, corresponding tothe previous and current time steps. At one double-precisionfloating-point number or 8 bytes per element, the workingset of 256 KB for the 1282-sized tile fits in a core’s private512 KB L2 cache, whereas for the larger 2562 tiles it doesnot — which makes performance significantly lower thanthat predicted by the roofline model. The three other ar-chitectures, which have an L2 cache of only 256 KB, reachtheir optimal performance using the smaller 642 tiles.
If we consider the e�ect of arithmetic intensity on energyrather than performance, we see that generally the optimumhas shifted towards the left, meaning that less redundantwork — and a slightly increased access rate to main mem-
ory — is preferred. This is because the ratio between accesstimes of caches and DRAM is very high, making the perfor-mance aspect of more redundant work cheap when comparedto extra DRAM accesses. On the other hand, when compar-ing the energy cost of DRAM accesses versus that of extracomputation, the ratio is lower which makes the relative costof the redundant work higher. Note that this extra cost con-sists of not just extra floating-point operations, which are inthemselves very cheap (in the order of 0.5 nJ per double-precision operation, vs. 100 nJ per DRAM access1) butalso the cost of extra instructions flowing through all stagesof the out-of-order pipeline, extra cache accesses, etc. (InFigure 5, floating-point ALU energy represents only a smallfraction of total core energy.)
When considering performance alone, the 3D architectureclearly wins for this benchmark. The additional bandwidthof the 3D stacked memory allows for a steeper performanceslope in the leftmost part of the performance graph, whilethe availability of 16 full-sized cores results in the highestpeak performance across all architecture design points con-sidered. Yet, tiles have to be kept small enough such thatthey fit in L2 cache; this architecture does not have an L3cache so the cost of L2 misses, which become significantwith tile sizes larger than 642, is much higher here than inthe other architectures.
On the other hand, the low-frequency architecture, whilebeing less high-performance than both the 8-core and 3Ddesign points, reaches the highest energy e⌅ciency. Becausethis application scales fairly good with core count, and thepower envelope of the low-frequency chip is still modest, onecould even consider another doubling of the number of coreswhich should almost double performance at very little addi-tional cost in energy.
1According to the relevant component in the dynamic en-ergy stacks computed by Sniper/McPAT, scaled by the num-ber of FP operations or DRAM accesses throughout thebenchmark, respectively.
USE CASE #3: SCHEDULING HETERO. CMPS
• Heterogeneous CMPs provide flexible power/performance trade-‐off (e.g. ARM big.LITTLE)
• Throughput-‐opXmized schedulers place thread with highest benefit on big cores
• Does not help parallel applicaXons: fast threads need to wait for slow threads
43
power-efficient
high-performance
B
S
B B …
S S S …
norm
alize
d runX
me
USE CASE #3: SCHEDULING HETERO. CMPS
• Equal-‐progress scheduler esXmates speedup/slowdown of thread if it were to run on the other core type
44 Van Craeynest et al., PACT’13
CPI
big
ILP change
CPI
smal
l
MLP change
CPIbig MLPbig ILPbig
CPIsmall MLPsmall ILPsmall
S
B
USE CASE #3: SCHEDULING HETERO. CMPS
• Equal-‐progress scheduler esXmates speedup/slowdown of thread if it were to run on the other core type
• Matches progress (=hypotheXcal speed thread always ran on big core)
• Results in performance improvement, even for heterogeneous applicaXons
45 Van Craeynest et al., PACT’13
THE SNIPER MULTI-‐CORE SIMULATOR SIMULATOR INTERNALS
WIM HEIRMAN, TREVOR E. CARLSON, IBRAHIM HUR, KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT
HTTP://WWW.SNIPERSIM.ORG
SUNDAY, SEPTEMBER 22ND, 2013 IISWC 2013, PORTLAND
RELAXED SYNCHRONIZATION
• Graphite introduced relaxed synchronizaXon with a number of different synchronizaXon schemes – none: only synchronizes when the applicaXon does; for pthread calls, etc.
– random-‐pairs: synchronizes random pairs of threads – barrier: synchronizes all threads at a given simulated Xme interval
• Sniper defaults to barrier synchronizaXon with 100ns intervals – MulX-‐machine mode not supported, so Xght synchronizaXon is easier
47
BARRIER SYNCHRONIZATION IN ACTION
real Xme
simulated
Xme
pthread_cond_wait
pthread_cond_signal
barrier
barrier
barrier
48
PARALLELISM INSIDE SNIPER
• Each simulated thread corresponds to a thread on the host – FuncXonal simulaXon (Pin mode) / reading trace (SIFT mode) – Executes Xming model for the core (and associated caches) on which the thread currently runs (when scheduled)
– Each core model maintains its own local Xme
• Extra threads for network and DRAM models – Can process invalidaXon requests without interrupXng the core model
• Each thread can independently make progress – Causality errors can occur, no rollback – Skew is limited to 100ns
49
THREADS IN SNIPER
50
core L1 I/D L2
core L1 I/D L2
L3
NI
core L1 I/D L2
core L1 I/D L2
L3
NI
DRAM
applicaXon threads network threads
core L1 I/D L2
L3
NI
TIME IN SNIPER
• Each memory access instantly returns latency • ApplicaXon threads maintain Xme • Network threads reset Xme for each request
51
core L1 I/D L2
core L1 I/D L2
L3
NI
core L1 I/D L2
core L1 I/D L2
L3
NI
DRAM
t=0
t=1
t=5
t=10
t=15
t=20
t=20 t=24 t=26
t=27
t=28
t=29
t=30
t=22
t=4
MODELING CONTENTION
• Events may happen out of order • How to model bandwidth / contenXon? – History list
• Resource in use at Xmes 0…10, 12…17, 25…30 • Access at 15: delay = 2 • Access at 8, length 5: ?
• Causality errors are possible – Effect is limited, as long as average bandwidth is OK – Allows for faster simulaXon, easier implementaXon – Speed versus accuracy trade-‐off
52
53
SIMULATOR ARCHITECTURE: PIN FRONT-‐END
Scheduler
Core performance models common/performance_model
Cache performance models common/core/memory_subsystem L2
L1 L1
L2
L1 L1
L2
L1 L1
L2
L1 L1
NoC NoC performance models common/network
Thread scheduling common/scheduler
Timing model common/
FuncSonal simulaSon: instrumentaSon pin/
ApplicaXon threads (single process) run-‐sniper -‐-‐ ... benchmarks/run-‐sniper -‐p
54
SIMULATOR ARCHITECTURE: SIFT MODE
Scheduler
Core performance models common/performance_model
Cache performance models common/core/memory_subsystem L2
L1 L1
L2
L1 L1
L2
L1 L1
L2
L1 L1
NoC NoC performance models common/network
Thread scheduling common/scheduler
Timing model common/
ApplicaXon 1 sift/recorder
SIFT pipe sift/
App 2
TraceThread common/system/trace_thread.cc
run-‐sniper -‐-‐traces run-‐sniper -‐-‐mpi benchmarks/run-‐sniper -‐-‐benchmarks
PIN FRONT-‐END: EXAMPLE
001: mov $123, eax 002: mov $321, ebx 003: mov $2, ecx 004: mov (eax), edx 005: add edx, 4 006: mov edx, (ebx) 007: add $4, eax 008: add $4, ebx 009: add $-1, ecx 00a: jnz $004
PM
BasicBlockQueue
DynamicInstrucXonInfoQ
ueue
Addr(123) – Addr(321) – Branch(taken) – Addr(127) – Addr(325) – Branch(not taken)
pm-‐>pushDynamicInstruc2onInfo ()
handleBasicBlock(0x001) handleBasicBlock(0x004)
handleMemoryRead(value of eax) handleMemoryWrite(value of ebx) handleBranch(taken/noFaken)
pm-‐>queueBasicBlock()
001: out(eax) 002: out(ebx) 003: out(ecx) – 004: in(eax), out(edx), mem_rd(.) 005: in(edx), out(edx) 006: in(edx,ebx), mem_wr(.) 007: in(eax), out(eax) 008: in(ebx), out(ebx) 009: in(ecx), out(ecx) 00a: jump(.) – 004: …
THE SNIPER MULTI-‐CORE SIMULATOR SIMULATOR ACCURACY
AND HARDWARE VALIDATION
TREVOR E. CARLSON, WIM HEIRMAN, IBRAHIM HUR, KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT
HTTP://WWW.SNIPERSIM.ORG
SUNDAY, SEPTEMBER 22ND, 2013 IISWC 2013, PORTLAND
HARDWARE VALIDATION
• Why validaXon? – Debugging – Verifying modeling assumpXons – Balance between accuracy and generality
• e.g.: loop buffer in Nehalem/Westmere; uop-‐cache in Sandy Bridge
• Current status: – Validated against Core2 (internal, results @ SC’11) – Nehalem ongoing (public version)
57
EXPERIMENTAL SETUP
• Benchmarks – Complete SPLASH-‐2 suite
• 1 to 16 threads • Linux pthreads API
– Extensive use of microbenchmarks to tune parameters and track down problems
• Hardware – Four-‐socket Intel Xeon X7460 machine – Core2 (45nm, Penryn) with 6 cores/socket
58
EXPERIMENTAL SETUP: ARCHITECTURE
L2
L1I L1D L1I L1
D
L2
L1I L1D L1I L1
D
L3
L2
L1I L1D L1I L1
D
L2
L1I L1D L1I L1
D
L3
L2
L1I L1D L1I L1
D
L2
L1I L1D L1I L1
D
L3
L2
L1I L1D L1I L1
D
L2
L1I L1D L1I L1
D
L3
DRAM
59
HINTS FOR COMPARING TO HARDWARE
• Threads are pinned to their own core pthread_setaffinity_np()
• Steepstep is disabled echo performance > /sys/devices/system/cpu/*/cpufreq/
scaling_governor
• Turbo mode, Hyperthreading disabled – BIOS se~ng
• Use hardware performance counters – But can be difficult to interpret – Overlapping cache misses (HW) vs. hits (Sniper)
60
INTERVAL PROVIDES NEEDED ACCURACY
61
The interval core model provides consistent accuracy of 25% avg. abs. error, with a minimal slowdown
INTERVAL: GOOD OVERALL ACCURACY
62
Good accuracy for the enXre benchmark suite
INTERVAL: BETTER RELATIVE ACCURACY • ApplicaXon scalability is affected by memory bandwidth • Interval model provides more realisXc memory request
streams, which results in a more accurate scaling predicXon
63
APPLICATION OPTIMIZATION • Splash2-‐Raytrace shows very bad scaling behavior • CPI stack shows why: heavy lock contenXon • Conversion to use locked increment instrucXon helps
64
SIMULATOR PERFORMANCE
65
Sniper currently scales to 2 MIPS
Typical simulators run at 10s-‐100s KIPS, without scaling
SYNCHRONIZATION VARIABILITY
66
Variability due to relaxed synchronizaXon is applicaXon specific
SYNCHRONIZATION: SPEED VS. ACCURACY
67
�Speed vs. simulaXon accuracy for barrier quanta of 10 ns, 100 ns, 1 μs, 10 μs, 100 μs and 1 ms
FLEXIBILITY TO CHOOSE NEEDED FIDELITY
68
MANY-‐CORE SIMULATIONS High simulaXon speed up to 1024 simulated cores
– Efficient simulaXon: L1-‐based benchmarks execute faster – Host system: dual-‐socket Xeon X5660 (6-‐core Westmere), 96 GB RAM
69
VALIDATING FOR NEHALEM
70
0 0.2 0.4 0.6 0.8 1
1.2 1.4 1.6 1.8 2
HW M
easuremen
t / Snipe
r Result
0 0.2 0.4 0.6 0.8 1
1.2 1.4 1.6 1.8 2
HW M
easuremen
t / Snipe
r Result
NEHALEM VALIDATION
• Currently working to validate addiXonal modern systems – The end goal is to improve both the interval model accuracy as well as the overall Sniper accuracy for both the core and uncore
– Ongoing work which unfortunately takes quite a bit of Xme
• Recent example: radix – CVT* instrucXons were not properly modeled
• Defaults to 1 cycle, but they are actually 3 to 6 cycles – Adding CVT* instrucXons improved error: 33% to 4%
71
POWER/ENERGY VALIDATION • Goal: – Validate correctness of McPAT’s input – Gain confidence in relaXve accuracy
• Most important for HW/SW design space exploraXon – Not necessarily: obtain high absolute accuracy
• Illusory at this level of abstracXon! • ValidaXon setup: – Dual-‐socket Intel Xeon (X5570 Nehalem) server – Measure wall power using RackXvity RC0816 PDU
72
POWER/ENERGY VALIDATION: METHODOLOGY • One HW power measurement per second
– Select benchmarks with long parallel region – Subtract server idle power – Apply temperature correcXon (staXc power) – Compute average dynamic power when thermal equilibrium is reached
• McPAT + DRAM power model – Compute CPU + DRAM dynamic power of workload’s parallel region – Validate against HW measurments
73
POWER/ENERGY VALIDATION: RESULTS
Good absolute and relaXve accuracy – 8.3% average absolute error – 16% max over SPEComp
74
60
80
100
120
140
160
180
O-ammp_m
O-applu_m
O-equake_m
O-fma3d_m
O-mgrid_m
O-swim_m
heat-64-1
heat-64-17
heat-128-21
heat-256-37
Dyn
am
ic p
ow
er
(W)
Measured Predicted
THE SNIPER MULTI-‐CORE SIMULATOR RUNNING SIMULATIONS
AND PROCESSING RESULTS
WIM HEIRMAN, TREVOR E. CARLSON, KENZO VAN CRAEYNEST, IBRAHIM HUR AND LIEVEN EECKHOUT
HTTP://WWW.SNIPERSIM.ORG
SUNDAY, SEPTEMBER 22RD, 2013 IISWC 2013, PORTLAND
OVERVIEW
• Obtain and compile Sniper • Running • ConfiguraXon • SimulaXon results • InteracXng with the simulaXon – SimAPI: applicaXon – Python scripXng
76
RUNNING SNIPER
• Download Sniper – hsp://snipersim.org/w/Download
• Download tar.gz • Git clone
~/sniper$ export SNIPER_ROOT=$(pwd) #optional ~/sniper$ make
• Running an applicaXon ~/sniper$ ./run-‐sniper -‐-‐ /bin/true ~/sniper/test/fft$ make run
77
RUNNING SNIPER
• Integrated benchmarks distribuXon – hsp://snipersim.org/w/Download_Benchmarks ~/benchmarks$ export BENCHMARKS_ROOT=$(pwd) ~/benchmarks$ make ~/benchmarks$ ./run-‐sniper –p splash2-‐fft \ –i small –n 4
• Standardizes input sets and command lines • Includes SPLASH-‐2, PARSEC, NAS Parallel Benchmarks, integraXon for SPEC CPU 2006
78
INTEGRATION WITH BENCHMARKS
• To add a new benchmark – Add source code – Add __init__.py file
• Provides applicaXon invocaXon details • Define input sets (e.g.: test, small, large)
– Mark the ROI region – Simple example: see local/pi
79
MULTI-‐PROGRAMMED WORKLOADS
• Recording traces (SIFT format) $ ./record-‐trace -‐o fft -‐-‐ test/fft/fft -‐p1
• Limited trace, by instrucXon count: Fast-‐forward (-‐f), detailed length (-‐d), block size (-‐b) $ ./record-‐trace -‐o fft –f 1e9 –d 1e9 –b 1e8 \ -‐-‐ test/fft/fft -‐p1 –m20
• Running traces $ ./run-‐sniper -‐c gainestown -‐n 4 \ -‐-‐traces=gcc.sift,swim.sift,\ swim.sift,equake.sift
80
PINPLAY TRACES
• PinPlay is a record/replay framework – PinBall consists of checkpoint + external input – Replay starts from checkpoint, then funcXonally executes applicaXon
– Smaller than SIFT trace (checkpoint+I/O vs. full dynamic instrucXon stream)
• Running PinBalls $ ./run-‐sniper -‐n 2 -‐-‐pinballs=gcc.sift,swim.sift
81
PINPOINTS
• SimPoint methodology selects representaXve samples from a long applicaXon – See Perelman, SIGMETRICS '03 – PinPoints = PinPlay + SimPoint
• Released the full-‐program and PinPoints versions of CPU2006 on the Sniper website – hsp://www.snipersim.org/w/Pinballs
82
MPI WORKLOADS
• Run single-‐node (shared-‐memory) MPI apps – Supports MPICH2 and derivaXves (Intel MPI, etc.)
• Running an MPI app with one rank per core: $ ./run-‐sniper -‐-‐mpi -‐n 4 -‐-‐ ./pi
• See test/mpi and test/mpi-‐omp for pure MPI and hybrid MPI+OpenMP examples
83
REGION OF INTEREST
• Skip benchmark iniXalizaXon and cleanup • Mark code with ROI begin / end markers – SimRoiStart() / SimRoiEnd() in your own applicaXon
– $ ./run-‐sniper -‐-‐roi -‐-‐ test/fft/fft • Already done in benchmarks distribuXon – benchmarks/run-‐sniper implies -‐-‐roi – Use -‐-‐no-‐roi to override
• Cache warming during pre-‐ROI period – Use -‐-‐no-‐cache-‐warming to override
84
CONFIGURATION
• Stackable configuraXon files (run-‐sniper -‐c) and explicit command-‐line opXons (-‐g) – Template configuraXons in sniper/config/*.cfg (-‐c name) – Your own local configuraXon files (-‐c filename.cfg) – Explicit opXon: -‐g -‐-‐secXon/key=value
• MulXple configuraXon files, and -‐g opXons, can be combined – Config files specified later on the command line take precedence – config/base.cfg is always included – If no -‐c opXon is provided, config/gainestown.cfg is the default
(quad-‐core Nehalem-‐based Xeon)
• Complete configuraXon is stored in sim.cfg aHer each run
85
CONFIGURATION • Example configuraXon: largecache.cfg
[perf_model/l3_cache] cache_size = 16384 # KB
$ run-‐sniper -‐c gainestown -‐c largecache.cfg
• Equivalent to:
$ run-‐sniper -‐c gainestown \ -‐g -‐-‐perfmodel/l3_cache/cache_size=16384
86
SIMULATION RESULTS
• Files created aHer each simulaXon: – sim.cfg: all configuraXon opXons used for this run (includes defaults, all -‐c and -‐g opXons)
– sim.out: basic staXsXcs (number of cycles, instrucXons per core, cache access and miss rates, …)
– sim.stats[.sqlite3]: complete set of all recorded staXsXcs at key points in the simulaXon (start, roi-‐begin, roi-‐end, stop)
• Processing and visualizaXon scripts in tools/ • Use the sniper_lib Python package for parsing
87
SIMULATION RESULTS • sim.out: Quick overview of basic performance results
88
| Core 0 | Core 1 Instructions | 506505 | 505562 Cycles | 469101 | 468620 Time (ns) | 176354 | 176173 Branch predictor stats | | num correct | 15338 | 15205 num incorrect | 1280 | 1218 misprediction rate | 7.70% | 7.42% mpki | 2.53 | 2.41 Cache Summary | | Cache L1-‐I | | num cache accesses | 46642 | 46555 num cache misses | 217 | 178 miss rate | 0.47% | 0.38% mpki | 0.43 | 0.35 Cache L1-‐D | | num cache accesses | 332771 | 332412 num cache misses | 517 | 720 miss rate | 0.16% | 0.22% mpki | 1.02 | 1.42 Cache L2 | | num cache accesses | 984 | 1090 num cache misses | 459 | 853 miss rate | 46.65% | 78.26% mpki | 0.91 | 1.69
SIMULATION RESULTS – TOOLS • CPI stacks $ ./tools/cpistack.py [-‐-‐time|-‐-‐cpi|-‐-‐abstime] [-‐-‐aggregate] CPI CPI % Time % Core 0 depend-‐int 0.20 23.42% 23.42% depend-‐fp 0.16 18.94% 18.94% branch 0.12 14.04% 14.04% ifetch 0.04 4.16% 4.16% mem-‐l1d 0.21 24.41% 24.41% mem-‐l3 0.02 2.72% 2.72% mem-‐dram 0.05 5.73% 5.73% sync-‐mutex 0.02 2.59% 2.59% sync-‐cond 0.03 3.01% 3.01% other 0.01 0.97% 0.97% total 0.84 100.00% 0.00s Core 1 depend-‐int 0.20 23.92% 23.92% depend-‐fp 0.16 18.79% 18.79% branch 0.12 13.72% 13.72% mem-‐l1d 0.20 24.06% 24.06% mem-‐l3 0.06 6.79% 6.79% sync-‐mutex 0.04 5.22% 5.22% sync-‐cond 0.05 5.60% 5.60% other 0.02 1.89% 1.89% total 0.85 100.00% 0.00s 89
SIMULATION RESULTS – TOOLS
• McPAT: power, energy and area $ ./tools/mcpat.py [-‐t static|dynamic|total|area] Power Energy Energy % core-‐core 13.77 W 0.00 J 30.14% core-‐ifetch 4.44 W 0.00 J 9.71% core-‐int 1.22 W 0.00 J 2.67% core-‐fp 1.47 W 0.00 J 3.21% core-‐mem 6.30 W 0.00 J 13.78% icache 5.55 W 0.00 J 12.16% dcache 4.74 W 0.00 J 10.37% l3 3.38 W 0.00 J 7.39% dram 4.51 W 0.00 J 9.87% other 0.32 W 0.00 J 0.70% core 27.19 W 0.00 J 59.52% total 45.68 W 0.01 J 100.00%
90
SIMULATION RESULTS – TOOLS
• Topology $ ./tools/gen_topology.py
91
SIMULATION RESULTS – TOOLS
• LLC miss latency breakdown $ ./tools/llcstack.py
Requests: 97485627 Total time: 85.16 dram-‐bus: 0.48 dram-‐device: 19.45 dram-‐queue: 0.07 inv-‐imbalance: 3.95 noc-‐base: 31.94 noc-‐queue: 2.11 remote-‐cache-‐inv: 0.80 remote-‐cache-‐wb: 10.15 td-‐access: 16.22
92
SIMULATION RESULTS – TOOLS • Per-‐funcXon staXsXcs (gprof2dot, KCacheGrind output) $ run-‐sniper -‐-‐profile calls time t.self icount ipc l2.mpki name 1 50.06% 0.00% 50.05% 1.06 0.86 _start 1 50.06% 0.00% 50.05% 1.06 0.86 _start 1 50.06% 0.00% 50.05% 1.06 0.86 __libc_start_main 1 50.06% 0.00% 50.05% 1.06 0.86 main 1 49.11% 0.20% 49.95% 1.08 0.69 SlaveStart 1 44.16% 0.18% 46.65% 1.12 0.40 FFT1D 32 32.87% 17.91% 36.83% 1.19 0.05 FFT1DOnce 32 14.96% 1.85% 10.62% 0.75 0.08 Reverse 1024 13.12% 13.12% 8.57% 0.69 0.02 BitReverse 3 5.92% 5.92% 6.24% 1.12 1.31 Transpose 16 2.17% 2.17% 3.25% 1.58 0.18 TwiddleOneCol 1 49.94% 0.00% 49.95% 1.06 1.63 clone 1 49.94% 0.02% 49.95% 1.06 1.63 start_thread 1 49.91% 0.17% 49.94% 1.06 1.58 SlaveStart 1 43.89% 0.15% 46.60% 1.12 0.66 FFT1D 32 32.60% 17.79% 36.83% 1.20 0.06 FFT1DOnce 32 14.81% 1.79% 10.62% 0.76 0.09 Reverse 1024 13.03% 13.03% 8.57% 0.70 0.03 BitReverse 3 7.02% 7.02% 6.24% 0.94 3.34 Transpose 16 2.04% 2.04% 3.25% 1.68 0.18 TwiddleOneCol
93
94
VIZ: CYCLES STACKS IN TIME
Time
Cycles (%)
Legend
OpXons
Markers
IPC VisualizaXon
$ run-‐sniper -‐-‐viz
95
VIZ: MCPAT OUTPUT OVER TIME
96
3D VISUALIZATION: IPC VS. TIME VS. CORE
AUTOMATIC RESULTS PROCESSING
• Avoid manually copy/pasXng simulaXon results for reliability, scalability (how else will you auto-‐generate graphs based on 100s of runs?)
• sniper_lib.get_results() parses sim.cfg, sim.stats, returns configuraXon and staXsXcs (roi-‐end – roi-‐begin) for all cores
~/sniper/tools$ python > import sniper_lib > results = sniper_lib.get_results(resultsdir = ‘..’) > print results {‘config’: {‘general/total_cores’: ‘64’, ‘perf_model/core/frequency’: ‘2.66’, …}, ‘results’: {‘performance_model.instruction_count’:[123], ‘performance_model.elapsed_time’: [23000000], …}}
97
SIMULATION RESULTS
• Let’s compute the IPC for core 0 • Core frequency is variable (DVFS) so cycle count has to be computed – Time is in femtoseconds, frequency in GHz
> instrs = results[‘results’] [‘performance_model.instruction_count’][0] > cycles = results[‘results’] [‘performance_model.elapsed_time’][0] * float(results[‘config’][‘perf_model/core/frequency’]) * 1e-‐6 # femtoseconds -‐> nanoseconds > ipc = instrs / cycles 2.0
98
INTERACTING WITH SNIPER
99
Sniper simulator
applicaXon
binary input/
cmdline
SimAPI
configuraXon
Python scripts
staXsXcs
visualizaXon
SIMAPI IMPLEMENTATION
• Magic instrucXons allow the applicaXon to talk to the simulator directly __asm__ __volatile__ ( "xchg %%bx, %%bx\n" : "=a" (_res) /* output */ : "a" (_cmd), "b" (_arg0), "c" (_arg1) /* input */ ); /* clobbered */
• Pin intercepts this instrucXon and passes control to the simulator
• Command and arguments passed through rax/rbx/rcx registers, result in rax
100
APPLICATION SIMAPI
• Calling simulator API funcXons from your C program #include <sim_api.h> – SimInSimulator()
• Return 1 when running inside Sniper, 0 when running naXvely – SimGetProcId()
• Return processor number of caller
– SimRoiStart() / SimRoiEnd() • Start/end detailed mode (when using ./run-‐sniper -‐-‐roi)
– SimSetFreqMHz(proc, mhz) / SimGetFreqMHz(proc) • Set / get processor frequency (integer, in MHz)
– SimUser(cmd, arg) • User-‐defined funcXon 101
PYTHON SCRIPTING
• Scripts are run on simulator startup – Register hooks: callbacks when certain events happen during the simulaXon
– See common/system/hooks_manager.h for all available hooks
• Use an exisXng script from sniper/scripts/*.py: ./run-‐sniper -‐s scriptname
• Or your own script: ./run-‐sniper -‐s myscriptname.py
• Use sim package for convenience wrappers 102
PYTHON SCRIPTING
• Low-‐level script • Execute “foo” at each barrier synchronizaXon
import sim_hooks def foo(t): print 'The time is now', t sim_hooks.register(sim_hooks.HOOK_PERIODIC, foo)
103
PYTHON SCRIPTING
• Higher-‐level script • Execute “foo” at each barrier synchronizaXon
import sim class Class: def hook_periodic(self, t): print 'The time is now', t sim.util.register(Class())
104
PYTHON SCRIPTING • High-‐level script: execute “foo” every X ms • Pass in parameter using ./run-‐sniper -‐s myscript.py:X
import sim class Class: def setup(self, args): sim.util.Every(long(args)*sim.util.Time.MS, self.periodic) def periodic(self, t, t_delta): print 'The time is now', t print 'Elapsed time since last call', t_delta sim.util.register(Class())
105
PYTHON SCRIPTING
• Access configuraXon, staXsXcs, DVFS • Live periodic IPC trace: – See scripts/ipctrace.py for a more complete example
class IPCTracer: def setup(self, args): sim.util.Every(1*sim.util.Time.US, self.periodic) self.instrs_prev = 0 def periodic(self, t, t_delta): freq = sim.dvfs.get_frequency(0) cycles = t_delta * freq * 1e-‐9 # fs * MHz -‐> cycles instrs = long(sim.stats.get('performance_model', 0, 'instruction_count')) print 'IPC =', (instrs – self.instrs_prev) / cycles self.instrs_prev = instrs
106
PYTHON & MAGIC INSTRUCTIONS
• Communicate informaXon between applicaXon and Python script – E.g.: simulated hardware performance counters
• ApplicaXon: uint64_t ninstrs = SimUtil(0xdeadbeef, SimGetProcId())
• Python script: class PerfCtr: def setup(self): sim.util.register_command(0xdeadbeef, self.compute) def compute(self, arg): return sim.stats.get('performance_model', arg, 'instruction_count')
107
EXAMPLE: REGION-‐BASED STATISTICS We want to know the L2 miss rate for a snipet of code • ApplicaXon: mark code snippet with Sim[Named]Marker
void myfunc(int a) { SimNamedMarker(1, "myfunc"); // function body SimNamedMarker(2, "myfunc");
}
• Script: intercept markers and write uniquely named staXsXcs snapshots import sim class Marker: def __init__(self): self.i = 0 def hook_magic_marker(self, thread, core, a, b, s): sim.stats.write('marker-‐%s-‐%d-‐%d' % (s, a, self.i)) if a == 2: self.i += 1 sim.util.register(Marker())
108
EXAMPLE: REGION-‐BASED STATISTICS • Run simulaXon
# ./run-‐sniper –cgainestown –smyscript.py -‐-‐ ./myapp # ./tools/dumpstats.py -‐-‐list
• Process results import sniper_stats, sniper_lib stats = sniper_stats.SniperStats(resultsdir = '.') niters = max([ int(name.split('-‐')[-‐1]) for name in stats.get_snapshots() if name.startswith('marker-‐myfunc-‐1-‐') ]) for i in range(0, niters+1): snapshot = sniper_lib.get_results(resultsdir = '.', partial =
('marker-‐myfunc-‐1-‐%d' % i, 'marker-‐myfunc-‐2-‐%d' % i))['results'] accesses = sum(snapshot['L2.loads']) + sum(snapshot['L2.stores']) misses = sum(snapshot['L2.load-‐misses']) \ + sum(snapshot['L2.store-‐misses']) print 'Iteration #%d: %.2f%%' % (i, 100. * misses / float(accesses))
109
THE SNIPER MULTI-‐CORE SIMULATOR THREAD SCHEDULING
KENZO VAN CRAEYNEST, TREVOR E. CARLSON, WIM HEIRMAN AND LIEVEN EECKHOUT
HTTP://WWW.SNIPERSIM.ORG
SUNDAY, SEPTEMBER 22ND, 2013 IISWC 2013, PORTLAND
SCHEDULING THREADS ON HETEROGENEOUS MULTI-‐CORE PROCESSORS • MulXple core types – Different power/performance trade-‐offs
small power-‐efficient cores
big high-‐performance cores
B
S
B B …
S S S …
intermediate cores
… M M M
111
CONFIGURING HETEROGENEOUS CONFIGURATIONS
• TradiXonal heterogeneous configuraXon opXons – Comma separated configuraXon parameters supported for most se~ngs
– (Re)defining core parameters -‐-‐perf_model/core/interval_Xmer/window_size=16,128,16,128 -‐-‐perf_model/core/interval_Xmer/dispatch_width=2,4,2,4 …
• Get’s complicated fast • Error prone
112
A BETTER WAY: DEFINING HETEROGENEOUS CONFIGURATIONS
113
big.cfg
medium.cfg
small.cfg small power-‐efficient cores
big high-‐performance cores
B
S
B B …
S S S …
intermediate cores
… M M M
./run_sniper –c big,small,big,small,medium -- /bin/ls
USING HETEROGENEITY INSIDE SNIPER
• Example: hardware scheduling – Need to know what core types there are
• and which are which – OpXon 1: hardcoded
• core_id 1 is a big core, core_id is a small core, … • hard to maintain, difficult to port for different experiments
– OpXon 2: IdenXfy core type by examining core parameters • Sim()-‐>getCfg()-‐>getBoolArray
("perf_model/core/interval_Xmer/window_size", coreId)
• Not flexible
114
TAGS: A CONVENIENT WAY TO USING HETEROGENEITY INSIDE SNIPER • Defining tags – OpXon 1: manually assign tags to core (or any other object) • -‐-‐tags/core/big=0,1,0,1 • -‐-‐tags/core/small=1,0,1,0
– OpXon 2: use heterogeneous .cfg files • Add “–c big.cfg,small,big,small” to run-‐sniper • AutomaXcally creates tags and heterogeneous configuraXon
• Using tags • Sim()-‐>getTagsManager()-‐>hasTag("core", coreId, "small"); • Sim()-‐>getTagsManager()-‐>hasTag("core", coreId, “big");
115
• Thread scheduling support added in Sniper 4.0 – Fully implemented inside the simulator, no applicaXon modificaXon or re-‐linking needed
• Base scheduling infrastructure supports: – Affinity mask for each thread (allowed cores)
• IniXalized at thread start based on scheduling policy • Updated by sched_affinity system calls • Updated by scheduling policy, user code
– Pre-‐empXve round-‐robin scheduling • Cores pick a new thread when Xme quantum expires, or when running thread goes to sleep
THREAD SCHEDULING
116
• Pinned (scheduler/type=pinned) [default] – Threads are assigned to a single core, round-‐robin – No migraXon – OpXonal core mask for disabled cores: --scheduler/pinned/core_mask=1,0,1,1,0,0,0,1
• Roaming (scheduler/type=roaming) – IniXal affinity: all cores (with opXonal mask) – Threads freely migrate to idle cores
• StaXc [deprecated] – Fixed assignment, no migraXon – No oversubscripXon (#threads <= #cores)
AVAILABLE SCHEDULERS
117
CUSTOM THREAD SCHEDULING
• When: periodically, in response to hooks (thread create, stall, resume, exit)
• How – hard way: moveThread() – Immediately moves a thread (but watch out for safe pre-‐empXon points, deadlocks, manual oversubscripXon, …)
• How – easy way: manipulaXng affinity masks – Threads will be moved as soon as it is safe, automaXc oversubscripXon (round-‐robin, quantum-‐based)
118
BIGSMALL SCHEDULER
• Provided as a demo/starXng point – Uses TagsManager to idenXfy core types
• Recommended because of flexibility
• Randomly reschedule threads between (mulXple) big and (mulXple) small cores – By sorXng cores based on random values – Sets affinity mask of threads to either “all small cores” or “all big cores”
– Round-‐robin scheduling within each core pool
119
MEMORY INTENSITY BASED SCHEDULER
• Threads that spend most of their Xme waiXng for memory get scheduled on small core types [Kumar at al., 2010] – Common pracXce to achieve energy-‐efficiency
• StarXng from BigSmall scheduler – Use memory cycles in the core Xming models – Register as a per-‐thread staXsXc – Use instead of random value
• Other straight-‐forward opXons: – IPC [Becchi and Crowley,2008], predicted slowdown [Van Craeynest et al., 2012]
120
SCHEDULING: PER-‐THREAD STATISTICS
• Most staXsXcs are collected per core (instrucXon count, cache misses, etc.)
• Support for per-‐thread staXsXcs – Linked to per-‐core staXsXc – Updated automaXcally when thread moves – Access per-‐thread staXsXc
Sim()-‐>getThreadStatsManager()-‐>getThreadStaXsXc (thread_id, ThreadStatsManager::INSTRUCTIONS)
– Or register your own staXsXc SchedulerDynamic::ThreadStats::ThreadStats
121
THE SNIPER MULTI-‐CORE SIMULATOR HANDS-‐ON DEMO
TREVOR E. CARLSON, WIM HEIRMAN, KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT
HTTP://WWW.SNIPERSIM.ORG
SUNDAY, SEPTEMBER 22RD, 2013 IISWC 2013, PORTLAND
SNIPER DEMO
• Downloading • Compiling • Running a demo applicaXon • EvaluaXng Performance – CPI Stacks
• ConfiguraXon and Run-‐Xme ModificaXons – ConfiguraXon files – Python scripXng – ROI markers and Magic instrucXons
123
REFERENCES
• Sniper website – hsp://snipersim.org/
• Download – hsp://snipersim.org/w/Download – hsp://snipersim.org/w/Download_Benchmarks
• Ge~ng started – hsp://snipersim.org/w/Ge~ng_Started
• QuesXons? – hsp://groups.google.com/group/snipersim – hsp://snipersim.org/w/Frequently_Asked_QuesXons
124
THE SNIPER MULTI-‐CORE SIMULATOR
WIM HEIRMAN, TREVOR E. CARLSON, IBRAHIM HUR KENZO VAN CRAEYNEST AND LIEVEN EECKHOUT
HTTP://WWW.SNIPERSIM.ORG
SUNDAY, SEPTEMBER 22RD, 2013 IISWC 2013, PORTLAND