erlangen regional computing center performance engineering … · erlangen regional computing...
TRANSCRIPT
ERLANGEN REGIONAL
COMPUTING CENTER
Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein
Erlangen Regional Computing Center & Dept. of Computer Science
Friedrich-Alexander-University Erlangen-Nuremberg, Germany
Performance Engineering for Stencil Updates
on Modern Processors
Partially funded by DFG Priority Programme1648
2
• There are so many
• potential stencil structures (2D/3D, long-/short-range,…),
• text book optimizations (register / spatial / temporal blocking,…),
• parameters for optimization (blocking / unrolling factor,…),
• parameters for execution (OpenMP schedule, clock speed, #cores).
• Basic questions addressed by ECM model
• What is the bottleneck of my stencil implementation?
Choose appropriate optimization technique
• What is the next bottleneck I will hit and “how far” is it from the first?
This yields the performance potential of the optimization
• Impact of processor frequency and #cores used on performance
Motivation
Performance Engineering for Stencil Kernels
3
Agenda
ECM model – basics
STREAM like kernel
Why does a single core not get full BW?
Case study 1: 5-pt 2D stencil
Case study 2: Long range 3D stencil
Case study 3: 3D stencil with DIVide
Summary
Performance Engineering for Stencil Kernels
4
Execution-Cache-Memory (ECM) model
Assumptions
Single core execution time is determined by
In-core execution time &
Data delay in memory hierarchy
Socket scaling in socket is linear until
relevant shared bottleneck is hit
Insights
Single-core performance & socket scaling
Relevant bottlenecks & performance impact
Input
Similar as for roofline (e.g. STREAM)
Data transfer times of all memory levels
Registers
L1
L2 / L3
Arithmetic
Memory
”d
ata
de
lay”
”c
ore
ti
me”
4.3cy/CL (40GB/s @2.7GHz)
2cy/CL
2cy/CL
Performance Engineering for Stencil Kernels
5
REPEAT[ A(:) = B(:) + C(:) * D(:)] @ double precision
Analysis: Sandy Bridge core w/ AVX (unit of work: 1 Cache Line)
Example: Schönauer Vector Triad in L2 cache
1 LD/cy + 0.5 ST/cy
Registers
L1
L2
32 B/cy (2 cy/CL)
Machine characteristics:
Arithmetic:
1 ADD/cy+ 1 MULT/cy
Triad analysis (per CL):
Registers
L1
L2
6 cy/CL
10 cy/CL
Arithmetic:
AVX: 2 cy/CL
Roofline prediction: 16/10 F/cy
LD LD LD WA ST
Measurement: ≈ 16 /17 F/cy
Performance Engineering for Stencil Kernels
Timeline:
LD
LD
ST/2
LD
ST/2
LD
LD
ST/2
LD
ST/2
ADD
MULT
ADD
MULT
16 Flops/CL
(AVX)
6
Data
in Model
Measure
ment
L1 6 cy 6.04 cy
L2 16 cy 17.2 cy
L3 26 cy 26.3 cy
MEM 50 cy 52.3 cy
Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) with AVX
LOAD/STORE CL
transfer
Write-allocate
CL transfer
Performance Engineering for Stencil Kernels
Full socket
stream BW
LD
6 c
y L2
-L1
1
0cy
L3
-L2
1
0cy
M
-L3
2
4cy
ST 4cy
Hard
ware
specific
ation
Time [cycles]
for single Cache Line Update
7
Identify relevant bandwidth (shared) bottlenecks
L3 cache
Memory interface
Scale single-thread performance until first bottleneck is hit:
Multicore scaling in the ECM model
. . . Sandy Bridge: Scalable L3
Bottleneck (Proof): Memory
P(t)=min(t*P0,Proof), with Proof=min(Pmax,l *bS)
Performance Engineering for Stencil Kernels
8
Saturation point (# cores) well predicted
Measurement: scaling not perfect
Single core performance accurately
predicted
ECM model works well for this
architecture and this benchmark!
ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:)
Performance Engineering for Stencil Kernels
Intel Sandy Bridge EP
Only „measurement“: full socket
STREAM BW
9
Case study 1
ECM modelling for stencil structures:
2D Jacobi (double precision) with SSE2 on SNB
Performance Engineering for Stencil Kernels
10
Case study 1: 2D Jacobi in DP with SSE2 on SNB
Instruction count
- 13 LOAD
- 4 STORE
- 12 ADD
- 4 MUL
4-way unrolling
8 LUP / iteration
Performance Engineering for Stencil Kernels
11
Case study 1: 2D Jacobi in DP with SSE2 on SNB
Code characteristics
(SSE instructions per CL)
- 13 LOAD
- 4 STORE
- 12 ADD
- 4 MUL
Machine characteristics
(SSE instructions per cycle)
- 2 LOAD || (1 LOAD + 1 STORE)
- 1 ADD
- 1 MUL
- Clock speed: 3.5 GHz
LD LD LD LD 2
LD
2
LD
2
LD
2
LD
L
ST ST ST ST
+ + + + + + + + + + + +
* * * *
Core execution
12 cycles
Bottleneck
ADD
Port utilization in cycles
CP
U P
ort
s
Performance Engineering for Stencil Kernels
12
Case study 1: 2D Jacobi in DP with SSE2 on SNB
Situation 1: Data set fits into L1 cache
ECM prediction:
(8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s
Measurement: 2.2 GLUP/s
Situation 2: Data set fits into L2 cache
3 transfer streams from L2 to L1
(LD T0 + LD/ST T1)
ECM prediction:
(8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s
Measurement: 1.9 GLUP/s What‘s
wrong??
12 cy
6 cy
Performance Engineering for Stencil Kernels
13
Case study 1: 2D Jacobi in DP with SSE2 on SNB
LD LD LD LD 2LD 2LD 2LD 2LD L
ST ST ST ST
+ + + + + + + + + + + +
* * * * core execution: 12 cycles
ECM prediction:
(8 LUP / (8.5+6) cy) * 3.5 GHz =1.9 GLUP/s
Measurement: 1.9 GLUP/s
Register L1: 8.5 cycles
8.5 cy
6 cy
“If the model fails, we learn something”
Performance Engineering for Stencil Kernels
L1 L2: 6 cycles
L1 / L2 transfer
overlaps with arithmetic
No concurrent data transfer between memory levels
Data transfer may overlap with other in-core instructions
14
Case study 2: Long range 3D stencil
ECM modelling for stencil structures: Long range 3D 25-pt stencil
Background:
Three-dimensional Seismic simulations in oil industry
Finite-Difference-Time-Domain (FDTD) discretization
Stencil operator is
Isotropic (symmetric),
With constant coefficients,
2nd order in time
8th order in space
Performance Engineering for Stencil Kernels
Innermost loop shown only
Collaboration with
D. Keyes & T. Malas
(KAUST)
15
Case study 2: Long range 3D stencil
Data delay below L1 assuming
spatial blocking easy to calculate: 65 cy/CL
Data transfer L1 – register for V depends
on compiler
How to calculate / estimate core
execution time and actual load / stores?
Use Intel Architecture Code Analyzer (IACA))
Performance Engineering for Stencil Kernels
Core execution time
LD cycles: L1 Reg.
Core bottleneck
V?
16
Case study 2: Long range 3D stencil
Core execution (Non-LD/ST cycles) Data delay
L1-R
EG (
Load
) 6
4cy
L2
-L1
2
4cy
L3
-L2
2
4cy
M
-L3
1
7cy
12
9c
y
FrontEnd stalls
overlap:
2*(34.1-32) =4.2cy
AD
D
52
cy
MU
LT
39
cy
Re
g-R
eg
tran
sfe
rs
48
cy
Stores 4cy
Single-core performance (ECM)
2.7GHz / (129cy / 16LUP) = 335 MLUP/s
Measurement: 320 MLUP/s
IACA throughput
34.1cy/8LUP
1 CL 16 LUP (sp)
Performance Engineering for Stencil Kernels
17
Case study 3: 3D stencil with DIVide
ECM modelling for stencil structures: Impact of complex arithmetic
Background:
Three-dimensional Seismic simulations
Unelastic Wave Propagation Code
(http://hpgeoc.sdsc.edu/personnel.html)
Finite-Difference-Time-Domain (FDTD) discretization
Performance Engineering for Stencil Kernels
Collaboration with
O. Schenk
(USI Lugano)
DIV operation
18
Case study 3: 3D stencil with DIV
Data delay
L1-R
EG
38
cy
L2-L
1
20
cy
L3-L
2
20
cy
MEM
-L3
2
6cy
Double Precision
Core execution
DIV
ide
r
84
cy
10
4cy
Single Precision
Core execution
FrontEnd
stalls
Single-core performance (SP)
ECM (104cy ): 415 MLUP/s
Measurement: 406 MLUP/s
Single-core performance (DP)
ECM (104cy ): 208 MLUP/s
Measurement: 179 MLUP/s
Compiler
avoids
DIV instruction
DIV becomes
bottleneck for
temporal
blocking!
Performance Engineering for Stencil Kernels
19
Conclusions
ECM allows good estimate/prediction of single core performance
and socket scalability for “streaming kernels”
ECM reveals computational/hardware bottlenecks and quantifies
their impact on performance
ECM allows to quantify impact of clock speed or vectorization
ECM could be integrated into autotuning frameworks or DSLs
Work is funded by
Performance Engineering for Stencil Kernels
DFG Priority Programme1648 Bavarian Network for HPC
Performance Engineering for Stencil Kernels
High Performance Computing
is Computing at a
Bottleneck!*
*Georg Hager
21
References – ECM model basics & applications
1. G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power
properties of modern multicore chips via simple machine models. (accepted)
Concurrency and Computation: Practice and Experience,
DOI: 10.1002/cpe.3180 (2013). Preprint: arXiv:1208.2908
2. M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein: Chip-level and multi-
node analysis of energy-optimized lattice-Boltzmann CFD simulations. Submitted.
Preprint: arXiv:1304.7664
3. J. Treibig, G. Hager, and G. Wellein: Performance patterns and hardware metrics
on modern multicore processors: Best practices for performance engineering.
Proc. 5th Workshop on Productivity and Performance (PROPER 2012) at Euro-Par
2012. Euro-Par 2012: Parallel Processing Workshops, LNCS 7640, 451-460
(2013), Springer, ISBN 978-3-642-36948-3. DOI: 10.1007/978-3-642-36949-0_50.
Preprint: arXiv:1206.3738
Performance Engineering for Stencil Kernels