erlangen regional computing center performance engineering … · erlangen regional computing...

21
ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional Computing Center & Dept. of Computer Science Friedrich-Alexander-University Erlangen-Nuremberg, Germany Performance Engineering for Stencil Updates on Modern Processors Partially funded by DFG Priority Programme1648

Upload: others

Post on 30-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

ERLANGEN REGIONAL

COMPUTING CENTER

Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein

Erlangen Regional Computing Center & Dept. of Computer Science

Friedrich-Alexander-University Erlangen-Nuremberg, Germany

Performance Engineering for Stencil Updates

on Modern Processors

Partially funded by DFG Priority Programme1648

Page 2: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

2

• There are so many

• potential stencil structures (2D/3D, long-/short-range,…),

• text book optimizations (register / spatial / temporal blocking,…),

• parameters for optimization (blocking / unrolling factor,…),

• parameters for execution (OpenMP schedule, clock speed, #cores).

• Basic questions addressed by ECM model

• What is the bottleneck of my stencil implementation?

Choose appropriate optimization technique

• What is the next bottleneck I will hit and “how far” is it from the first?

This yields the performance potential of the optimization

• Impact of processor frequency and #cores used on performance

Motivation

Performance Engineering for Stencil Kernels

Page 3: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

3

Agenda

ECM model – basics

STREAM like kernel

Why does a single core not get full BW?

Case study 1: 5-pt 2D stencil

Case study 2: Long range 3D stencil

Case study 3: 3D stencil with DIVide

Summary

Performance Engineering for Stencil Kernels

Page 4: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

4

Execution-Cache-Memory (ECM) model

Assumptions

Single core execution time is determined by

In-core execution time &

Data delay in memory hierarchy

Socket scaling in socket is linear until

relevant shared bottleneck is hit

Insights

Single-core performance & socket scaling

Relevant bottlenecks & performance impact

Input

Similar as for roofline (e.g. STREAM)

Data transfer times of all memory levels

Registers

L1

L2 / L3

Arithmetic

Memory

”d

ata

de

lay”

”c

ore

ti

me”

4.3cy/CL (40GB/s @2.7GHz)

2cy/CL

2cy/CL

Performance Engineering for Stencil Kernels

Page 5: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

5

REPEAT[ A(:) = B(:) + C(:) * D(:)] @ double precision

Analysis: Sandy Bridge core w/ AVX (unit of work: 1 Cache Line)

Example: Schönauer Vector Triad in L2 cache

1 LD/cy + 0.5 ST/cy

Registers

L1

L2

32 B/cy (2 cy/CL)

Machine characteristics:

Arithmetic:

1 ADD/cy+ 1 MULT/cy

Triad analysis (per CL):

Registers

L1

L2

6 cy/CL

10 cy/CL

Arithmetic:

AVX: 2 cy/CL

Roofline prediction: 16/10 F/cy

LD LD LD WA ST

Measurement: ≈ 16 /17 F/cy

Performance Engineering for Stencil Kernels

Timeline:

LD

LD

ST/2

LD

ST/2

LD

LD

ST/2

LD

ST/2

ADD

MULT

ADD

MULT

16 Flops/CL

(AVX)

Page 6: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

6

Data

in Model

Measure

ment

L1 6 cy 6.04 cy

L2 16 cy 17.2 cy

L3 26 cy 26.3 cy

MEM 50 cy 52.3 cy

Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) with AVX

LOAD/STORE CL

transfer

Write-allocate

CL transfer

Performance Engineering for Stencil Kernels

Full socket

stream BW

LD

6 c

y L2

-L1

1

0cy

L3

-L2

1

0cy

M

-L3

2

4cy

ST 4cy

Hard

ware

specific

ation

Time [cycles]

for single Cache Line Update

Page 7: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

7

Identify relevant bandwidth (shared) bottlenecks

L3 cache

Memory interface

Scale single-thread performance until first bottleneck is hit:

Multicore scaling in the ECM model

. . . Sandy Bridge: Scalable L3

Bottleneck (Proof): Memory

P(t)=min(t*P0,Proof), with Proof=min(Pmax,l *bS)

Performance Engineering for Stencil Kernels

Page 8: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

8

Saturation point (# cores) well predicted

Measurement: scaling not perfect

Single core performance accurately

predicted

ECM model works well for this

architecture and this benchmark!

ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:)

Performance Engineering for Stencil Kernels

Intel Sandy Bridge EP

Only „measurement“: full socket

STREAM BW

Page 9: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

9

Case study 1

ECM modelling for stencil structures:

2D Jacobi (double precision) with SSE2 on SNB

Performance Engineering for Stencil Kernels

Page 10: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

10

Case study 1: 2D Jacobi in DP with SSE2 on SNB

Instruction count

- 13 LOAD

- 4 STORE

- 12 ADD

- 4 MUL

4-way unrolling

8 LUP / iteration

Performance Engineering for Stencil Kernels

Page 11: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

11

Case study 1: 2D Jacobi in DP with SSE2 on SNB

Code characteristics

(SSE instructions per CL)

- 13 LOAD

- 4 STORE

- 12 ADD

- 4 MUL

Machine characteristics

(SSE instructions per cycle)

- 2 LOAD || (1 LOAD + 1 STORE)

- 1 ADD

- 1 MUL

- Clock speed: 3.5 GHz

LD LD LD LD 2

LD

2

LD

2

LD

2

LD

L

ST ST ST ST

+ + + + + + + + + + + +

* * * *

Core execution

12 cycles

Bottleneck

ADD

Port utilization in cycles

CP

U P

ort

s

Performance Engineering for Stencil Kernels

Page 12: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

12

Case study 1: 2D Jacobi in DP with SSE2 on SNB

Situation 1: Data set fits into L1 cache

ECM prediction:

(8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s

Measurement: 2.2 GLUP/s

Situation 2: Data set fits into L2 cache

3 transfer streams from L2 to L1

(LD T0 + LD/ST T1)

ECM prediction:

(8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s

Measurement: 1.9 GLUP/s What‘s

wrong??

12 cy

6 cy

Performance Engineering for Stencil Kernels

Page 13: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

13

Case study 1: 2D Jacobi in DP with SSE2 on SNB

LD LD LD LD 2LD 2LD 2LD 2LD L

ST ST ST ST

+ + + + + + + + + + + +

* * * * core execution: 12 cycles

ECM prediction:

(8 LUP / (8.5+6) cy) * 3.5 GHz =1.9 GLUP/s

Measurement: 1.9 GLUP/s

Register L1: 8.5 cycles

8.5 cy

6 cy

“If the model fails, we learn something”

Performance Engineering for Stencil Kernels

L1 L2: 6 cycles

L1 / L2 transfer

overlaps with arithmetic

No concurrent data transfer between memory levels

Data transfer may overlap with other in-core instructions

Page 14: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

14

Case study 2: Long range 3D stencil

ECM modelling for stencil structures: Long range 3D 25-pt stencil

Background:

Three-dimensional Seismic simulations in oil industry

Finite-Difference-Time-Domain (FDTD) discretization

Stencil operator is

Isotropic (symmetric),

With constant coefficients,

2nd order in time

8th order in space

Performance Engineering for Stencil Kernels

Innermost loop shown only

Collaboration with

D. Keyes & T. Malas

(KAUST)

Page 15: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

15

Case study 2: Long range 3D stencil

Data delay below L1 assuming

spatial blocking easy to calculate: 65 cy/CL

Data transfer L1 – register for V depends

on compiler

How to calculate / estimate core

execution time and actual load / stores?

Use Intel Architecture Code Analyzer (IACA))

Performance Engineering for Stencil Kernels

Core execution time

LD cycles: L1 Reg.

Core bottleneck

V?

Page 16: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

16

Case study 2: Long range 3D stencil

Core execution (Non-LD/ST cycles) Data delay

L1-R

EG (

Load

) 6

4cy

L2

-L1

2

4cy

L3

-L2

2

4cy

M

-L3

1

7cy

12

9c

y

FrontEnd stalls

overlap:

2*(34.1-32) =4.2cy

AD

D

52

cy

MU

LT

39

cy

Re

g-R

eg

tran

sfe

rs

48

cy

Stores 4cy

Single-core performance (ECM)

2.7GHz / (129cy / 16LUP) = 335 MLUP/s

Measurement: 320 MLUP/s

IACA throughput

34.1cy/8LUP

1 CL 16 LUP (sp)

Performance Engineering for Stencil Kernels

Page 17: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

17

Case study 3: 3D stencil with DIVide

ECM modelling for stencil structures: Impact of complex arithmetic

Background:

Three-dimensional Seismic simulations

Unelastic Wave Propagation Code

(http://hpgeoc.sdsc.edu/personnel.html)

Finite-Difference-Time-Domain (FDTD) discretization

Performance Engineering for Stencil Kernels

Collaboration with

O. Schenk

(USI Lugano)

DIV operation

Page 18: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

18

Case study 3: 3D stencil with DIV

Data delay

L1-R

EG

38

cy

L2-L

1

20

cy

L3-L

2

20

cy

MEM

-L3

2

6cy

Double Precision

Core execution

DIV

ide

r

84

cy

10

4cy

Single Precision

Core execution

FrontEnd

stalls

Single-core performance (SP)

ECM (104cy ): 415 MLUP/s

Measurement: 406 MLUP/s

Single-core performance (DP)

ECM (104cy ): 208 MLUP/s

Measurement: 179 MLUP/s

Compiler

avoids

DIV instruction

DIV becomes

bottleneck for

temporal

blocking!

Performance Engineering for Stencil Kernels

Page 19: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

19

Conclusions

ECM allows good estimate/prediction of single core performance

and socket scalability for “streaming kernels”

ECM reveals computational/hardware bottlenecks and quantifies

their impact on performance

ECM allows to quantify impact of clock speed or vectorization

ECM could be integrated into autotuning frameworks or DSLs

Work is funded by

Performance Engineering for Stencil Kernels

DFG Priority Programme1648 Bavarian Network for HPC

Page 20: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

Performance Engineering for Stencil Kernels

High Performance Computing

is Computing at a

Bottleneck!*

*Georg Hager

Page 21: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional

21

References – ECM model basics & applications

1. G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power

properties of modern multicore chips via simple machine models. (accepted)

Concurrency and Computation: Practice and Experience,

DOI: 10.1002/cpe.3180 (2013). Preprint: arXiv:1208.2908

2. M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein: Chip-level and multi-

node analysis of energy-optimized lattice-Boltzmann CFD simulations. Submitted.

Preprint: arXiv:1304.7664

3. J. Treibig, G. Hager, and G. Wellein: Performance patterns and hardware metrics

on modern multicore processors: Best practices for performance engineering.

Proc. 5th Workshop on Productivity and Performance (PROPER 2012) at Euro-Par

2012. Euro-Par 2012: Parallel Processing Workshops, LNCS 7640, 451-460

(2013), Springer, ISBN 978-3-642-36948-3. DOI: 10.1007/978-3-642-36949-0_50.

Preprint: arXiv:1206.3738

Performance Engineering for Stencil Kernels