multicore scaling: the ecm model

Multicore Scaling:

The ECM Model

Single-core performance prediction

The saturation point

Three stencil code examples

Assumptions and shortcomings of the roofline model

Assumes one of two bottlenecks

1. In-core execution

2. Bandwidth of a single hierarchy level

Latency effects are not modeled pure data streaming assumed

In-core execution is sometimes hard to

model

Saturation effects in multicore

chips are not explained

ECM model gives more insight

A(:)=B(:)+C(:)*D(:)

Roofline predicts full socket BW

(c) RRZE 2014 ECM Model 2

The Execution-Cache-Memory (ECM)

model

G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience, DOI: 10.1002/cpe.3180 (2013). Preprint: arXiv:1208.2908

http://dx.doi.org/10.1002/cpe.3180



http://arxiv.org/abs/1208.2908

ECM Model

ECM = “Execution-Cache-Memory”

Assumptions:

Single-core execution time is composed of

1. In-core execution

2. Data transfers in the memory hierarchy

Data transfers may or may not overlap with

each other or with in-core execution

Scaling is linear until the relevant bottleneck

is reached

Input:

Same as for Roofline

+ data transfer times in hierarchy

(c) RRZE 2014 4 ECM Model

Example: Schönauer Vector Triad in L2 cache

REPEAT[ A(:) = B(:) + C(:) * D(:)] @ double precision

Analysis for Sandy Bridge core w/ AVX (unit of work: 1 cache line)


1 LD/cy + 0.5 ST/cy

Registers

L1

L2

32 B/cy (2 cy/CL)

Machine characteristics:

Arithmetic: 1 ADD/cy+ 1 MULT/cy

Registers

L1

L2

Triad analysis (per CL):

6 cy/CL

10 cy/CL

Arithmetic: AVX: 2 cy/CL

LD LD ST/2

LD ST/2 LD LD

ST/2 LD

ST/2

LD

ADD MULT

ADD MULT

LD LD WA ST

Roofline prediction: 16/10 F/cy

Timeline:

16 F/CL (AVX)

Measurement: 16F / ≈17cy

Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) on a Sandy Bridge Core with AVX


CL transfer

Write-allocate CL transfer

Full vs. partial vs. no overlap


Results suggest no overlap!

Multicore scaling in the ECM model

Identify relevant bandwidth bottlenecks

L3 cache

Memory interface

Scale single-thread performance until first bottleneck is hit:


𝑃 𝑡 = min(𝑡𝑃0, 𝑃roof), with 𝑃roof= min(𝑃max, 𝐼 ∙ 𝑏𝑆)

. . . Example: Scalable L3

on Sandy Bridge

ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:)

on a Sandy Bridge socket (no-overlap assumption)

Model: Scales until saturation

sets in

Saturation point (# cores) well

predicted

Measurement: scaling not perfect

Caveat: This is specific for this

architecture and this benchmark!

Check: Use “overlappable” kernel

code


ECM prediction vs. measurements for A(:)=B(:)+C(:)/D(:)

on a Sandy Bridge socket (full overlap assumption)


In-core execution is dominated by

divide operation

(44 cycles with AVX, 22 scalar)

Almost perfect agreement with

ECM model

General observation:

If the L1 cache is 100% occupied

by LD/ST, there is no overlap

throughout the hierarchy

If there is “slack” at the L1, there is

overlap in the hierarchy

Performance Modeling of Stencil Codes

Applying the ECM model to stencil updates:

- 2D Jacobi smoother

- Long-range stencil

(H. Stengel, RRZE)

Example 1: 2D Jacobi in DP with SSE2 on Sandy Bridge


Example 1: 2D Jacobi in DP with SSE2 on SNB


Instruction count - 13 LOAD - 4 STORE - 12 ADD - 4 MUL

4-way unrolling 8 LUP / iteration



Code characteristics (SSE instructions per iteration) - 13 LOAD - 4 STORE - 12 ADD - 4 MUL

Processor characteristics (SSE instructions per cycle) - 2 LOAD || (1 LOAD + 1 STORE) - 1 ADD - 1 MUL

LD LD LD LD 2LD 2LD 2LD 2LD L

ST ST ST ST

+ + + + + + + + + + + +

* * * *

core execution: 12 cy



Situation 1: Data set fits into L1 cache

ECM prediction:

(8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s

Measurement: 2.2 GLUP/s

Situation 2: Data set fits into L2 cache (not into L1)

3 additional transfer streams from L2 to L1 (data delay)

ECM prediction:

(8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s


Overlap?

12 cy

6 cy t0 RFO t1



LD LD LD LD 2LD 2LD 2LD 2LD L

ST ST ST ST

+ + + + + + + + + + + +

* * * *

core execution: 12 cycles

ECM prediction w/ overlap:

(8 LUP / (8.5+6) cy) * 3.5 GHz = 1.9 GLUP/s


L1 „single ported“ no overlap during LD/ST

data delay: 6 cycles

12 cy

6 cy RFO t0 t1

“If the model fails, we learn something”

Example 2: 3D long-range stencil in single precision

with AVX on Sandy Bridge


Example 2: 3D long-range stencil in SP with AVX on SNB

4 neighbors per direction

Operations per update

27 LOAD (25 V, 1 ROC, 1 U)

1 STORE (U)

26 ADD

15 MUL

Core time

IACA


Interlude: Intel Architecture Code Analyzer (IACA)

Performs architecture-specific code analysis

Prerequisite: Mark start and end of dominant work loop

In high-level code (documented)

In assembly code (see iacaMarks.h)

Does not influence code optimization (e.g. vectorization)

Assembly loop might perform multiple updates per iteration (unrolling, SIMD)

Important reports (throughput mode):

Block throughput: runtime of one loop iteration ( core-time)

Throughput bottleneck: limiting resource for code execution

Port pressure: dominant pipeline port


IACA example output


AVX vectorization, no unrolling: One iteration updates 8 SP (float) elements


4 neighbors per direction

Operations per update

27 LOAD (25 V, 1 ROC, 1 U)

1 STORE (U)

26 ADD

15 MUL

Core time (IACA)

34.25 cy / 8 LUP (SP)

69 cy / CL

LOAD dominated



Data delay

Spatial blocking for minimal traffic between L3 cache and memory

8 additional streams for smaller caches

Single-core performance

2.7GHz / (134cy / 16LUP) = 322MLUP/s

Measurement: 320MLUP/s

LOAD dominated no overlap expected

Socket scaling (8 cores)

8*322MLUP/s = 2576MLUP/s

Limit: 40GB/s / 16B/LUP = 2500MLUP/s

Saturation at 8 cores

Measurement: ~2200MLUP/s (88% max.)

Optimization possibilities

Data transfer well optimized (blocking)

Reduce core time (LD); limit: ADD (2*26cy)

Possible speedup: 69cy-52cy=17cy (~15%)


per

cyc

le t

ran

sfer

wid

ths

per

cac

he

line

up

dat

e ti

me

69cy

24cy

24cy 24cy

17cy

Socket scaling


memory bandwidth limit

Example 3: 3D red-black tetrahedral smoother in DP

with AVX on Sandy Bridge

“Terra-Neo” DFG project


Stencil inner loop code and pattern

Notes

Stride-two access pattern on the central line (red-black pattern)

15-point stencil

Compiler “vectorizes” inner loop, but loads and stores are scalar


ECM for Terra-Neo kernel

measured manually, optimization target!

ECM Model (c) RRZE 2014 26

8 LUP (1 CL):

layer condition:

Socket scaling of Terra-Neo kernel


no overlap

max. overlap

“some“ overlap

What we have learned

Code does not saturate the relevant bottleneck

We have to improve single-core performance

Core analysis: Bottleneck is register-register moves

Reason: Vectorized arithmetic, but scalar loads

Optimization goal: Truly vectorized kernel with SIMD loads

Options:

SIMD-friendly data layout

(red-black separation)

→ # iterations?

Different smoother algorithm

(e.g. Jacobi) on central line

→ convergence?


ECM model: Conclusions & outlook

Saturation effects are ubiquitous; understanding them gives us

opportunity to

Find out about optimization opportunities

Save energy by letting cores idle see power model later on

Putting idle cores to better use see spMVM case study

Simple models work best. Do not try to complicate things unless it

is really necessary!

Possible extensions to the ECM model

Accommodate latency effects

Model simple “architectural hazards”


multicore scaling: the ecm model

Documents