carnegie mellon generation of simd dense linear algebra kernels with analytical models generation of...

Carnegie MellonCarnegie Mellon

Generation of SIMDDense Linear Algebra Kernels with Analytical Models

Richard M. Veras, Tze Meng Low & Franz FranchettiCarnegie Mellon University Tyler Smith & Robert van de GeijnUniversity of Texas at Austin


Introduction: What is the problem?4th loop around micro-kernel

3rd loop around micro-kernel

mR

mR

1

+=

+=

+=

+=

+=

kC

kC

mC mC

1

nR

kC

nR

Pack Ai → Ai

~

Pack Bp → Bp

~

nR

Ap

Bp

Cj

Ai

~ Bp

~

Bp

~Ci

Ci

kC

L3 cacheL2 cacheL1 cacheregisters

main memory

1st loop around micro-kernel

2nd loop around micro-kernel

micro-kernel

Ai


Introduction: What is our Solution

Kernel within a kernel

An approach for finding high performance candidate implementations.

A mechanical method for transforming that candidate into a high performance implementation.


Overview

Introduction

Background Kernel within a kernel Candidate Components that cover the kernel

Implementation: Reaching theoretical performance Keeping the non-obvious bottlenecks away Statically scheduling to deal with the remaining bottlenecks

Results and Analysis Summary


Background: A kernel for the BLIS kernel

LDALDB

MULADD

---PERM

MULADD

A0

A1

B0 B1

C00 C01

C10 C11

B0 B1

A0

A1


Background: Outer Product to ComponentPERM

B0

B1

B2

B3

B1

B0

B3

B2

B3

B2

B1

B0

B2

B3

B0

B1

B0

LDB

A0

A1

A2

A3

LDA

B0

B0

B0

B0

PERM

00

10

20

30FMA

A0

A1

A2

A3

LDAB0 B1 B2 B3

LDB

00

11

22

33

01

10

23

32

03

12

21

30

02

13

20

31

FMA


Implementing in context of the kernel

LDALDB

MULADD

-->PERM

MULADD

-->PERM

MULADD

-->PERM

MULADD

LDAvvv

MULADD

-->vvv

MULADD

-->vvv

MULADD

-->vvv

MULADD

A0

A1

A2

A3

A4

A5

A6

A7

B0 B1 B2 B3


Implementing a High Performance Kernel

• Remove any obstacles that will place the bottleneck anywhere but the multiply-add units..

• Statically schedule what we have to insure that multiply-add instructions retire at the predicted peak rate.


Implementation: Avoiding Bottlenecks

• Picking Instructions with shorter encodings:

vmovapd 66 41 0f 28 c0vmovaps 0f 28 47 c0

• Keeping Offsets between -128 and 127:

add 0x40, rdi 48 83 c7 40 add 0x100,rdi 48 83 c7 00 01 00 00

• Using Low Number Registers:

addpd xmm0,xmm8 66 44 0f 58 c0addpd xmm0,xmm1 66 44 59 c8


Statically Scheduling Instructions

s

LD ALU WR

5

s

LD ALU WR

6

s

LD ALU WR

7

// Prolog:READ(0)READ(1) ADD(0)---- ADD(1) READ(2)// Steady State:for(p = 3; p < K; p++ )WRITE(p-3) ---- ADD(p-1) READ(p)// Epilog:---- WRITE(K-2) ---- ADD(K-1)---- ---- ---- -------- ---- ---- WRITE(K-1)

RD

ADD

WR

LD ALU WR

1

2

3

4

Cloc

k Cy

cle

for(p = 0; p < K; p++ ) r1 = a[p] // 1 cycle r2 = r1+1 // 2 cycles a[p] = r2 // 1 cycle

[Lam, 1988]


Managing our Usage of Registers

LDALDB

MULADD

-->PERM

MULADD

-->PERM

MULADD

-->PERM

MULADD

LDAvvv

MULADD

-->vvv

MULADD

-->vvv

MULADD

-->vvv

MULADD

LDALDB PERM

MUL MULADD ADD

LDA

MUL MULADD ADD

MUL MULADD ADD

PERM PERM

MUL MULADD ADD


Scheduling the Partial Components

LDALDB PERM

MUL MULADD ADD

LDA

MULMULADDADD

MULMULADDADD

PERM PERM

MUL MULADD ADD

// CodeADDMULPERM

LDALDB PERM

MUL MULADD ADD

LDA

MULMULADDADD

MULMULADDADD

PERM PERM

MUL MULADD ADD


Overview

Introduction

Background Kernel within a kernel Candidate Components that cover the kernel

Implementation: Reaching theoretical performance Keeping the non-obvious bottlenecks away Statically scheduling to deal with the remaining bottlenecks

Results and Analysis Summary


Sequential Results: HSW and SNB


Sequential Results: NEH and PEN


Spilling Results


Partial Component Shape

Versus


Instruction Selection Model


Moving Forward

We wanted to take the mystery out of implementing high performance Blis kernel using the double precision outer-product kernel within the micro-kernel Look at other options and data types Capture all other architectures because of recurring motifs.


Summary

Distill to kernel in a kernel

Determined how we estimate the best theoretical options

Show how we can mechanically turn that selection into an implementation

Presented options for the future

carnegie mellon generation of simd dense linear algebra kernels with analytical models generation of...

Documents

candidate kernel

kernel implementation

solution kernel

blis kernel

kernel lda ldb mul

high performance kernel

carnegie mellon implementation

carnegie mellon background