carnegie mellon generation of simd dense linear algebra kernels with analytical models generation of...

20
Carnegie Mellon Carnegie Mello eneration of SIMD ense Linear Algebra Kernels ith Analytical Models chard M. Veras, Tze Meng Low & Franz Franchetti negie Mellon University er Smith & Robert van de Geijn versity of Texas at Austin

Upload: albert-knight

Post on 13-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Generation of SIMDDense Linear Algebra Kernels with Analytical Models

Richard M. Veras, Tze Meng Low & Franz FranchettiCarnegie Mellon University Tyler Smith & Robert van de GeijnUniversity of Texas at Austin

Page 2: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Introduction: What is the problem?4th loop around micro-kernel

3rd loop around micro-kernel

mR

mR

1

+=

+=

+=

+=

+=

kC

kC

mC mC

1

nR

kC

nR

Pack Ai → Ai

~

Pack Bp → Bp

~

nR

Ap

Bp

Cj

Ai

~ Bp

~

Bp

~Ci

Ci

kC

L3 cacheL2 cacheL1 cacheregisters

main memory

1st loop around micro-kernel

2nd loop around micro-kernel

micro-kernel

Ai

Page 3: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Introduction: What is our Solution

Kernel within a kernel

An approach for finding high performance candidate implementations.

A mechanical method for transforming that candidate into a high performance implementation.

Page 4: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Overview

Introduction

Background Kernel within a kernel Candidate Components that cover the kernel

Implementation: Reaching theoretical performance Keeping the non-obvious bottlenecks away Statically scheduling to deal with the remaining bottlenecks

Results and Analysis Summary

Page 5: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Background: A kernel for the BLIS kernel

LDALDB

MULADD

---PERM

MULADD

A0

A1

B0 B1

C00 C01

C10 C11

B0 B1

A0

A1

Page 6: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Background: Outer Product to ComponentPERM

B0

B1

B2

B3

B1

B0

B3

B2

B3

B2

B1

B0

B2

B3

B0

B1

B0

LDB

A0

A1

A2

A3

LDA

B0

B0

B0

B0

PERM

00

10

20

30FMA

A0

A1

A2

A3

LDAB0 B1 B2 B3

LDB

00

11

22

33

01

10

23

32

03

12

21

30

02

13

20

31

FMA

Page 7: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Implementing in context of the kernel

LDALDB

MULADD

-->PERM

MULADD

-->PERM

MULADD

-->PERM

MULADD

LDAvvv

MULADD

-->vvv

MULADD

-->vvv

MULADD

-->vvv

MULADD

A0

A1

A2

A3

A4

A5

A6

A7

B0 B1 B2 B3

Page 8: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Implementing a High Performance Kernel

• Remove any obstacles that will place the bottleneck anywhere but the multiply-add units..

• Statically schedule what we have to insure that multiply-add instructions retire at the predicted peak rate.

Page 9: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Implementation: Avoiding Bottlenecks

• Picking Instructions with shorter encodings:

vmovapd 66 41 0f 28 c0vmovaps 0f 28 47 c0

• Keeping Offsets between -128 and 127:

add 0x40, rdi 48 83 c7 40 add 0x100,rdi 48 83 c7 00 01 00 00

• Using Low Number Registers:

addpd xmm0,xmm8 66 44 0f 58 c0addpd xmm0,xmm1 66 44 59 c8

Page 10: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Statically Scheduling Instructions

s

LD ALU WR

5

s

LD ALU WR

6

s

LD ALU WR

7

// Prolog:READ(0)READ(1) ADD(0)---- ADD(1) READ(2)// Steady State:for(p = 3; p < K; p++ )WRITE(p-3) ---- ADD(p-1) READ(p)// Epilog:---- WRITE(K-2) ---- ADD(K-1)---- ---- ---- -------- ---- ---- WRITE(K-1)

RD

ADD

WR

LD ALU WR

1

2

3

4

Cloc

k Cy

cle

for(p = 0; p < K; p++ ) r1 = a[p] // 1 cycle r2 = r1+1 // 2 cycles a[p] = r2 // 1 cycle

[Lam, 1988]

Page 11: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Managing our Usage of Registers

LDALDB

MULADD

-->PERM

MULADD

-->PERM

MULADD

-->PERM

MULADD

LDAvvv

MULADD

-->vvv

MULADD

-->vvv

MULADD

-->vvv

MULADD

LDALDB PERM

MUL MULADD ADD

LDA

MUL MULADD ADD

MUL MULADD ADD

PERM PERM

MUL MULADD ADD

Page 12: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Scheduling the Partial Components

LDALDB PERM

MUL MULADD ADD

LDA

MULMULADDADD

MULMULADDADD

PERM PERM

MUL MULADD ADD

// CodeADDMULPERM

LDALDB PERM

MUL MULADD ADD

LDA

MULMULADDADD

MULMULADDADD

PERM PERM

MUL MULADD ADD

Page 13: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Overview

Introduction

Background Kernel within a kernel Candidate Components that cover the kernel

Implementation: Reaching theoretical performance Keeping the non-obvious bottlenecks away Statically scheduling to deal with the remaining bottlenecks

Results and Analysis Summary

Page 14: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Sequential Results: HSW and SNB

Page 15: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Sequential Results: NEH and PEN

Page 16: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Spilling Results

Page 17: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Partial Component Shape

Versus

Page 18: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Instruction Selection Model

Page 19: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Moving Forward

We wanted to take the mystery out of implementing high performance Blis kernel using the double precision outer-product kernel within the micro-kernel Look at other options and data types Capture all other architectures because of recurring motifs.

Page 20: Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical

Carnegie MellonCarnegie Mellon

Summary

Distill to kernel in a kernel

Determined how we estimate the best theoretical options

Show how we can mechanically turn that selection into an implementation

Presented options for the future