carnegie mellon generation of simd dense linear algebra kernels with analytical models generation of...
TRANSCRIPT
Carnegie MellonCarnegie Mellon
Generation of SIMDDense Linear Algebra Kernels with Analytical Models
Richard M. Veras, Tze Meng Low & Franz FranchettiCarnegie Mellon University Tyler Smith & Robert van de GeijnUniversity of Texas at Austin
Carnegie MellonCarnegie Mellon
Introduction: What is the problem?4th loop around micro-kernel
3rd loop around micro-kernel
mR
mR
1
+=
+=
+=
+=
+=
kC
kC
mC mC
1
nR
kC
nR
Pack Ai → Ai
~
Pack Bp → Bp
~
nR
Ap
Bp
Cj
Ai
~ Bp
~
Bp
~Ci
Ci
kC
L3 cacheL2 cacheL1 cacheregisters
main memory
1st loop around micro-kernel
2nd loop around micro-kernel
micro-kernel
Ai
Carnegie MellonCarnegie Mellon
Introduction: What is our Solution
Kernel within a kernel
An approach for finding high performance candidate implementations.
A mechanical method for transforming that candidate into a high performance implementation.
Carnegie MellonCarnegie Mellon
Overview
Introduction
Background Kernel within a kernel Candidate Components that cover the kernel
Implementation: Reaching theoretical performance Keeping the non-obvious bottlenecks away Statically scheduling to deal with the remaining bottlenecks
Results and Analysis Summary
Carnegie MellonCarnegie Mellon
Background: A kernel for the BLIS kernel
LDALDB
MULADD
---PERM
MULADD
A0
A1
B0 B1
C00 C01
C10 C11
B0 B1
A0
A1
Carnegie MellonCarnegie Mellon
Background: Outer Product to ComponentPERM
B0
B1
B2
B3
B1
B0
B3
B2
B3
B2
B1
B0
B2
B3
B0
B1
B0
LDB
A0
A1
A2
A3
LDA
B0
B0
B0
B0
PERM
00
10
20
30FMA
A0
A1
A2
A3
LDAB0 B1 B2 B3
LDB
00
11
22
33
01
10
23
32
03
12
21
30
02
13
20
31
FMA
Carnegie MellonCarnegie Mellon
Implementing in context of the kernel
LDALDB
MULADD
-->PERM
MULADD
-->PERM
MULADD
-->PERM
MULADD
LDAvvv
MULADD
-->vvv
MULADD
-->vvv
MULADD
-->vvv
MULADD
A0
A1
A2
A3
A4
A5
A6
A7
B0 B1 B2 B3
Carnegie MellonCarnegie Mellon
Implementing a High Performance Kernel
• Remove any obstacles that will place the bottleneck anywhere but the multiply-add units..
• Statically schedule what we have to insure that multiply-add instructions retire at the predicted peak rate.
Carnegie MellonCarnegie Mellon
Implementation: Avoiding Bottlenecks
• Picking Instructions with shorter encodings:
vmovapd 66 41 0f 28 c0vmovaps 0f 28 47 c0
• Keeping Offsets between -128 and 127:
add 0x40, rdi 48 83 c7 40 add 0x100,rdi 48 83 c7 00 01 00 00
• Using Low Number Registers:
addpd xmm0,xmm8 66 44 0f 58 c0addpd xmm0,xmm1 66 44 59 c8
Carnegie MellonCarnegie Mellon
Statically Scheduling Instructions
s
LD ALU WR
5
s
LD ALU WR
6
s
LD ALU WR
7
// Prolog:READ(0)READ(1) ADD(0)---- ADD(1) READ(2)// Steady State:for(p = 3; p < K; p++ )WRITE(p-3) ---- ADD(p-1) READ(p)// Epilog:---- WRITE(K-2) ---- ADD(K-1)---- ---- ---- -------- ---- ---- WRITE(K-1)
RD
ADD
WR
LD ALU WR
1
2
3
4
Cloc
k Cy
cle
for(p = 0; p < K; p++ ) r1 = a[p] // 1 cycle r2 = r1+1 // 2 cycles a[p] = r2 // 1 cycle
[Lam, 1988]
Carnegie MellonCarnegie Mellon
Managing our Usage of Registers
LDALDB
MULADD
-->PERM
MULADD
-->PERM
MULADD
-->PERM
MULADD
LDAvvv
MULADD
-->vvv
MULADD
-->vvv
MULADD
-->vvv
MULADD
LDALDB PERM
MUL MULADD ADD
LDA
MUL MULADD ADD
MUL MULADD ADD
PERM PERM
MUL MULADD ADD
Carnegie MellonCarnegie Mellon
Scheduling the Partial Components
LDALDB PERM
MUL MULADD ADD
LDA
MULMULADDADD
MULMULADDADD
PERM PERM
MUL MULADD ADD
// CodeADDMULPERM
LDALDB PERM
MUL MULADD ADD
LDA
MULMULADDADD
MULMULADDADD
PERM PERM
MUL MULADD ADD
Carnegie MellonCarnegie Mellon
Overview
Introduction
Background Kernel within a kernel Candidate Components that cover the kernel
Implementation: Reaching theoretical performance Keeping the non-obvious bottlenecks away Statically scheduling to deal with the remaining bottlenecks
Results and Analysis Summary
Carnegie MellonCarnegie Mellon
Sequential Results: HSW and SNB
Carnegie MellonCarnegie Mellon
Sequential Results: NEH and PEN
Carnegie MellonCarnegie Mellon
Spilling Results
Carnegie MellonCarnegie Mellon
Partial Component Shape
Versus
Carnegie MellonCarnegie Mellon
Instruction Selection Model
Carnegie MellonCarnegie Mellon
Moving Forward
We wanted to take the mystery out of implementing high performance Blis kernel using the double precision outer-product kernel within the micro-kernel Look at other options and data types Capture all other architectures because of recurring motifs.
Carnegie MellonCarnegie Mellon
Summary
Distill to kernel in a kernel
Determined how we estimate the best theoretical options
Show how we can mechanically turn that selection into an implementation
Presented options for the future