jade nie sharad malik

31
MemFlow: Memory-Driven Data Scheduling with Datapath Co-design in Accelerators for Large- scale Applications Jade Nie Sharad Malik This work is sponsored in part by SONIC and C-FAR, funded centers of STARnet, a Semiconductor Research Corporation (SRC) program sponsored by MARCO and DARPA

Upload: others

Post on 17-Apr-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jade Nie Sharad Malik

MemFlow: Memory-Driven Data Scheduling with

Datapath Co-design in Accelerators for Large-scale Applications

Jade NieSharad Malik

This work is sponsored in part by SONIC and C-FAR, funded centers of STARnet, a Semiconductor Research Corporation (SRC) program sponsored by MARCO and DARPA

Page 2: Jade Nie Sharad Malik

Demands of large-scale computation

Neural networks

AlexNet 240MBVGG16 552MBGoogLeNet 28MBSqueezeNet 4.8MB

Feature size = 500 x 500 = 250,000Eigendecomposition on matrix 250,000x250,000

Eigenface

Page 3: Jade Nie Sharad Malik

Accelerator design is memory-driven

Latency: 10x SRAM20x function units

Energy consumption:100x SRAM1000x function units

Intel Core i7 4-8MB CacheIntel Core i9 25MB CacheNvidia Geforce 1080 GTX

4928KB SRAMTPU 24MB SRAMFPGA 10MB

DRAM access is inefficientOn-chip SRAM is limited

Page 4: Jade Nie Sharad Malik

Accelerator architecture

Software-controlled scratch-pad SRAM

DRAM accesses:- Compulsory accesses:

- Load initial input- Store back final results

- Accesses due to SRAM limited size:- Load/store spilled

intermediate results

Determining DRAM accesses:Computation scaleSRAM sizeData scheduling

Page 5: Jade Nie Sharad Malik

Effect of scheduling - DRAM accesses

* =

10

5

5

10

5

5

Scratch-pad size: 35 scalar data

DRAM Accesses due to spilling: 102030

Page 6: Jade Nie Sharad Malik

Effect of scheduling - DRAM accesses

* =

10

5

5

10

5

5

Scratch-pad size: 35 scalar data

DRAM Accesses due to spilling: 0

Page 7: Jade Nie Sharad Malik

Global Optimization – NP-hard problem

Data dependency graph

…….

…….

op

scalar data

…Through registers

Through scratchpad

Constraints: - Bound on scratchpad size- Bound on read/write ports- Function units numbers

Objective:- Number of data spilling- Number of cycles

Optimization tools:- Modern constraints solvers: SAT,

SMT, ILP- Stimulated annealing

Not feasible:- Complexity- Irregular scheduling – hard to control

Optimization with regularity

Variables: - Starting cycle of each op

Page 8: Jade Nie Sharad Malik

MemFlowβ€’ Tunable parameters from three levels

β€’ Level1: block dimensionsβ€’ Level2: block schedulingβ€’ Level3: parallelism of local computing

β€’ Mathematical modelingβ€’ DRAM accesses, SRAM accesses, cycles

β€’ Integrated optimization

Page 9: Jade Nie Sharad Malik

Blocking – matrix multiply

Macro node: Block computationABlock * BBlock + CBlock -> CBlock

Page 10: Jade Nie Sharad Malik

Blocking – LU factorization

Block computationLU

TRS

LUCPL

SUBGEMM

𝐴𝐴 β†’ 𝐿𝐿𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 βˆ— π‘ˆπ‘ˆπ‘’π‘’π‘’π‘’π‘’π‘’π‘™π‘™π‘™π‘™

𝐴𝐴 βˆ— π‘ˆπ‘ˆπ‘’π‘’π‘’π‘’π‘’π‘’π‘™π‘™π‘™π‘™βˆ’1 β†’ 𝐿𝐿

Lπ‘™π‘™π‘™π‘™π‘™π‘™π‘™π‘™π‘™π‘™βˆ’1 βˆ— A β†’ π‘ˆπ‘ˆ

𝐴𝐴 βˆ’ 𝐿𝐿 βˆ— π‘ˆπ‘ˆ β†’ 𝐴𝐴

𝐴𝐴 β†’ 𝐿𝐿 βˆ— π‘ˆπ‘ˆ L is lower triangular matrix, U is upper triangular matrix

Page 11: Jade Nie Sharad Malik

Blocking – QR factorization

Block computations:QR

QRUpdateTr

QRCPL

QRUpdate

𝐴𝐴 β†’ 𝐻𝐻(𝑉𝑉𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙) βˆ— 𝑅𝑅𝑒𝑒𝑒𝑒𝑒𝑒𝑙𝑙𝑙𝑙

𝐴𝐴 βˆ— π‘…π‘…π‘’π‘’π‘’π‘’π‘’π‘’π‘™π‘™π‘™π‘™βˆ’1 β†’ 𝐻𝐻(𝑉𝑉)

H(Vlower) βˆ— A β†’ 𝑅𝑅

𝐻𝐻 𝑉𝑉 βˆ— 𝑅𝑅,𝐴𝐴 β†’ [𝑅𝑅,𝐴𝐴]

𝑉𝑉 = 𝑣𝑣1,𝑣𝑣2, . . , 𝑣𝑣𝑛𝑛 ,𝐻𝐻 𝑉𝑉 = π‘ƒπ‘ƒπ‘›π‘›π‘ƒπ‘ƒπ‘›π‘›βˆ’1. .𝑃𝑃1 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝑃𝑃𝑖𝑖 = 𝐼𝐼 βˆ’ 𝑣𝑣𝑖𝑖𝑣𝑣𝑖𝑖𝑇𝑇

𝐴𝐴 β†’ 𝑄𝑄 βˆ— 𝑅𝑅 Q is orthogonal matrix, R is upper triangular matrix

Page 12: Jade Nie Sharad Malik

Why blocking?

DRAM

Scratch-pad

A Block B Block C Block

Customized datapath

Data blocks

Increase data reuse for each array

Balance data reuse between arrays

Fully local computation- SRAM bandwidth is the key

limitation

Page 13: Jade Nie Sharad Malik

Level1: block dimensions

dimi dimj

diml

Not the larger the better!

Page 14: Jade Nie Sharad Malik

Level2: block scheduling

Six choices:𝑖𝑖 β†’ 𝑗𝑗 β†’ 𝑙𝑙, 𝑗𝑗 β†’ 𝑖𝑖 β†’ 𝑙𝑙,𝑙𝑙 β†’ 𝑖𝑖 β†’ 𝑗𝑗, 𝑖𝑖 β†’ 𝑙𝑙 β†’ 𝑗𝑗,𝑙𝑙 β†’ 𝑗𝑗 β†’ 𝑖𝑖, 𝑗𝑗 β†’ 𝑙𝑙 β†’ 𝑖𝑖

List scheduling + regularity

For 𝑖𝑖 : 0 β†’ 𝑛𝑛𝑛𝑛𝑛𝑛_π‘π‘π‘™π‘™π‘˜π‘˜π‘šπ‘šFor 𝑗𝑗 : 0 β†’ 𝑛𝑛𝑛𝑛𝑛𝑛_π‘π‘π‘™π‘™π‘˜π‘˜π‘›π‘›

For 𝑙𝑙 : 0 β†’ 𝑛𝑛𝑛𝑛𝑛𝑛_π‘π‘π‘™π‘™π‘˜π‘˜π‘˜π‘˜executing macro node(𝑖𝑖, 𝑗𝑗, 𝑙𝑙)

Page 15: Jade Nie Sharad Malik

Fully local computation

Scratch-pad

A Block B Block C Block

Customized datapath

How to lay out data blocks?

How to design datapath?How to decide its parallelism?

Page 16: Jade Nie Sharad Malik

Level3: parallelism of local computation

Local computation of matrix multiply macro node

Page 17: Jade Nie Sharad Malik

Datapath sharing

LU

QR

Datapath sharing for Matmul, LU and QR

Page 18: Jade Nie Sharad Malik

MemFlowβ€’ Tunable parameters from three levels

β€’ Level1: block dimensions – blk_dimi, blk_dimj, blk_dimlβ€’ Level2: block scheduling – six loop ordering β€’ Level3: parallelism of local computing – PA, PB

β€’ Mathematical modelingβ€’ DRAM accesses, SRAM accesses, cycles

β€’ Integrated optimization

Page 19: Jade Nie Sharad Malik

Modeling – DRAM accesses

C01C00 C02 C10 C11 C12 C20 C21 C22

A00A01

A02

B00B10

B20

B01

B11B21

B02B12

B22

A10A11

A12

A20A21

A22

Macro nodeTime

Block reuse pattern in matrix multiply of loop ordering 𝑖𝑖 β†’ 𝑗𝑗 β†’ 𝑙𝑙

Page 20: Jade Nie Sharad Malik

Modeling – DRAM accesses

A group of alternatively used blocks

When SRAM is full, evict block based on optimal replacement policy-- replace block with furthest next use

Page 21: Jade Nie Sharad Malik

Modeling – DRAM accesses

#data evicted for each matrix = #reuse group * #periods * #blocks evicted per period * block size

Input size, Block dimension, block ordering, scratchpad partitioning

Page 22: Jade Nie Sharad Malik

Mathematical Modeling

β€’ DRAM accesses:β€’ input size, block size, loop ordering, memory partitioning

factor

β€’ SRAM Accesses:β€’ input size, block dimension, SRAM partitioning

β€’ Cycles:β€’ input size, SRAM partitioning

Page 23: Jade Nie Sharad Malik

MemFlowβ€’ Tunable parameters from three levels

β€’ Level1: block dimensions – blk_dimi, blk_dimj, blk_dimlβ€’ Level2: block scheduling – six loop ordering β€’ Level3: parallelism of local computing – PA, PB

β€’ Mathematical modelingβ€’ DRAM accesses, SRAM accesses, cycles

β€’ Integrated optimization

Page 24: Jade Nie Sharad Malik

Integrated Optimization

β€’ Parameters:β€’ Block dimensions, loop ordering, memory partitioning

β€’ Objectivesβ€’ Primary: weighted total of DRAM accesses and SRAM accessesβ€’ Secondary: cycles

β€’ Optimization methodsβ€’ Gradient descent to optimal magnitudeβ€’ Parameter sweeping for global optimum

β€’ Worst-case search space: 𝑂𝑂(𝑁𝑁1.5𝑃𝑃𝑙𝑙𝑃𝑃𝑃𝑃𝑃𝑃) , N input size, P total SRAM ports

β€’ ~90 secs for P=16, N=1000

Optimization is sensitive to input size

For specific kernel and input size:

Page 25: Jade Nie Sharad Malik

Results Comparison – cache reuse

Metric of evaluating data reuse: Number of SRAM allocations:

- Counts each time a new word is stored in SRAM- For cache, equals to (read misses + write misses)*cache line size

L1 cache: 32KB, 8-way associative, 64B cache line size

MemFlow:8 4KB dual-port banks

Page 26: Jade Nie Sharad Malik

Results Comparison – cache reuseSRAM allocation for MemFlow and L1 cache

MM: 37%, due to spilling 75%LU: 26%, due to spilling 76%QR: 23%, due to spilling 55%

Page 27: Jade Nie Sharad Malik

Results – GPU+cuBLAS

GPU: Nvidia Geforce 1080 GTXβ€’ Total SRAM size: 4928 KB

β€’ L1 cache, shared memory, L2 cacheβ€’ cuBLAS library:

β€’ Matrix multiply: cublasSgemmβ€’ LU decomposition: cublasSgetrfBatchedβ€’ QR decomposition: cublasSgeqrfBatched

MemFlow: 1232 4KB dual-port SRAM banks

Page 28: Jade Nie Sharad Malik

Results – GPU+cuBLAS

Improvements of SRAM and DRAM accesses.

Page 29: Jade Nie Sharad Malik

Results – GPU+cuBLASMatrix multiply data use distribution

GPU MemFlow

Page 30: Jade Nie Sharad Malik

Results – Vivado HLS

Datapath comparison with Vivado HLS

Page 31: Jade Nie Sharad Malik

Future Workβ€’ More efficient optimization methodsβ€’ Optimizing data scheduling for a flow of computation

kernelsβ€’ Verilog validation