tensorlib: a spatial accelerator generation framework for
TRANSCRIPT
![Page 1: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/1.jpg)
Tensorlib: A Spatial Accelerator Generation Framework for Tensor Algebra
Liancheng Jia1, Zizhang Luo1, Liqiang Lu1, and Yun Liang1,2
1Center for Energy-Efficient Computing and Applications, School of EECS, Peking University
2Pengcheng laboratory, China{jlc, semiwaker, liqianglu, ericlyun}@pku.edu.cn
1
![Page 2: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/2.jpg)
Overview
HASCO HW/SW Co-design for Tensor Computation
TENET Modeling Tensor Dataflow Based on Relation-centric Notation
Tensorlib Auto-Generation of Spatial Hardware Accelerators
2
![Page 3: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/3.jpg)
Spatial Accelerator Architecture
PE Array
On-chip Buffer
…
…
… … …
Main Memory Host CPU
ControllerBank 1 Bank 2 …
PE0,0 PE0,1
PEM,0 PEM,1 PEM,N
PE0,N
PE Array
On-Chip Buffer
Controller
Typical Architecture for Spatial Accelerators
3
![Page 4: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/4.jpg)
Spatial Accelerators for Tensor Algebra
Google TPU ShiDiannao
Ctrl
Core Core
Core CoreDRAM
Core Core
SRAM
NOC
PE Array
MIT Eyeriss
Inpu
tIm
age
Buf IN
Buf Out
Buf S
BufCtrl
Ctrl
PE Array
Multiple Memory Banks 2D PE Array
Unified ControllerMulti-Level Storage
HBM
8GB
vectorunit
vectorunit
MXU MXU
Vector Mem\ \
Ctrl
4
![Page 5: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/5.jpg)
Various Hardware Dataflows
Output stationary systolic
Reduction Tree(Applied by MAERI)
Multicast
+ +
+
Row Stationary(Applied by Eyeriss)
Multicast+systolic(Applied by ShiDiannao)
Weight stationary systolic(Applied by TPU)
. . . . . .
. . . . . .
Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." ISCA 2017Chen, Yu-Hsin, et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." JSSC 2016Du, Zidong, et al. "ShiDianNao: Shifting vision processing closer to the sensor." ISCA 2015H Kwon et al. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects ASPLOS 2018 5
![Page 6: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/6.jpg)
Architecture Similarity and Difference
Multi-Level Storage HierarchyOn-chip buffer is divided into banks
PEs form a spatial array
Controlled by a unified controller
Different PE array dataflow
Different PE size and data type
Different kernel implementation
Different NoC architecture
6
![Page 7: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/7.jpg)
Current Solution of Spatial Accelerator Development
Expression Performance Productivity Dataflow SupportManual HDL Verilog ★★★ ★ ★
High-Level Synthesis C/C++ ★★ ★★ ★
HLS + DSL Halide / TVM DSL ★ ★★★ ★★
Tensorlib(Ours)
Space-Time Transformation ★★★ ★★★ ★★★
Dilemma between productivity and performance
7
![Page 8: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/8.jpg)
Overview of Tensorlib
DataflowDefinition
ComputeDefinition
Tensor A: Systolic ↑Tensor B: StationaryTensor C: Systolic→…
Dataflow Generation
Space-Time Dataflow Analysis
PE internal modules
PE Interconnection Codegen
Chisel
Verilog
FPGA/ASICSynthesis
Simple Expression
VariousDataflows
GoodProductivity
HighPerformance
8
![Page 9: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/9.jpg)
Dataflow Analysis
9
![Page 10: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/10.jpg)
Relations in Spatial Accelerators
PE PE
PE PE
0 1 2
𝑖𝑖𝑗𝑗𝑘𝑘
cycle
instance to PE
itera
tion
dom
ain
instance to execution sequence
Dataflow RelationInstance-Hardware
output tensor
PE
A[0][2]B[2][0]Y[0][0]
0 1 2
A[0][0] A[0][1]A[1][0]
A[1][1]A[2][0]A[0][2]
input tensor
data to PE
data to execution sequence
Data AssignmentHardware-Data
Instance-Data
𝑖𝑖𝑗𝑗𝑘𝑘
itera
tion
dom
ain
output tensor
input tensor
L.Lu et al. TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation ISCA, 202110
![Page 11: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/11.jpg)
How to represent relations?
loop nest:for (i = 0; i < 3; i++)
for (j = 0; j < 3; j++)for (k = 0; k < 3; k++)
instance [i,j,k]:Y[i,j] += A[i,k] * B[k,j];
[i, j, k]
instance
A[i, k]
tensor index× 1 0 00 0 1
PE[x, y]Cycle[t]
PE andCycle index
hardware-data
Space-Time Transformation
instance-data
instance-hardware
?
11
![Page 12: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/12.jpg)
Space-Time Transformation
STT ⋅122
=
1
2
5
PE0,0 PE0,1 PE0,2
PE1,0 PE1,1 PE1,2
PE2,0 PE2,1 PE2.2
cycle0 1 2 3 4 5 6
Index:𝑖𝑖𝑗𝑗𝑘𝑘
=122
𝑆𝑆𝑆𝑆𝑆𝑆 =1 0 00 1 01 1 1
loop nest:for (i = 0; i < 3; i++)
for (j = 0; j < 3; j++)for (k = 0; k < 3; k++)
instance [i,j,k]:Y[i,j] += A[i,k] * B[k,j];
Space-Time Transformation (STT):
Transform loop iteration domain into hardware space-time domainCan be expressed with Matrix Multiplication
𝑆𝑆𝑆𝑆𝑆𝑆𝑖𝑖𝑗𝑗𝑘𝑘
=𝑥𝑥𝑦𝑦𝑡𝑡
Example:
12
![Page 13: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/13.jpg)
Formulating Hardware-Data Relation
PE[1, 0] PE[1, 1]
PE[0, 0] PE[0, 1]
A[0, 0]
A[0, 1] A[1, 0]
PE[1, 0] PE[1, 1]
PE[0, 0] PE[0, 1]
A[0, 1] A[1, 0]
A[0, 2] A[1, 1]
PE[1, 0] PE[1, 1]
PE[0, 0] PE[0, 1]
A[0, 2] A[1, 1]
A[0, 3] A[1, 2]
t=1 t=2 t=3
PE[1, 0] PE[1, 1]
PE[0, 0] PE[0, 1]
[1, 0, 0]
[0, 0, 1] [1, 0, 1]
PE[1, 0] PE[1, 1]
PE[0, 0] PE[0, 1]
[1, 0, 1] [1, 1, 0]
[0, 0, 2] [1, 0, 2]
PE[1, 0] PE[1, 1]
PE[0, 0] PE[0, 1]
[1, 0, 1] [1, 1, 1]
[0, 0, 3] [1, 0, 3]
t=1 t=2 t=3
Step 1: space-time domain -> instance domain
Step 2: instance domain -> tensor index domain
Step 1: Inverse Space-Time Transformation:
𝑆𝑆𝑆𝑆𝑆𝑆−1𝑥𝑥𝑦𝑦𝑡𝑡
=𝑖𝑖𝑗𝑗𝑘𝑘
Step 2: Tensor Access Formula:
𝑆𝑆𝑇𝑇𝑖𝑖𝑗𝑗𝑘𝑘
= 𝑚𝑚𝑛𝑛
Combine two steps together:
𝑆𝑆𝑇𝑇 · 𝑆𝑆𝑆𝑆𝑆𝑆−1𝑥𝑥𝑦𝑦𝑡𝑡
= 𝑚𝑚𝑛𝑛
13
![Page 14: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/14.jpg)
x
y t
x
y t
Reuse Space and PE-Memory Interconnection
Reuse subspace In (x, y, t) domain
PE-MemoryInterconnection
Example
Each PE reads data once from memory bank
No data reuse
Reuse subspace forms a line in space-time
Reuse subspace forms a plane
0-D 1-D 2-Dx
y t
14
![Page 15: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/15.jpg)
Reuse direction 𝑥𝑥 = 𝑝𝑝𝑡𝑡
space parttime part
1-D Reuse Type Identification
Reuse Vector Direction
�⃗�𝑝 = 0𝑡𝑡 ≠ 0
�⃗�𝑝 ≠ 0𝑡𝑡 ≠ 0
�⃗�𝑝 ≠ 0𝑡𝑡 = 0
PEBehavior
Stay in the same PE during execution stage
Delay data before send to the next PE
Data arrive at PEs atthe same time
DataflowType Stationary Systolic Multicast
1-D reuse type forms a vector in space-time domain
15
![Page 16: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/16.jpg)
Hardware Generation
16
![Page 17: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/17.jpg)
Systolic Dataflow
Reuse Vector 𝑥𝑥 = �⃗�𝑝𝑡𝑡
�⃗�𝑝 ≠ 0𝑡𝑡 ≠ 0
• Transfer to PE 𝑥𝑥 + �⃗�𝑝 from PE 𝑥𝑥 after delaying 𝑡𝑡 cycles
• Use register to delay input data in each PE
ComputeCell
Shift regin out
ComputeCell
shiftregin out
PE Dataflow for Input Tensor
PE Dataflow for Output Tensor
17
![Page 18: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/18.jpg)
Multicast Dataflow
Reuse Vector 𝑥𝑥 = �⃗�𝑝𝑡𝑡
�⃗�𝑝 = 0𝑡𝑡 ≠ 0
• Transfer to PEs in the same line simultaneously• Directly send to/from compute cell• Use reduction tree for output communication
ComputeCell
in reg
ComputeCell
Reduction tree
PE Dataflow for Input Tensor
PE Dataflow for Output Tensor
18
![Page 19: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/19.jpg)
Stationary Dataflow
Reuse Vector 𝑥𝑥 = �⃗�𝑝𝑡𝑡
�⃗�𝑝 = 0𝑡𝑡 ≠ 0
• Keep stationary inside PE during execution
• Requires systolic dataflow to move in/out to/from PE array
• Use double buffer for interleaving two dataflows
regreg
ComputeCell
in
out
reg reg
in
out
ComputeCell
PE Dataflow for Input Tensor
PE Dataflow for Output Tensor
19
![Page 20: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/20.jpg)
Systolic Stationary Multicast
Inputa c e
Outputb d f
Compute
reg regreg
ComputeCompute
reg
Compute
regreg reg
Compute Compute
Building PE Inner Architecture
comp.cell
in_C out_C?
in_B
?
out_B
in_A ? out_A
PE Basic Structure PE Dataflow Structure
?Substitute each of the module to one of the 6 dataflow modules based on the dataflow of each tensor
20
![Page 21: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/21.jpg)
reg
reg
Building PE Architecture - Example
Compute
reg
Compute
reg
reg reg
Compute
compute
in_Cout_C
in_B
?
out_B
in_A ? out_A
reg reg?
Output Stationary Architecture
A: Systolic
B: Systolic
C: Stationary
21
![Page 22: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/22.jpg)
reg
reg
Building PE Architecture - Example
Compute
reg
Compute
reg
regreg
Compute
compute
in_B
out_B
in_A
?
out_A
in_C out_C
Weight Stationary Architecture
regreg
?
?
A: Systolic
B: Stationary
C: Systolic
22
![Page 23: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/23.jpg)
+ ++
+ ++
Building PE Array Interconnection - Example
Tensor A
Tensor B
Systolic [0, 1]
Systolic [1, 0] Systolic [1, 0]
Multicast [0, 1] Multicast [1, 0]
++
+
Reduction Tree [0, 1]
23
![Page 24: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/24.jpg)
Implementation
PE BasicModule
PE DataflowTemplates
PEArchitecture
PE NetworkTemplates
PE ArrayArchitecture
MemoryTemplates
CompleteAccelerator
PE Config
NetworkConfig
MemoryConfig
class PEArray(params):
val PEs=Array[size]new PE(pe_config)
val pe_net = new PENetwork(net_config)
connect_PE(PEs, pe_net)val mem = Array[banks] new
MemBank(mem_config)connect_mem(pe_net, mem)
24
![Page 25: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/25.jpg)
Experiments and Evaluation
25
![Page 26: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/26.jpg)
Dataflow Performance Evaluation
• Evaluated 6 tensor applications
• Included Systolic, Multicast, Stationary, Unicast, Broadcast dataflows
• Performance variance because of dataflow mapping and bandwidth
Application variance Dataflow variance
26
![Page 27: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/27.jpg)
21% Frequency Improvement15% Throughput Improvement
Susy [ICCAD’20] PolySA [ICCAD’18] TensorlibArria-10 Xilinx VU9P Xilinx VU9P
Application GEMM Conv GEMM Conv GEMM ConvLUT(%) 40% 35% 49% 49% 68% 75%DSP(%) 93% 84% 89% 89% 75% 75%
BRAM(%) 32% 30% 89% 71% 51% 73%Freq(MHz) 202 220 229 229 263 245
Throughput(Gop/s)
547 551 555 548 673 626
FPGA Performance Results
27
![Page 28: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/28.jpg)
Summary
• Tensorlib: A Spatial Accelerator Generation Framework for Tensor
Algebra
• Dataflow Representation
• Analysis based on space-time transformation
• Hardware Generation
• Hierarchical bottom-up generation
• Open source: https://github.com/pku-liang/tensorlib
28
![Page 29: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/29.jpg)
Framework Tutorial : Tensorlib API
Tensorlib provides API to define computation
Tensorlib API DescriptiongenIterators(x) Define x iterators and return the iterator listgenTensors(x) Define x tensors and return the tensor listsetExpr(expr) Define the compute expression with tensors and iterators
iterator.setRange(x) Set the range of iterators as xTensor.setWidth(x) Set the tensor data bitwidth as x
setLatency(x) Define the latency of computation pipelineuseCustomKernel Whether use customized IP for computation cell
29
![Page 30: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/30.jpg)
Framework Tutorial: GEMM
Example: GEMM
Dataflow:output stationary Systolic array
for (i = 0; i < 3; i++)for (j = 0; j < 3; j++)for (k = 0; k < 3; k++)
C(i)(j) += A(i)(k) * B(k)(j)
30
![Page 31: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/31.jpg)
Framework Tutorial: Example
val stt = DenseMatrix((1, 0, 0),(0, 1, 0),(1, 1, 1),)
Step 2: Space-Time matrix definition
val config = gen_dataflow(opSpec_Gemm,stt)
chisel3.Driver.execute(args, () => newPEArray(config))
Step 3: Generate dataflow and Verilog code
val opSpec_Gemm = new OperatorSpec {val i :: j :: k :: Nil = genIterators(3)val A :: B :: C :: Nil = genTensor(3)setExpr(C(i)(j) += A(i)(k) * B(k)(j))i.setRange(8)k.setRange(20)j.setRange(8)setLatency(1)A.setWidth(16)B.setWidth(16)C.setWidth(16)
}
Step 1: Workload Definition
31
![Page 32: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/32.jpg)
Framework Tutorial: Using Pre-defined Configs
Running Tensorlib with Pre-defined dataflow configs
Config File Example (Can be generated by HASCO):
sbt "runMain tensorlib.ParseJsonexamples/gemm.json output/output.v"
{"benchmark": "GEMM","bitwidth": 16,"length":[8,8,8],"STT":[[1,0,0],[0,0,1],[1,1,1]]
}32
![Page 33: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/33.jpg)
Framework Tutorial: Customized Kernels
Default Kernel: Integer Multiply-Add
Can be substituted with user-provided IPs
33
val opSpec = new OperatorSpec {......useCustomKernel(true)......
}
Step 1: Choose to use customized kernelIn computation definition
Step 2: Provide customized computation kernel & reduction kernel
module ComputeCell_Customized(input clock,input reset,input [15:0] io_data_2_in,input [15:0] io_data_1_in,input [15:0] io_data_0_in,output [15:0] io_data_0_out
);
module Reduction_Customized(input [15:0] io_in_a,input [15:0] io_in_b,output [15:0] io_out
);
![Page 34: Tensorlib: A Spatial Accelerator Generation Framework for](https://reader031.vdocuments.net/reader031/viewer/2022012018/61daf43d2091a469da3d80b3/html5/thumbnails/34.jpg)
Result Validation
Chisel Tester Validation
Runs a validation program of tensorlibgenerated GEMM accelerator based on Chisel cycle-level simulator
sbt "runMain tensorlib.Test_Gemm"
34