xilinx dnn processor - hot chips...© copyright 2018 xilinx efficient memory utilization >> 9...
Post on 01-Apr-2021
26 Views
Preview:
TRANSCRIPT
© Copyright 2018 Xilinx
Xilinx DNN ProcessorAn Inference Engine, Network Compiler
+ Runtime for Xilinx FPGAs
Rahul Nimaiyar, Brian Sun, Victor Wu, Thomas Branca, Yi Wang, Justin Oo, Elliott Delaye, Aaron Ng, Paolo D'Alberto, Sean Settle, Bill Teng, Manasa Bollavaram, Chaithanya Dudha, Hanh Hoang, Swati Gupta,, Alina Huang, Ephrem Wu, Samuel Bayliss, Phil James-Roxby, Ralph Wittig, Ashish Sirasao, Ravi Sunkavalli
21st August 2018
© Copyright 2018 Xilinx
Xilinx Inference Engine – DNN Processor + Compiler
>> 2
vV v
Exec
utio
n C
ontr
olle
r
Spill
/ R
esto
re D
MA
Co
ntro
ller
Weights DMA Controller
Bias
ReLU
Bias
ReLU
Bias
ReLU
Bias
ReLU
Pooling Pooling Pooling Pooling
Image Queue
Instruction Buffer
Cross Bar
Pooling/EWA
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
X+
Wei
ght
s
Trained Model Compiler + Runtime Xilinx DNN Processor
60-80% EfficiencyLow Latency, High
Throughput – Batch 1
No FPGA Expertise
Required
© Copyright 2018 Xilinx
Virtex® UltraScale+™ VU9P FPGA
>> 3
˃ 16nm TSMC FF+ FPGA
˃ 2.5M System Logic Cells
˃ 6840 DSP Blocks (18x27 MACs)
˃ 382 Mbit On-Die SRAM
˃ 4 DDR4-2400 x72 Channels
˃ VU9P Virtex UltraScale+ FPGA
˃ 21 TOPS (INT8)
˃ 382 Mbit on-chip SRAM
˃ 64 GByte on-board DRAM
˃ 75WVirtex® UltraScale+™ FPGA VCU1525
Developer Board and Reference Design
© Copyright 2018 Xilinx
Motivations for Deep Learning on FPGA
>> 4
• 2D Array of MACs
• Flexible on-chip memory access
• High Bandwidth, Multiple Access PortsData Parallel
• Near Memory Compute
• Programmable routing for data & filter reuse
Data Reuse
• Flexible Data Types
• FP32/16, INT16/8/4/2, Binary/Ternary
• Sparsity friendly compute
Compression & Sparsity
© Copyright 2018 Xilinx
Virtex® UltraScale+™ Full Spectrum of Memory
>> 5
Gap in
Memory
Hierarchy
Block RAM(10s of megabits)
UltraRAM(100s of Megabits)
External Memory
2666-DDR4(Multi-Gigabyte)High Bandwidth
Memory (Multi-Gigabyte)
Distributed
RAM(10s of megabits)
5 Tiers of Memory -> Build custom memory hierarchy.
500 Mb of On-chip Memory and Tb/s of On-chip Memory Bandwidth
VU9P 36Mb
675Tb/s
77Mb
216Tb/s
270Mb
80Tb/s
64GB
85GB/s
VU37P: 8GB,
460GB/s
© Copyright 2018 Xilinx
Xilinx DNN Processor (xDNN)
>> 6
˃ Configurable Overlay Processor
˃ DNN Specific Instruction Set
Convolution, Max Pool etc.
˃ Any Network, Any Image Size
˃ High Frequency & High Compute
Efficiency
˃ Compile and run new networks
Execu
tion C
ontr
olle
r
Spill
/ R
esto
re D
MA
Contr
olle
rWeights DMA Controller
Systolic Array
Bias
ReLU
Bias
ReLU
Bias
ReLU
Bias
ReLU
Pooling Pooling Pooling Pooling
Image Queue
Instruction Buffer
Cross Bar
Pooling/
EWA
© Copyright 2018 Xilinx
xDNN – Channel Parallel Systolic Array
>> 7
2D Channel Parallel
Datapath
&
Distributed
Weight Buffers
X
+
Weig
hts X
+
Weig
hts
X
+Wei
ghts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
X
+
Weig
hts X
+
Weig
hts
∑ ∑ ∑ ∑
Micro-Architecture Optimized for underlying Ultrascale+ FPGA Fabric
© Copyright 2018 Xilinx
xDNN Processor – Tensor Memory
>> 8
WR
WR
WR
WR
From Systolic Array
From DDR DMA
To Systolic Array
To DDR DMA
To Parallel
Processing
From Parallel
Processing
Channel Parallel
Concurrent Access
© Copyright 2018 Xilinx
Efficient Memory Utilization
>> 9
Previous Layer Output
3x3 Conv Reduce 5x5 Conv Reduce1x1 Conv
3x3 Conv 5x5 Conv
Concatenated Output
Tensor Memory
Time: Tn
Tensor Memory
Tn+1
Tensor Memory
Tn+2
Tensor Memory
Tn+3
Previous Layer
Output
Tensor Memory
Tn+3
Tensor Memory
Tn+4
Tensor Memory
Tn+5
Previous Layer
Output
Previous Layer
Output
Previous Layer
Output
1x1 Conv 1x1 Conv1x1 Conv
3x3 Conv
3x3 Conv Reduce 3x3 Conv Reduce
Previous Layer
Output
Previous Layer
Output
Previous
Layer Output
1x1 Conv
3x3 Conv
3x3 Conv Reduce 5x5 Conv Reduce
3x3 Conv
5x5 Conv
5x5 Conv Reduce
1x1 Conv
Concatenated Output
Input for Next Layer
1x1 Conv
3x3 Conv
© Copyright 2018 Xilinx
xDNN Compiler + Runtime
>> 10
https://github.com/Xilinx/ml-suite
MxNet
CPU Layers FPGA Layers
RuntimeImage
Model Weights
Calibration Set
QuantizerCompiler
Tensor Graph Optimization
Framework Tensor Graph
to Xilinx Tensor Graph
Frontend
Deep Learning Frameworks
© Copyright 2018 Xilinx
Graph Partitioning
>> 11
˃ One time loading of sub-graph instructions
˃ Data Flow Execution
Subgraph 1 Parallel Subgraphs Post-ProcessingPre-Processing
FPGA or CPU FPGA CPU CPUFPGA
1 Core -> Multi-Core -> Multi-Chip
© Copyright 2018 Xilinx
Fusing of Operations
>> 12
˃ Fuse operations as pipe-lined stages
˃ Access activations only once
ReLU
Conv
Max
Pool
Batch
Norm.
Bias Conv.
Pipelined Operators(In-line, no RAM access)
© Copyright 2018 Xilinx
Instruction Level Parallelism
>> 13
Previous Layer Output
3x3 Conv Reduce 5x5 Conv Reduce1x1 Conv
3x3 Conv 5x5 Conv
Concatenated Output
3x3 Max Pool
1x1 Conv
1x1 Conv
3x3 Conv Reduce
3x3 Max Pool
3x3 Conv
Parallel Execution 5x5 Conv Reduce
5x5 Conv
1x1 Conv
Time
© Copyright 2018 Xilinx
Automatic Intra Layer Tiling
>> 14
˃ Tile when Feature Map size exceeds on-chip memory
˃ Work on full feature map depth
2 3
0 1
Any Feature Map Size
© Copyright 2018 Xilinx
xDNN Key Functions
>> 15
Feature Details
Convolution NxM, Stride 1,2,4,8 N,M=1-15
Max Pool NxM, Stride 1,2,4,8 N,M=1-15
Avg Pool NxM, Stride 1,2,4,8 N,M=1-15
Dilated Convolution Factor 1,2,4
De-Convolution NxM, Stride 1,2,4,8 N,M=1-15
Up-sampling Factor 2,4,8,16
Activation ReLU, pReLU
Elementwise Addition Any – Square, Rectangular
Precision Int8, Int16
Network
Classification: e.g. ResNet
Object Detection: e.g. YOLO v2,
Segmentation: e.g. MaskRCNN
© Copyright 2018 Xilinx
xDNN Implementation on VU9P
>> 16
˃ 3 Large 96x16 PEs– 1 in each SLR
˃ Systolic Array at 800 MHz; Rest of logic at 400 MHz
Resource Count VU9P Utilization
LUTs 612k 52%
DSPs 5493 80%
BRAM 228 38%
URAM 864 92%
Systolic
Array
800MHz – 90% of device Fmax
© Copyright 2018 Xilinx
xDNN Compute Efficiency
>> 17
Full Benchmark Data at Xilinx Developers Forum – Oct’ 1, 2018
60-80% Efficiency
Across NetworksxDNN 74%
Other Architecture
© Copyright 2018 Xilinx
Custom Deep Learning Pipeline
>> 18
xDNN
xDNN
xDNN
XDNN
Video
Decode +
Processing
Video
Processing
+ Encode
Video + ML
Genomics + ML
Risk Modelling + ML
Database + ML
Network IPS + ML
Storage + ML
Integrate Custom Applications with xDNN. Lower end-to-end latency
© Copyright 2018 Xilinx
Xilinx DNN Processor - Summary
>> 19
90%
60-80%
EFFICIENCY FREQUENCY
xDNN PerformanceNo FPGA Expertise
Needed
Compile & Run
Trained Networks
Deploy on AWS F1, other
Cloud Platforms
Deploy in PCIe Cards
© Copyright 2018 Xilinx
Additional Information
>> 20
Connect Learn Share
Silicon Valley October 1-2Beijing October 6th
Frankfurt December 10
XDF connects software developers and system designers to the deep expertise of Xilinx engineers, partners, and industry leaders.
Learn More www.xilinx.com/xdf
top related