ai chip trends and forecase - recommnine.co.kr
TRANSCRIPT
![Page 1: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/1.jpg)
AI Chip Trends and Forecast
Joo-Young Kim
2019. 11. 06
ICT 산업전망컨퍼런스
![Page 2: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/2.jpg)
Outline
• Introduction- Brief history & deep neural network models
- AI stack and new computing paradigm
• Trends in AI chips- ??
• Looking forward- ???
![Page 3: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/3.jpg)
Motivation
Artificial Intelligence is pervasive in our everyday life.
![Page 4: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/4.jpg)
Brief History of Neural Networks
F. Rosenblatt B. Widrow – M. Hoff M. Minsky – S. Papert D. Rumelhart – G. Hinton – R. Wiliams G. Hinton – S. Ruslan
• Learnable weights and
Threshold• XOR problem • Nonlinear problem solved
• High computation
• Local optima and overfitting
• Hierarchical feature
learning
1943
• Adjustable
but not
learnable
weights
S. McCulloc – W. Pitts
1957 1960 1969 1986 2006Deep
Learning!
![Page 5: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/5.jpg)
Deep Learning ≠ AI
AISearching
Planning
Knowledge
Representation
Fuzzy Logic
Natural Language
Processing
Genetic
Algorithm
Any technique that enables computers to mimic human behavior
AI techniques that have computers learn without being explicitly programmed
A subset of ML that makes the computation of multi-layer neural networks feasible
![Page 6: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/6.jpg)
Deep Learning Revolution
Human: ~5%
ImageNet (ILSVRC) Top-5 Error
* F. Veen, The Asimov Institute, 2016
Deep learning starts to surpass human-level recognition on specific tasks
*
![Page 7: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/7.jpg)
What Has Changed?
• Traditional pattern recognition
• Deep learning (model + data)Trainable Features & Classifiers
"Dog"
"Ship"
"Car"CNN
DNN
Hand-Crafted
Features
HoG
SIFT
Haar Like
Simple Trainable
ClassifiersSVM
K-Means
"Dog"
"Ship"
"Car"
Amount of Data
Perf
orm
ance
Traditional algorithms
Deep learning
Andrew Ng, Stanford CS 229 class
![Page 8: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/8.jpg)
Popular Types of DNNs
MLP(Multi-Layer Perceptron)
CNN(Convolutional)
RNN(Recurrent)
Characteristic Fully Connected Convolutional LayerSequential DataFeedback Path
Major Application
Speech Recognition
Image RecognitionSpeech / Action
Recognition
Number of Layers
3~10 Layers Max ~100 Layers 3~5 Layers
Convolution
PoolingInput
Outp
ut
Fully Connected
Outp
ut
Input
Hidden Outp
ut
Input
![Page 9: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/9.jpg)
And Many More Models…
1970s 1980s 1990s
MLPCognitron/
CNN
Attention only
Network
Tacotron
YOLO v3
BERT
FCN
DeepLab v3+
VoxelNet
PointNet++
WGAN
CycleGAN
StarGAN
DiscoGAN
DenseNet
DeepLab
Enet
YOLO v2
PointNet
WaveNet
CNN+RNN
ResNet
Fast R-CNN
Faster R-CNN
YOLO
GRU
R-CNNLSTM
LeNet
AlexNet
VGGNet
GoogleNet
SegNet
2012~2014 2015 2016 2017~
![Page 10: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/10.jpg)
DNN Characteristics
• Requires big data & big computation
• Modern hardware enabled deep learning revolution (e.g. GPU)
# Operations: ~2Billion/Face# Mem. Access: ~1GB/Face
Local-feature-based Deep Learning-based
# Operations: ~0.1Billion/Face# Mem. Access: ~10MB/Face
![Page 11: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/11.jpg)
AI Stack
Algorithm
Chip
Device
• Neuromorphic chip: brain-inspired computing, biological brain simulation, …• Programmable chip: GPU, ASIC, FPGA, DSP, …• System-on-Chip: multi-core, many-core, SIMD, systolic array, …• Development tool-chain: frameworks, compiler, simulator, optimizer, …
• High bandwidth off-chip memory: HBM, DRAM, GDDR, STT-MRAM, …• High speed interface: SerDes, Optical Communication• CMOS 3d stacking• Emerging computing device: analog computing, memristors, …• Emerging memory device: ReRAM, PCRAM, …
• Neural network topology: MLP, CNN, RNN, LSTM, SNN, …• Deep neural networks: AlexNet, ResNet, GoogLeNet, …• Neural network algorithms: reinforcement Learning, adversarial Learning, …• Machine learning algorithms: SVM, K-NN, decision tree, Markov chain, …
Application
• Video/Image: face recognition, image generation, video analysis, …• Sound and Speech: speech recognition, language synthesis, music generation, …• NLP: text analysis, language translation, human-machine communication, …• Robotics: autopilot, UAV, industrial automation, …
![Page 12: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/12.jpg)
New Computational Paradigm
• Being able to handle big data- Huge storage capacity, high bandwidth, low latency memory access- “memory wall” problem
• Large amount of computation- Mainly linear algebraic operations while control is relatively simple- Parameters are large
• Training vs Inference- Training: accuracy, data capacity (~1018 bytes), weight synchronization- Inference: speed, energy, hardware cost, efficient reading of weights
• Data precision / Model compression / Pruning- Not always require a high precision
• High configurability- Tradeoff between energy efficiency and adaptability to new algorithms
![Page 13: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/13.jpg)
AI Chip Landscape
https://basicmi.github.io/AI-Chip/
![Page 14: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/14.jpg)
DNN Hardware
• Mobile Based- Specific AI - Real-time- Limited resources- Low-power
• Cloud Based- General AI- High computing- Huge memory- Fast & accurate
learning
Lo
w
Low Real-Time Operation
Glo
bal D
ata
Sh
ari
ng
Cloud Server
Mobile
Edge Terminal
Control &
Control Model
Control &
Control Model
Data &
Learned Model
Data &
Learned Model
High
Hig
h
![Page 15: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/15.jpg)
Cloud based AI Computing
Pre-trained Network
LearningT
rain
ing
Da
ta (D
ata
se
t)Inferenceon
Cloud / Server
Question
Answer
Voice Assistant
Cloud / ServerDevice / Edge
![Page 16: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/16.jpg)
DNN Chips for Cloud Server
• Nvidia (GPU)
• Goodle (TPU)
• Microsoft (BrainWave)
• Amazon (Inferentia)
• Alibaba, Baidu
Real-Time Operation
Glo
bal D
ata
Sh
ari
ng
Lo
wH
igh
HighLow
Cloud Server
• Control based on overall conditions
• Learning with data collected from edge devices
Stand-Alone AI
NVIDIA Volta Google Cloud TPU
![Page 17: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/17.jpg)
Mobile/Edge based AI Inference
Self-driving vehicle, intelligent camera/speaker, IoT devices
Pretrained Network
Learning
Inferenceon
Cloud / Server
Tra
inin
g D
ata
(Da
tase
t)
InferenceUsing Pretrained Model
UserInterface
&
APPsplatform
Se
nso
rs
Camera
MIC
GPS
Gyro
Touch
Local Data
Load Pretrained
Model
Cloud / ServerDevice / Edge
![Page 18: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/18.jpg)
Mobile/Edge DNN Applications
• Apple
• Huawei
• Qualcomm
• ARM
• CEVA
• Cambrion
• Horizon Robotics
• MobileEye
• Tesla
Pow
er
Con
sum
ptio
n
Inference Speed
Hig
hLo
w
Slow Fast
IoT
Wearable
Smart
Phone
Drone
Automoitive
Mobile
Robot
![Page 19: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/19.jpg)
Cloud vs Edge Summary
High Performance
High Precision
High Flexibility
Distributed
Scalable
Diverse Requirements
(Car, Wearable, IoT)
Low-Moderate Throughput
Low Latency
Power Efficiency
Low Cost
High Throughput
Low Latency
Power Efficiency
Distributed
Scalable
?
Cloud / Datacenter Edge / MobileIn
fere
nce
Tra
inin
g
![Page 20: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/20.jpg)
Functional Integration
Intel CPU
nVidia GPU
Xilinx FPGA
MIT Eyeriss
KAIST LNPU
Google TPU
Microsoft BrainWave
…
Wave DPU
Tsinghua Thinker
…
Hardware Classic Domain specific Reconfigurable
Domain Cloud Could/Edge Could/Edge
Target Workload Training oriented Inference Inference & Training
Early 1st Stage 2nd Stage
?
Courtesy of GTIC 2019
![Page 21: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/21.jpg)
Two Different Directions
• Be more flexible
• Be more compact
DedicatedDiannao
2014
RS DataflowMIT Eyeriss
Systolic ArrayGoogle TPU
Sparse-awareNvidia SCNN
Flexible BitwidthKAIST UNPU …
2016 2017.6 20182017.1
CompressionPruning
EIE
2016.2
BWN TWN Low-bit TrainingDoReFa-Net
Low-bit QuantizationLQ-Nets …
2016.8 2018.2 2018.92016.11
Courtesy of GTIC 2019
![Page 22: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/22.jpg)
Von Neumann Bottleneck for AI
• Von-Neumann architecture serially fetches data from the storage
• AI application needs to access tremendous amount of data
AI Processor
Memory
BUS
Bottleneck
Memory Wall
![Page 23: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/23.jpg)
NVM DRAMSRAM
(Cache)Processor
Von Neumann Bottleneck
NVM DRAMSRAM
(Cache)Processor
Increasing Memory Bandwidth
How can we increase bandwidth between processor and memory?
![Page 24: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/24.jpg)
Near Memory Processing
PCB
Processor
DRAM
DRAM
3D-Stacked Memory
High Bandwidth Memory
![Page 25: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/25.jpg)
Advantage of HBM
ITEM GDDR5 HBM (High B/W Memory)
System
Configuration
DRAM 8Gb GDDR5 12ea 4GB HBM 4ea
Size 3120 ㎟ 792 ㎟
Density 12GB 16GB
Bandwidth 384GB/s 1024GB/s
Power 18.3W (1.5W X GDDR5 12ea) 9.1W (2.3W X HBM 4ea)
Pin
(Ball)
Speed 8 Gbps 2 Gbps
# I/O 32 per chip (Total 384) 1024 per cube (Total 4096)
2016GFX 예측 사양• HBM 4~6cube• 4~8GB, 512~1TB/s• 10TFLOPs
Processor
HBM
HBM
HBM
HBM
Processor
G5 G5
G5 G5
G5
G5
G5
G5
G5
G5
G5
G5
60mm
52mm
33mm
24mm
-75%
1.3x
3.6x
+18%
![Page 26: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/26.jpg)
Emerging Non-Volatile Memory
White Paper on AI Chip Technologies (2018)
DRAM-like speed, Flash-like capacity and Non-Volatile
![Page 27: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/27.jpg)
Towards into Memory
NVM DRAMSRAM
(Cache)Processor
Von Neumann Bottleneck
NVM DRAMSRAM
(Cache)Processor
NVM DRAM
P
SRAM
P P P P P P P P P P P
Traditional
Near-Memory/Emerging Mem
In-Memory/Memory-centric
![Page 28: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/28.jpg)
Processing-In-Memory (PIM)
AI Processor
Memory
BUS
Bottleneck
Von Neuman
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
Mem
Logic
✓ Non Von Neuman
✓ Converged logic + memory (high BW)
✓ Suitable for data-intensive workloads
✓ Little data movement (energy efficient)
![Page 29: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/29.jpg)
PIM Chip
Renesas’s ternary SRAM PIM for AI inference
S. Okumura, et al., “A Ternary Based Bit Scalable, 8.80 TOPS/W CNN accelerator with Many-core Processing-in-memory Architecture with 896K synapses/mm2”, Symposium on VLSI Technology 2019
![Page 30: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/30.jpg)
AI Framework
Provides higher-level abstraction to developers/users
Convolution on volumes (1 line)
Max pooling (1 line)
Non-linear ReLu (1 line)
![Page 31: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/31.jpg)
Hyper-Scale AI Accelerators
TPU v3 (2018)
Cerebras Wafer Scale Engine (2019)
Usually hundreds of processing units
in array structure..
How do we program this?
1.2T transistors
46,225 mm2
400,000 cores
18GB SRAM
100 Pb/s interconnect
![Page 32: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/32.jpg)
Who Fills this Gap?
…
…
…
…
…
…
… …
… …
…
…
…
…
…
…
… …
… …
Cerebras WSE
![Page 33: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/33.jpg)
AI Software Tool-Chain
• Xilinx AI Edge PlatformSW developers, users
A few hardware vendors
![Page 34: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/34.jpg)
Problem: No De Facto SW Tool & Hardware!
C / Java Compiler toolchain CPU
Software Hardware
OpenGL / CUDA
Compiler toolchain GPU
Verilog / VHDL Synthesis toolchain FPGA
![Page 35: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/35.jpg)
Neuromorphic Chip
• “Spiking neuron”• Closely model biological
neuron’s activity• Incorporates concept of
time: integrate and fire• Computationally expensive• Difficult to train →
Not practical at moment
1st
Generation
• Perceptron based• No non-linear
functions• Binary output
2nd
Generation3rd
Generation
• Non-linear activation functions• Continuous output• Functional modeling of our
brain• Working real-life applications• We are here (FF, CNN, RNN, …)
![Page 36: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/36.jpg)
IBM TrueNorth
• 5.4 billion transistors in 28nm CMOS process
• 64 x 64 neurosynaptic core, 256 neurons each
Paul A. Merolla, et al. "A million spiking-neuron integrated circuit with a scalable communication network and interface." Science2014
![Page 37: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/37.jpg)
IBM TrueNorth
• Mimicking synapse with SRAM
• However, SRAM is not made for this (large area, cost).
Pre-Neuron (Tx)
Post-Neuron (Rx)
Synapse is a structure that permits a neuron to pass an electrical signal to another.
Input Spike
1 0
0 0
1 1
8T SRAM cell
as synapse
Output Spike (Voltage)
WL
BLT
BLT
BLBLWLT
Voltage Σ ΣΣ
1
0
1
SRAM Synapse Array
![Page 38: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/38.jpg)
Neuromorphic Chip with Emerging Device
• New model requires device with new physics • FeFET: better storing/transferring analog signal
M. Jerry., et al., "Ferroelectric FET analog synapse for acceleration of deep neural network training.", IEEE IEDM 2017
![Page 39: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/39.jpg)
Neuromorphic Chip with Emerging NV RAM
Z. Wang., et al, "Fully memristive neural networks for pattern classification with unsupervised learning", Nature Electronics 2018
• ReRAM (memristor)
![Page 40: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/40.jpg)
1. Cloud and Edge Will be Closer
• Edge inference & learning will be more important due to privacy concern, real-time operation, and power constraint
• Federated learning: leverage cloud’s big data advantage on edge devices
Mobile Devices
Encryption & Compressed Data
LocalLearning
Custom Weight
Cloud ServersShared Model
Broadcasting shared model
Aggregating encrypted data
LocalLearning
Custom Weight
LocalLearning
Custom Weight
LocalLearning
Custom Weight
Updated Model
![Page 41: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/41.jpg)
2. AI Chips will Support More Algorithms
• State-of-the-art algorithms are moving from traditional MLP, CNN, RNN to GAN, reinforcement learning, and unsupervised learning
Inference only(MLP/RNN or CNN)
Inference + Training(MLP/CNN/RNN)Inference only
(MLP/CNN/RNN)
Inference + Training(GAN/RL/
Unsupervised/MLP/CNN/RNN)
![Page 42: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/42.jpg)
3. AI Security Will be Essential
• It is easy to break DNN based recognitionNew cyberattack: imperceivable noise injection
Breaking state-of-the-art face recognition Physical attack for autonomous vehicles
![Page 43: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/43.jpg)
4. For Success of AI Chip, SW is the Key
• How did ARM dominate mobile processor market?- Low power consumption with reasonable performance
- ARM’s great complier toolchain & licensing
• Why did GPU have a big success in early DNN revolution?- That was because of CUDA which is a generic programming language for data-
intensive workloads like matrix-vector multiplication
- CUDA was baked for several years to have developers actually use it
![Page 44: AI Chip Trends and Forecase - recommnine.co.kr](https://reader035.vdocuments.net/reader035/viewer/2022081503/62a1cbf29a61130e7b614183/html5/thumbnails/44.jpg)
AI Chip Researches at KAIST
Multi-core OR
Processor
Dual
Layered
3-stage
Pipeline
Simultaneous
Multi-threading
Multi-classifier
System
Multi-core
MIMD
2008 2009 2010 2012 2013
Visual
Attent
ionTomatoSauce
$2.60
Heterogeneous
Many-SIMD
20142011 2015 2016 2017
Multi-Modal UI/UX
Deep Learning Core
Tan
k
Rob
ot
Recogni
tion
Result
Sen
sing
Convolution
Cluster 0
FC LSTM
Processor
Ext. Gateway
Convolution
Cluster 3
Convolution
Cluster 1
Convolution
Cluster 2
CNN
Ctrlr.
Aggregation
Core
Top
Ctrlr.
Ex
t. G
ate
wayStereo Matching
Processor
Face
Recognition
& CNN–RNN
2018 2019
Core
#1
Core
#2
Core
#3
Ext.
IF
#0
Aggregation Core
1-D
SIM
D C
ore
To
p C
trlr
.
40
00
mm
WM
EM
Ext.
IF
#1
AFL
LB
PE
#0
LB
PE
#1
LB
PE
#2
LB
PE
#3
LB
PE
#4
LB
PE
#5
Matching
Core
Pipelined CNN PE
FMEM2
FMEM0
FWD/BWD Unit
CN
N
Co
re 1
Custom
RISC
WMEM
FMEM1
Lo
cal D
MA
Ext. I/F Ext. I/FTop Controller
ICP-PSO Engine
NN
PIM 0
NN
PIM 1
NN
PIM 2
NN
PIM 3
NN
PIM 4
NN
PIM 5
NN
PIM 6
NN
PIM 7
NN
PIM 8
NN
PIM 9
NN
PIM 10
NN
PIM 11
NN
PIM 12
NN
PIM 13
NN
PIM 14
NN
PIM 15
Variable Bit
DNN
& 3D HGR
Core Cluster 3Core Cluster 2
Core Cluster 1
Core1
Core3Core2
DMEM
PELPELPELPEL
ILB
Central CoreI/F1
fp-u
nit S
IMD
Co
reT
op
Ctrlr. R
ISC
I/F0
Process 65nm 1P8M Logic CMOS
Area 4mm × 4mm
SRAM 448 KB
Supply 0.67V – 1.1V
Power196 mW @ 200MHz, 1.1V
2.4 mW @ 10MHz, 0.67V
PrecisionFeature – bfloat16
Weight – 16/8/4'b FXP
Peak
Performance204 GFLOPS @ 16b Weight
Ext.
IF 0
Core 1
Core 2 Core 3
Top Ctrlr.Ext.
IF 1
UMEM
UMEMBMEM
BMEM
PE Arrays
Exp. Compressor
1-D SIMD
Supervised &
Reinforcement
Learning
Input Image
Hand Depth
Tracking
Results
-1.5cm
10cm
0cm
5cm
-5cm
7.5cm
0cm
5cm
40cm
20cm25cm
30cm35cm
-5cm
10cm
0cm5cm
-5cm
10cm
0cm
5cm
40cm
20cm25cm
30cm35cm
X
Y
-5cm
10cm
0cm5cm
-5cm
10cm
0cm
5cm
40cm
20cm25cm
30cm35cm
X
Y
X
Y
Hand
Tracking
Accuracy
2.6mm@20cm
4.6mm@30cm
3.4mm@40cm
5cm
Seperated
VGA
Cameras
22.5cm
40.5cm