ai chip trends and forecase - recommnine.co.kr

AI Chip Trends and Forecast

Joo-Young Kim

2019. 11. 06

ICT 산업전망컨퍼런스

Outline

• Introduction- Brief history & deep neural network models

- AI stack and new computing paradigm

• Trends in AI chips- ??

• Looking forward- ???

Motivation

Artificial Intelligence is pervasive in our everyday life.

Brief History of Neural Networks

F. Rosenblatt B. Widrow – M. Hoff M. Minsky – S. Papert D. Rumelhart – G. Hinton – R. Wiliams G. Hinton – S. Ruslan

• Learnable weights and

Threshold• XOR problem • Nonlinear problem solved

• High computation

• Local optima and overfitting

• Hierarchical feature

learning

1943

• Adjustable

but not

learnable

weights

S. McCulloc – W. Pitts

1957 1960 1969 1986 2006Deep

Learning!

Deep Learning ≠ AI

AISearching

Planning

Knowledge

Representation

Fuzzy Logic

Natural Language

Processing

Genetic

Algorithm

Any technique that enables computers to mimic human behavior

AI techniques that have computers learn without being explicitly programmed

A subset of ML that makes the computation of multi-layer neural networks feasible

Deep Learning Revolution

Human: ~5%

ImageNet (ILSVRC) Top-5 Error

* F. Veen, The Asimov Institute, 2016

Deep learning starts to surpass human-level recognition on specific tasks

*

What Has Changed?

• Traditional pattern recognition

• Deep learning (model + data)Trainable Features & Classifiers

"Dog"

"Ship"

"Car"CNN

DNN

Hand-Crafted

Features

HoG

SIFT

Haar Like

Simple Trainable

ClassifiersSVM

K-Means

"Dog"

"Ship"

"Car"

Amount of Data

Perf

orm

ance

Traditional algorithms

Deep learning

Andrew Ng, Stanford CS 229 class

Popular Types of DNNs

MLP(Multi-Layer Perceptron)

CNN(Convolutional)

RNN(Recurrent)

Characteristic Fully Connected Convolutional LayerSequential DataFeedback Path

Major Application

Speech Recognition

Image RecognitionSpeech / Action

Recognition

Number of Layers

3~10 Layers Max ~100 Layers 3~5 Layers

Convolution

PoolingInput

Outp

ut

Fully Connected

Outp

ut

Input

Hidden Outp

ut

Input

And Many More Models…

1970s 1980s 1990s

MLPCognitron/

CNN

Attention only

Network

Tacotron

YOLO v3

BERT

FCN

DeepLab v3+

VoxelNet

PointNet++

WGAN

CycleGAN

StarGAN

DiscoGAN

DenseNet

DeepLab

Enet

YOLO v2

PointNet

WaveNet

CNN+RNN

ResNet

Fast R-CNN

Faster R-CNN

YOLO

GRU

R-CNNLSTM

LeNet

AlexNet

VGGNet

GoogleNet

SegNet

2012~2014 2015 2016 2017~

DNN Characteristics

• Requires big data & big computation

• Modern hardware enabled deep learning revolution (e.g. GPU)

# Operations: ~2Billion/Face# Mem. Access: ~1GB/Face

Local-feature-based Deep Learning-based

# Operations: ~0.1Billion/Face# Mem. Access: ~10MB/Face

AI Stack

Algorithm

Chip

Device

• Neuromorphic chip: brain-inspired computing, biological brain simulation, …• Programmable chip: GPU, ASIC, FPGA, DSP, …• System-on-Chip: multi-core, many-core, SIMD, systolic array, …• Development tool-chain: frameworks, compiler, simulator, optimizer, …

• High bandwidth off-chip memory: HBM, DRAM, GDDR, STT-MRAM, …• High speed interface: SerDes, Optical Communication• CMOS 3d stacking• Emerging computing device: analog computing, memristors, …• Emerging memory device: ReRAM, PCRAM, …

• Neural network topology: MLP, CNN, RNN, LSTM, SNN, …• Deep neural networks: AlexNet, ResNet, GoogLeNet, …• Neural network algorithms: reinforcement Learning, adversarial Learning, …• Machine learning algorithms: SVM, K-NN, decision tree, Markov chain, …

Application

• Video/Image: face recognition, image generation, video analysis, …• Sound and Speech: speech recognition, language synthesis, music generation, …• NLP: text analysis, language translation, human-machine communication, …• Robotics: autopilot, UAV, industrial automation, …

New Computational Paradigm

• Being able to handle big data- Huge storage capacity, high bandwidth, low latency memory access- “memory wall” problem

• Large amount of computation- Mainly linear algebraic operations while control is relatively simple- Parameters are large

• Training vs Inference- Training: accuracy, data capacity (~1018 bytes), weight synchronization- Inference: speed, energy, hardware cost, efficient reading of weights

• Data precision / Model compression / Pruning- Not always require a high precision

• High configurability- Tradeoff between energy efficiency and adaptability to new algorithms

AI Chip Landscape

https://basicmi.github.io/AI-Chip/

DNN Hardware

• Mobile Based- Specific AI - Real-time- Limited resources- Low-power

• Cloud Based- General AI- High computing- Huge memory- Fast & accurate

learning

Lo

w

Low Real-Time Operation

Glo

bal D

ata

Sh

ari

ng

Cloud Server

Mobile

Edge Terminal

Control &

Control Model

Control &

Control Model

Data &

Learned Model

Data &

Learned Model

High

Hig

h

Cloud based AI Computing

Pre-trained Network

LearningT

rain

ing

Da

ta (D

ata

se

t)Inferenceon

Cloud / Server

Question

Answer

Voice Assistant

Cloud / ServerDevice / Edge

DNN Chips for Cloud Server

• Nvidia (GPU)

• Goodle (TPU)

• Microsoft (BrainWave)

• Amazon (Inferentia)

• Facebook

• Alibaba, Baidu

Real-Time Operation

Glo

bal D

ata

Sh

ari

ng

Lo

wH

igh

HighLow

Cloud Server

• Control based on overall conditions

• Learning with data collected from edge devices

Stand-Alone AI

NVIDIA Volta Google Cloud TPU

Mobile/Edge based AI Inference

Self-driving vehicle, intelligent camera/speaker, IoT devices

Pretrained Network

Learning

Inferenceon

Cloud / Server

Tra

inin

g D

ata

(Da

tase

t)

InferenceUsing Pretrained Model

UserInterface

&

APPsplatform

Se

nso

rs

Camera

MIC

GPS

Gyro

Touch

Local Data

Load Pretrained

Model

Cloud / ServerDevice / Edge

Mobile/Edge DNN Applications

• Apple

• Huawei

• Qualcomm

• ARM

• CEVA

• Cambrion

• Horizon Robotics

• MobileEye

• Tesla

Pow

er

Con

sum

ptio

n

Inference Speed

Hig

hLo

w

Slow Fast

IoT

Wearable

Smart

Phone

Drone

Automoitive

Mobile

Robot

Cloud vs Edge Summary

High Performance

High Precision

High Flexibility

Distributed

Scalable

Diverse Requirements

(Car, Wearable, IoT)

Low-Moderate Throughput

Low Latency

Power Efficiency

Low Cost

High Throughput

Low Latency

Power Efficiency

Distributed

Scalable

?

Cloud / Datacenter Edge / MobileIn

fere

nce

Tra

inin

g

Functional Integration

Intel CPU

nVidia GPU

Xilinx FPGA

MIT Eyeriss

KAIST LNPU

Google TPU

Microsoft BrainWave

…

Wave DPU

Tsinghua Thinker

…

Hardware Classic Domain specific Reconfigurable

Domain Cloud Could/Edge Could/Edge

Target Workload Training oriented Inference Inference & Training

Early 1st Stage 2nd Stage

?

Courtesy of GTIC 2019

Two Different Directions

• Be more flexible

• Be more compact

DedicatedDiannao

2014

RS DataflowMIT Eyeriss

Systolic ArrayGoogle TPU

Sparse-awareNvidia SCNN

Flexible BitwidthKAIST UNPU …

2016 2017.6 20182017.1

CompressionPruning

EIE

2016.2

BWN TWN Low-bit TrainingDoReFa-Net

Low-bit QuantizationLQ-Nets …

2016.8 2018.2 2018.92016.11

Courtesy of GTIC 2019

Von Neumann Bottleneck for AI

• Von-Neumann architecture serially fetches data from the storage

• AI application needs to access tremendous amount of data

AI Processor

Memory

BUS

Bottleneck

Memory Wall

NVM DRAMSRAM

(Cache)Processor

Von Neumann Bottleneck

NVM DRAMSRAM

(Cache)Processor

Increasing Memory Bandwidth

How can we increase bandwidth between processor and memory?

Near Memory Processing

PCB

Processor

DRAM

DRAM

3D-Stacked Memory

High Bandwidth Memory

Advantage of HBM

ITEM GDDR5 HBM (High B/W Memory)

System

Configuration

DRAM 8Gb GDDR5 12ea 4GB HBM 4ea

Size 3120 ㎟ 792 ㎟

Density 12GB 16GB

Bandwidth 384GB/s 1024GB/s

Power 18.3W (1.5W X GDDR5 12ea) 9.1W (2.3W X HBM 4ea)

Pin

(Ball)

Speed 8 Gbps 2 Gbps

# I/O 32 per chip (Total 384) 1024 per cube (Total 4096)

2016GFX 예측 사양• HBM 4~6cube• 4~8GB, 512~1TB/s• 10TFLOPs

Processor

HBM

HBM

HBM

HBM

Processor

G5 G5

G5 G5

G5

G5

G5

G5

G5

G5

G5

G5

60mm

52mm

33mm

24mm

-75%

1.3x

3.6x

+18%

Emerging Non-Volatile Memory

White Paper on AI Chip Technologies (2018)

DRAM-like speed, Flash-like capacity and Non-Volatile

Towards into Memory

NVM DRAMSRAM

(Cache)Processor

Von Neumann Bottleneck

NVM DRAMSRAM

(Cache)Processor

NVM DRAM

P

SRAM

P P P P P P P P P P P

Traditional

Near-Memory/Emerging Mem

In-Memory/Memory-centric

Processing-In-Memory (PIM)

AI Processor

Memory

BUS

Bottleneck

Von Neuman

Mem

Logic

Mem

Logic

Mem

Logic

Mem

Logic

Mem

Logic

Mem

Logic

Mem

Logic

Mem

Logic

Mem

Logic

✓ Non Von Neuman

✓ Converged logic + memory (high BW)

✓ Suitable for data-intensive workloads

✓ Little data movement (energy efficient)

PIM Chip

Renesas’s ternary SRAM PIM for AI inference

S. Okumura, et al., “A Ternary Based Bit Scalable, 8.80 TOPS/W CNN accelerator with Many-core Processing-in-memory Architecture with 896K synapses/mm2”, Symposium on VLSI Technology 2019

AI Framework

Provides higher-level abstraction to developers/users

Convolution on volumes (1 line)

Max pooling (1 line)

Non-linear ReLu (1 line)

Hyper-Scale AI Accelerators

TPU v3 (2018)

Cerebras Wafer Scale Engine (2019)

Usually hundreds of processing units

in array structure..

How do we program this?

1.2T transistors

46,225 mm2

400,000 cores

18GB SRAM

100 Pb/s interconnect

Who Fills this Gap?

…

…

…

…

…

…

… …

… …

…

…

…

…

…

…

… …

… …

Cerebras WSE

AI Software Tool-Chain

• Xilinx AI Edge PlatformSW developers, users

A few hardware vendors

Problem: No De Facto SW Tool & Hardware!

C / Java Compiler toolchain CPU

Software Hardware

OpenGL / CUDA

Compiler toolchain GPU

Verilog / VHDL Synthesis toolchain FPGA

Neuromorphic Chip

• “Spiking neuron”• Closely model biological

neuron’s activity• Incorporates concept of

time: integrate and fire• Computationally expensive• Difficult to train →

Not practical at moment

1st

Generation

• Perceptron based• No non-linear

functions• Binary output

2nd

Generation3rd

Generation

• Non-linear activation functions• Continuous output• Functional modeling of our

brain• Working real-life applications• We are here (FF, CNN, RNN, …)

IBM TrueNorth

• 5.4 billion transistors in 28nm CMOS process

• 64 x 64 neurosynaptic core, 256 neurons each

Paul A. Merolla, et al. "A million spiking-neuron integrated circuit with a scalable communication network and interface." Science2014

IBM TrueNorth

• Mimicking synapse with SRAM

• However, SRAM is not made for this (large area, cost).

Pre-Neuron (Tx)

Post-Neuron (Rx)

Synapse is a structure that permits a neuron to pass an electrical signal to another.

Input Spike

1 0

0 0

1 1

8T SRAM cell

as synapse

Output Spike (Voltage)

WL

BLT

BLT

BLBLWLT

Voltage Σ ΣΣ

1

0

1

SRAM Synapse Array

Neuromorphic Chip with Emerging Device

• New model requires device with new physics • FeFET: better storing/transferring analog signal

M. Jerry., et al., "Ferroelectric FET analog synapse for acceleration of deep neural network training.", IEEE IEDM 2017

Neuromorphic Chip with Emerging NV RAM

Z. Wang., et al, "Fully memristive neural networks for pattern classification with unsupervised learning", Nature Electronics 2018

• ReRAM (memristor)

1. Cloud and Edge Will be Closer

• Edge inference & learning will be more important due to privacy concern, real-time operation, and power constraint

• Federated learning: leverage cloud’s big data advantage on edge devices

Mobile Devices

Encryption & Compressed Data

LocalLearning

Custom Weight

Cloud ServersShared Model

Broadcasting shared model

Aggregating encrypted data

LocalLearning

Custom Weight

LocalLearning

Custom Weight

LocalLearning

Custom Weight

Updated Model

2. AI Chips will Support More Algorithms

• State-of-the-art algorithms are moving from traditional MLP, CNN, RNN to GAN, reinforcement learning, and unsupervised learning

Inference only(MLP/RNN or CNN)

Inference + Training(MLP/CNN/RNN)Inference only

(MLP/CNN/RNN)

Inference + Training(GAN/RL/

Unsupervised/MLP/CNN/RNN)

3. AI Security Will be Essential

• It is easy to break DNN based recognitionNew cyberattack: imperceivable noise injection

Breaking state-of-the-art face recognition Physical attack for autonomous vehicles

4. For Success of AI Chip, SW is the Key

• How did ARM dominate mobile processor market?- Low power consumption with reasonable performance

- ARM’s great complier toolchain & licensing

• Why did GPU have a big success in early DNN revolution?- That was because of CUDA which is a generic programming language for data-

intensive workloads like matrix-vector multiplication

- CUDA was baked for several years to have developers actually use it

AI Chip Researches at KAIST

Multi-core OR

Processor

Dual

Layered

3-stage

Pipeline

Simultaneous

Multi-threading

Multi-classifier

System

Multi-core

MIMD

2008 2009 2010 2012 2013

Visual

Attent

ionTomatoSauce

$2.60

Heterogeneous

Many-SIMD

20142011 2015 2016 2017

Multi-Modal UI/UX

Deep Learning Core

Tan

k

Rob

ot

Recogni

tion

Result

Sen

sing

Convolution

Cluster 0

FC LSTM

Processor

Ext. Gateway

Convolution

Cluster 3

Convolution

Cluster 1

Convolution

Cluster 2

CNN

Ctrlr.

Aggregation

Core

Top

Ctrlr.

Ex

t. G

ate

wayStereo Matching

Processor

Face

Recognition

& CNN–RNN

2018 2019

Core

#1

Core

#2

Core

#3

Ext.

IF

#0

Aggregation Core

1-D

SIM

D C

ore

To

p C

trlr

.

40

00

mm

WM

EM

Ext.

IF

#1

AFL

LB

PE

#0

LB

PE

#1

LB

PE

#2

LB

PE

#3

LB

PE

#4

LB

PE

#5

Matching

Core

Pipelined CNN PE

FMEM2

FMEM0

FWD/BWD Unit

CN

N

Co

re 1

Custom

RISC

WMEM

FMEM1

Lo

cal D

MA

Ext. I/F Ext. I/FTop Controller

ICP-PSO Engine

NN

PIM 0

NN

PIM 1

NN

PIM 2

NN

PIM 3

NN

PIM 4

NN

PIM 5

NN

PIM 6

NN

PIM 7

NN

PIM 8

NN

PIM 9

NN

PIM 10

NN

PIM 11

NN

PIM 12

NN

PIM 13

NN

PIM 14

NN

PIM 15

Variable Bit

DNN

& 3D HGR

Core Cluster 3Core Cluster 2

Core Cluster 1

Core1

Core3Core2

DMEM

PELPELPELPEL

ILB

Central CoreI/F1

fp-u

nit S

IMD

Co

reT

op

Ctrlr. R

ISC

I/F0

Process 65nm 1P8M Logic CMOS

Area 4mm × 4mm

SRAM 448 KB

Supply 0.67V – 1.1V

Power196 mW @ 200MHz, 1.1V

2.4 mW @ 10MHz, 0.67V

PrecisionFeature – bfloat16

Weight – 16/8/4'b FXP

Peak

Performance204 GFLOPS @ 16b Weight

Ext.

IF 0

Core 1

Core 2 Core 3

Top Ctrlr.Ext.

IF 1

UMEM

UMEMBMEM

BMEM

PE Arrays

Exp. Compressor

1-D SIMD

Supervised &

Reinforcement

Learning

Input Image

Hand Depth

Tracking

Results

-1.5cm

10cm

0cm

5cm

-5cm

7.5cm

0cm

5cm

40cm

20cm25cm

30cm35cm

-5cm

10cm

0cm5cm

-5cm

10cm

0cm

5cm

40cm

20cm25cm

30cm35cm

X

Y

-5cm

10cm

0cm5cm

-5cm

10cm

0cm

5cm

40cm

20cm25cm

30cm35cm

X

Y

X

Y

Hand

Tracking

Accuracy

2.6mm@20cm

4.6mm@30cm

3.4mm@40cm

5cm

Seperated

VGA

Cameras

22.5cm

40.5cm

ai chip trends and forecase - recommnine.co.kr

Documents