ralph wittig, distinguished engineer office of the cto, xilinx · your logo here pruning elements r...

44
Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Join the Conversation #OpenPOWERSummit

Upload: others

Post on 22-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Revolutionizing the Datacenter

Join the Conversation #OpenPOWERSummit

Power-Efficient Machine Learning using FPGAs on POWER Systems

Ralph Wittig, Distinguished Engineer

Office of the CTO, Xilinx

Join the Conversation #OpenPOWERSummit

Page 2: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Super Human

Humans: ~95%***

Top-5 Accuracy Image ClassificationImage-Net Large-Scale Visual Recognition Challenge (ILSVRC*)

* http://image-net.org/challenges/LSVRC/**http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10*** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf

**

Page 2

Page 3: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Super Human

Humans: ~95%***

Top-5 Accuracy Image ClassificationImage-Net Large-Scale Visual Recognition Challenge (ILSVRC*)

CNNs far outperform non AI methods

CNNs deliver super-human accuracy

**

* http://image-net.org/challenges/LSVRC/**http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10*** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf

Page 3

Page 4: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

CNNs Explained

Page 4

Page 5: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

The Computation

Page 5

Page 6: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

The Computation

Page 6

Page 7: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Calculating a single pixel on a single output feature plane requires a 3x3x384 input sub-volume and a 3x3x384 set of kernel weights

Page 7

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Page 8: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Calculating the next pixel on the same output feature plane requires an overlapping 3x3x384 input sub-volume and the same 3x3x384 set of weights

Page 8

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Page 9: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Continue along the row ...

Page 9

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Page 10: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Before moving down to the next row

Page 10

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Page 11: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

The first output feature map is complete

Page 11

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Page 12: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Move onto the next output feature map by switching weights, and repeat

Page 12

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Page 13: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Pattern repeats as before: same input volumes, different weight

Page 13

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Page 14: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Complete the second output feature map plane

Page 14

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Page 15: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Finally, after 256 weight sets have been used, the output feature map is complete

Page 15

Convolution

13

13

384

3

3

384

13

13

256

Input256

KernelWeights

Output

Page 16: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Fully Connected Layers

Page 16

Page 17: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Fully Connected Layers

a0,0

a0,1

a1,40 95

a1,1

a1,0

w0,0,0

w0,0,1

)*(0

,0,0,00,1

i

ii wafa

a2,1

a2,0

w1,40 95,1

w1,40 95,0

a0,40 95

w0,0,40 95

a2,99 9

w1,40 95,99 9

)*(0

,0,1,1999,2

i

ii wafa

fc6 fc7fc7 fc8

Page 17

Page 18: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Compute: dominated by convolution (CONV) layers

0.2

1

0.3

4

0.1

7 3

.87

3.8

7

0.9

0

0.8

3

1.8

5 5

.55

5.5

5

0.3

0

0.3

0

5.5

5 9

.25

12

.95

0.4

5

0.4

5

5.5

5 9

.25

12

.95

0.3

0

0.3

0

1.8

5

2.3

1

3.7

0

0.0

8

0.1

0

0.2

1

0.2

1

0.2

1

0.0

3

0.0

3

0.0

3

0.0

3

0.0

3

0.0

1

0.0

1

0.0

1

0.0

1

0.0

1

CaffeNet ZF VGG11 VGG16 VGG19

CONV1 CONV2 CONV3 CONV4CONV5 FC6 FC7 FC8

0.1

4

0.0

6

0.0

1

0.1

5

0.1

5

1.2

3

2.4

6

0.2

9

0.8

8

0.8

8

3.5

4

3.5

4

3.5

4

5.9

0

8.2

6

2.6

5

5.3

1

14

.16

23

.59

33

.03

1.7

7

3.5

4

18

.87

28

.31

37

.75

15

0.9

9

20

9.7

2

411

.04

411

.04

411

.04

67

.11

67

.11

67

.11

67

.11

67

.11

16

.38

16

.38

16

.38

16

.38

16

.38

CaffeNet ZF VGG11 VGG16 VGG19

CONV1 CONV2 CONV3 CONV4

Co

mp

ute

GO

Ps P

er L

ayer

Mem

ory

Acc

ess

G R

ead

s Pe

r La

yer

Source: Yu Wang, Tsinghua University, Feb 2016

CNN Properties

Memory BW: dominated by fully-connected (FC) layers

Page 18

Page 19: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Humans vs Machines

Humans are six orders of magnitude more efficient

*IBM Watson, ca 2012

Source: Yu Wang, Tsinghua University, Feb 2016

*

Page 19

Page 20: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Cost of Computation

Source: William Dally, “High Performance Hardware for Machine Learning”Cadence ENN Summit, 2/9/2016.Page 20

Page 21: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

21

Cost of Computation

Stay in on-chip memory (1/100 x power)

Use Smaller Multipliers (8bits vs 32bits: 1/16 x power)

Fixed-Point vs Float (don’t waste bits on dynamic range)

Source: William Dally, “High Performance Hardware for Machine Learning”Cadence ENN Summit, 2/9/2016.Page 21

Page 22: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Improving Machine Efficiency

Model Pruning

Right-Sizing Precision

Custom CNN Processor Architecture

Page 22

Page 23: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Pruning Elements Retrain to Recover Accuracy

Train Connectivity

Prune Connections

Train Weights

-4.5%

-4.0%

-3.5%

-3.0%

-2.5%

-2.0%

-1.5%

-1.0%

-0.5%

0.0%

0.5%

40% 50% 60% 70% 80% 90% 100%

Accu

racy L

oss

Parametes Pruned Away

L2 regularization w/o retrain L1 regularization w/o retrain

L1 regularization w/ retrain L2 regularization w/ retrain

L2 regularization w/ iterative prune and retrain

Pruned Han et al. Learning both Weights and Connections for Efficient Neural Networks, NIPS 2015 Remove Low Contribution Weights (Synapses)

Retrain Remaining Weights

Source: Han, et al, “Learning both Weights and Connections for Efficient Neural Networks”http://arxiv.org/pdf/1506.02626v3.pdf

Page 23

Page 24: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Pruning Results: AlexNet

Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING,TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdf

9x Reduction In #Weights

Most Reduction In FC Layers

Page 24

Page 25: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Pruning Results: AlexNet

< 0.1% Accuracy Loss

Source: Han, et al, “DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING,TRAINED QUANTIZATION AND HUFFMAN CODING ”, http://arxiv.org/pdf/1510.00149.pdfPage 25

Page 26: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Inference with Integer Quantization

Page 26

Page 27: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Right-Sizing Precision

Dynamic: Variable Format Fixed-Point (Per Layer)

< 1% Accuracy Loss

Network VGG16

Data Bits Single-float 16 16 8 8

Weight Bits Single-float 16 8 8 8 or 4

Data Precision N/A 2-2 2-2 2-5/2-1 Dynamic

Weight Precision N/A 2-15 2-7 2-7 Dynamic

Top-1 Accuracy 68.1% 68.0% 53.0% 28.2% 67.0%

Top-5 Accuracy 88.0% 87.9% 76.6% 49.7% 87.6%

Source: Yu Wang, Tsinghua University, Feb 2016 Page 27

Page 28: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Right-Sizing Precision

Fixed-Point Sufficient For Deployment (INT16, INT8)

No Significant Loss in Accuracy (< 1%)

>10x Energy Efficiency OPs/J (INT8 vs FP32)

4x Memory Energy Efficiency Tx/J (INT8 vs FP32)

Page 28

Page 29: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Improving Machine Efficiency

CNN Model

Pruned

Floating-Point Model

Pruned

Fixed-Point Model

Instructions

FPGA Based

Neural Network

Processor

Modelpruning

Data/weightquantization

Compilation

Run

Modified From: Yu Wang, Tsinghua University, Feb 2016 Page 29

Page 30: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Xilinx Kintex® UltraScale™ KU115 (20nm)

5520 DSP Cores,up to 500Mhz

5.5 T OPs int16 (peak)

4 GB DDR4-2400 & 38 GB/s

55W TDP & 100 G OPs/W

Single Slot, Low Profile FF

OpenPOWER CAPI AlphaData ADM-PCIE-8K5

Page 30

Page 31: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

FPGA Architecture

CLB DSP CLBRAM RAM

CLB DSP

CLB DSP

CLB

CLB

CLB DSP CLB

RAM

RAM

RAM

RAM

RAM

RAM

. . . .

. . . .

2D Array Architecture (scales with Moore’s Law)

Memory Proximate Computing (Minimize Data Moves)

Broadcast Capable Interconnect (Data Sharing/Reuse)

Page 31

Page 32: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

FPGA Arithmetic & Memory Resources

Wij

Dj

Oi

16-bitMultiplier

Native 16-bit multiplier (or reduced power 8-bit)

On-Chip RAMs store INT4, INT8, INT16, …

Custom Quantization Formatting (Qm.n)

48-bitAccumulator

Q8.8Q2.14Qm.n

CustomQuantization

Custom WidthMemory

INT4INT8

INT16INT32FP16FP32

Page 32

Page 33: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Convolver Unit

⋯⋯ ⋯⋯

⋯⋯ ⋯⋯

⋯⋯

MUX

MUX

Data buffer

Weight buffer

MultipliersAdder Tree

X+

9 Data Inputs

9 Weight

Inputs

n Delays

𝑚 Delays

+

++

+⋯

X X

X X X

X X X

Input

Data

Input

Weight

Output

Data

Source: Yu Wang, Tsinghua University, Feb 2016 Page 33

Page 34: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Convolver Unit

⋯⋯ ⋯⋯

⋯⋯ ⋯⋯

⋯⋯

MUX

MUX

Data buffer

Weight buffer

MultipliersAdder Tree

X+

9 Data Inputs

9 Weight

Inputs

n Delays

𝑚 Delays

+

++

+⋯

X X

X X X

X X X

Input

Data

Input

Weight

Output

Data

Memory Proximate Compute2D Parallel Memory2D Operator Array

INT16

Serial to ParallelPing/Pong

Serial to ParallelData Reuse: 8/9

Source: Yu Wang, Tsinghua University, Feb 2016 Page 34

Page 35: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Processing Engine (PE)

C

Convolver

Complex

+

+

+

+

+ NL PoolC

C

Output

Buffer

Input

Buffer

Data

Bias

Weights

Intermediate Data

Controller

PE

Adder

Tree

Bias Shift

Data

shift

……

Source: Yu Wang, Tsinghua University, Feb 2016 Page 35

Page 36: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Processing Engine (PE)

C

Convolver

Complex

+

+

+

+

+ NL PoolC

C

Output

Buffer

Input

Buffer

Data

Bias

Weights

Intermediate Data

Controller

PE

Adder

Tree

Bias Shift

Data

shift

……

Memory SharingBroadcast Weights

CustomQuantization

Source: Yu Wang, Tsinghua University, Feb 2016 Page 36

Page 37: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Top Level

Power CPUExternal

Memory

Pro

ce

ssin

g S

ys

tem

DMA w/ compression

Data & Inst. Bus

Input

Buffer

PE

Computing Complex

Output

Buffer

PE PE

FIFO

Co

ntr

oll

er

Pro

gra

mm

ab

le L

og

icConfig

.

Bus

Source: Yu Wang, Tsinghua University, Feb 2016 Page 37

Page 38: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Top Level

Power CPUExternal

Memory

Pro

ce

ssin

g S

ys

tem

DMA w/ compression

Data & Inst. Bus

Input

Buffer

PE

Computing Complex

Output

Buffer

PE PE

FIFO

Co

ntr

oll

er

Pro

gra

mm

ab

le L

og

icConfig

.

Bus

SW ScheduledDataflow

Decompress weights on the fly

Multiple PEBlock Level Parallelism

Ping Pong BuffersTransfers Overlap with Compute

Source: Yu Wang, Tsinghua University, Feb 2016 Page 38

Page 39: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

FPGA Neural Net Processor

Tiled Architecture (Parallelism & Scaling)

Semi Static Dataflow (Pre-scheduled Data Transfers)

Memory Reuse (Data Sharing across Convolvers)

Page 39

Page 40: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

OpenPOWER CAPI

Shared Virtual Memory

System-Wide Memory Coherency

Low Latency Control Messages

POWER8CAP UNIT

CAP PSL

Peer Programming Model and Interaction Efficiency

Page 40

Page 41: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

OpenPOWER CAPI

POWER8CAP UNIT

CAP PSL

Power

• Caffe, TensorFlow, etc• Load CNN Model• Call AuvizDNN Library

Xilinx FPGA

• AuvizDNN Kernel• Scalable & Fully Parameterized• Plug and Play Library

Page 41

Page 42: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

OpenPOWER CAPI

POWER8CAP UNIT

CAP PSL

14 Images/s/W (AlexNet)

Batch Size 1

Low Profile TDP

Page 42

Page 43: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

Take Aways

FPGA: Ideal Dataflow CNN Processor

POWER/CAPI: Elevates Accelerators As Peers to CPUs

FPGA CNN Libraries

Page 43

Page 44: Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx · Your logo here Pruning Elements R etra in to R eco ve r Accu ra cy T ra in C o n n ec tiv ity P ru n e C o n n ec

Your logohere

444/11/2016

Thank You!