fpl15 talk: deep convolutional neural network on fpga

28
A Deep Convolutional Neural Network Based on Nested Residue Number System Hiroki Nakahara 1 Tsutomu Sasao 2 1 Ehime University, Japan 2 Meiji University, Japan 1

Upload: hiroki-nakahara

Post on 12-Apr-2017

2.590 views

Category:

Engineering


7 download

TRANSCRIPT

Page 1: FPL15 talk: Deep Convolutional Neural Network on FPGA

A Deep ConvolutionalNeural Network Based on

Nested Residue Number System

Hiroki Nakahara1 Tsutomu Sasao2

1Ehime University, Japan2Meiji University, Japan

1

Page 2: FPL15 talk: Deep Convolutional Neural Network on FPGA

Outline

• Background

• Deep convolutional neural network (DCNN)

• Residue number system (RNS)

• DCNN using nested RNS (NRNS)

• Experimental results

• Conclusion

2

Page 3: FPL15 talk: Deep Convolutional Neural Network on FPGA

Background

• Deep Neural Network

– Multi-layer neuron model

– Used for embedded vision system

• FPGA realization is suitable for real-time systems

– faster than the CPU

– Lower power consumption than the GPU

– Fixed point representation is sufficient

• High-performance per area is desired

3

Page 4: FPL15 talk: Deep Convolutional Neural Network on FPGA

Deep Convolutional Neural Network (DCNN)

4

Page 5: FPL15 talk: Deep Convolutional Neural Network on FPGA

Artificial Neuron

+

x0=1x0=1

x1

x2

xN

...

w0 (Bias)

w1

w2

wN

f(u)u y

xi: Input signalwi: Weightu: Internal statef(u): Activation function (Sigmoid, ReLU, etc.)y: Output signal

5

y f (u)

u wixii0

N

Page 6: FPL15 talk: Deep Convolutional Neural Network on FPGA

Deep Convolutional Neural Network(DCNN)for ImageNet

• 2D convolutional layer, pooling layer, and fully connection layer

6

Page 7: FPL15 talk: Deep Convolutional Neural Network on FPGA

2D Convolutional Layer• Consumes more than 90% of the computation time

– Multiply-accumulation (MAC) operation is performed

7

zij yij xim, jnwmnn0

K1

m0

K1

xij: Input signalyij : Biaswmn: WeightK: Kernel sizezij: Output signal

K

K

Page 8: FPL15 talk: Deep Convolutional Neural Network on FPGA

FPGA

Realization of 2D Convolutional Layer• Requires more than billion MACs!• Our realization

– Time multiplexing– Nested Residue Number System(NRNS)

8

Off‐chip Memory

** **

++

** **

++

++

** **

++

** **

++

++

BRAMBRAM BRAMBRAM

FPGA

Off‐chip Memory

** **

BRAMBRAM

****

** ****

** **

BRAMBRAM

****

** ****

** **

BRAMBRAM

****

** ****

** **

BRAMBRAM

****

** ****

➔ConverterConverter ConverterConverter ConverterConverter ConverterConverter

ConverterConverter ConverterConverter ConverterConverter ConverterConverter

Page 9: FPL15 talk: Deep Convolutional Neural Network on FPGA

Residue Number System(RNS)

9

Page 10: FPL15 talk: Deep Convolutional Neural Network on FPGA

Residue Number System (RNS)

Defined by a set of L mutually prime integer constants 〈m1,m2,...,mL〉

No pair modulus have a common factor with any other

Typically, prime number is used as moduli set

An arbitrary integer X can be uniquely represented by a tuple of L integers (X1,X2,…,XL), where

Dynamic range

10

)(mod ii mXX

M mii1

L

Page 11: FPL15 talk: Deep Convolutional Neural Network on FPGA

Parallel Multiplication

Multiplication on RNS

Moduli set〈3,4,5〉, X=8, Y=2

Z=X×Y=16=(1,0,1)

X=(2,0,3), Y=(2,2,2)

Z=(4 mod 3,0 mod 4,6 mod 5)

=(1,0,1)=16

11

Binary2RNS Conversion

RNS2Binary Conversion

➔ ➔

Page 12: FPL15 talk: Deep Convolutional Neural Network on FPGA

Binary2RNS Converter

12

X mod 2 mod 3 mod40 0 0 01 1 1 12 0 2 23 1 0 34 0 1 05 1 2 1

Page 13: FPL15 talk: Deep Convolutional Neural Network on FPGA

13

00 01 10 11

00011011

0111

1100

0111

1100

X1=(x1, x2)

X2=(x3, x4)

h(X1) 0 01 1

x1 0 0 1 1x2 0 1 0 1

h(X1) 0 1 0 1

0 100 0 101 1 110 1 011 1 0

x3,x4

h(X1)

Functional Decomposition

24x1=16 [bit] 22x1+23x1=12 [bit]

Column multiplicity=2

Bound variables

Freevariables

Page 14: FPL15 talk: Deep Convolutional Neural Network on FPGA

Decomposition Chart for X mod 3

14

000 001 010 011

00011011

0120

1201

2012

0120

X2=(x3, x4, x5)

X1=(

x1, x

2)

100 101 110 111

1201

2012

0120

1201

0 mod 3 = 01 mod 3 = 12 mod 3 = 23 mod 3 = 04 mod 3 = 15 mod 3 = 26 mod 3 = 07 mod 3 = 18 mod 3 = 29 mod 3 = 0

10 mod 3 = 1

Freevariables

Bound variables

Page 15: FPL15 talk: Deep Convolutional Neural Network on FPGA

Decomposition Chart for X mod 3

15

0 1 2

00011011

0120

1201

2012

X2=(

x3, x

4, x5

)X 1

=(x1

, x2)

0 mod 3 = 01 mod 3 = 12 mod 3 = 23 mod 3 = 04 mod 3 = 15 mod 3 = 26 mod 3 = 07 mod 3 = 18 mod 3 = 29 mod 3 = 0

10 mod 3 = 1

Fre

eBound

x3 0 0 0 0 1 1 1 1x4 0 0 1 1 0 0 1 1x5 0 1 0 1 0 1 0 1

h(X2) 0 1 2 0 1 2 0 1

Page 16: FPL15 talk: Deep Convolutional Neural Network on FPGA

Binary2RNS Converter

16

LUT cascade for X mod m1

LUT cascade for X mod m2

BRAM

BRAM

BRAM

Page 17: FPL15 talk: Deep Convolutional Neural Network on FPGA

RNS2Binary Converter (m=30)

17

x1 y10 01 15

x2 y20 01 102 20

x3 y30 01 62 123 184 24

Mod mAdder

Mod mAdder

carry

carry

Page 18: FPL15 talk: Deep Convolutional Neural Network on FPGA

Problem• Moduli set of RNS consists of mutually prime numbers

– sizes of circuits are all different

• Example: <7,11,13>

18

6‐inputLUT

8‐inputLUT

8‐inputLUT

34

4

443

3

4

4

Binary2RNSConverter

byBRAMs

RNS2BinaryConverter

byDSP blocksand BRAMs

➔ ➔

Page 19: FPL15 talk: Deep Convolutional Neural Network on FPGA

DCNN using Nested RNS

19

Page 20: FPL15 talk: Deep Convolutional Neural Network on FPGA

Nested RNS

• (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL)

• Ex: <7,11,13>×<7,11,13>

<7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13>

20

1. Reuse the same moduli set

2. Decompose a large modulo into smaller ones

Original modulus

Page 21: FPL15 talk: Deep Convolutional Neural Network on FPGA

Example of Nested RNS• 19x22(=418) on <7,<5,6,7>11,<5,6,7>13>

19×22

=<5,8,6>×<1,0,9>

=<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13>

=<5,<0,0,0>11,<4,0,5>13>

=<5,0,2>

=418

21

Modulo Multiplication

Bin2RNS on NRNS

RNS2Bin

Binary2NRNS Conversion

Page 22: FPL15 talk: Deep Convolutional Neural Network on FPGA

Realization of Nested RNS

22

<5,6,7>2Bin

Bin2<7,11,13>

3

<7,11,13>2Bin

<5,6,7>2Bin

Bin2<5,6,7>

Bin2<5,6,7>

6‐inputLUT

6‐inputLUT

6‐inputLUT

6‐inputLUT

6‐inputLUT

6‐inputLUT

6‐inputLUT

Bin2<7,11,13> Bin2

<5,6,7>

Bin2<5,6,7>

44

3

44

3333

3

3

Binary2NRNS

NRNS2Binary

Realized by BRAMs                      LUTs      BRAMs and DSP blocks   

Page 23: FPL15 talk: Deep Convolutional Neural Network on FPGA

Moduli Set for NRNS

• Conventional RNS (uses 23 moduli)

<3,4,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83>

• Applied the NRNS to moduli that are greater than 15

<3,4,5,7,11,13,

<3,4,5,7,11,13>17,

<3,4,5,7,11,13>19,

<3,4,5,7,11,13,<3,4,5,7,11,13>17>23,

<3,4,5,7,11,13,<3,4,5,7,11,13>17>29,

…, <3,4,5,7,11,13,<3,4,5,7,11,13>17>83>

23All the 48-bit MAC operations are decomposed into 4-bit ones

Page 24: FPL15 talk: Deep Convolutional Neural Network on FPGA

DCNN Architecture using the NRNS

24

...

16 parallel modulo mi2D convolutional units

...

...

. . .

BRAM BRAM BRAM...

BRAM BRAM BRAM...

BRAM BRAM BRAM...

. . .

Parallel Bin2NRNSConverters

Tree‐based NRNS2BinConverters

Sequencer             

External DDR3SODIMM

DDR3 Ctrl.DDR3 Ctrl.On‐chipMemory

RNS2Bin

RNS2Bin

RNS2Bin

RNS2Bin

RNS2Bin

......

...

...

Page 25: FPL15 talk: Deep Convolutional Neural Network on FPGA

Experimental Results

25

Page 26: FPL15 talk: Deep Convolutional Neural Network on FPGA

Implementation Setup

• FPGA board: Xilinx VC707

– FPGA: Virtex7 VC485T

– 1GB DDR3SODIMM

(Bus@800MHz, 64 bit width)

• Realized the pre-trained

ImageNet by Convnet2

– 48-bit fixed precision

• Synthesis tool: Xilinx Vivado2014.1

– Timing constrain: 400MHz

26

Page 27: FPL15 talk: Deep Convolutional Neural Network on FPGA

Comparison with Other Implementations

27

Precision Max.Freq.[MHz]

FPGA Performance[GOPS]

Performanceper area[GOPS/

Slice x 10‐4]ASAP2009 16bit fixed 115 Viretex5 LX330T 6.7 1.3

PACT2010 ‐‐‐ fixed 125 Viretex5 SX240T 7.0 1.9

FPL2009 48bit fixed 125 Spartax3A DSP3400 5.3 2.2

ISCA2010 48bit fixed 200 Virtex5 SX240T 16.0 4.3

ICCD2013 ‐‐‐ fixed 150 Virtex6 LVX240T 17.0 4.5

FPGA2015 32bit float 100 Virtex7 VX485T 61.6 8.1

Proposed 48bit fixed 400 Virtex7 VX485T 132.2 25.2

Page 28: FPL15 talk: Deep Convolutional Neural Network on FPGA

Conclusion

• Realized the DCNN on the FPGA

– Time multiplexing– Nested RNS

• MAC operation is realized by small LUTs

• Functional decomposition are used as follows:

– Bin2NRNS converter is realized by BRAMs

– NRNS2Bin converter is realized by DSP blocks and BRAMs

• Performance per area (GOPS/Slice)– 5.86 times higher than ISCA10’s

28