fpl15 talk: deep convolutional neural network on fpga

A Deep ConvolutionalNeural Network Based on

Nested Residue Number System

Hiroki Nakahara1 Tsutomu Sasao2

1Ehime University, Japan2Meiji University, Japan

1

Outline

• Background

• Deep convolutional neural network (DCNN)

• Residue number system (RNS)

• DCNN using nested RNS (NRNS)

• Experimental results

• Conclusion

2

Background

• Deep Neural Network

– Multi-layer neuron model

– Used for embedded vision system

• FPGA realization is suitable for real-time systems

– faster than the CPU

– Lower power consumption than the GPU

– Fixed point representation is sufficient

• High-performance per area is desired

3

Deep Convolutional Neural Network (DCNN)

4

Artificial Neuron

+

x0=1x0=1

x1

x2

xN

...

w0 (Bias)

w1

w2

wN

f(u)u y

xi: Input signalwi: Weightu: Internal statef(u): Activation function (Sigmoid, ReLU, etc.)y: Output signal

5

y f (u)

u wixii0

N

Deep Convolutional Neural Network(DCNN)for ImageNet

• 2D convolutional layer, pooling layer, and fully connection layer

6

2D Convolutional Layer• Consumes more than 90% of the computation time

– Multiply-accumulation (MAC) operation is performed

7

zij yij xim, jnwmnn0

K1

m0

K1

xij: Input signalyij : Biaswmn: WeightK: Kernel sizezij: Output signal

K

K

FPGA

Realization of 2D Convolutional Layer• Requires more than billion MACs!• Our realization

– Time multiplexing– Nested Residue Number System(NRNS)

8

Off‐chip Memory

** **

++

** **

++

++

** **

++

** **

++

++

BRAMBRAM BRAMBRAM

FPGA

Off‐chip Memory

** **

BRAMBRAM

****

** ****

** **

BRAMBRAM

****

** ****

** **

BRAMBRAM

****

** ****

** **

BRAMBRAM

****

** ****

➔ConverterConverter ConverterConverter ConverterConverter ConverterConverter

ConverterConverter ConverterConverter ConverterConverter ConverterConverter

Residue Number System(RNS)

9

Residue Number System (RNS)

Defined by a set of L mutually prime integer constants 〈m1,m2,...,mL〉

No pair modulus have a common factor with any other

Typically, prime number is used as moduli set

An arbitrary integer X can be uniquely represented by a tuple of L integers (X1,X2,…,XL), where

Dynamic range

10

)(mod ii mXX

M mii1

L

Parallel Multiplication

Multiplication on RNS

Moduli set〈3,4,5〉, X=8, Y=2

Z=X×Y=16=(1,0,1)

X=(2,0,3), Y=(2,2,2)

Z=(4 mod 3,0 mod 4,6 mod 5)

=(1,0,1)=16

11

Binary2RNS Conversion

RNS2Binary Conversion

➔ ➔

Binary2RNS Converter

12

X mod 2 mod 3 mod40 0 0 01 1 1 12 0 2 23 1 0 34 0 1 05 1 2 1

➔

13

00 01 10 11

00011011

0111

1100

0111

1100

X1=(x1, x2)

X2=(x3, x4)

h(X1) ００１１

x1 0 0 1 1x2 0 1 0 1

h(X1) 0 1 0 1

0 100 0 101 1 110 1 011 1 0

x3,x4

h(X1)

Functional Decomposition

24x1=16 [bit] 22x1+23x1=12 [bit]

Column multiplicity=2

Bound variables

Freevariables

Decomposition Chart for X mod 3

14

000 001 010 011

00011011

0120

1201

2012

0120

X2=(x3, x4, x5)

X1=(

x1, x

2)

100 101 110 111

1201

2012

0120

1201

0 mod 3 = 01 mod 3 = 12 mod 3 = 23 mod 3 = 04 mod 3 = 15 mod 3 = 26 mod 3 = 07 mod 3 = 18 mod 3 = 29 mod 3 = 0

10 mod 3 = 1

…

Freevariables

Bound variables

Decomposition Chart for X mod 3

15

0 1 2

00011011

0120

1201

2012

X2=(

x3, x

4, x5

)X 1

=(x1

, x2)

0 mod 3 = 01 mod 3 = 12 mod 3 = 23 mod 3 = 04 mod 3 = 15 mod 3 = 26 mod 3 = 07 mod 3 = 18 mod 3 = 29 mod 3 = 0

10 mod 3 = 1

…

Fre

eBound

x3 0 0 0 0 1 1 1 1x4 0 0 1 1 0 0 1 1x5 0 1 0 1 0 1 0 1

h(X2) 0 1 2 0 1 2 0 1

Binary2RNS Converter

16

LUT cascade for X mod m1

LUT cascade for X mod m2

BRAM

BRAM

BRAM

RNS2Binary Converter (m=30)

17

x1 y10 01 15

x2 y20 01 102 20

x3 y30 01 62 123 184 24

Mod mAdder

Mod mAdder

carry

carry

Problem• Moduli set of RNS consists of mutually prime numbers

– sizes of circuits are all different

• Example: <7,11,13>

18

6‐inputLUT

8‐inputLUT

8‐inputLUT

34

4

443

3

4

4

Binary2RNSConverter

byBRAMs

RNS2BinaryConverter

byDSP blocksand BRAMs

➔ ➔

DCNN using Nested RNS

19

Nested RNS

• (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL)

• Ex: <7,11,13>×<7,11,13>

<7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13>

20

1. Reuse the same moduli set

2. Decompose a large modulo into smaller ones

Original modulus

➔

Example of Nested RNS• 19x22(=418) on <7,<5,6,7>11,<5,6,7>13>

19×22

=<5,8,6>×<1,0,9>

=<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13>

=<5,<0,0,0>11,<4,0,5>13>

=<5,0,2>

=418

21

Modulo Multiplication

Bin2RNS on NRNS

RNS2Bin

Binary2NRNS Conversion

Realization of Nested RNS

22

<5,6,7>2Bin

Bin2<7,11,13>

3

<7,11,13>2Bin

<5,6,7>2Bin

Bin2<5,6,7>

Bin2<5,6,7>

6‐inputLUT

6‐inputLUT

6‐inputLUT

6‐inputLUT

6‐inputLUT

6‐inputLUT

6‐inputLUT

Bin2<7,11,13> Bin2

<5,6,7>

Bin2<5,6,7>

44

3

44

3333

3

3

Binary2NRNS

NRNS2Binary

Realized by BRAMs LUTs BRAMs and DSP blocks

Moduli Set for NRNS

• Conventional RNS (uses 23 moduli)

<3,4,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83>

• Applied the NRNS to moduli that are greater than 15

<3,4,5,7,11,13,

<3,4,5,7,11,13>17,

<3,4,5,7,11,13>19,

<3,4,5,7,11,13,<3,4,5,7,11,13>17>23,

<3,4,5,7,11,13,<3,4,5,7,11,13>17>29,

…, <3,4,5,7,11,13,<3,4,5,7,11,13>17>83>

23All the 48-bit MAC operations are decomposed into 4-bit ones

DCNN Architecture using the NRNS

24

...

16 parallel modulo mi2D convolutional units

...

...

. . .

BRAM BRAM BRAM...

BRAM BRAM BRAM...

BRAM BRAM BRAM...

. . .

Parallel Bin2NRNSConverters

Tree‐based NRNS2BinConverters

Sequencer

External DDR3SODIMM

DDR3 Ctrl.DDR3 Ctrl.On‐chipMemory

RNS2Bin

RNS2Bin

RNS2Bin

RNS2Bin

RNS2Bin

......

...

...

Experimental Results

25

Implementation Setup

• FPGA board: Xilinx VC707

– FPGA: Virtex7 VC485T

– 1GB DDR3SODIMM

(Bus@800MHz, 64 bit width)

• Realized the pre-trained

ImageNet by Convnet2

– 48-bit fixed precision

• Synthesis tool: Xilinx Vivado2014.1

– Timing constrain: 400MHz

26

Comparison with Other Implementations

27

Precision Max.Freq.[MHz]

FPGA Performance[GOPS]

Performanceper area[GOPS/

Slice x 10‐4]ASAP2009 16bit fixed 115 Viretex5 LX330T 6.7 1.3

PACT2010 ‐‐‐ fixed 125 Viretex5 SX240T 7.0 1.9

FPL2009 48bit fixed 125 Spartax3A DSP3400 5.3 2.2

ISCA2010 48bit fixed 200 Virtex5 SX240T 16.0 4.3

ICCD2013 ‐‐‐ fixed 150 Virtex6 LVX240T 17.0 4.5

FPGA2015 32bit float 100 Virtex7 VX485T 61.6 8.1

Proposed 48bit fixed 400 Virtex7 VX485T 132.2 25.2

Conclusion

• Realized the DCNN on the FPGA

– Time multiplexing– Nested RNS

• MAC operation is realized by small LUTs

• Functional decomposition are used as follows:

– Bin2NRNS converter is realized by BRAMs

– NRNS2Bin converter is realized by DSP blocks and BRAMs

• Performance per area (GOPS/Slice)– 5.86 times higher than ISCA10’s

28

fpl15 talk: deep convolutional neural network on fpga

Engineering