design of fpga-based accelerator for convolutional neural

Research ArticleDesign of FPGA-Based Accelerator forConvolutional Neural Network under HeterogeneousComputing Framework with OpenCL

Li Luo1 YakunWu1 Fei Qiao 2 Yi Yang2 Qi Wei 2 Xiaobo Zhou1

Yongkai Fan 3 Shuzheng Xu2 Xinjun Liu 4 and Huazhong Yang2

1Department of Electronic Science and Technology Beijing Jiaotong University Beijing China2Department of Electronic Engineering Tsinghua University Beijing China3China University of Petroleum Beijing China4Department of Mechanical Engineering Tsinghua University Beijing China

Correspondence should be addressed to Qi Wei weiqitsinghuaeducn

Received 26 January 2018 Revised 5 April 2018 Accepted 28 May 2018 Published 3 July 2018

Academic Editor Michael Hubner

Copyright copy 2018 Li Luo et al This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

CPU has insufficient resources to satisfy the efficient computation of the convolution neural network (CNN) especially forembedded applications Therefore heterogeneous computing platforms are widely used to accelerate CNN tasks such as GPUFPGA and ASIC Among these FPGA can accelerate the computation by mapping the algorithm to the parallel hardware insteadof CPU which cannot fully exploit the parallelism By fully using the parallelism of the neural networkrsquos structure FPGA canreduce the computing costs and increase the computing speed However the development of FPGA requires great design skillsAs a heterogeneous development platform OpenCL has some advantages such as high abstraction level short development cycleand strong portability which can make up for the lack of skilled designers This paper uses Xilinx SDAccel to realize the parallelacceleration of CNN task and it also proposes an optimizing strategy of single convolutional layer to accelerate CNN Simulationresults show that the calculation speed could be improved by adopting the proposed optimizing strategy Compared with thebaseline design the strategy of single convolutional layer could increase the computing speed 14 times Performance of the wholeCNN task could be improved 2 times more than before and the speed of image classification could attain more than 48 fps

1 Introduction

Deep learning is currently the most popular field in recentyears It processes complex input data based on themultilayerneural network Deep learning has a significant effect on theanalysis of machine vision [1] video surveillance [2] andother information [3] which solves the problem of patternrecognition in many complex scenes It is also important forthe development of artificial intelligence As a classical modelof the deep learning system convolutional neural network(CNN) has an enormous advantage in image recognition[4] Especially in the ImageNet Large Scale Visual Recogni-tion Challenge 2012 (ILSVRC2012) [5] CNN improved theaccuracy rate from 70 to 80 which is far better than

the traditional methods Actually CNN has broad space forfurther development which has been widely used in facerecognition [6] audio retrieval hand posture recognition [7]etc [8] It becomes increasingly difficult for CPU to realizecomplex CNN [9] while using the heterogeneous computingplatform is an emerging method to accelerate CNN and thehardware such as GPU [10 11] or FPGA [12 13]

Compared to CPUs the neural networkrsquos acceleratorsbased on FPGA are increasingly popular because of theirhigher efficiency [14] Over the baseline CPU FPGA accel-erators deliver one to two orders of magnitude speedups [1]Moreover the FPGA1024 design delivers almost 50x perfor-mance improvement over the baseline CPU (Intel Xeon E5-2699v3 server) [14] In addition due to the structure of CNN

HindawiInternational Journal of Reconfigurable ComputingVolume 2018 Article ID 1785892 10 pageshttpsdoiorg10115520181785892

2 International Journal of Reconfigurable Computing

each layer of calculation is independent of others and theinterlayered structure can be dealt with like a flow structureMoreover various modules on FPGA can be executed inparallel [15] It is very suitable to accelerate CNN by mappingthe flow structure on FPGA However in order to useFPGA the development process requires that the hardwaredescription language as well as some knowledge about theinternal structure of FPGA must be comprehended whichlimits the designersrsquo ability to design to some extent

OpenCL provides a new method for the acceleration ofCNN based on FPGA It is a unified programming envi-ronment [16] and a parallel programming standard for theheterogeneous systems Based on the traditional C languagethe codewritten inOpenCL is highly portable [17]Moreoveron account of the existence of the framework about OpenCLto FPGA designers need neither to synthesize hardwaredescription language nor to understand the hardware circuitstructure which greatly lowers the using threshold [18]

This paper proposed a parallel acceleration strategy ofCNN based on FPGA with OpenCL by the use of XilinxSDAccel As the convolution layer is the most complicatedpart of CNN we first optimize the single convolution layerand then on this basis we accelerate complete CNN andexplore different optimization strategies

Throughout this paper we deal with the following thegeneral concept of CNN and the traditional method of CNNon CPU in Section 2 the knowledge of heterogeneous com-puting platform about OpenCL in Section 3 optimization ofa single convolution layer in Section 4 the acceleration ofcomplete CNN in Section 5 the experimental environmentsand the analysis of experimental results in Section 6 and theconclusions in Section 7

2 CNN Information andTraditional Acceleration

21The Basic Structure of Convolution Neural Network CNNis an efficient identification method which has receivedconsiderable attention in recent years CNN is a hierarchicalmodel whose basic structure consists of input layers con-volution layers pooling layers fully connected layers andoutput layers The main characteristic is that the convolutionlayer and the pooling layer alternate and the local feature ofthe previous layer gets the characteristics of the next layerof convolution by the weight of the convolution Throughthe connections of multiple convolution with pooling layersimages gradually change from the surface features to the deepones then the features are classified by the fully connectedlayer and the output layer and finally the images are dividedinto categories Therefore according to the function of everylayer CNNcanbe divided into twoparts the feature extractoris composed of the input layer the convolution layer and thepooling layer while the fully connected layer and the outputlayer constitute the classifier

CNN was first put forward by a professor from Uni-versity of Toronto called LeCun [3] Based on the researchof Fukushima he used BP algorithm to design the firstCNN named LeNet-5 Because LeNet-5 is one of the mostclassic structure of CNN the following introductions and

Figure 1 LeNet-5 [3]

experiments are taken with the example of LeNet-5 LeNet-5 is CNN for handwritten numeral recognition Figure 1 isthe structure of LeNet-5 which contains three convolutionlayers two pooling layers a fully connected layer and aGaussian layer

211 The Convolution Layer In the convolution layer theinput data is convolved with a learning convolution kerneland the result of convolution forms the characteristic patternof this layer By the activating function each characteristicpattern can be combinedwith the numerical value of previouscharacteristic patterns Its expression is shown in

119909119897119895 = 119891(sum119894119890119872119895

119909119894minus1119894 lowast 119896119897119894119895 + 119887119897119895) (1)

where lsquoIrsquo represents the number of layers lsquokrsquo representsthe weight of the convolution kernel of l layer and lsquobrsquorepresents the bias of the characteristic pattern after theconvolution f(x) is the activation function like sigmoid andtanh

In LeNet-5 model for example the input size is 32lowast32 inC1 Layer the convolution kernelrsquos size is 5lowast5 and the size ofC1 Layer is 28lowast28 (28=32-5+1) It can be seen from Figure 1that C1 Layer has six two-dimensional planes totally and eachplane has 26 (26=5lowast5+1) parameters Therefore C1 Layerhas 156 parameters and 122304 (122304=(5lowast5+1)lowast6lowast28lowast28)connections The same is true for C3 and C5 Layer

212 The Pooling Layer The pooling layer is also called thesubsampling layer It usually connects with the convolutionlayer Through partial correlation principle on the one handit can improve the robustness of the system and on theother hand it can reduce the calculation of the characteristicpattern Its expression is shown in

119909119897119895 = 119891 (120573119897119895119889119886119908119899 (119909119897minus1119895 ) + 119887119894119895) (2)

lsquodawn(119909119897minus1119895 )rsquo represents the pooling function Beta repre-sents the weight coefficient of the pooling layer lsquobrsquorepresentsthe bias of the characteristic pattern after the pooling Themain principle is to divide the image into nlowastn image blocksthen sum the pixels in each image block and take the meanor the maximum valueTherefore the output image is the sizeof the input image of 1n Among them every characteristicpattern has its own beta and b

In LeNet-5 model S2 consists of six characteristic pat-terns of the size 14 x 14 Each neuron in the characteristic

International Journal of Reconfigurable Computing 3

(a) Full connection layer model (b) Convolutional layer model

Figure 2

Figure 3 Workflow

pattern connects with the 2 x 2 neighborhood of the cor-responding characteristic pattern in C1 Layer Because eachneuronrsquos sensory field does not overlap every characteristicpattern in S2 is one quarter of the size of the characteristicpattern in C1 S2 Layer has 12 (12=(1+1)lowast6) parametersand 5880 (5880=14lowast14lowast(4+1)lowast6) connections S4 samplingLayerrsquos analysis is the same

213 The Fully Connected Layer When neurons of this layerare connected with each neuron in the upper layer it formsa fully connected layer After multiple feature extraction ofmultiple convolution layers and pooling layers CNN usesthe fully connected layer to classify the extracted features Itsexpression is shown in

119909119897 = 119891 (119906119897) (3)

119906119897 = 119908119897119909119897minus1 + 119887119897 (4)

In LeNet-5 model F6 is fully connected with C5 whichincludes 84 neurons 10164 parameters and 10164 connec-tions

Comparing Figure 2(a) with Figure 2(b) we can see thedifference between the convolution layer and the fully con-nected layer In the convolution layer the connections andparameters are reduced by means of local connection andweight sharing In the fully connected layer neurons of the

layer are connected with all the neurons in the upper layerand its parameters and connections are numerous so thefully connected layer generally appears near the output partof CNN

22 The Implementation of CNN by CPU

221 The Overall Design The system is implemented byC++ language The abstract base class is defined then theconvolution layer the fully connected layer and the poolinglayer are all derived from the abstract base class CNNis responsible for IO and the configuration of all CNNstructures and calls OpenCLrsquos API for the acceleration

Figure 3 shows the workflow of the entire system Firstread the configuration file of CNN and then choose the wayof calculation according to the need Finally process all inputdata in turn and get the elapsed time

222 Test Results The system is tested on the server plat-form and the convolution neural network is LeNet-5 Themain parameters of the platform are shown in Table 3 Thebenchmarks of CNN which are small convolutional coresand LeNet-5 network are simple enough Therefore for cur-rent evaluations the experimental results have no cumulativeeffectThe results ofmultiple experiments are almost constantthroughmultiple running evaluations Table 1 lists the single


Table 1 The single thread calculation time of the LeNet-5 layers on CPU

Layer C1 S2 C3 S4 C5 F6 O7 TotalTime (ms) 414 014 786 004 157 016 002 1393

Table 2 Hardware on Xilinx FPGA[25]

OpenCL storage structure Hardware implementation about Xilinx FPGAGlobal memory The DDR storageLocal memory The on-chip Block RAMPrivate memory The on-chip registers

thread calculation timeof the LeNet-5 layers on serverwithCPU

3 The Heterogeneous ComputingPlatform about OpenCL

OpenCL is an industry-standard framework about heteroge-neous platforms made up of CPU GPU [19] FPGA [20] andother processors By defining a unified framework OpenCLsignificantly increases the portability of programs At thesame time OpenCL allows developers to make specific opti-mizations for different hardware if they needOpenCL frame-work mainly includes platform models memory modelsprogramming models and execution models [21] Similar toCUDA OpenCL provides a standard framework for parallelprogramming and can support programming a variety ofhardware platforms including FPGAs

Since 2013 Altera and Xilinx started to adopt OpenCLSDKs for their FPGAs [22 23] It makes more developers takeadvantage of the high-performance and low power benefits todesign for FPGAs

OpenCL mainly faces parallel applications and supportstwo parallel programming models [24] data parallelismand task parallelism Data parallelism is implemented withmultiple work items in parallel while task parallelism is acommand queue that relies on out-of-order execution In thispaper the data parallel model is adopted to accelerate thesingle convolution layer while CNN adopts the task parallelmodel to accelerate

4 The Optimization of the Convolution Layer

In image recognition the convolution plays a key role Itevolves from two aspects one is the delay network of soundprocessing and the other is the algorithm to extract thefeature points on the image processing For the latter theconvolution is to filter the image and make some eigenvalueextraction As it can be seen from Table 1 90 of CNNis consumed in the convolution layer [26] Therefore it isnecessary to optimize the convolution part We can thinkabout algorithms hardware features and so on

41 LoopOptimization Byusingmore nested loops the datarsquosreused effect can be improved to accelerate the computationof the convolution layer In order to further optimize theconvolution layer when there is no dependence in the loop

Figure 4

Figure 5

we can use loop unrolling to reduce the delay by consumingmore hardware resources and parallel execution of loops Asshown in Figure 4 the loop will be expanded with a base of2 and the theoretical runtime will be halved

However when the dependence exists in the loop Wecan solve this problem by cyclic pipelining Cyclic pipeliningis that the three stages of reading calculation and memoryare in the working state to improve throughput throughthe pipeline operation Figure 5 shows how to use cyclicpipelining

42 On-Chip Memory In OpenCL storage structure thememory is divided into the globalmemory the localmemoryand the private memory Table 2 lists this hardware onXilinx FPGA [25]

Since the global memory is mainly used for the datatransmission between Host and FPGA by adopting the DDR


Table 3 TheMain Parameters of The Host

Feature SpecificationCPU Intel Xeon E3-1230 V2330GHzMemory 32GBOS CentOS 65 FinalOpenCL SDK Xilinx SDAccel 20153(with OpenCL 10)

storage the access delay is longer than before Meanwhile aprivate memory adopts the on-chip registers with minimumdelay except for large amount of data of CNN it will consumelarge amounts of the resources and the local memory is acompromise Its access latency is small while it does not takeup a large number of resources Therefore before the actualcalculation starts the required data can be read from theglobal memory to the local memory and then read back afterthe calculation It can further shorten the computation timeof the convolution layer and reduce the delay In additionif the same image convolved we can convolve all of theconvolution kernels together and store these separately

43 Multiple Compute Units On the FPGA platform hard-ware resources are programmable and there is no precon-figured core which gives users greater freedom The coreis the compute units which can calculate the work itemFor increasing the time parallelism the Xilinx OpenCL SDKautomatically assigns work items to different compute unitsat runtime The number of compute units are limited bytwo factors one is the total resources on FPGA and theother is the maximum number of compute units supportedby SDAccel In practice SDAccel currently supports up to 10compute units but they do not make full use of the resourceon FPGA It can be assumed that if SDAccel adds the numberof compute units the time parallelism can be higher

5 The Optimization of CNN

This part mainly considers the acceleration of CNN from twoaspects the storage and the pipeline

51 The Global Memory The global memory is implementedby DDR storage to exchange the data between Host andFPGA However there are a number of layers in CNN andthe cache between layer and layer is only the intermediateresult of calculation without the need to be visible to theHost OpenCL specifies that the local memory can only beused within the kernel Therefore these caches cannot beimplemented in the local memory

In this case SDAccel provides an optimization of theglobal memory By defining the global memory outside thekernel SDAccel automatically uses Block RAM on the chipTherefore by defining the caches between layers and layersby the global memory the computation delay will be furtherreduced Figure 6 shows how to use the global memory onthe chip

52The Pipelining of CNN InOpenCL all compute tasks canbe represented as an event The Host submits the compute

Figure 6

tasks to the task queue in OpenCL which is assigned to thecompute device to complete the calculation The order ofexecution of events in a task queue can be sequential anddisorderly In an out-of-order execution queue a numberof dependent events can be set for every event to ensuredata dependencies OpenCL commands that the queue doesnot run the event until all dependent events are completedThis kind of event mechanism is well suited to realize thepipelining of CNN

521 The Dependence Relationship of CNN The key of thepipelining of CNN is the relationship between events For Nlayers of CNN an input sample can generate (N+2) eventsso an event can be represented as a binary group (i j)representing the jth event of the ith input sample The arrowsymbol (i1 j1)997888rarr(i2 j2) indicates that the event (i2 j2)depends on the event (i1 j1) so the event (i2 j2) must waitfor the event (i1 j1) to be completed

After the analysis two dependencies exist in CNN whichare expressed as Theorems 1 and 2

Theorem 1 ((i j-1) 997888rarr (i j)) Event (i j) depends on event (ij-1) Obviously for the same input sample the calculation of theLayer j in CNN depends on the calculation result of the Layer(j-1)

Theorem 2 ((i-1 j+1) 997888rarr (i j)) Event (i j) is dependent onevent (i - 1 j + 1) There is only one cache area between Layer jand Layer (j+1) so the calculation of Layer j must wait for thecalculation of Layer (j+1) before it outputs the calculation resultto the cache between the two

Figure 7 shows the dependency among OpenCL eventsgenerated by the four input samples in three-layer CNNEach square represents an event the horizontal axis repre-sents Event j and the vertical axis represents Input Sample i


i

m

n

Figure 7

RD L0 L1 L2 WT RD L0 L1 L2 WT RD L0 L1 L2 WT RD L0 L1 L2 WTt

Figure 8

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WTt

Figure 9

Figure 8 shows a sequence diagram in which CNN isnot streamlined You can see that all events are executedsequentially

Figure 9 shows a sequence diagram in which CNN isstreamlined Due to the pipelining of each of the two levelsthe time required to process an input sample has not changedbut overall throughput has been increased by about N2

Theorem 3 represents the time t required to deal with aninput sample on average

Theorem 3 (T = max(t0 + t1 t1 + t2 tj-2 + tjminus1))522 The Realization of OpenCL OpenCL API providesthat all dependent events must be continuously distributedin memory when setting dependent events for an eventIt means that all events connected by the dotted line

in Figure 7 should be continuously distributed in thememory To avoid unnecessary losses all events need tobe remapped to a new two-dimensional coordinate system(m n) In this frame all events of the same n coordinatesform a one-dimensional array and can be called directly byOpenCL API To define j as the total events of an inputsample Theorem 4 represents the mapping relation fromcoordinates (i j) to coordinates (m n) and Figure 10represents the arrangement of events in (m n) coordinatesafter the mapping

Theorem 4 The mapping relation from coordinates (i j) tocoordinates (m n) is as follows

m = min (i [ 119869 minus 1 minus 1198952 ]) (5)

n = j + 2i (6)

6 Experiment Environment andResult Analysis

61 Experimental Environment The laboratory server is usedto set up the experimental environment in which the Hostadopts the CPU and the computing equipment uses FPGA tocomplete the experiment and measure the data

611 Host Configuration Table 3 lists the main parametersof the Host The CPU is Intel Xeon e3-1230 V2330GHz


Table 4 Specific parameters of alpha data ADM-PCIE-7V3 FPGA

Feature SpecificationMemory Two 8GB DDRS memory speeds up to 1333MTs of Flip Flops 866400 of LUTs 433200Device of DSP Slices 3600 of Block RAMs 1470

Table 5 The optimization results

Optimization Methods FFs LUTs DSPs Block RAMs Time(ms)Baseline 5360 8793 16 18 504Loop Pipeline 5374 8814 16 18 500Loop Unroll 5533 10214 16 18 279

Table 6 Calculating time with different number of compute units

Compute unit 1 2 3 4 5 6 7 8 9 10Computing time 279 145 108 74 73 59 58 42 39 35

n

m

0

1

2

3

4

5

6

7

8

9

10

0 1 2

Figure 10

the operating system is CentOS 65 Final and the XilinxOpenCL SDK is SDAccel 20153 which supports OpenCL10

612 FPGAConfiguration Thecomputing device uses AlphaData ADM-PCIE-7V3 FPGA and its FPGArsquos core uses XilinxVirtex-7XC7VX690T-2 which includes dual channel DDR3memory (1333MTs) Table 4 lists its specific parameters

613 CNN Benchmark CNN for testing uses LeNet-5 intro-duced in Section 2As a classical CNN LeNet-5 can better testthe performance of the acceleration on FPGA At the sametime the whole system stores data by XML format whichcan automatically generate OpenCL code according to theparameters of CNN

62 Results Analysis

621 Analysis of Optimization Results about the ConvolutionLayer The optimization of the convolution layer is mainlythe C3 Layer in LeNet-5 After using a single computeunit of loop tiling and local memory loop unrolling andcyclic pipelining are used to optimize it Table 5 lists theoptimization results and the loop unrolling increases thecalculation speed by 45 and consumes more computa-tional resources

The optimization is for a single compute unit If we com-bine multiple compute units on the FPGA the computationtime is lessTable 6 shows the value of calculating timewithdifferent numbers of computing units In current evalua-tions since the clock frequency of the FPGA developmentboard is fixed and the clock cycles of selected benchmarkof CNN are also constant The running time of multipleevaluations is constant as well In addition we have no OSinteractions and complex memory management so the run-ning time is not a randomvalue throughmultiple evaluationsFigure 11 shows the change trend of the calculation timewith the number of compute units We can see that thecomputation time is basically decreasing with the number of


Table 7 The assignments of compute units

Layer C1 S2 C3 S4 C5 F6 O7Number of Compute Units 2 1 3 1 1 1 1

Table 8 The resources of every compute unit on FPGA

Layer FFs LUTs DSPs Block RAMsC1lowast2 8331 11762 19 20S2 6967 10672 15 31C3lowast3 8160 15935 17 18S4 6777 10071 15 9C5 7141 11593 16 136F6 6637 10341 15 10O7 5087 4488 12 4SUM() 74751(863) 118494(2732) 145(403) 284(1932)

Figure 11 The calculation time with the number of compute units

compute units but the scheduling unit also produces somelosses so it is not completely consistent with the curve inFigure 11

622 Analysis of Optimization Results of CNN In sequen-tial execution all OpenCL kernels calls will be executedsequentially and SDAccel supports up to 10 compute unitsThe convolution layer takes the most time so two or threecompute units are configured for the convolution layerTable 7 lists the assignments of compute units Howeverthough themost compute units have been used the resourcesof FPGA are still small enough to give full play to theperformance of FPGA

Table 8 lists the resources of FPGA that are consumedby every compute unit We can see that if more computeunits can be supported in parallel the performance willimprove further

Figure 12 shows the average running time of everyevent in the program and we can see that even afteroptimization the two convolutions layers (event 2 andevent 4) consume the most time

Figure 13 shows the time distribution of the first twoeventswithout streamlineEach event is represented as a linesegment where left endpointrsquos coordinate is lsquostart timersquo and

Figure 12 The average running time of every event

Figure 13 The time distribution without streamline

right endpointrsquos coordinate is lsquoend timersquo It can be found in allevents being executed sequentially

Figure 14 shows the time distribution of the eventswith streamline It shows that both the second convolutionlayer of the first input and the first convolution layer ofthe second input can be calculated on FPGA The overallthroughput on FPGA is also improved which is representedby the properly working pipeline Before the pipeline thethroughput of the system is 244 FPS while it reaches 485FPS Now the throughput has doubled


Figure 14 The time distribution with streamline

7 Conclusion

This paper is devoted to the parallel acceleration of CNNbased on FPGA with OpenCL by using Xilinx SDAcceltools Taking LeNet-5 as an example the acceleration ofconvolution layer and CNN is studied On the convolutionlayer by adopting the loop optimization and memory access-ing methods the computation speed could be increased 14times of processing convolution layer while the whole systemprocessing speed would be improved 2 times where theoverall speed reaches 485 fps

Due to their own characteristics CPU could not fullyexploit the parallelism of CNN algorithms and GPU wouldconsume much more power Alternatively by fully usingthe parallelism of the neural networkrsquos structure FPGA cannot only reduce the computing costs but also increase thecomputing speed Currently Xilinx SDAccelrsquos tool chaincould not support up to 10 compute units in parallel untilnow but FPGA has its redundant resources to support morecomputing unitsWith the increasing development of the toolchain in the future the parallel acceleration of CNN based onFPGA with OpenCL has its prospects

Data Availability

The data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

The authors would like to acknowledge the support fromNational Natural Science Foundation of China under Grantno 91648116 and National Key RampDProgram of China underGrant no 2017YFC1500601 The authors also acknowledgethe support from Xilinx University Program

References

[1] C Farabet C Poulet J Y Han and Y LeCun ldquoCNP an FPGA-based processor for Convolutional Networksrdquo in Proceedings of

the International Conference on Field Programmable Logic andApplications pp 32ndash37 Prague Czech September 2009

[2] S Ji W Xu M Yang and K Yu ldquo3D Convolutional neuralnetworks for human action recognitionrdquo IEEE Transactions onPattern Analysis andMachine Intelligence vol 35 no 1 pp 221ndash231 2013

[3] Y LeCun L Bottou Y Bengio and P Haffner ldquoGradient-basedlearning applied to document recognitionrdquo Proceedings of theIEEE vol 86 no 11 pp 2278ndash2324 1998

[4] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classi-fication with deep convolutional neural networksrdquo in Advancesin Neural Information Processing Systems F Pereira C BurgesL Bottou and K Weinberger Eds vol 60 pp 1097ndash1105 2017

[5] A Krizhevsky I Sutskever and G E Hinton ldquoImageNetclassification with deep convolutional neural networksrdquo inAdvances in Neural Information Processing Systems F PereiraC J C Burges L Bottou and etal Eds vol 60 pp 1097ndash11052012

[6] H Li Z Lin X Shen J Brandt and G Hua ldquoA convolutionalneural network cascade for face detectionrdquo in Proceedings ofthe 2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR) pp 5325ndash5334 Boston MA USA June2015

[7] P Barros S Magg C Weber and S Wermter ldquoA MultichannelConvolutional Neural Network for Hand Posture Recognitionrdquoin Artificial Neural Networks and Machine Learning ndash ICANN2014 vol 8681 of Lecture Notes in Computer Science pp 403ndash410 Springer International Publishing Cham 2014

[8] T Yang Cellular Neural Networks and Image Processing NovaScience Publishers 2002

[9] C Farabet B Martini P Akselrod S Talay Y LeCun andE Culurciello ldquoHardware accelerated convolutional neuralnetworks for synthetic vision systemsrdquo in Proceedings of theIEEE International Symposium on Circuits and Systems (ISCASrsquo10) pp 257ndash260 IEEE Paris France May-June 2010

[10] I Buck ldquoGPU Computing Programming a Massively ParallelProcessorrdquo in Proceedings of the International Symposium onCodeGeneration andOptimization (CGOrsquo07) pp 17-17 San JoseCA March 2007

[11] D Strigl K Kofler and S Podlipnig ldquoPerformance and scalabil-ity of GPU-based convolutional neural networksrdquo in Proceed-ings of the 18th Euromicro Conference on Parallel DistributedandNetwork-Based Processing (PDP rsquo10) pp 317ndash324 Pisa ItalyFebruary 2010

[12] G Borgese S Vena P Pantano C Pace and E Bilotta ldquoSimula-tion Modeling and Analysis of Soliton Waves Interaction andPropagation in CNN Transmission Lines for Innovative DataCommunication and Processingrdquo Discrete Dynamics in Natureand Society vol 2015 pp 1ndash13 2015

[13] C Zhang P Li G Sun Y Guan B Xiao and J CongldquoOptimizing FPGA-based accelerator design for deep convo-lutional neural networksrdquo in Proceedings of the ACMSIGDAInternational Symposium on Field-Programmable Gate Arrays(FPGA rsquo15) pp 161ndash170 USA February 2015

[14] E Nurvitadhi D Sheffield J Sim AMishra G Venkatesh andD Marr ldquoAccelerating binarized neural networks Comparisonof FPGA CPU GPU and ASICrdquo in Proceedings of the 15thInternational Conference on Field-Programmable Technology(FPT rsquo16) pp 77ndash84 December 2016

[15] A Fernandez R San Martin E Farguell and G E PazienzaldquoCellular Neural Networks simulation on a parallel graphics


processing unitrdquo in Proceedings of the 2008 11th InternationalWorkshop on Cellular Neural Networks and Their Applications- CNNA 2008 pp 208ndash212 Santiago de Composteia Spain July2008

[16] J E Stone D Gohara and G Shi ldquoOpenCL a parallelprogramming standard for heterogeneous computing systemsrdquoComputing in Science amp Engineering vol 12 no 3 Article ID5457293 pp 66ndash72 2010

[17] OpenCL ldquoThe open standard for parallel programming ofheterogeneous systemsrdquo The Khronos OpenCL Working Group2011

[18] H M Waidyasooriya T Endo M Hariyama and Y OhteraldquoOpenCL-Based FPGAAccelerator for 3D FDTDwith Periodicand Absorbing Boundary Conditionsrdquo International Journal ofReconfigurable Computing vol 2017 2017

[19] NVIDIA ldquoIntroduction to GPU Computing with OpenCLrdquohttpdeveloperdownloadnvidiacomCUDAtraining

[20] D Singh ldquoImplementing FPGA design with the OpenCLstandardrdquo in Altera Whitepaper 2011 Implementing FPGAdesign with the OpenCL standard in Altera Whitepaper

[21] Intel ldquoIntel A Development Environment for OpenCL Appli-cationsrdquo 2012 httpsoftwareintelcomen-usvcsourcetoolsopencl-sdk

[22] T S Czajkowski U Aydonat D Denisenko et al ldquoFromOpenCL to high-performance hardware on FPGAsrdquo inProceed-ings of the 22nd International Conference on Field ProgrammableLogic andApplications (FPL rsquo12) pp 531ndash534 IEEEAugust 2012

[23] Environment Installation and Licensing Guide Xilinx sdaccelSDAccel Development Xilinx 2015

[24] B Chapman G Jost and R v d Pas Using OpenMP PortableShared Memory Parallel Programming (Scientific and Engineer-ing Computation) The MIT Press 2007

[25] ldquoXilinx Limited Melita House 124 Bridge Rd Chertsey KT168LA United Kingdom SDAccel Development EnvironmentUser Guide 20153 ed 10 2015rdquo

[26] J Cong and B Xiao ldquoArtificial Neural Networks and MachineLearningrdquo inProceedings of the 24th International Conference onArtificial Neural Networks (ICANN rsquo14) vol 8681 pp 281ndash290Springer International Publishing Hamburg Germany

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018


Active and Passive Electronic Components

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of



Journal ofEngineeringVolume 2018

SensorsJournal of



RotatingMachinery


Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation


Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom


each layer of calculation is independent of others and theinterlayered structure can be dealt with like a flow structureMoreover various modules on FPGA can be executed inparallel [15] It is very suitable to accelerate CNN by mappingthe flow structure on FPGA However in order to useFPGA the development process requires that the hardwaredescription language as well as some knowledge about theinternal structure of FPGA must be comprehended whichlimits the designersrsquo ability to design to some extent

OpenCL provides a new method for the acceleration ofCNN based on FPGA It is a unified programming envi-ronment [16] and a parallel programming standard for theheterogeneous systems Based on the traditional C languagethe codewritten inOpenCL is highly portable [17]Moreoveron account of the existence of the framework about OpenCLto FPGA designers need neither to synthesize hardwaredescription language nor to understand the hardware circuitstructure which greatly lowers the using threshold [18]

This paper proposed a parallel acceleration strategy ofCNN based on FPGA with OpenCL by the use of XilinxSDAccel As the convolution layer is the most complicatedpart of CNN we first optimize the single convolution layerand then on this basis we accelerate complete CNN andexplore different optimization strategies

Throughout this paper we deal with the following thegeneral concept of CNN and the traditional method of CNNon CPU in Section 2 the knowledge of heterogeneous com-puting platform about OpenCL in Section 3 optimization ofa single convolution layer in Section 4 the acceleration ofcomplete CNN in Section 5 the experimental environmentsand the analysis of experimental results in Section 6 and theconclusions in Section 7

2 CNN Information andTraditional Acceleration

21The Basic Structure of Convolution Neural Network CNNis an efficient identification method which has receivedconsiderable attention in recent years CNN is a hierarchicalmodel whose basic structure consists of input layers con-volution layers pooling layers fully connected layers andoutput layers The main characteristic is that the convolutionlayer and the pooling layer alternate and the local feature ofthe previous layer gets the characteristics of the next layerof convolution by the weight of the convolution Throughthe connections of multiple convolution with pooling layersimages gradually change from the surface features to the deepones then the features are classified by the fully connectedlayer and the output layer and finally the images are dividedinto categories Therefore according to the function of everylayer CNNcanbe divided into twoparts the feature extractoris composed of the input layer the convolution layer and thepooling layer while the fully connected layer and the outputlayer constitute the classifier

CNN was first put forward by a professor from Uni-versity of Toronto called LeCun [3] Based on the researchof Fukushima he used BP algorithm to design the firstCNN named LeNet-5 Because LeNet-5 is one of the mostclassic structure of CNN the following introductions and

Figure 1 LeNet-5 [3]

experiments are taken with the example of LeNet-5 LeNet-5 is CNN for handwritten numeral recognition Figure 1 isthe structure of LeNet-5 which contains three convolutionlayers two pooling layers a fully connected layer and aGaussian layer

211 The Convolution Layer In the convolution layer theinput data is convolved with a learning convolution kerneland the result of convolution forms the characteristic patternof this layer By the activating function each characteristicpattern can be combinedwith the numerical value of previouscharacteristic patterns Its expression is shown in

119909119897119895 = 119891(sum119894119890119872119895

119909119894minus1119894 lowast 119896119897119894119895 + 119887119897119895) (1)

where lsquoIrsquo represents the number of layers lsquokrsquo representsthe weight of the convolution kernel of l layer and lsquobrsquorepresents the bias of the characteristic pattern after theconvolution f(x) is the activation function like sigmoid andtanh

In LeNet-5 model for example the input size is 32lowast32 inC1 Layer the convolution kernelrsquos size is 5lowast5 and the size ofC1 Layer is 28lowast28 (28=32-5+1) It can be seen from Figure 1that C1 Layer has six two-dimensional planes totally and eachplane has 26 (26=5lowast5+1) parameters Therefore C1 Layerhas 156 parameters and 122304 (122304=(5lowast5+1)lowast6lowast28lowast28)connections The same is true for C3 and C5 Layer

212 The Pooling Layer The pooling layer is also called thesubsampling layer It usually connects with the convolutionlayer Through partial correlation principle on the one handit can improve the robustness of the system and on theother hand it can reduce the calculation of the characteristicpattern Its expression is shown in

119909119897119895 = 119891 (120573119897119895119889119886119908119899 (119909119897minus1119895 ) + 119887119894119895) (2)

lsquodawn(119909119897minus1119895 )rsquo represents the pooling function Beta repre-sents the weight coefficient of the pooling layer lsquobrsquorepresentsthe bias of the characteristic pattern after the pooling Themain principle is to divide the image into nlowastn image blocksthen sum the pixels in each image block and take the meanor the maximum valueTherefore the output image is the sizeof the input image of 1n Among them every characteristicpattern has its own beta and b

In LeNet-5 model S2 consists of six characteristic pat-terns of the size 14 x 14 Each neuron in the characteristic



Figure 2

Figure 3 Workflow



119909119897 = 119891 (119906119897) (3)

119906119897 = 119908119897119909119897minus1 + 119887119897 (4)





















Figure 4

Figure 5















Figure 6








i

m

n

Figure 7


Figure 8

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WTt

Figure 9








n = j + 2i (6)











n

m

0

1

2

3

4

5

6

7

8

9

10

0 1 2

Figure 10




62 Results Analysis




















7 Conclusion



Data Availability




Acknowledgments


References
































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia




Figure 2

Figure 3 Workflow



119909119897 = 119891 (119906119897) (3)

119906119897 = 119908119897119909119897minus1 + 119887119897 (4)





















Figure 4

Figure 5















Figure 6








i

m

n

Figure 7


Figure 8

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WTt

Figure 9








n = j + 2i (6)











n

m

0

1

2

3

4

5

6

7

8

9

10

0 1 2

Figure 10




62 Results Analysis




















7 Conclusion



Data Availability




Acknowledgments


References
































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia















Figure 4

Figure 5















Figure 6








i

m

n

Figure 7


Figure 8

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WTt

Figure 9








n = j + 2i (6)











n

m

0

1

2

3

4

5

6

7

8

9

10

0 1 2

Figure 10




62 Results Analysis




















7 Conclusion



Data Availability




Acknowledgments


References
































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia












Figure 6








i

m

n

Figure 7


Figure 8

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WTt

Figure 9








n = j + 2i (6)











n

m

0

1

2

3

4

5

6

7

8

9

10

0 1 2

Figure 10




62 Results Analysis




















7 Conclusion



Data Availability




Acknowledgments


References
































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



i

m

n

Figure 7


Figure 8

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WT

RD L0 L1 L2 WTt

Figure 9








n = j + 2i (6)











n

m

0

1

2

3

4

5

6

7

8

9

10

0 1 2

Figure 10




62 Results Analysis




















7 Conclusion



Data Availability




Acknowledgments


References
































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia









n

m

0

1

2

3

4

5

6

7

8

9

10

0 1 2

Figure 10




62 Results Analysis




















7 Conclusion



Data Availability




Acknowledgments


References
































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



















7 Conclusion



Data Availability




Acknowledgments


References
































RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia

















RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia


design of fpga-based accelerator for convolutional neural

Documents