convolutional neural network fpga-accelerator on intel

Master of Science Thesis in Electrical EngineeringDepartment of Electrical Engineering, Linköping University, 2021

Convolutional Neural Network FPGA-accelerator on Intel DE10-Standard FPGA

Tianxu Yue

Master of Science Thesis in Electrical Engineering

Convolutional Neural Network FPGA-accelerator on Intel DE10-Standard FPGATianxu Yue

LiTH-ISY-EX--21/5400--SE

Supervisor:

Mikael OlofssonISY, Linköping University

Examiner:

Mark VesterbackaISY, Linköping University

Division of Integrated Circuits and SystemsDepartment of Electrical Engineering

Linköping UniversitySE-581 83 Linköping, Sweden

Copyright 2021 Tianxu Yue

Abstract

Convolutional neural networks (CNNs) have been extensively used in manyaspects, such as, face and speech recognition, image searching and classification,and automatic drive. Hence, CNN accelerators has become a trending research.Generally, Graphics processing units (GPUs) are widely applied in CNNaccelerators. However, Field-programmable gate arrays (FPGAs) have higherenergy and resource efficiency compared with GPUs, moreover, high levelsynthesis tools based on Open Computing Language (OpenCL) can reduceverification and implementation period for FPGAs. In this project, PipeCNN[1] isimplemented on Intel DE10-Standard FPGA. This OpenCL design acceleratesAlexnet through the interaction between Advanced RISC Machine (ARM) andFPGA. Then, PipeCNN optimization based on memory read and convolution isanalysed and discussed.

Contents

1 Introduction........................................................................................... 3

1.1 Motivation.......................................................................................... 31.2 Purpose.............................................................................................. 41.3 Problem Statement............................................................................. 41.4 Limitations......................................................................................... 5

2 Background............................................................................................ 6

2.1 CNN Model......................................................................................... 62.2 Software............................................................................................. 8

2.2.1 OpenCL....................................................................................... 82.2.2 Caffe......................................................................................... 15

2.3 Hardware......................................................................................... 152.4 CNN FPGA-Accelerator..................................................................... 172.5 Winograd Algorithm......................................................................... 18

3 Method................................................................................................. 23

3.1 OpenCL Environment Setup..............................................................233.2 Memory Read Optimization.............................................................. 243.3 Convolutional Kernel Optimization................................................... 27

4 Result................................................................................................... 30

4.1 Simulation Results with FPGA-Accelerator........................................ 304.2 Simulation Results with FPGA-Accelerator Resource Usage...............31

5 Discussion............................................................................................ 33

5.1 Result............................................................................................... 335.2 Method............................................................................................. 345.3 Work in a Wider Perspective............................................................. 34

vii

6 Conclusion............................................................................................36

Appendix A............................................................................................. 37

7 Quantization Parameters..................................................................... 37

Bibliography........................................................................................... 39

viii

List of FiguresFigure 2.1.1 Data flow of the first convolutional layer.................................................7Figure 2.2.1 The overview of platform model (modified from [14])......................... 10Figure 2.2.2 The structure of index space (modified from [9])................................. 11Figure 2.2.3 The structure of OpenCL memory model (modified from [16])............12Figure 2.2.4 The overview of program process..........................................................14Figure 2.3.1 The architecture of Intel SoC OpenCL (modified from [17]).................17Figure 2.4.1 The architecture of PipeCNN (modified from [1])................................18Figure 2.5.1 2D winograd algorithm......................................................................... 21Figure 2.5.2 2D winograd transformation................................................................. 22Figure 3.1.1 OpenCL development process............................................................... 24Figure 3.2.1 The principle of fetching data in MemRD kernel................................... 25Figure 3.2.2 Memory read code from PipeCNN.........................................................26Figure 3.2.3 Memory read optimized code................................................................27Figure 3.3.1 The progress of winograd implementation............................................ 29

ix

Index of TablesTable 4.1 The measured runtime on different FPGA-accelerators..............................31Table 4.2 The summary of the estimated resource usage..........................................31Table 4.3 Detailed resource usage in memory read kernel........................................32Table 6.1 Quantization parameters for Alexnet (from [18])......................................38

xi

Notation

Abbreviation/Acronym

Meaning Explanation Context

ALUT Adaptive Lookup Table

A logical construct that represents what can be implemented by the combinational logic hardware

ARM Advanced RISC Machine

A 32 bits processor

API Application Programming Interface

The code provided by the computer operating system or program library for invoking application

A10_ref Intel® Arria®1 0 G X F P G ADevelopment KitReferencePlatform

An emulation platform for OpenCL

BSP Board Support Package

Support OpenCL projecton FPGA

CAFFE Convolutional Architecture for Fast Feature Embedding

An efficient open-source deep learning framework

Detailed descriptionin section 3.1.2

xiii



CNN Convolutional Neural Network

A deep learning algorithm which can extract image feature through assigning learnable weights and bias to image data

CPU Central Processing Unit

A main part of computer

CU Computing Unit A part of OpenCL Platform Model

Detailed descriptionin section 2.1.2.2

DSP Digital Signal Processor

A microprocessor specialized for digital signal processing

FC Fully Connected Layer

A part of convolution neural network architecture

FF Flip Flop A form of computer memory that can be read and changed in any order

FPGA Field-Programmable Gate Array

An integrated circuit which can be configuredby designers

GPU Graphics processing Unit

A specialized electronic circuit to accelerate the creation of images

xiv



HPC High Performance Computing

The ability to process data and perform complex calculations at high speeds

ILSVRC-2012 ImageNet Large Scale Visual Recognition Challenge 2012

One of the most popularand authoritative academic competitions in computer vision

Intel FPGA SDK for OpenCL

Intel FPGA Software Development Kit for OpenCL

A software which provides compiler and tools for OpenCL design

LRN Local Response Normalization

A part of convolutional neural network which implements lateral inhibition

ms Millisecond Unit to measure time Used in execution time measurement

NDRange N-Dimensional Range

A specific index space inOpenCL execution model

OpenCL Open ComputingLanguage

Heterogeneouscomputing executionframework


xv



PCI-e Peripheral Component Interconnect Express

A high-speed serial computer expansion busstandard

PE Processing Element

A part of OpenCL Platform Model

Detailed descriptionin section 2.1.2.2

RAM Radom Access Memory

A form of computer memory that can be read and changed in any order

ReLU Rectified Linear Unit

A common activation function in neural network


1-D 1-Dimension

2-D 2-Dimension

3-D 3-Dimension

xvi

1Introduction

1.1 Motivation

In recent years convolutional neural networks (CNNs) have made a greatcontribution to computer vision research [2]. Before CNNs were applied, digitalimage processing was a big problem in artificial intelligence because the amountof image data to be processed was too large, resulting in a high cost and lowefficiency. Also, retaining images original features was difficult during thedigitization process, resulting in low accuracy of image processing. However,CNNs could solve the two issues stated above due to its structure [3]. At present,CNNs have been used widely in face and speech recognition, image searchingand classification, and human pose estimation among other things [4].

Currently, graphics processing units (GPUs) are commonly used for deep neuralnetwork training. However, they have some shortcomings such as high powerconsumption and slow inference [1]. Field-programmable gate arrays (FPGAs)are also used to train deep neural networks. These can provide parallelconvolutional neural network inference system [5]. By reusing computingresources, processing data in parallel and pipelining design, the usage ofresources is reduced and the computing speed is greatly improved. Therefore,CNN accelerators implemented by FPGAs will be of great value in the future.

3

1.2 Purpose

Central processing units (CPUs) are good at management and scheduling. Theycan read data and manage files, but they are weak at computing. FPGAs canhandle both management and computing. However, their long developmentperiod and complex algorithm development make them non-suitable to apply forlarge scale projects such as this project. Instead, heterogeneous computing willbe applied.

Heterogeneous computing is a system in which there are different type ofprocessors working together. Compared to traditional CPU parallel computing,heterogeneous computing provides a higher efficiency and lower latency. With arising demand for computing performance, heterogeneous computing isbecoming more and more important.

In this project, a “CPU + FPGA” framework will be used to accelerate Alexnet,which is a CNN model. The performance will be analysed by comparing resultswith a CPU framework. Besides, some advanced optimization of the FPGA-accelerator will be done followed by tests of its performance.

1.3 Problem Statement

During acceleration using PipeCNN [1], the convolutional kernel causes majortime delay. Besides, performance is affected by data communication from thememory kernel. There is hence a need to optimize these parts. This is the maingoal of this project. Another goal is to accelerate Alexnet using PipeCNN on aDE10-Standard FPGA.

Stated below are the research questions to be answered in this project:

• Can PipeCNN be used to accelerate Alexnet on an Intel DE10-StandardFPGA?

• Can an FPGA-accelerator be improved by optimization of the PipeCNNmemory read kernel?

• Is optimization of PipeCNN possible by replacing the convolutional layerwith the winograd transformation algorithm?

4

1.4 Limitations

This project has several limitations which affect optimization and study results.Though it is possible to use multiple CNN models, only Alexnet was utilized inthis project due to time limitations. Alexnet is a typical CNN model and plays animportant role in CNN development. Therefore, improving Alexnet can also boostthe machine learning development. Compared with some other CNN models, likeVGG, GoogleNet, Resnet, Alexnet has less weights and layers, which makes itmore suitable to be used on the Intel DE10-standard FPGA.

In addition, lack of necessary laboratory equipment makes it difficult todetermine multiple parameters. Therefore, only execution time and resourceconsumption are measured. These features are easy to measure, which simplifiescomparison.

5

2Background

In this chapter, CNN model is introduced first in Section 2.1. Next, CNN FPGA-accelerator is presented in Section 2.2. Section 2.3 covers OpenCL andConvolutional Architecture for Fast Feature Embedding (Caffe). Then Section 2.4provides relevant theory regarding winograd transformation algorithm.

2.1 CNN Model

In 1988, Fukushima proposed the theory of the CNN structure [6]. However, atthat time the development of computation hardware limited CNN research andapplication. Not until 1990, LeCun et al. obtained successful results for thehandwritten digit classification problem through application of a gradient-basedlearning algorithm to CNNs [7]. Then a historical breakthrough happened in2012 due to the appearance of Alexnet model. Alexnet is composed of fiveconvolutional layers, some of which include max-pooling layers, and three fullyconnected (FC) layers with a final 1000-way softmax output [8]. The firstconvolutional layer includes convolution computation with a rectified linear unit(ReLU) function and max pooling with a Local Response Normalization(LRN).LRN is a lateral inhibition of the neural network in biology and is used toimprove the generalization ability of the CNN model [8]. The process ofconvolution is to use weights to multiply the RGB values of images to extract theimage data information.

6

The progress of the first convolution is shown in Figure 2.1.1. 96 filters are dividedinto two parts and then calculated with feature maps, which are 11×11×3 in size,from image data. Max pooling is the next operation after generating a set of resultswith the size of 55×55×96 from convolution. 3×3 filters, with a stride of two,select a maximum value from related feature maps, generating the final result with asize of 27×27×96 , since LRN does not influence the final output size. Thefollowing four convolutional layers execute the same operation with 5×5 , 3×3 ,3×3 and 3×3 filters, respectively. Compared with the first layer, the third andfourth convolutional layers lack max pooling and LRN operations, while the fifthconvolutional layer only lack LRN operations. All convolutional layers except thefirst layer set padding values (all zero padding) in order to guarantee all featuremaps can fetch data properly. The sixth, seventh and eighth layers are fully FClayers which add dropout step after convolution. Dropout represents that inputelements are set to zero randomly with a specific probability. During deep learningnetwork training, reduced overfitting [8] and network complexity can be obtained bydropout.

7

Figure 2.1.1 Data flow of the first convolutional layer.

Alexnet has two obvious innovations. Firstly, there are two new conceptsintroduced in the Alexnet model, one is dropout and another is LRN. Secondly,the utilization of ReLUs can accelerate training speed [8]. Based on the use ofmillions of ImageNet data as big data training and the development of GPUs,Alexnet won the championship of ImageNet Large Scale Visual RecognitionChallenge 2012 (ILSVRC-2012), which motivates further researches for Alexnet.Therefore, Alexnet model became a milestone in the history of CNNdevelopment.

2.2 Software

In this section, OpenCL, as the programming language, is introduced first. Then the Alexnet training software platform, that is Caffe, is described.

2.2.1 OpenCL

With the exploitation of a range of multi-core microprocessors, CPUs, digitalsignal processors (DSPs), reconfigurable hardware and GPUs, various computingenvironments come out nowadays [9]. Developing efficient computingframeworks based on heterogeneity has become a hot trend. Therefore, OpenComputing Language (OpenCL) version 1.0 was published by Khronos Group in2008 [10].

OpenCL is an open and royalty-free standard for parallel heterogeneouscomputing platforms facing CPUs, GPUs, DSPs, and FPGAs, etc. [11]. OpenCLprovides easy-to-use abstractions [12] and applications programming interface(API) that is compatible for significantly different architectures. Furthermore,each hardware platform can obtain high performance under OpenCL framework.At present, not only embedded vendors, like ARM, Imagination Technologies, etc,but also high performance computing (HPC), like Intel, AMD, NVIDIA and IBM,etc, are all supporting OpenCL for their hardware platforms [9]. The feature ofcross-platform and the support from related vendors guarantee that OpenCL hasbroad prospects in heterogeneous computation.

8

In order to describe the core of OpenCL standard, Khronos Group dividesOpenCL architecture into four models, namely platform, execution, memory andprogramming model respectively. This section will introduce these models andOpenCL program overview.

Platform Model

Platform model consists of one host and one or more connected OpenCL devices.Figure 2.2.1 illustrates the overview of OpenCL platform model architecture. Thehost, including the computing platform of ARM or X86, controls multiple OpenCLcomputing devices (CPUs, GPUs, DSPs or FPGAs etc.) [13]. Each OpenCL devicehas one or more computing units (CUs), and each computing unit is composed ofone or more processing elements (PEs) [14]. These processing elements are thesmallest units used to perform calculations on devices. The platform model isalso an abstract hardware architecture. [9] introduces OpenCL C language calledkernels, which can be executed on the computing devices via platform model,and then vendors can map this abstract architecture to the physical hardware.Furthermore, the host and the compute device are connected through PeripheralComponent interconnect Express (PCI-e).

Execution Model

The OpenCL program consists of two parts: the host program and the kernelprogram. The host program is executed on the host processor, and then OpenCLdevices perform kernels on the processing units through commands built by thehost program. Generally, these kernel functions have low computationalcomplexity. Based on the host-device interaction, the execution model describesthe OpenCL environment configuration on the host and the execution of kernelson the devices. Contexts, command queues, and kernels are three essentialelements of the execution model.

9

Figure 2.2.1 The overview of platform model (modified from [14]).

A context can coordinate the mechanisms for host-device interaction, it also canmanage available memory on the device and observe programs and kernels fromOpenCL devices [9]. A host builds and manages a context via an OpenCL APIfunction named clCreateContext(). The mechanism of the command queue isthat the host sends commands to devices through invoking an OpenCL APIfunction called clCreateCommandQueue(). These commands are submitted tothe command queue by the host and then wait until they are required to beexecuted on the devices.

Generally, an OpenCL kernel program is composed of multiple kernels. The hostinforms the device to execute the specific kernel function by submitting thecommands and sending the parameters to the kernel. When the device receivesthe command from the host, the OpenCL system can create an index space tolocate the data on the processing elements shown in Figure 2.2.2. An index spaceis composed of some blocks and each block is called a work-item. A collection ofwork-items is a work-group. The number of work-items is supposed to be thesame for each work-group on the kernel. The advantage of this model is thatwork-items and work-groups can be mapped well into hardware. Besides, eachdevice has multiple CUs and each CU has multiple PEs. Therefore, a work-item

10

can be mapped to a PE, and a work-group can be mapped to a CU as well basedon the parallelism between OpenCL and hardware.

Figure 2.2.2 The structure of index space (modified from [9]).

On the other hand, the index space is specified based on the number of work-items. This specific index space is called an N-dimensional range (NDRange),where N is always one, two or three [9]. An NDRange is defined by an integerarray of length N, and the elements in this array specify the size of the indexspace in each dimension [15]. Moreover, the length N is consistent with thedimension of work-items. In the index space, the global ID is usually used toidentify the coordinate of work-items.

There is a requirement that the work-group size should be a multiple of the indexspace size for each dimension. However, if the work-group size does not satisfythe requirement, expanding the index space size in each dimension throughrounding up is enough [9].

Memory Model

11

The memory model is an abstract memory hierarchy prepared for kernels. Itdescribes memory structure, management and communication between the hostand the device. Some programs related to this model are described and thenvendors can map the model to their actual memory hardware [9].

Figure 2.2.3 exhibits the structure overview of the OpenCL memory model. First,the host memory is a part of system memory but it is only visible and accessibleto the host. The host processor can execute reading and writing operationswithout any limitation since the memory in this region is fully administered bythe host. However, kernels have no authority to access data located in this space[16].

Figure 2.2.3 The structure of OpenCL memory model (modified from [16]).

12

The second is the global memory that allows all work-items to read and writedata. Work-items can interact with any unit of objects without any limitationsand the host is also responsible for the allocation and de-allocation of buffers inthis space. Another is the constant memory which is only readable. In general, itis responsible for transferring constant data from the host to the device [16].

The global memory is actually a bridge connecting the host and the device, anddata stored in this memory have unlimited access. First, the host transfers datafrom the host memory to the global memory, and then it releases control to thebuffer when the kernel starts to process the data. Finally, the kernel conductsreading and writing from the global memory until the kernel execution iscompleted [16].

The local memory is allocated on the computing unit and only accessible to thedevice. Thus, the host has no authority to data processing. The function is totransmit and store the data that are shared by work-items in a work-group. Thelast space in the model is the private memory which is only accessible by aprocessing element within the device [16]. This space is private to a work-itemand different work-items in a work-group cannot be accessed.

Programming Model

The programming model defines how to map parallel algorithm into physicalhardware. It includes two main programming models：data parallelism and taskparallelism.

Data parallelism, where different data is processed at the same time under thesame instruction, is the basis of OpenCL. All work-items follow this principle.Multi-threading operation can reduce the number of commands from eachthread, which is much better than that of single thread. Therefore, the efficiencycan be improved significantly.

Generally, data parallelism is treated as an essential target and anotheralternative is task parallelism, which is also supported in OpenCL. The firstpattern of task parallelism is defined as a kernel task that is performed as a singlework-item.

13

The second pattern is operated coincidentally with an out-of-order queue whenthe first occurs and the third adopts OpenCL’s event model when the tasks areconnected in a task graph [15].

Program Overview

An OpenCL program consists of the host and the device. The program of theformer is operated on the host processor, which realizes OpenCL initialization,context, command queue generation, commands submission and datainteraction. In essence, the host program invokes OpenCL API functions and thelogic of the host program is illustrated in Figure 2.2.4. There are severalprocedures:

1. Initialize the platform, and then discover and initialize the device. 2. Create the context and the command queue. 3. Create the device buffer and write the host data to the buffer.

14

Figure 2.2.4 The overview of program process.

4. Build and compile the program. 5. Create the kernel, set the kernel arguments and execute the kernel

according to the sequence order. 6. Collect results from output buffer to the host. 7. Release all OpenCL resources.

After introducing the host program, OpenCL C kernel program on the device isexplained as illustrated in Figure 2.3.1 [17]. The kernel functions are similar to Cfunctions [9]. The keyword (__kernel) is a symbol of the kernel program and areturn type of void is included, which represents return is unnecessary. Allpointers in the argument list should specify the address space. Data in the globalmemory can be declared by the keyword (__global). Similarly, data in theconstant memory (__constant) and local memory (__local) can also be declared.

2.2.2 Caffe

Caffe is an efficient open-source deep learning framework. It supports thecommand line, Python and Matlab interface. Besides, both CPU and GPU arecompatible with Caffe.

This project adopts a Matlab interface to train Alexnet and Wang introduces it in[18]. The procedures explain how to train this model. The first step is to extractAlexnet through Matlab. The second step is to transform its weights and bias to afixed-point format. The reason why fixed-point arithmetic is implemented is thathardware consumption and demand of memory bandwidth, is diminishedcompared with floating-point arithmetic. Hence, the quantization with a unifiedword length and variable fractional bit-width for each CNN layer is necessary forPipeCNN [1]. Wang assumes that PipeCNN fixed-point formula can be expressedas N⋅2−m , where N is an n-bit fixed-point integer and m represents the fractionalbits of the quantized data. Appendix A displays the parameters of quantizationfor Alexnet. The input, output and weight of each CNN layer is expressed as 8-bitfixed-point integers with corresponding fractional bits according to theassumption made by Wang. Finally, the quantized weights and biases are writteninto a single binary file.

15

2.3 Hardware

The DE10-Standard Development Kit is a hardware platform built with IntelCycloneV System-on-chip (SoC) FPGA and this kit is actually an ARM-based hardprocessor system (HPS) which consists of dual-core Cotex-A9 embedded core,peripherals and memory interfaces [19]. It is necessary to install relatedresources and software to build an OpenCL development environment beforedeveloping OpenCL applications. The procedures for this installation can bepresented as follows.

First, DE10-standard OpenCL Board Support Package (BSP) is required becauseDE10-Standard Board needs resources from BSP to support the OpenCL projectdevelopment. Second, the Intel SoC EDS tool is required since the host Programcan not be operated on the ARM processor directly without cross-compiling.Third, the Intel FPGA software development kit (SDK) for OpenCL, which equipsa compiler and tools to build OpenCL projects [20]. Figure 2.3.1 illustrates howto perform an OpenCL project on Intel SoC FPGA. The FPGA part of SoC isresponsible for the OpenCL kernels operation. Besides, the Intel SoC EDS cross-compiles the host program and then create a binary file that can be operated onthe embedded ARM core.

On the other hand, the kernel program is developed in Intel FPGA SDK andQuartus, and then compiled by OpenCL compiler, which can generate a .aocx fileoperated on FPGA.

In conclusion, three key parts of the project are introduced: Alexnet, which is theaccelerated object; OpenCL framework, which provides heterogeneouscomputing framework; Intel DE10-Standard FPGA, which is treated as OpenCLexecution carrier. These backgrounds lay a solid foundation for the followingchapters.

16

Figure 2.3.1 The architecture of Intel SoC OpenCL (modified from [17]).

2.4 CNN FPGA-Accelerator

The CNN FPGA-accelerator employed in this project is based on PipeCNNproposed by Professor Wang [1] and the architecture is illustrated in Figure2.4.1. MemRD and MemWR are two data mover kernels, which read and writefeature maps and weights from/to global memory in the form of 3-dimension (3-D) NDRange. In the convolutional kernel, each PE adopts a pipelined multiple-adder tree with a delayed buffer [1] instead of nested-loops in order to improveefficiency. Then, multiple PEs can perform parallel convolutions. Theconvolutional layer and the FC layer are all performed in the convolutionalkernel. The next part is the pooling kernel which executes sub-samplingoperations on the data from the convolutional kernel. Next, the LRN kernel readsdata from the global memory, aiming to perform normalization. Finally, data issent back to the global memory.

In this structure, the cascaded kernels, which are connected by Altera’s OpenCLextension Channels, do not need to store interlayer data back to the external

17

memory. This structure results in the decrease of the memory bandwidth burden.To sum up, the convolutional kernel contains two different layers which makehardware resource utilization more efficiently [1].

Figure 2.4.1 The architecture of PipeCNN (modified from [1]).

2.5 Winograd Algorithm

Generally, convolutional layers lead to a large time delay in the CNNarchitecture. Therefore, the optimization for the convolution algorithm should beput forward on agenda. The specification of time consumed by a multiplication ismuch strict than that by an addition. There is a strive to reduce the number of

18

multiplications. Therefore, winograd fast convolution algorithm [21] wasproposed in 2016. The following section uses a classic example to explain howto decrease the number of multiplications in the winograd algorithm.

The conventional algorithm is given by

D= [d 0 d 1 d 2 d 3 ]T

(1)

Where D is the input signal matrix.

G=[ g0 g1 g2 ]T

(2)

Where G is the filter.

R=[r0r1]=[d 0 d1 d 2d 1 d 2 d 3]×[ g0g1g2] (3)

Where R denotes the final expression for convolution operation.

In Eq.(3), six multiplications and four additions are required. Then the algorithmafter optimization is given by

R=[d0 d 1 d 2d1 d 2 d 3]×[g0g1g2]=[ m1+m2+m3m2−m3−m4] (4)

Where

m1=(d 0−d 2)g0 (5)

m2=(d 1+d 2)g0+g1+g2

2(6)

19

m3=(d 2−d1)g0−g1+g2

2(7)

m4=(d 1−d3)g2 (8)

In total, four additions related to the input signals are located within theparenthesis and four multiplications from Eq.(5) to(8). In Eq.(6) and (7), threeadditions are needed because the summation of g0 and g2 can be regarded asonly one calculation. Besides, two multiplications are also required because theequation related to G is divided by two. In Eq.(4), four additions are used in thefinal expression.

The related calculation can be completed in advance since all elements of thefilter matrix g are determined during the training of the neural network. Hence,the final calculation only requires four multiplications and eight additions.Compared with six multiplications and four additions of direct convolution, thenumber of multiplications is reduced while the number of additions is increased.The winograd algorithm can thus have an optimization since multiplication isgenerally slower than addition.

However, it is necessary to transform the 1-dimension (1-D)to a 2-dimension (2-D) algorithm if winograd algorithm is applied in the convolutional layer. Figure2.5.1 displays the principle of the 2-D winograd algorithm. First, all elements ofthe filter are converted in parallel form. For the input image, a matrix with thedimension of 4×9 is generated since the elements in each sliding window aretransformed into a row. This matrix is divided into six blocks. The location of therepeated elements in each block is the same as that in the 1-D algorithm, and theposition of the repeated block matrices in the 4×9 matrix is the same as that inthe 1-D algorithm.

20

Then, transforming input images and filters is necessary in the 2-D algorithm.The detailed progress is shown in Figure 2.5.2. Six block matrices ( K 0 , K 1 , K 2 ,K 3 ) and three columns (W 0 ,W 1 ,W 2 )shown in Figure 2.5.1 repeat an 1-D

winograd transformation, and then M 0 , M 1 , M 2 , M 3 is created based on Eq.(5)to(8). These intermediate parameters are nested with themselves throughrepeating the 1-D winograd algorithm separately. Finally, based on Eq.(4) andFigure 2.5.2, r0 , r1 , r2 and r3 can be expressed as

r0=m11+m12+m13+m21+m22+m23+m31+m32+m33 (9)

r1=m12−m13−m14+m22−m23−m24+m32−m33−m34 (10)

21

Figure 2.5.1 2D winograd algorithm.

r2=m21+m22+m23−m31−m32−m33−m41−m42−m43 (11)

r3=m22−m23−m24−m32+m33+m34−m42+m43+m44 (12)

Here, m is the intermediate parameter of the 2-D algorithm, r i (with i asindex)denotes the final result.

Throughout the entire process, the 1-D algorithm is nested with itself to acquire a2-D algorithm [21]. In the 2-D winograd algorithm, the total amount ofmultiplications is sixteen. Under the same situation, the direct convolutionalcalculation costs 36 multiplications. It can be clearly seen that the number ofmultiplications is reduced. Accordingly, the complexity of multiplication isdecreased by 2.25 times.

22

Figure 2.5.2 2D winograd transformation.

3Method

This chapter describes the method of PipeCNN optimization. Generally, OpenCLdevelopment environment configuration is the first task for the whole design. Itincludes related software installation, emulation platform simulation and FPGAtest, which is introduced in Section 3.1. Section 3.2 explains the memory readkernel optimization of PipeCNN through the comparison and analysis betweenthe original and optimized program. In the end, Section 3.3 covers how toreshape the convolutional kernel architecture to improve fast convolutionperformance.

3.1 OpenCL Environment Setup

The development process of this project is shown in Figure 3.1.1. Firstly, thefunctionality of PipeCNN should be verified through the emulation platform. Theemulator is responsible for verifying host runtime functionality [20]. It can bebuilt in Microsoft Visual Studio under Windows operating system. If executingPipeCNN on the emulation platform, a non-SoC board, such as Intel® Arria® 10GX FPGA Development Kit Reference Platform (a10_ref), should be selected tosimulate FPGA hardware environment. After the emulation result verifies thefunctionality successfully, PipeCNN can be targeted on DE10_Standard FPGA. ForSoC hardware, Intel FPGA SDK for OpenCL Offline Compiler is utilized to create.aocx executable file. This file is an FPGA hardware configuration file which can

23

be executed on the DE10_Standard [20]. The host application, which is executedon the ARM processor, is cross-compiled by EDS.

In this project, there are two main files in PipeCNN: one is the host file namedmain.cpp and another is OpenCL kernel file named conv_pipe.cl. Aftercompilation, conv_pipe.aocx and the host application file are generated.Through executing these two files together in the embedded Linux system thatinstalled in the DE10-Standard board, the PipeCNN accelerated result can bedisplayed on the screen.

3.2 Memory Read Optimization

After PipeCNN is executed on Intel DE10-Standard FPGA successfully, itsoptimization should be considered. Reducing computational complexity as muchas possible has already been an important optimized method. This sectionintroduces the memory read kernel optimization through compressing thedimension of index.

24

Figure 3.1.1 OpenCL development process.

The work of the memory read kernel is fetching data from the global memoryand then writing data to the next kernel. Figure 3.2.1 demonstrates how thememory read kernel sends data to the convolutional kernel in PipeCNN. Thereare three parameters (output_idx_dim1, output_idx_dim2 and output_idx_dim3)in the index system. In the CNN model, the dimension of input data in everylayer is always three, so that output_idx_dim1, output_idx_dim2 andoutput_idx_dim3 are responsible for the index of the first, the second and thethird dimension respectively. First, the work-group window starts to slide in thefirst dimension, and then output_idx_dim1 increases one as the window movesforward one step. Second, when output_idx_dim1 reaches the border of the firstdimension, the dimension of output_idx_dim2 will increase one as well.Meanwhile, output_idx_dim1 is initialized and repeats the first step. Third, if bothoutput_idx_dim1 and output_idx_dim2 arrive at their own borders, they will bereset and output_idx_dim3 increases one. Then, the window repeats the first andthe second step.

Figure 3.2.2 [18] exhibits the program how the index system works in PipeCNN.The API function write_channel_intel() presents the bias and data. Weights fromAlexnet are written from buffer to the next kernel by extended channels [22].However, it is obvious that the conditional statement (like “if”) is long. The Indexcomputational complexity is also high due to too many multiplications and

25

Figure 3.2.1 The principle of fetching data in MemRD kernel.

additions. And these factors lead to lower efficiency. Therefore, reducing thecomputational complexity of this part is the core of the memory read kerneloptimization.

26

Figure 3.2.2 Memory read code from PipeCNN.

Figure 3.2.3 illustrates the optimized program based on PipeCNN. There exists anew parameter called index_wr, which treats the 3-D index of weights as a wholeentity. Therefore, loading weights into weights buffer and writing weights frombuffer to channel do not need calculations from output_idx_dim1,output_idx_dim2 and output_idx_dim3. However, for writing data, more divisionsand remainders have to be utilized to obtain the correct location in thewin_buffer because of the different size between win_buffer and weight_buffer.Therefore, the calculation of output_idx_y, output_idx_dim1, output_idx_dim2and output_idx_dim2 requires more hardware resources compared withPipeCNN.

27

Figure 3.2.3 Memory read optimized code.

3.3 Convolutional Kernel Optimization

As one of the fast convolution algorithms, the 2-D winograd algorithm has beenintroduced in section 2.6. In this section, the third convolutional layer of Alexnetmodel is selected as an example to explain how to implement winogradoptimization in the convolutional kernel.

In the third layer, feature maps and filters are both 3×3 in 2-D size. Besides, Padis one and stride is also one. Figure 3.3.1 describes how to replace convolutionwith winograd algorithm in PipeCNN. As a first step, the input and filter shouldbe transformed based on Figure 2.5.2. Thus, the 2-D size of the feature maps andfilters is expanded to 4×4 . Then, in order to be compatible with the newfeature map, the winograd pad is supposed to be added. The size of the input isincreased from 15×15 to 16×16 . Apart from this, the stride value should alsobe changed from one to two because one 2-D winograd operation generates fourresults synchronously. Next, 256 feature map windows (from Win1 to Win256)operate winograd with transformed filter because the size of image input is13×13×256 . Finally, four final results are generated through accumulation.

Due to the modification above, “Win1” slides seven times horizontally andvertically so that the results with the size of 14×14 can be generated. For thepurpose of verifying functionality, it is required to cut the right and the bottomedge of the results. Eventually, the rest of the results will be moved forward tothe pooling kernel. Similarly, the fourth layer and the fifth layer in Alexnet alsorepeat the same steps to implement winograd optimization.

This chapter demonstrates the hardware implementation in Intel DE10-StandardFPGA for PipeCNN. The emulation platform is in charge of functionalverification. After passing the simulation test, the PipeCNN project will beexecuted in the embedded Linux system by invoking the related hardwareresources. Next, the code optimization in the memory read kernel from PipeCNNis analyzed. Treating the 3-D index of weights buffer as a new 1-D index is thecore of the memory optimization. Besides, winograd algorithm optimizationimplemented in the convolutional kernel is also introduced. It is necessary tomodify some Alexnet parameters to fit 2D winograd algorithm.

28

29

Figure 3.3.1 The progress of winograd implementation.

4Result

This chapter mainly focuses on the accelerated performance from the memoryread optimization. The execution time and hardware resources consumptionfrom FPGA-accelerators will be presented as well.

4.1 Simulation Results with FPGA-Accelerator

In the test, both the original and optimized design described in Section 3.2 areperformed successfully on Intel DE10-standard FPGA. Table 4.1 illustrates thedetailed time measurement on two different FPGA-accelerators.

For the original PipeCNN, [1] presents that time consumption by PipeCNN is 37times shorter than that only by CPU operation. The total runtime of PipeCNN inthis project is 836 ms. The first Layer costs the most time, which almost occupies1/3 of the total time. For the optimized mode, the second layer consumes thelongest time since the first layer accelerates a lot compared with the originaldesign. Generally, the memory read optimization realizes an advancedacceleration through comparison. However, from the second layer to the eighthlayer, it can be observed that the execution time of the optimized accelerator islonger than that of the original accelerator but only the first layer achieves asignificant acceleration.

30

PipeCNNLayer Execution Time(ms) Total

Runtime1 2 3 4 5 6 7 8

Original 261.09 207.97 138.82 104.16 69.47 35.09 15.63 3.95 836.17

Optimized 168.61 233.52 148.38 113.53 75.71 40.35 20.01 5.03 805.15

Table 4.1 The measured runtime on different FPGA-accelerators.

4.2 Simulation Results with FPGA-Accelerator Resource Usage

Table 4.2 presents the usage of hardware resource between two different FPGA-accelerators from the Intel OpenCL compiler. It is obvious that all measuredparameters in optimized PipeCNN increase by around 10%.

PipeCNN LogicUtilization

ALUTs Dedicatedlogic

registers

Memoryblocks

DSP blocks

Original 75% 48% 31% 55% 38%

Optimized 91% 61% 41% 61% 48%

Table 4.2 The summary of the estimated resource usage.

Table 4.3 illustrates the hardware resources consumption from different memoryread kernels. The new index system from optimized PipeCNN costs extra hardwareresources. ALUTs and FFs cost two times compared with the original memory readkernel. Meanwhile, the number of RAMs and DSPs increases by around 23% and32% respectively for comparison with the original memory read kernel.

31

Memory Read ALUTs FFs RAMs DSPs

Original 14247 20805 138 23

Optimized 29054 39096 170 34.5

Table 4.3 Detailed resource usage in memory read kernel.

This chapter exhibits the performance of the original and optimized PipeCNN inexecution time and hardware resource usage. The optimized design can achievean advanced acceleration with the higher hardware resource sacrifice.

32

5Discussion

5.1 Result

This chapter discusses the PipeCNN performance on Intel DE10-Standard FPGA,as well as results from the memory-read optimized accelerator and winogradtransformation. There are two parameters in the PipeCNN design to optimizethroughput, namely VEC_SIZE and CU_NUM [1]. Hardware resources fromdifferent platforms can be adopted through adjusting VEC_SIZE and CU_NUM. Ifthe usage of resources reaches the maximum value in Intel DE10-Standard FPGA,VEC_SIZE=16 and CU_NUM=8 should be set. In this case, the execution time ofPipeCNN can reach around 180ms. However, due to the high resourceconsumption from the memory read optimization, this design has to setVEC_SIZE=16 and CU_NUM=4 to fit the hardware resources. Therefore, theexecution time of Alexnet in this paper is much slower than the measurements in[1].

From Table 4.1, the optimized accelerator reduces the total execution time byaround 31ms and this proves the method of the memory read optimization iscorrect. Hence, this project succeed in decreasing computational complexitythough the performance still is not perfect. For the results from winogradoptimization, this project fails in obtaining the measured value because of theproject time limitations. Therefore, the validity can not be fully verified. This partof work can be one direction for further study.

33

5.2 Method

For the memory read optimization, there are still many shortcomings. First, theacceleration effect is not obvious with a huge sacrifice of hardware resources. Atpresent, this optimization is only suitable for the large-size filter, like the firstlayer of Alexnet. A large filter needs more index space to indicate the location ofelements in weights buffer, so the computational complexity in the weights bufferdecreases significantly. However, for the rest of the layers in Alexnet, the negativeimpact of the image input buffer is stronger than the positive impact of theweights buffer, which results in an increase of the execution time. This is one ofthe unexpected optimized results for this design. Hence, it is necessary to reducehardware resource costs and the negative effect caused by the input data writingoperation for the memory read kernel.

For winograd transformation, this project stops in the debugging stage. The mainproblem is generating floating-point data during the filter transformationbecause some fixed-point weights should be divided by four by means of thismethod. The transformed weights do not unify data format, which leads to aninaccuracy effect for the following rounding operation. Moreover, matrixtransformation increases the burden of the hardware memory, so it is a betterchoice to select a better development kit to continue the unfinished design.

Besides, the theoretical basis and method related to winograd algorithm still stayin imperfect status. Therefore, more related researches should be conducted inorder to modify the method of winograd hardware implementation. This part willbe treated as a further task to continue to explore the fast convolutionoptimization.

5.3 Work in a Wider Perspective

In order to promote the rapid development of deep learning, there are varioussoftware libraries and open-source frameworks, like Caffe, Torch and Tensorflow.CNN model can be built effectively with high-level abstract APIs. On the otherhand, the way of invoking FPGA resources through software-level commands isalso highly valued. FPGA-accelerator by OpenCL can narrow the gap between thehardware and software levels. For performance, this project can greatly shorten

34

the FPGA development period for CNN. Besides, it also provides computationperformance that is comparable to GPU, which has a profound impact on thefuture development of artificial intelligence.

35

6Conclusion

This project introduces an optimized convolutional neural network FPGA-accelerator based on OpenCL. An Arm processor interacts with the FPGAhardware device through the execution model provided by OpenCL. An efficientand adjustable FPGA-accelerator named PipeCNN is also introduced. Thememory read kernel and convolutional kernel optimization are discussed as well.

From the measured results, this design can fulfill the purpose of CNNacceleration and optimization, but the usage of hardware resources isunsatisfactory. Therefore, the future study is to optimize the method. Winogradtransformation still stays at the theoretical stage, so the program improvement ismandatory for solving the floating-point problem. In addition, applying partialreconfiguration on PipeCNN with advanced FPGA development kits is alsoconsidered as another further work.

36

Appendix AQuantization Parameters

Layer Name Input Output Weight

Conv1 8,0 8,-4 8,8

Relu1 8,-4 8,-4

Lrn1 8,-4 8,0

Pool1 8,0 8,0

Conv2 8,0 8,-2 8,8

Relu2 8,-2 8,-2

Lrn2 8,-2 8,0

Pool2 8,0 8,0

Conv3 8,0 8,-1 8,8

Relu3 8,-1 8,-1

Conv4 8,-1 8,-1 8,8

37

Relu4 8,-1 8,-1

Conv5 8,-1 8,-1 8,8

Relu5 8,-1 8,-1

Pool5 8,-1 8,-1

Fc6 8,-1 8,0 8,11

Relu6 8,0 8,0

Drop6 8,0 8,0

Fc7 8,0 8,2 8,10

Relu7 8,2 8,2

Drop7 8,2 8,2

Fc8 8,2 8,2 8,10

Table 6.1 Quantization parameters for Alexnet (from [18]).

38

Bibliography

[1] D. Wang, J. An, and K. Xu, “PipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks,” 2016.

[2] W. Wang and Y. Yang, “Development of convolutional neural network and its application in image classification: a survey,” Opt. Eng., vol. 58, no. 04, p. 1, Apr. 2019, doi: 10.1117/1.OE.58.4.040901.

[3] https://easyai.tech/ai-definition/cnn/.

[4] A. Bhandare, M. Bhide, P. Gokhale, and R. Chandavarkar, “Applications of Convolutional Neural Networks,” Int. J. Comput. Sci. Inf. Technol., vol. 7, no. 5, pp. 2206–2215, 2016.

[5] Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang, “A high performance FPGA-based accelerator for large-scale convolutional neural networks,” in 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Aug. 2016, pp. 1–9, doi: 10.1109/FPL.2016.7577308.

[6] K. Fukushima, “Neocognitron: A hierarchical neural network capable of visual pattern recognition,” Neural Networks, vol. 1, no. 2, pp. 119–130, Jan. 1988, doi: 10.1016/0893-6080(88)90014-7.

[7] M. Z. Alom et al., “The history began from AlexNet: A comprehensive survey on deep learning approaches,” arXiv, 2018.

[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, Jun. 2017, doi: 10.1145/3065386.

[9] B. R. Gaster, Heterogeneous computing with OpenCL., Rev. OpenC. Elsevier/MK, 2013.

[10] https://en.wikipedia.org/wiki/OpenCL#History.

39

[11] https://www.khronos.org/opencl/.

[12] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems,” Comput. Sci. Eng., vol. 12, no. 3, pp. 66–72, May 2010, doi: 10.1109/MCSE.2010.69.

[13] J. Tompson and K. Schlachter, “An Introduction to the OpenCL Programming Model,” Digit. version available here, 2012.

[14] “The OpenCLTM Specification.” https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_platform_model.

[15] A. Munshi, OpenCL programming guide. Addison-Wesley Professional, 2011.

[16] “SDAccel Environment User Guide.” https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/caz1504034320212.html.

[17] “DE10-Standard OpenCL.” https://rocketboards.org/foswiki/pub/Documentation/DE10Standard/DE10-Standard_OpenCL.pdf.

[18] https://github.com/doonny/PipeCNN.

[19] https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=205&No=1081&PartNo=1.

[20] “Intel® FPGA SDK for OpenCLTM Pro Edition Getting Started Guide.” https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl_getting_started.pdf.

[21] A. Lavin and S. Gray, “Fast Algorithms for Convolutional Neural Networks,” Sep. 2015.

[22] “Intel® FPGA SDK for OpenCLTM Pro Edition Programming Guide.” https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf.

40

convolutional neural network fpga-accelerator on intel

Documents