a framework for fpga-basedacceleration of neural … · inference with limited numerical...

113
A Framework for FPGA-Based Acceleration of Neural Network Inference with Limited Numerical Precision via High-Level Synthesis with Streaming Functionality by Ruo Long Lian A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2016 by Ruo Long Lian

Upload: vukhue

Post on 02-Apr-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

A Framework for FPGA-Based Acceleration of Neural Network

Inference with Limited Numerical Precision via High-Level Synthesis

with Streaming Functionality

by

Ruo Long Lian

A thesis submitted in conformity with the requirements

for the degree of Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

c© Copyright 2016 by Ruo Long Lian

Abstract

A Framework for FPGA-Based Acceleration of Neural Network Inference with Limited

Numerical Precision via High-Level Synthesis with Streaming Functionality

Ruo Long Lian

Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2016

Deep neural networks (DNN) are achieving state-of-the-art performance in many artificial intel-

ligence tasks, such as computer vision and speech recognition. Due to the high computational

requirements of DNN, there is an increasing need to design custom hardware for accelerating

the DNN computation with a low power budget.

This thesis proposes an FPGA-based acceleration solution for DNN inference, realized on a

SoC device where software controls the execution and off-loads compute-intensive operations to

the hardware accelerator. To minimize the hardware cost, limited precision data representations

are investigated for DNN computations, and incorporated in the accelerator design. Streaming

functionality is added to the LegUp high-level synthesis tool, allowing the hardware accelerator

to be designed entirely in C-language and synthesized to pipelined hardware. The accelerator

solution is not tied to a particular DNN architecture; rather, it is configurable in software,

permitting the acceleration for a range of DNN architectures proposed in recent literatures.

ii

Acknowledgements

I am very grateful to have worked with many wonderful people throughout my M.A.Sc study

and research. First and foremost, I would like to express my sincere gratitude to my advisor

Professor Jason Anderson. This thesis would not have been possible without his guidance and

support. I admire, appreciate and am inspired by Jason’s enthusiasm, dedication and work

ethic on his teaching and research. Ever since my very first undergraduate courses on computer

fundamentals and digital systems, to my undergraduate summer research project, and to this

thesis work, Jason has always been a great teacher and mentor. Jason has not only introduced

me to the wonderful field of computer engineering, but has also helped me to vastly improved

my research ability and communication skills.

I would like to thank our collaborators from Samsung, John Brothers and Joohoon Lee

for the invaluable discussion and feedback. I would also like to thank my thesis committee

members, Professor Vaughn Betz and Professor Andreas Moshovos for their edits and feedback

on this work.

I would like to thank the people involved with the LegUp project, I was lucky to work with

such an amazing team: Blair Fort, Nazanin Calagar, Bain Syrowik, Joy Chen, and Julie Hsiao.

In particular, I want to thank Andrew Canis and Jongsok (James) Choi, of whom we spent

lots of time together discussing ideas and improving LegUp. I would also like to thank my

roommate Max for the thesis edits and many stress-reducing talks.

I am deeply indebted to the unconditional support from my parents. I would not be able

to achieve any of this without you guys. Last but not least, a special thanks to Isabella for all

the love and understanding.

iii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5

2.1 Primer on Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 General Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Neural Network Training and Inference . . . . . . . . . . . . . . . . . . . 6

2.2 Major Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Fully-connected Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 Maxpooling Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.4 Local Response Normalization (LRN) Layers . . . . . . . . . . . . . . . . 10

2.3 Benchmark Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 A Toy Model for The MNIST Dataset . . . . . . . . . . . . . . . . . . . . 11

2.3.2 AlexNet for The ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Key Operation - Multiply and Accumulate (MAC) . . . . . . . . . . . . . . . . . 14

3 Using Low-Precision Fixed-point in Neural Networks 15

3.1 Floating-Point versus Fixed-Point . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Impact of Bit-width on Hardware Operator . . . . . . . . . . . . . . . . . . . . . 16

3.3 Low-Precision Fixed-point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 Heterogeneous Fixed-Point Representation . . . . . . . . . . . . . . . . . 18

3.3.2 Heterogeneous Fixed-Point Arithmetic . . . . . . . . . . . . . . . . . . . . 19

iv

3.3.3 Conversion Between Fixed-Point Formats . . . . . . . . . . . . . . . . . . 20

3.4 Software Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.1 Object-Oriented Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4.2 Model Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5.1 Uniform Fixed-Point in Neural Network Training . . . . . . . . . . . . . . 24

3.5.2 Value Range Profiling in Floating-Point Neural Network Training . . . . . 25

3.5.3 Heterogeneous Fixed-point in Neural Network Inference . . . . . . . . . . 26

3.5.4 Heterogeneous Fixed-point in AlexNet Inference . . . . . . . . . . . . . . 27

3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 System Architecture 31

4.1 Accelerator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Computation and Data Access . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.2 Data Reuse and On-Chip Storage . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.3 Accelerator Structure versus Data Width of On-Chip Buffer . . . . . . . . 34

4.1.4 Accelerator Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Translation Layer Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.1 Translation Layer for Fully-Connected Layers . . . . . . . . . . . . . . . . 38

4.3.2 Translation Layer for Convolutional Layers . . . . . . . . . . . . . . . . . 40

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Streaming Hardware Generation in LegUp High-Level Synthesis 45

5.1 Loop Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Pipeline Function Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 FIFO Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.1 First-Word-Fall-Through FIFO . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.2 Software Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.4 Stall Logic and Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.5 Accelerator Design using LegUp’s Pipelined Function Feature . . . . . . . . . . . 54

5.5.1 Software Implementation of Accelerator Controller . . . . . . . . . . . . . 54

5.5.2 Software Implementation of the Compute Unit . . . . . . . . . . . . . . . 56

v

5.5.3 Software Implementation of Buffer Readers and Writer . . . . . . . . . . . 59

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6 System Implementation 61

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.2 SoC Device Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.3 System Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

6.3.1 The Instruction Registers Module . . . . . . . . . . . . . . . . . . . . . . 64

6.3.2 The Status Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.3.3 The Accelerator Controller Interface . . . . . . . . . . . . . . . . . . . . . 64

6.3.4 The DMA Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.3.5 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.4 Data Movement Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.4.1 Addressing Scheme for the On-Chip Buffers. . . . . . . . . . . . . . . . . 67

6.4.2 Custom DMA Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.4.3 Translation From Virtual Addresses to Physical Addresses . . . . . . . . . 74

6.5 Clock Frequency Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.6 Additional Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.6.1 Using the Direct SDRAM Interface . . . . . . . . . . . . . . . . . . . . . . 76

6.6.2 Control Loop Optimization in Accelerator Controller . . . . . . . . . . . . 78

6.6.3 Reuse of Data Between Input Feature Map Tiles . . . . . . . . . . . . . . 79

6.7 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.8.1 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.8.2 Hardware Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.8.3 Power Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7 Conclusion 96

7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

vi

Bibliography 100

vii

List of Tables

3.1 Value range profiling of the MNIST model . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Value range profiling of AlexNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Inference Accuracy of AlexNet With Heterogeneous Fixed-point Representations 29

6.1 Design components being tested at each development/verification flow. . . . . . . 81

6.2 Run-time breakdown per image inference of AlexNet . . . . . . . . . . . . . . . . 84

6.3 Time spent on data transfer between memory and on-chip buffers . . . . . . . . . 84

6.4 Resource usage of the proposed system on the Cyclone V SoC FPGA . . . . . . . 89

6.5 Resource usage of the proposed system on the Arria V SoC FPGA . . . . . . . . 90

viii

List of Figures

2.1 A neuron in artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 A simple deep neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Example activation functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Illustration of a convolutional layer. . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Illustration of a maxpooling layer. . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 A neural network model for MNIST dataset . . . . . . . . . . . . . . . . . . . . . 11

2.7 ILSVRC image classification error rate from 2010 to 2015. . . . . . . . . . . . . . 13

2.8 An illustration of the AlexNet architecture . . . . . . . . . . . . . . . . . . . . . . 13

2.9 Computation time distribution of individual layers on GPU and CPU . . . . . . 14

3.1 Schematic of a 16-input dot product operator. . . . . . . . . . . . . . . . . . . . . 17

3.2 The impact of input bit-width on the resource usage, FMax and power of a

dot-product operator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Example of conversion between heterogeneous fixed-point representations . . . . 20

3.4 Key components of the object-oriented software architecture. . . . . . . . . . . . 21

3.5 A snippet of a model configuration file. . . . . . . . . . . . . . . . . . . . . . . . . 23

3.6 Training of MNIST model using 32-bit fixed-point representations. . . . . . . . . 24

3.7 Inference of MNIST model using uniform and heterogeneous fixed-point repre-

sentations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Accelerator Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Three abstract layers in the overall system. . . . . . . . . . . . . . . . . . . . . . 37

4.3 Tiling and traversal of fully-connected layers. . . . . . . . . . . . . . . . . . . . . 39

4.4 Traversal order in the tiling software for convolutional layers . . . . . . . . . . . 41

4.5 Traversal order in the accelerator controller for convolutional layers . . . . . . . . 42

ix

5.1 Module interface of a sequential function. . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Handshaking between sequential functions. . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3 Ready-valid-data (RVD) interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.4 Handshaking between source and sink using RVD interface . . . . . . . . . . . . 48

5.5 RVD-compatible interface of FWFT FIFO. . . . . . . . . . . . . . . . . . . . . . . . 49

5.6 Pipeline circuit datapath and stall logic. . . . . . . . . . . . . . . . . . . . . . . . 52

6.1 Overall system integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.2 Data layout in off-chip memory for a tile in fully-connected layer. . . . . . . . . . 68

6.3 Data layout in on-chip buffers for a tile in fully-connected layer. . . . . . . . . . . 68

6.4 Data layout in off-chip memory for a tile in convolutional layer. . . . . . . . . . . 69

6.5 Data layout in on-chip buffers for a tile in convolutional layer. . . . . . . . . . . . 69

6.6 Implementation in the custom DMA unit that handles unaligned starting address 72

6.7 Example use of pipeline bridges in the interconnect . . . . . . . . . . . . . . . . . 76

6.8 Interconnect from DMA master to SDRAM. . . . . . . . . . . . . . . . . . . . . . 77

6.9 Previous traversal order during the tiling of a convolutional layer . . . . . . . . . 80

6.10 A data re-use-friendly traversal order. . . . . . . . . . . . . . . . . . . . . . . . . 80

6.11 Using input buffer in a circular organization. . . . . . . . . . . . . . . . . . . . . 80

6.12 Screen shot of chip planner on Cyclone V SoC device. . . . . . . . . . . . . . . . 85

6.13 Screen shot of chip planner on Arria V SoC device. . . . . . . . . . . . . . . . . . 85

6.14 Power monitoring of HPS and FPGA fabric on the Arria V SoC development

board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.15 Neural Network Accelerator Architecture of DianNao [13]. . . . . . . . . . . . . . 93

6.16 Neural Network Accelerator Architecture in [33]. . . . . . . . . . . . . . . . . . . 94

x

Chapter 1

Introduction

1.1 Motivation

Deep neural networks (DNNs) have gained prominence recently by producing state-of-the-art

results in pattern recognition, speech synthesis, customer preference elicitation, and other ma-

chine learning tasks [9]. Like genetic algorithms and simulated annealing, DNNs are based on

an analogy with real-world biological/physical processes. A DNN is a computational model

inspired by the structure and function of the brain, wherein abstract “neurons” perform com-

putations on inputs received from other adjacent neurons, and produce outputs that are then

received and processed by other neurons in the network. The degree of influence a neuron

has on another neuron is reflected by a numerical weight. In simple terms, training a DNN is

the process of selecting values for the weights so that the overall neural network produces the

desired output for a given input. On the other hand, inference of a DNN refers to the use of

an already-trained model to make a prediction for an input that was not seen during training.

For example, in the case of image recognition, when the DNN is presented with an image of a

tree, the DNNs outputs recognize it as so. DNN research is very active in both the research

community and in industry, with companies such as Google, Facebook, and Baidu actively

publishing (e.g. [31]) and competing with one another. The future potential for DNNs, both

in big-data/data center applications and in low-power/mobile/IoT applications, appears to be

very strong. One can imagine, for example, DNNs in data centers being applied for analytics

on data gathered from social media, and DNNs applied in smartphones for image recognition

directly from the camera feed. In both of such contexts, there is a pressing need for both low

power and high speed.

1

Chapter 1. Introduction 2

In most applications, DNNs are first trained off-line with a large set of training data on

machine clusters or GPUs. Trained neural network models are then deployed for inference

tasks in data centres or in an embedded environment, serving a large set of end-users and

applications. For example, a pre-trained neural network model that predicts web-advertisement

click-through rates can be deployed on thousands of ad servers, making inferences for billions

of ad placements on websites and apps everyday. Likewise, in the embedded context, an image-

recognition neural network could be deployed in self-driving cars. Generally speaking, neural

network inference is executed much more frequently than training – a trained network would be

used for inference many times. Therefore, for this research, we optimized to focus on optimizing

inference. The goal of the research is to investigate, design, develop and evaluate customized

hardware implementations for the inference of deep neural networks, with the aim of achieving

considerably better speed and energy efficiency than those can be realized with standard processors

or GPUs.

Broadly speaking, there are two ways of implementing computations: in hardware or in

software. The latter approach is most frequently used, as it is more straightforward and software

development skills are widely available. The software is run on a standard processor, which

is a generic platform with fixed datapath widths, and high overhead associated with fetching

and decoding instructions, accessing memory, and so on. The hardware approach involves

the custom design of a circuit dedicated to the particular application needs. The functions

performed by the circuit, the degree of parallelism, and the datapath widths can all be tailored

precisely to the application requirements, reducing overhead considerably, and improving speed

and power versus using a processor. Indeed, such a customized computing approach can provide

order(s) of magnitude improvement in speed and energy over processors [16].

Although specialized hardware has the potential to provide huge acceleration at a fraction

of a processors energy, the main drawback is related to its design. On one hand, describing

hardware components in a hardware description language (HDL) (e.g. VHDL or Verilog) allows

a designer to adopt existing tools for RTL and logic synthesis into the target technology. On the

other hand, this requires the designer to specify functionality at a low level of abstraction, where

cycle-by-cycle behaviour is completely specified. The use of such languages requires advanced

hardware expertise, besides being cumbersome to develop in. This leads to longer development

times that can critically impact the time-to-market.

An interesting solution to realize customized computing and, at the same time, address the

Chapter 1. Introduction 3

time-to-market problem is the combination of reconfigurable hardware architectures, such as

field-programmable gate arrays (FPGAs) and high-level synthesis (HLS) tools [15]. FPGAs are

integrated circuits that can be configured by the end user to implement digital circuits. Most

FPGAs are also reconfigurable, allowing a relatively quick refinement and optimization of a

hardware design with no additional manufacturing costs. The designer can modify the HDL

description of the components and then use an FPGA vendor toolchain for the synthesis of the

bitstream to configure the FPGA. HLS tools start from a high-level software programmable

language (HLL) (e.g. C, C++, SystemC) to automatically produce a circuit specification

in HDL that performs the same function as the software. HLS offers benefits to software

engineers, enabling them to reap some of the speed and energy benefits of hardware, without

actually having to build up hardware expertise. HLS also offers benefits to hardware engineers,

by allowing them to design systems faster at a high level abstraction and rapidly explore the

design space. This is crucial in the design of complex systems and especially suitable for FPGA

design where many alternative implementations can be easily generated, deployed onto the

target device, and compared.

1.2 Contributions

We design and implement a complete acceleration solution for the inference step of deep neural

networks, specifically in an embedded context. Our implementation is realized on a System-on-

Chip (SoC) FPGA device where a software framework accepts a generic (user-specified) neural

network model as input, then executes and accelerates the corresponding inference by off-

loading the computation to the hardware accelerator implemented on the FPGA. The principal

contributions of this thesis include:

• Using a heterogeneous fixed-point representation in the neural network computation,

which brings performance and area benefits to the custom hardware, while at the same

time, retaining the model accuracy.

• A high-throughput hardware design that accelerates the most compute-intensive part of

neural network inference.

• A function pipelining feature implemented as part of the LegUp high-level synthesis tool,

which allows a streaming design to be described in C-language.

• A complete system implementation that performs accelerated inference using an already-

Chapter 1. Introduction 4

trained neural network model.

• A working software and hardware framework that will enable many further optimizations

to be realized rapidly.

The function pipelining support in LegUp HLS has been submitted for publication to

the 2016 IEEE Int’l Conference on Application-specific Systems, Architectures and Processors

(ASAP).

1.3 Thesis Overview

Chapter 2 first introduces the basics of neural networks and two example models that serve as

target benchmarks in this project. Chapter 3 explores the feasibility of using a reduced-precision

data representation in neural network computation. Our system architecture is described in

Chapter 4. Chapter 5 describes the work done for adding pipeline function support to the LegUp

HLS tool. In Chapter 6, we explain the system implementation and present the experimental

results. Chapter 7 summarizes the thesis contributions and proposes suggestions for future

work.

Chapter 2

Background

The chapter provides basic background on neural networks. We begin by introducing their

general structure, as well as how they can be trained and then used for inference, in Section 2.1.

Four types of neural network layers are described in the Section 2.2. Section 2.3 describes two

benchmark models that are used in this project. We highlight the key operation of neural

network computation, multiply and accumulate, in Section 2.4.

2.1 Primer on Neural Networks

2.1.1 General Structure

The neuron is the basic element in an artificial neural network (Figure 2.1). A neuron receives

inputs that may come from the observable properties of a given problem (e.g., image pixels, or

audio samples) or may be the outputs of other neurons. Each connection from an input to a

neuron is associated with a synaptic weight. The neuron sums up the products of all pairs

of inputs and synaptic weights. This weighted sum is typically offset with a bias term in order

to make the model more general. The bias term can be considered as an additional synaptic

weight that is always connected to a constant input of +1. Nonlinearity is introduced into

neural networks by introducing an activation function at the neuron output. The activation

function transfers the weighted sum to an output value that defines the state of the neuron.

Example activation functions may be a sigmoid function that maps weighted sum values from

(-inf, +inf) to (0, 1) (Figure 2.3a), or a Rectified Linear Unit (ReLU) which clamps all negative

values to 0 and retains all positive values (Figure 2.3b).

5

Chapter 2. Background 6

z! o

w1

w2

w3

wn

+1

!"!"!

i1

i2

i3

in

b

Figure 2.1: A neuron in artificial neural net-works, o = f(z) = f(

∑nk=1(wk · ik) + b ).

!"""""!"""""!

!""!""!

Output Layer

Hidden Layer

Input Layer

Figure 2.2: A simple deep neural network.

−10 −5 0 5 10

0

0.2

0.4

0.6

0.8

1

x

y

(a) Sigmoid function, y = 1

1+e−x

−10 −5 0 5 10−4

−2

0

2

4

6

8

10

x

y

(b) ReLU function, y = (x > 0) ? x : 0

Figure 2.3: Example activation functions.

A neural network with only one layer of synaptic weights can only approximate linear

functions of the inputs. Therefore, intermediate layers are introduced to form more powerful

neural networks in order to approximate more complex models. These intermediate layers are

known as hidden layers since their states are usually not directly observable. Neural networks

with one or more hidden layers are referred to as deep neural networks (Figure 2.2). As shown in

the figure, a neural network can have more than one output. In the case of deep neural networks,

values of the neuron outputs are computed in topological order, from inputs to outputs – the

outputs from neurons in one layer are used as inputs to neurons in deeper layers.

2.1.2 Neural Network Training and Inference

Training of a neural network is the process of finding a set of parameters (weights and bias)

that minimize the model’s approximation error on the training dataset. The approximation

error can be calculated by a loss function, which is typically determined based on the task.

For example, the Euclidean loss function is a popular choice for real-valued regression tasks. It

Chapter 2. Background 7

computes the sum of squares of differences between each model output o and desired output t

as follows:

E =1

2N·

N∑

i=1

(oi − ti)2 (2.1)

where N is the total number of samples in the training dataset.

The combination of gradient descent and back-propagation [29] is the most commonly used

technique to train a neural network. Gradient descent aims to minimize the error, E. In

gradient descent, the synaptic weight w, is updated by a value that is proportional to the

derivative of total error with respect to w, i.e.,

w = w − α · ∂E/∂w (2.2)

where the term α is known as the learning rate.

The back-propagation algorithm can be divided into two phases, the forward pass and the

backward pass. Using the neuron model in Figure 2.1 as an example, the forward pass computes

the neuron output values layer by layer from input to output, with the form

o = f(z) = f(n∑

k=1

(wk · ik) + b ) (2.3)

where f is the non-linear activation function.

The purpose of the backward pass is to find the error derivative with respect to each model

parameter (∂E/∂wk in Equation 2.2), so that the parameters can be updated using the gradient

descent method. Given the error between actual output and desired output, E, the error

gradient with respect to the weight can be computed as:

∂E/∂wk = ∂E/∂o · ∂o/∂z · ∂z/∂wk = ∂E/∂o · ∂o/∂z · ik (2.4)

where ∂E/∂o is the error derivative with respect to the output neuron, ∂o/∂z is the derivative

of the activation function f , and ∂z/∂wk can be reduced to the input neuron value ik of which

the connection to o is associated with wk. When computing the error derivative with respect to

a weight in a hidden layer, the term ∂E/∂o refers to the error derivative of the hidden neuron.

The error derivative with respect to neurons in each hidden layer is computed by propagating

the error derivatives backward layer by layer, from output neurons to input neurons, with the

Chapter 2. Background 8

form:

∂E/∂ik = ∂E/∂o · ∂o/∂z · ∂z/∂ik = ∂E/∂o · ∂o/∂z · wk (2.5)

where ∂E/∂o and ∂E/∂ik are the error derivatives with respect to the output and input neuron

of the current layer; the term ∂z/∂ik is simplified to the weight wk that corresponds to the

connection between input ik and output o. When computing the error derivatives for the

preceding layer (in the sense of forward pass), an input neuron ik in current layer becomes

an output neuron of the preceding layer. Thus, the computed ∂E/∂ik becomes ∂E/∂o in

Equation 2.4 when computing the error derivative with respect to the weights in the preceding

layer, and substitutes ∂E/∂o in Equation 2.5 for computing the error derivative with respect

to the input neurons in the preceding layer.

Using a neural network for inference is to use a pre-trained model to approximate the

output for unseen data samples. Neural network inference is equivalent to the forward pass of

the back-propagation algorithm.

2.2 Major Layers

The following paragraphs describe the four types of layers used by the target benchmarks in

this research.

2.2.1 Fully-connected Layers

In a fully-connected layer, every neuron is connected to all the neurons in its previous layer.

Each connection between neurons is associated with a unique synaptic weight and each neuron

is associated with a bias term. For a fully-connected layer with M outputs and N inputs, there

will be as many as M ×N trainable weights and M trainable biases.

2.2.2 Convolutional Layers

Unlike fully-connected layers, output neurons in a convolutional layer are only connected to a

local region of input neurons. Output neurons may share a common set of synaptic weights but

apply the same connectivity pattern on different regions of inputs. The set of shared synaptic

weights can be thought as a filter or a feature extractor. Consider the scenario where the input

to the neural network is an image with three colour channels. A 3D filter with dimensions of

K×K×3 is applied to a region of pixels in the input to extract the feature state at that particular

Chapter 2. Background 9

!"#$%&'()')*+,%&-.'/

/

012",'/%3,"&%'432-5",2",'/%3,"&%'432-

6

6

Figure 2.4: Illustration of a convolutional layer.

local region. Such a local region is sometimes referred as the output neuron’s receptive field in

its previous layer. As shown in Figure 2.4, the filter acts as a sliding window that moves across

a multi-channel image to extract a feature from all local regions. The extracted values at all

locations form a feature map representing the intensity of the feature at each location.

A convolutional layer is usually designed to capture more than one feature and hence,

multiple filters are used to construct a stack of output feature maps. For example, one filter

may be “looking” for horizontal lines in an input image; another filter may be “looking” for

diagonal lines. As with the filter weights, the bias term is shared by all output neurons in

the same feature map. The training process for a convolutional layer aims to adjust filter

weights and bias in order to extract features that can help to minimize approximation error of

the overall network. It is worth mentioning that filters may not be applied at every possible

starting position of the input feature maps; rather, a convolutional layer may have non-unit

stride. For example, with a stride of 2 in both dimensions, filters would be applied at positions

separated by 2 pixels.

2.2.3 Maxpooling Layers

A maxpooling layer is an approach to sub-sampling and it does not have trainable parameters,

i.e., synaptic weights. It reduces the input dimensionality by extracting the maximum value

from a set of neighbouring inputs. For example, a maxpooling layer can be applied on a set

Chapter 2. Background 10

!"#$%&'()%$*(&+)#,

-$%#$%&'()%$*(&+)#,

./ . 01

2 0/ 3

4 05 0/

Figure 2.5: Illustration of a maxpooling layer.

of input feature maps (Figure 2.5). For each individual feature map, it extracts the most

responsive neurons – the neurons with the highest output values – from the patches covered

by a sliding window. In this case, the number of output feature maps will be the same as the

number of input feature maps, but output feature maps are typically smaller on the X- and

Y-dimensions. During back-propagation, the error derivatives with respect to input neurons

unselected by the max operation in the forward pass are zero. For the selected input neurons,

their error derivatives are equal to the error derivatives with respect to their corresponding

output neurons.

2.2.4 Local Response Normalization (LRN) Layers

In image recognition tasks, local response normalization layers are designed to normalize the

response of neurons at the same spatial location, but from different feature maps [24]. This is

loosely akin to averaging a neuron’s output with those of other neurons at the same location in

different feature maps. The computation can be formulated as follows:

bix,y = aix,y/

(

k + α

min(N−1,i+n/2)∑

j=max(0,i−n/2)

(ajx,y)2)β

(2.6)

The terms aix,y and bix,y are input and output neurons at location <x,y> of feature map f ,

respectively. N refers to the total number of feature maps and n refers to the number of adjacent

feature maps to be considered. The constants k, n, α, and β are non-trainable parameters, also

known as hyper-parameters. Their values are selected from validation where the best set of

Chapter 2. Background 11

values are chosen based on many trial runs. It is worth noting that the ordering of feature

maps are arbitrary and it is up to the network itself to adjust the feature extractors to adapt to

the normalizations. As it pointed out in [24], this response normalization method bares some

similarity to lateral inhibition behaviour found in biological neurons.

2.3 Benchmark Models

In this dissertation, we experiment with two image recognition benchmarks: a small model

designed to recognize the MNIST dataset [26] of handwritten digits and a more sophisticated

image classification model for the ImageNet dataset [30]. The MNIST dataset is useful for

validating some of our design ideas that will be discussed in Chapter 3. The ImageNet dataset

is more complex and reflects datasets used in real applications. Hence, this model would be

more representative when evaluating our overall hardware implementation.

2.3.1 A Toy Model for The MNIST Dataset

The MNIST dataset consists of 28 × 28 greyscale images of handwritten digits. The dataset

has a training set of 60,000 images and a test set of 10,000 images. The task is to classify each

image into one of the 10 digit classes from 0 to 9. For this dataset, we use a neural network

model that is similar to LeNet-5 proposed by Yann LeCun in 1998 [25]. The model has two

!"

!

#

!"

#

$"

!

$"

%&'(&)*+,&'-. /+0,123.4 %&'(&)*+,&'-. /+0,123.45678&&)-. /+0,123.!

5678&&)-. /+0,123.!

9...9

...9

!"#$$%&'$(')'*+,%&-+(&

9.....9

.....9

4":$""

!$

4!;!;

$

<*))=>?&''2?+21

Figure 2.6: The neural network architecture of the toy model for MNIST dataset.

convolutional layers, each followed by a maxpooling layer, and one fully-connected layer used as

a classifier at the end of the network. The first convolution layer applies 20 5× 5× 11 filters on

the input image with a stride of 1, and creates 20 feature maps of size 24× 24. A maxpooling

layer downsamples the feature maps with a 2×2 non-overlapped (i.e., stride of 2) sliding window

1Since the inputs are single-channel greyscale images, the third dimension of the 3D filter will have a size of1.

Chapter 2. Background 12

to produce 20 12× 12 feature maps. In the next convolutional layer, the downsampled feature

maps are convolved with 40 independent 5 × 5 × 20 filters with a stride of 1, resulting in 40

8× 8 feature maps. The identical maxpooling layer is used again to downsample each of the 40

feature maps into a smaller size of 4× 4. Lastly, the fully-connected layer discards the spatial

information and takes in the feature maps as an input vector of 6400 (4× 4× 40) neurons, and

produces 10 outputs: one corresponding to each digit class.

We use ReLU as the activation functions after the two convolutional layers and softmax

activation after the fully-connected layer. Softmax differs from the ReLU or sigmoid functions

that map each input independently to a new value. Softmax takes all K neurons (K = 10 in

this case) and produces a probability value for each neuron, where the sum of all K probability

values equals to 1 (Equation 2.7).

bi =exp(ai)

∑Kj=1 exp(aj)

(2.7)

where ai is the output of neuron i before the softmax activation, and bi is the output of neuron

i after softmax is applied. In this case, a probability value is the estimated likelihood of an

input image belonging to the corresponding digit class.

We trained the network for 100 epochs2 and were able to obtain a validation accuracy of

99.34%. This accuracy serves as the baseline for our later evaluations.

2.3.2 AlexNet for The ImageNet Dataset

ImageNet [27] is a large image dataset organized primarily by the Standford Vision Lab [4].

The dataset contains more than 15 million high-resolution images belonging to around 22,000

categories. The images are collected from the internet and labelled by humans using a crowd-

sourcing tool. This dataset has become an invaluable resource for computer vision and machine

learning researchers. The ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [30]

uses a subset of ImageNet, containing ∼12 million training images and 50 thousand validation

images, with roughly even number of images for each of the 1000 selected classes. The competi-

tion has been held annually for six years and has become a standard benchmark for large-scale

image recognition. Significant progress has been made by researchers around the world. The

2One epoch consists of one full training cycle on the entire training data set.

Chapter 2. Background 13

top-5 error rate3 of the image classification task for instance, has been substantially reduced

from 28.2% in 2010 to 3.6% in 2015 (Figure. 2.7).

!"#!$!%#"$

&'#($

&&#)$

'#)$

*#'$

+#+$

%#+$

&+#+$

&%#+$

!+#+$

!%#+$

*+#+$

!+&+ !+&& !+&! !+&* !+&( !+&%

,-./%0122-203456

7642

Figure 2.7: ILSVRC image classification error rate from 2010 to 2015.

Most notably, in 2012, Krizhevsky et al. proposed a large convolutional neural network

(CNN) architecture and showed significant improvement upon previous approaches [24]. This

CNN architecture is commonly known as AlexNet and has been widely used as a reference

model in many research papers. Figure 2.8 illustrates the architecture of AlexNet.

Figure 2.8: An illustration of the AlexNet architecture (The figure is taken from [24]).

The model has 8 trainable layers with adjustable weights. The first five are convolutional

layers and the last three are fully-connected. The output layer is activated by a 1000-way

softmax that produces probability distribution over the 1000 classes. The first two convolutional

layers are followed by local response normalization layers. Three maxpooling layers are placed

after the two normalization layers and the fifth convolutional layer. All maxpooling layers use

a 3×3 pooling window with a stride of 2. To speedup the training, the network is split into two

partitions so that each partition can be trained on one GPU. Communication between the two

GPUs only happens at certain layers. As the figure shows, the third convolutional layer applies

3Top-5 error rate is the fraction of images whose correct labels are not matched by any of the top-5 predictionsmade by the model.

Chapter 2. Background 14

the filters on all feature maps in the second layer, and output neurons in fully-connected layers

are connected to all input neurons in the previous layers. The model has around 60 million

parameters and 650,000 neurons. It was trained on two NVIDIA GTX580 3GB GPUs for almost

six days.

In this research, we use a pre-trained AlexNet model provided by an open-source neural

network framework called Caffe [23]. This model has a minor variation from the original one

described above and achieved a top-1 accuracy of 57.4% and a top-5 accuracy of 80.4% [1].

2.4 Key Operation - Multiply and Accumulate (MAC)

As pointed out in [22], the most compute-intensive layers in typical neural networks are the

fully-connected and convolutional layers.

Figure 2.9: Computation time distribution of individual layers, on both GPUs and CPUs forthe forward pass. This figure is taken from [22].

Figure 2.9 shows the run-time time breakdown of the forward pass (inference) of AlexNet

on both GPUs and CPUs. Labels prefixed with fc and conv are the fully-connected and

convolutional layers. These two types of layers combined take 95% and 89% of time on GPUs

and CPUs, respectively. In both layers, most of the work for an output neuron is to compute the

dot-product between its connected input neurons and the associated weights. The underlying

operation is essentially multiply-accumulate (MAC).

Chapter 3

Using Low-Precision Fixed-point in

Neural Networks

Neural network computations are typically performed using 32- or 64-bit floating-point because

of their ease-of-use on general processors (CPUs or GPUs). However, in custom hardware,

we have the ability to use a more cost-efficient fixed-point representation and even tailor the

numerical precision to the minimum required for inference of the desired accuracy. We propose

to use heterogeneous fixed-point representations during neural network inference, in hope of

the maximizing computational throughput of our custom hardware and also raising energy

efficiency.

In this chapter, we begin by reviewing the trade-offs between floating-point and fixed-point

representations in Section 3.1. Section 3.2 illustrates the positive impact of reducing fixed-

point precision on hardware performance, area and power cost. We then present the concept of

heterogeneous fixed-point representation in Section 3.3. In Section 3.4, we introduce a software

framework that was developed to experiment with low-precision fixed-point in neural networks.

In Section 3.5, our experiments show that neural network computations can be carried out

using fixed-point arithmetic and the reduced bit-width heterogeneous fixed-point format can be

used in neural network inference with minimum damage on the accuracy. Lastly, we summarize

several related works in Section 3.6.

15

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 16

3.1 Floating-Point versus Fixed-Point

In computing, real-valued numbers are represented in two main categories of data types,

floating-point and fixed-point. The floating-point representation consists of three components,

a sign bit (S), exponent (E) and mantissa (M). For example, the single-precision floating-

point (SPFP) representation defined in IEEE standard 754 has 1 sign bit, 8 exponent bits

and 23 mantissa bits. The value of a floating-point number is formulated as (−1)S × (1 +

M × 2−23)× 2E−127. The exponent part allows the floating-point format to represent a wide

range of magnitudes, from 2−127 to 2128 in the case of SPFP, while the mantissa limits the

relative error by keeping the resolution at the most significant bits.

Fixed-point representation is essentially an integer with a fixed radix point position that de-

termines the scaling factor. We refer the bits on the left and right of the radix-point as decimal

bits (D) and fraction bits (F), respectively. Given a [D.F ] fixed-point representation, its

precision is limited by the smallest representable magnitude 2−F and the representable values

are bound in the range of [−2D−1, 2D−1 − 2−F ].

In scientific computing, floating-point is more commonly used because of its ease-of-development

in most programming environments: 1) Floating-point can represent a wider range of values,

and hence programmers generally do not need to worry about overflow or underflow scenarios,

which are more likely when fixed-point is used; 2) Floating-point also provides better precision

than fixed-point in general; 3) Mathematical functions, such as exponential and power, are bet-

ter supported for floating-point by software libraries; for instance, the C math library math.h

supports only single- and double-precision floating-point data types.

While float-point provides far better ease-of-use for software developers, the computational

complexity of fixed-point arithmetic is considerably less than that of floating-point. Fixed-point

arithmetic operators are generally faster and more area- and power-efficient. Given that we are

designing custom hardware, we have the opportunity to use custom fixed-point arithmetic in

order to maximize the hardware throughput, while maintaining a low power budget.

3.2 Impact of Bit-width on Hardware Operator

Fixed-point arithmetic is mostly the same as integer arithmetic with additional conversion

operations. The hardware cost of integer arithmetic operators, specifically multiplication and

addition, are tightly correlated with the data bit-width.

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 17

Modern FPGA architectures contain pre-fabricated DSP blocks for implementing multipliers

and adders in a wide range of precisions [5]. Using the Altera Stratix V FPGA as an example,

each DSP block can implement three independent 9-bit multiplies, a sum of two 18-bit multiplies

or one independent 27-bit multiply. These variable-precision DSP blocks can also be cascaded

to support higher precision modes, e.g., two DSP blocks can be chained to implement an

independent 36-bit multiply operation1.

To illustrate the impact of operator bit-width on hardware performance, area and power

cost, we implement a dot-product operator with different bit-widths on the Altera Stratix V

device. Figure 3.1 shows the structure of the dot-product operator, consisting of 8 multipliers

and 7 adders. Inputs and outputs of the dot-product circuit are registered. All multiplier

Figure 3.1: Schematic of a 16-input dot product operator.

and adder outputs are two times wider than the multiplier inputs. We placed-and-routed the

circuit using Altera’s Quartus II synthesis tool and report the resource usage, clock frequency

and estimated power estimation in Figure 3.2. The power consumption is estimated in vectorless

mode with 12.5% input toggle rate2 and 100MHz clock frequency. In the experiment, the DSP

blocks are used in three configurations: 1) when the input width is less than or equal to 18 bits,

the DSPs are used in “the sum of two 18-bit multiplies” mode and hence 4 DSPs are enough

to implement 8 multipliers; 2) when the input width ranges from 19 bits to 27 bits, the DSPs

are used in “one independent 27-bit multiply” mode, resulting in 1 DSP per multiply; and 3)

when the input width goes beyond 27 bits to 32 bits, two DSPs are cascaded to form an up-to

36-bit multiply. Combinational ALUT usage and estimated power consumption both increase

1Consider, for example, that a 36-bit multiply can be realized using four 18-bit multiplies, and each DSPblock can implement two independent 18-bit multiplies.

2The power consumption results gathered from the vectorless estimation may not be accurate. However, theestimated results can accurately reflect the relative difference in power consumption when different input bitwidths are used by the dot-product operator.

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 18

8 18 19 27 28 320

5

10

15

20

DP

S B

locks

8 18 19 27 28 320

100

200

300

Com

bin

ational A

LU

Ts

8 18 19 27 28 32100

150

200

250

300

Fm

ax [M

Hz]

8 18 19 27 28 320

10

20

30

Therm

al P

ow

er

[mW

]

Input Bit−width

Figure 3.2: The impact of input bit-width on the resource usage, FMax and power of a dot-product operator.

as the input gets wider, and have steeper increases when more DSPs are required. The circuit

Fmax degrades in general as bit-width increases and falls rapidly when more DSPs are used.

Based on the result, we can see that minimizing the operation bit-width can largely improve

the hardware efficiency in terms of performance, area and power. This motivates us to explore

the possibility of using lower-precision fixed-point arithmetic.

3.3 Low-Precision Fixed-point

3.3.1 Heterogeneous Fixed-Point Representation

Overflow and insufficient precision are the major challenges to using a fixed-point representation.

Overflow can be avoided by allocating enough decimal width (west of the decimal point) such

that the fixed-point format covers the representing data’s value range. It is harder to analyze

the sufficiency of precision, but in general it is always better to use a wider fixed-point format.

In Section 3.5.2, our experiments show that different parts of a neural network can have very

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 19

diverse value ranges, e.g., one layer’s neuron values can range from −34.8 to 16.9, while another

layer’s bias only ranges within ±3.2× 10−5.

In scenarios with diverse precision requirements, it is wasteful to use a uniform fixed-point

format that prevents overflow and provides sufficient precision throughout the entire neural

network. Doing so would imply that many leading decimal bits of a small-value-range data

will be constant 0, while the provided precision may be “overkill” for large-value-range data.

Therefore, we propose to use a heterogeneous fixed-point representation, where each part of a

neural network can have its own decimal width and fraction width. It is worth noting that we

permit the custom decimal width to be a negative value for variables with small value ranges.

The decimal width needed to avoid overflow can be formulated as ⌈log2(MaxMagnitude)⌉+1.

For example, for variable ranges within (-0.25, 0.25), the required decimal width would be -1

since the first bit on the right of the radix point does not carry additional information.

3.3.2 Heterogeneous Fixed-Point Arithmetic

Heterogeneous fixed-point arithmetic is a generalized form of uniform fixed-point arithmetic.

We use the dot-product operation as an example to illustrate the design considerations in

heterogeneous fixed-point arithmetic implementation.

Given a dot-product operation of y = ~A · ~B and the corresponding fixed-point formats

[Dy.Fy], [DA.FA] and [DB .FB ], the operation can be done in the following steps:

1. For each pair of elements (a, b) in ~A and ~B,

(a) Perform integer multiplication of a × b to obtain a temporary product t in the

format of [Dt.Ft] where Dt = Da +Db and Ft = Fa + Fb;

(b) Accumulate temporary product t to the sum s, whose fixed-point format is [Ds.Fs].

2. Convert the sum s from format [Ds.Fs] to format [Dy.Fy].

At the accumulation step (1.b), the accumulator s’s fraction width should be as wide as Ft to

preserve best possible precision, while the decimal width needs to beDt+⌈log2(InputV ectorLength)⌉

to prevent overflow completely. In our following experiment, the accumulator uses the same

fixed-point format as the temporary product, [Dt, Ft]. In this case, potential overflow could

happen. We discuss how we handle such cases and the last conversion step in the following

subsection.

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 20

3.3.3 Conversion Between Fixed-Point Formats

As mentioned above, fixed-point arithmetic requires conversion operations in addition to the

underlying integer arithmetic. When converting a fixed-point number to another format with

wider decimal width or fraction width, the conversion can be done by padding zeros before

or after the original number. When converting to a narrower decimal width format, potential

overflow can occur. To minimize conversion error due to overflow, overflow protection can

be implemented by setting the value to the maximum (or minimum if the original value is

negative) representable value of the target format.

Furthermore, truncation or rounding is needed when converting to a narrower fraction width.

Truncation simply discards the additional fraction bits and does not require extra operations or

hardware cost. Round-to-nearest, converts the original value to its closest representable value

010.01101

+ 1

010.11001

010.01101

+ 011 !rand()

010.11000

010.01101

+ 010 !rand()

010.01111

Round-to-nearest: Stochastic rounding:

Figure 3.3: Example of converting a fixed-point number from [3.5] format to [3.2] format usinground-to-nearest and stochastic rounding schemes.

of the target format, resulting in less precision lost than truncation. Round-to-nearest can

be carried out by adding a bit 1 at the most significant position of the trailing fraction bits,

as shown in Figure 3.3. Stochastic rounding is another rounding method that can effectively

reduce precision lost. It probabilistically rounds the original value to the representable value

just above or just below the original value. The decision to round up or down is based on rolling

a weighted die, weighted by the distance from the original value to the two neighbouring values.

Concretely, given the above and below closest representable values u and l, the probability of

rounding the original value o to u is P = (o − l)/(u − l), and the probability of rounding

down to l is 1− P . Statistically speaking, stochastic rounding preserves the most precision as

the expected value of the rounded number equals to the original value. Stochastic rounding

can be implemented by truncating the sum of the original number and a random value, whose

bit-width equals to the number of trailing fraction bits. This method will require a random

number generator and an adder in hardware.

In the following experiments, we use overflow protection and truncation for fixed-point

format conversion.

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 21

3.4 Software Framework

A software framework was built to facilitate experiments that use heterogeneous fixed-point

in neural network computations. The framework supports both inference and training of any

feed-forward neural network architecture consisting of the layers described in Chapter 2. Users

can specify the neural network model in a standalone configuration file, which is read in as input

by the software at run-time. Computations can selectively be carried-out in both floating-point

and heterogeneous fixed-point representation.

3.4.1 Object-Oriented Architecture

The software is written in C++ in an object-oriented style. Figure 3.4 shows several key classes

of the design and their relationships in a UML diagram.

Matrix

-fixpt_matrix: vector<int>

-flopt_matrix: vector<float>

-is_fixpt: bool

-decimal_width: int

-fraction_width: int

Edge

-src: Layer*

-dst: Layer*

+FP()

+BP()

EdgeWithWeights

-weights: Matrix

-bias: Matrix

-grad_weights: Matrix

-grad_bias: Matrix

+UpdateWeightsAndBias()

Layer

-state: Matrix

-deriv: Matrix

+ApplyActivation()

+ApplyDerivOfActivation()

Maxpool_Edge

FC_Edge

Conv_Edge

LRN_Edge

NN

+layers: vector<Layer>

+edges: vector<Edge>

+ForwardPass()

+BackwardPass()

Figure 3.4: Key components of the object-oriented software architecture.

From the lowest level, a Matrix class implements N-dimensional arrays of fixed-point or

floating-point data. The underlying container is a one-dimensional array of integers or floating-

point numbers. In the case of fixed-point, all elements in the matrix share the same format,

i.e., decimal width and fraction width.

In this design, the Layer class encapsulates a layer of neurons. It contains two Matrix

objects, one for the layer’s neuron values (named state) and the other one for the corresponding

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 22

error derivatives (named deriv). During the forward and backward pass, two member methods,

ApplyActivation and ApplyDerivOfActivation, are respectively invoked to apply activations

(e.g., ReLU) to state or apply the derivative of the activation function to deriv.

Edge is an abstract class that connects two adjacent layers of neurons in the network. It

contains two pointers pointing to the source (src) and destination (dst) Layers. Two vir-

tual methods corresponding to the forward propagation (FP) and backward propagation (BP)

must be implemented by its subclasses. For example, a Maxpooling layer’s FP implementation

downsamples the src layer and assigns values to the neurons in the dst layer.

EdgeWithWeights is also an abstract class, which itself is a subclass of Edge. It is used as

the base class for the Edges with trainable parameters, such as the fully-connected (FC) and

convolutional (Conv) edges. Four Matrix objects are used for storing the weights, bias and their

corresponding gradients. A member method called UpdateWeightsAndBias is implemented to

update the parameters based on the computed gradients from back propagation.

Lastly, the NN class is designed to construct the network based on the user-provided con-

figuration file and to orchestrate the neural network computations. During the forward pass,

this class goes through a series of alternating Layers and Edges, from the input layer to the

output layer, by calling the ApplyActivation and FP functions, respectively. In the backward

pass, this class invokes the Layers’ ApplyDerivOfActivation and Edges’ BP in reverse order,

and calls the UpdateWeightsAndBias method to update parameters.

3.4.2 Model Configuration File

The model configuration file is designed using Google protocol buffers [3], a language-neutral and

platform-neutral mechanism for serializing structured data. Our implementation of the model

configuration file leverages a subset of protocol buffer features, including the human-readable

text format, auto-generated source code for serializing and parsing between configuration file

and in-memory data objects and auto-generated data-access classes that are programmatically

easy-to-use.

Figure 3.5 shows a snippet of the configuration file for the last three layers of the MNIST

model described in Section 2.3.1. In the configure file, a layer structure specifies the name,

dimensions, activation function type, and data representation for the layer of neurons. The

data representation is described by a structure with fields indicating whether a fixed-point

representation should be used and also the fixed-point format. For example, the layer named

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 23

!"#!$ %

&'(!) *+,&-.!"#!./*

01+.&'(!) *('23,,45&#.4'6!1.7*

"!08.&'(!) *+,&-.4'6!1./*

!"#!.863!) 9:;<:=>?@:;A=

B548!1.1,C0) D

B548!1.+,40) D

0815"!) 7

C!5#E80."'8'.1!3 %$FFF$G

H5'0."'8'.1!3 %$FFF$G

G

!"#!$ %

&'(!) *('23,,45&#.!"#!./*

01+.&'(!) *+,&-.4'6!1./*

"!08.&'(!) *('23,,45&#.4'6!1./*

!"#!.863!) IAJK::=

B548!1.1,C0) /

B548!1.+,40) /

0815"!) 7

G

!"#!$ %

&'(!) *,L83L8.B+.!"#!*

01+.&'(!) *('23,,45&#.4'6!1./*

"!08.&'(!) *,L83L8.4'6!1*

!"#!.863!) M9

C!5#E80."'8'.1!3 %

50.B5238) 81L!

H58.C5"8E) N

B1'+85,&.C5"8E) N

G

H5'0."'8'.1!3 %$FFF$G

G

4'6!1$ %

&'(!) *+,&-.4'6!1./*

&L(.+E'&&!40) OP

&L(.5('#!.1,C0) N

&L(.5('#!.+,40) N

'+85-'85,&) QR9?@M@RS.=@;RAQ

08'8!."'8'.1!3 %

50.B5238) 81L!

H58.C5"8E) N

B1'+85,&.C5"8E) 7

G

G

4'6!1$ %

&'(!) *('23,,45&#.4'6!1./*

&L(.+E'&&!40) OP

&L(.5('#!.1,C0) O

&L(.5('#!.+,40) O

'+85-'85,&) QR9?@M@RS.=@;RAQ

08'8!."'8'.1!3 %

50.B5238) 81L!

H58.C5"8E) N

B1'+85,&.C5"8E) /

G

G

4'6!1$ %

&'(!) *,L83L8.4'6!1*

&L(.+E'&&!40) 7

&L(.5('#!.1,C0) 7

&L(.5('#!.+,40)$ 7P

'+85-'85,&) T:M?IAJ

08'8!."'8'.1!3 %

50.B5238) 81L!

H58.C5"8E) N

B1'+85,&.C5"8E) 7

G

G

4'6!1$ %

&'(!) *('23,,45&#.4'6!1.7*

&L(.+E'&&!40)$ /P

&L(.5('#!.1,C0)$ 7/

&L(.5('#!.+,40)$ 7/

'+85-'85,&) QR9?@M@RS.=@;RAQ

08'8!."'8'.1!3 %

50.B5238) 81L!

H58.C5"8E) N

B1'+85,&.C5"8E) O

G

G

Figure 3.5: A snippet of a model configuration file.

‘maxpooling layer 1’, which is the output of the first maxpooling edge in the MNIST model

(not shown in the snippet) model, has 20 12×12 feature maps that are represented in 8-bit fixed-

point with the 4 least-significant bits allocated for fraction bits. An edge structure specifies the

names of its source and destination layers, the edge type, additional parameters associated with

the specific edge type (e.g., the filter size and stride for convolutional and maxpooling edges),

and also the data representations to be used for weights and bias (if the edge contains trainable

parameters). For instance, the second convolutional layer in the MNIST model convolves its

input feature maps with 40 5×5×20 filters with a stride of 1. The edge named ‘conv edge 2’

in the configuration file describes such computation. The source and destination of the edge

are the ‘maxpooling layer 1’ and ‘conv layer 2’ layers. Since the depth of filters (i.e. 20)

always equals the number of input feature maps and the number of filters (i.e. 40) can be

implied by the number of output feature maps, we can omit this information from the edge

structure as they are already specified in the source and destination layer structures.

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 24

3.5 Experimental Study

3.5.1 Uniform Fixed-Point in Neural Network Training

We start by using uniform fixed-point representation to train the MNIST model described in

Section 2.3.1. All neurons’ states and derivatives, as well as all parameter values and gradients

are represented in a uniform fixed-point format with the same decimal width and fraction width.

All multiply and add operations are performed in fixed-point arithmetic. The only non-linear

function used in the MNIST model is the softmax function. It is performed in floating-point

format, where fixed-point inputs and outputs are converted to/from floating-point before and

after the calculation.

In the experiment, we fix the total bit-width at 32 bits and vary the radix point position to a

range of configurations. Figure 3.6 reports the best validation accuracy achieved by each fixed-

!"#$%&

!$#!'& !$#!'& !$#$(&

!$#"!&

!(#')&

!"#"*&

!+#**&

!+#"*&

!(#**&

!(#"*&

!$#**&

!$#"*&

!!#**&

!!#"*&

'#%$ +#%+ $#%' ,*#%% ,%#% ,'#,$ ,+#,+

-./01234567.863.9:4;<<=>6<?4;<0.1@18

A.B18CD9.:34 <9:E./=>63.9:2

FGAGH4!$#!$&

$"#%%&

Figure 3.6: Training of MNIST model using 32-bit fixed-point representations.

point configuration during 100 epochs of training. The red line corresponds to the baseline

accuracy achieved by using single-precision floating-point. Both [6.26] and [8.24] fixed-point

formats achieve the highest accuracy at 98.94%, which is very closed to the 98.98% accuracy

achieved using floating-point. In general, the model achieves better validation accuracy when

better precision is preserved by allocating more bits to the fraction part. However, when the

fraction width is increased to 28 bits, the model stops from converging to a higher accuracy.

Most likely this is the result of frequent overflow caused by the small decimal width of 4 bits.

On the other end, the model trained with [16.16] fixed-point format does not achieve any higher

accuracy than 85.22%, due to the insufficient precision and potential underflow.

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 25

3.5.2 Value Range Profiling in Floating-Point Neural Network Training

To better understand the reason behind the lower accuracy obtained when using narrower

decimal width or fraction width, we examine each matrix’s value range during the training

with a floating-point representation. In Table 3.1, we report the maximum and minimum

Table 3.1: The value range of each matrix during training using float-point.

Layer/Edge MatrixDecimal

Max. Min. WidthValue Value Needed

input layerstate 1 0 1

deriv 3.28e+00 -2.89e+00 3

conv 1 layerstate 5.23e+00 -3.44e+00 4

deriv 3.91e-01 -4.46e-01 0

conv 2 layerstate 1.69e+01 -3.48e+01 7

deriv 5.31e-01 -5.34e-01 1

output layerstate 3.46e+01 -3.29e+01 7

deriv 9.99e-01 -1.00e+00 1

conv 1 edgeweights

value 7.72e-01 -4.98e-01 1

gradient 1.12e-01 -1.28e-02 -2

biasvalue 3.20e-05 -3.20e-05 -13

gradient 5.20e-05 -6.30e-05 -12

conv 2 edgeweights

value 2.32e-01 -2.26e-01 -1

gradient 2.47e-01 -1.31e-01 -1

biasvalue 6.10e-05 -6.20e-05 -12

gradient 2.30e-05 -2.00e-05 -14

fc edgeweights

value 3.25e-01 -4.32e-01 0

gradient 2.22e-01 -9.51e-02 -1

biasvalue 1.59e-04 -1.69e-04 -11

gradient 2.70e-05 -2.90e-05 -14

values that have ever appeared in each matrix during the 100-th training epoch (not including

the first 99 training epochs). The last column reports the minimum decimal width required

to prevent overflow if fixed-point representation is used. The results of maxpooling layers are

skipped because their value ranges are the same as their previous convolutional layers. From the

table, we observe a wide swing in the required decimal width across matrices. The conv 2 layer

and output layer require the decimal width to be as wide as 7 bits. This confirms that the reason

for lower accuracy when using [4.28] format is indeed due to overflow caused by small decimal

width. On the other hand, the bias value and gradient matrices are tend to have very small

value ranges, requiring less than -12 bits decimal width. Moreover, a parameter is updated

during training by incrementing itself with the term Gradient×LearningRate. The Gradient

is initially a larger value (than the ones reported here at 100-th epoch) and diminishes as

learning proceeds since the model gets closer to convergence. The LearningRate is set to 10−3

in our experiment. This means that, when the [16.16] fixed-point format is used, the term

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 26

Gradient × LearningRate is going to underflow if the gradient has a magnitude smaller than

2−16/10−3 ≈ 1.526 × 10−2, and there will be no update to the parameter. This explains the

reason why the accuracy does not converge any higher than 85.22% when the [16.16] fixed-point

format is used.

3.5.3 Heterogeneous Fixed-point in Neural Network Inference

The above experiments show that it is feasible to train a neural network in fixed-point repre-

sentation if a proper configuration is used. However, it is not trivial to find a good fixed-point

configuration without doing any test runs, e.g., training with floating-point representation and

profiling the value ranges. This could lead to practical difficulty when using fixed-point in

neural network training. On the other hand, neural network inference can take advantage of

knowing the value ranges of a pre-trained model and customizing the fixed-point representation

with reduced bit-width. We therefore propose to tailor the decimal width based on the profiled

value range and analyze the accuracy change while reducing the total bit-width by shrinking

the fraction part of the data representation. The experiment is conducted on both uniform

and heterogeneous fixed-point configurations. The decimal width of the uniform fixed-point

representation is set to 7 bits, the widest decimal width needed reported above. For hetero-

geneous fixed-point representation, the decimal width is tailored for each matrix based on its

value range. It is worth noting that the value ranges are profiled during the training process

where the training dataset is used. Meaning that, these custom fixed-point configurations are

not specialized to the validation dataset that is used in this inference experiment.

Figure 3.7 shows the validation accuracy at each bit-width setting. On the left end of the

figure, where the bit-width is as wide as 24 bits, both representations can achieve 98.98% of

validation accuracy, which is the same as that when using floating-point representation. For

uniform fixed-point representation, the accuracy does not decrease by more than 0.02% until

the bit-width is reduced to 13 bits. Then the accuracy starts to drop significantly as the bit-

width is further reduced. For heterogeneous representation, more bit-width can be reduced

without causing much accuracy degradation. Even when the bit-width is reduced to 8 bits, the

validation accuracy is only 0.21% worse than the floating-point baseline.

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 27

!" !# !! !$ !% $& $' $( $) $* $" $# $! $$ $% & ' (

+,-./01 &'2&'3 &'2&'3 &'2&'3 &'2&(3 &'2&(3 &'2&(3 &'2&(3 &'2&)3 &'2&)3 &'2&'3 &'2&(3 &'2'&3 &'2*(3 &$2"&3 #$2!$3 &2'%3 &2'%3 &2'%3

45650/75,5/89 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&(3 &'2&(3 &'2&)3 &'2&*3 &'2&(3 &&2%$3 &'2&*3 &'2((3 $"2)'3

&'2*%3

&'2)%3

&'2(%3

&'2'%3

&'2&%3

&&2%%3

&&2$%3

:;<-=;6-/,>?@@80;@A

B-6CD-=6E

+,-./01 45650/75,5/89

Figure 3.7: Inference of MNIST model using uniform and heterogeneous fixed-point represen-tations.

3.5.4 Heterogeneous Fixed-point in AlexNet Inference

With the above data show that using heterogeneous representation can effectively reduce the

data bit-width, we applied the same approach to the AlexNet model described in Section 2.3.2.

Table 3.2 shows the matrix value range profiled from the pre-trained AlexNet. Based on

the profiled result, we customize each matrix’s decimal width to the minimum width required

to avoid overflow. We then perform inference of AlexNet using three bit-width settings. The

first two settings use 32-bit and 16-bit fixed-point representations for all matrices. In the

third setting, 16-bit fixed-point is used for neuron matrices (layer), while 8-bit fixed-point

is used for parameter matrices (edge). This setting is based on the fact that most neuron

matrices require more than 8 decimal bits, while none of the parameter matrices requires more

than 2 decimal bits. The inference accuracy achieved with each of the heterogeneous fixed-

point settings is reported in Table 3.3. Inference with 32-bit and 16-bit heterogeneous fixed-

point achieve similar results, corresponding to 1.7% and 1.2% degradations in top-1 and top-

5 accuracy versus floating-point, respectively. Using the 8-bit parameter and 16-bit neuron

setting, top-1 and top-5 accuracy are 3.2% and 2.7% worse than that of using floating-point.

Broadly speaking, the results presented in this chapter are encouraging in that they show

that fixed point representations can be used for neural network inference without significant

accuracy loss.

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 28

Table 3.2: Value range profiling of AlexNet3

DecimalRange Width

Layer Magnitude Needed

input 1.61e+02 9

conv 1 1 3.09e+03 13

conv 1 2 2.17e+03 13

lrn 1 1 1.39e+02 9

lrn 1 2 1.39e+02 9

conv 2 1 7.30e+02 11

conv 2 2 6.18e+02 11

lrn 2 1 1.39e+02 9

lrn 2 2 1.39e+02 9

conv 3 1 3.75e+02 10

conv 3 2 4.44e+02 10

conv 4 1 2.53e+02 9

conv 4 2 3.54e+02 10

conv 5 1 2.24e+02 9

conv 5 2 2.85e+02 10

fc 6 1.31e+02 9

fc 7 1.65e+01 6

fc 8 1.11e+01 5

DecimalRange Width

Edge Magnitude Needed

weights 4.03e-01 0conv 1 1

bias 8.05e-01 1

weights 3.71e-01 0conv 1 2

bias 8.54e-01 1

weights 3.79e-01 0conv 2 1

bias 1.03e+00 2

weights 4.16e-01 0conv 2 2

bias 1.03e+00 2

weights 3.22e-01 0conv 3 1

bias 9.05e-02 -2

weights 5.12e-01 1conv 3 2

bias 1.04e-01 -2

weights 3.22e-01 0conv 4 1

bias 1.14e+00 2

weights 3.53e-01 0conv 4 2

bias 1.22e+00 2

weights 2.54e-01 0conv 5 1

bias 1.50e+00 2

weights 3.15e-01 0conv 5 2

bias 1.74e+00 2

weights 4.84e-02 -3fc 6

bias 1.06e+00 2

weights 5.21e-02 -3fc 7

bias 1.26e+00 2

weights 6.74e-02 -2fc 8

bias 4.25e-01 0

3.6 Related Work

Several approaches have been proposed recently to use low-precision data representations in

neural networks. The work highlighted here has been done concurrently with the research work

in this thesis.

In [19], Gupta et al. propose to train neural networks in uniform fixed-point with the use

of stochastic rounding. Their experiments show that the convolutional neural networks trained

with [2.14] or [4.12] 16-bit uniform fixed-point format do not converge when the round-to-nearest

scheme is used. Conversely, models trained with stochastic rounding can achieve test errors

that are very closed to the floating-point baseline. For the CIFAR10 dataset4, the authors also

find that the model trained with [4.12] fixed-point format stops learning after some point in the

training process to the diminishing gradients. This aligns with our observation in Section 3.5.2,

where training with insufficient precision causes underflow in parameter updates and leads to

4CIFAR10 is a collection of 32×32 colour images categorized into 10 classes. There are 50,000 training imagesand 10,000 test images.

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 29

Table 3.3: Inference Accuracy of AlexNet With Different Heterogeneous Fixed-point Represen-tations.

Data Representation Top-1 Accuracy Top-5 Accuracy

floating-point 57.40% 80.40%32-bit fixed-point 55.73% 79.19%16-bit fixed-point 55.75% 79.18%8/16-bit fixed-point 54.25% 77.69%

poor model convergence. To overcome this problem, they increment the fraction width by 4

bits to [4.16] format after 100 epochs of training to keep the learning going. The test error can

then be further reduced and converged near the floating point baseline.

In [18], Courbariaux et al. compares three types of data representations to be used in

neural network training. In addition to floating-point and uniform fixed-point format, they also

propose to use dynamic fixed-point format, which can be thought as a dynamically-adjustable

form of the heterogeneous fixed-point. Besides customizing each matrix’s format, dynamic fixed-

point allows one to update the decimal width during the training process. They keep track of

the overflow rate for each matrix and periodically compare the overflow rate to a overflow

threshold. The decimal width is then updated under these two conditions: a) if a matrix’s

overflow rate is higher than the threshold, its decimal width will be incremented by 1 to reduce

overflow, b) if the overflow rate is lower than half of the threshold, the decimal is decremented

by 1 to gain more precision. In order to work around the above mentioned diminishing gradient

problem, they use a higher precision format for parameters during the updates than during the

forward- and backward-propagations so that the small parameter changes can be accumulated.

They are able to achieve validation accuracy that is very close to the floating point baseline

using 20-bit uniform fixed-point or 10-bit dynamic fixed-point (12-bit for parameters during

update).

Courbariaux et al. further explore the possibility of using more restricted form of data

representation to enable more efficient hardware design. In [17], they introduced a technique

called BinaryConnect that stochastically binarizes the real-valued weights to be either +1 or

−1 during the forward propagation. The real valued weights are limited to be within the range

from -1 to +1, and the probability of rounding a weight to +1 is proportional to how “close” the

weight is to +1. By doing so, each multiplication between a neuron and a binarized weight in

the forward pass can be converted into a conditional sign-inversion operation. The authors have

Chapter 3. Using Low-Precision Fixed-point in Neural Networks 30

successfully trained DNNs with BinaryConnect on the MNIST, CIFAR10 and SVHN5 datasets

and achieved nearly state-of-the-art accuracy performance. We agree with the authors that

such a method can greatly speedup neural network computation, as well as improve area and

energy efficiency, especially for a specialized hardware implementation. However, the datasets

used by the authors are small compared to large-scale datasets, such as ImageNet. Thus, it is

still unknown if BinaryConnect approach will be applicable to more sophisticated models (e.g.,

AlexNet).

5SVHN is a colour image dataset of street view house numbers.

Chapter 4

System Architecture

This chapter introduces the system architecture developed for accelerating the inference com-

putation of neural networks in an embedded environment. Section 4.1 discusses the design

considerations for the accelerator on an FPGA and proposes our accelerator design. Section 4.2

describes the overall system, structured as a processor-accelerator SoC system. In Section 4.3,

we then describe the following in detail: 1) the tiling software that decomposes a layer of

computation into smaller tiles based on the available on-FPGA storage, and 2) the hardware

controller that orchestrates the data access and computation of the accelerator pipeline.

4.1 Accelerator Design

An natural/intuitive implementation of a neural network accelerator would be to physically lay

out the entire neural network structure within the hardware circuit. In such a scheme, each

connection with a synaptic weight would require a multiplier, and each neuron would require an

activation unit and several adders to sum the multiplier outputs. This is not realistic for state-

of-the-art neural networks due to their large model sizes. For example, AlexNet has around

650K neurons and 60M weights. The corresponding requirements for hardware resources are far

beyond any modern FPGA. Therefore, the computations of a neural network must be divided

into smaller tasks, such that the accelerator can be time-shared among the tasks.

As discussed in Section 2.4, the computations of convolutional and fully-connected layers

require more than 90% of total run-time in both CPU and GPU implementations. Hence, the

focus of our design is on the acceleration of these two types of layers. The computations of

these two layers are essentially multiply and accumulate (MAC) operations (weights are mul-

31

Chapter 4. System Architecture 32

tiplied by neuron outputs). Meanwhile, the operations used in max-pooling and local response

normalization layers are quite different. The max-pooling layer repeatedly compares neuron

values to find the largest neuron value in each pooling region, while the operations used in local

response normalization (LRN) layer include exponentiation, summation and division. This im-

plies that an accelerator circuit designed for convolutional and fully-connected layers cannot be

used to implement max-pooling or LRN. Since the majority of run-time is spent on the MAC

operations, we opted to focus on the MAC operations in our accelerator design and not have

accelerator support for the max-pooling and LRN layers. The max-pooling and LRN layers will

be implemented in software and executed on a processor (on the FPGA die).

The following subsections will review the computations of convolutional and fully-connected

layers, highlight the design considerations of the accelerator and then propose an accelerator

design.

4.1.1 Computation and Data Access

ofm[OFM_Y][OFM_X][OFM_Z]; // Output feature maps.

ifm[IFM_Y][IFM_X][IFM_Z]; // Input feature maps.

kernel[OFM_Z][K_Y][K_X][IFM_Z]; // Filter kernels.

bias[OFM_Z]; // Bias associated with output feature map.

// Iterate through all neurons in output feature maps.

for (oy = 0; y < OFM_Y; y++) {

for (ox = 0; x < OFM_X; x++) {

for (oz = 0; z < OFM_Z; z++) {

// Compute the dot-product of associated kernel and the corresponding receptive field.

for (ky = 0; ky < K_Y; ky++) {

for (kx = 0; kx < K_X; kx++) {

for (kz = 0; kz < IFM_Z; kz++) {

ofm[oy][ox][oz] += kernel[oz][ky][kx][kz] * ifm[stride_y * oy + ky][stride_x * ox + kx][kz];

} } }

// Accumulate with the bias associated to the output feature map.

ofm[oy][ox][oz] += bias[oz];

// Perform activation.

ofm[oy][ox][oz] = activation(ofm[oy][ox][oz]);

} } }

Listing 4.1: Pseudo code of a convolutional layer.

Listing 4.1 shows the pseudo code for a convolutional layer. The three outer loops iterate

through the neurons in the output feature maps along the row, column and depth dimensions,

Chapter 4. System Architecture 33

respectively. The three inner loops compute the dot-product of the associated kernel and the

corresponding receptive field in the input feature maps. The output neuron is then accumulated

with the associated bias, and passed through the activation function at the end.

From the pseudo code, we can see that the number of MAC operations is equal to the size of

output feature maps times the size of a receptive field, i.e. OFMY ×OFMX ×OFMZ ×KY ×

KX × IFMZ . For the largest convolutional layer in AlexNet, the number of MAC operations

is as large as 75 million. Since each MAC operation accepts two inputs, the number of data

accesses may be two times the number of MAC operations (assuming no data re-use).

output[N_O]; // Output neurons.

input[N_I]; // Input neurons.

weights[N_O][N_I]; // Synaptic weights.

bias[N_O]; // Bias associated with output neurons.

for (o = 0; o < N_O; o++) {

for (i = 0; i < N_I; i++)

output[o] += weights[o][i] * input[i];

output[o] += bias[o];

// Perform activation.

output[o] = activation(output[o]);

}

Listing 4.2: Pseudo code of a fully-connected layer.

The computations of fully-connected layer also require a large amount of data accesses and

MAC operations. As shown in the pseudo code in Listing 4.2, a fully-connected layer with

NO output neurons and NI input neurons will have NO × NI synaptic weights and NO × NI

MAC operations. As an example, for the first fully-connected layer in AlexNet, there are 9,216

input neurons and 4,096 output neurons, which correspond to 36 million weights and MAC

operations.

4.1.2 Data Reuse and On-Chip Storage

Recall that AlexNet has roughly 60 million parameters and 650 thousand neurons. The data

requirements are much more than the capacity of the RAM blocks available on modern FPGAs,

and hence, some data must be stored in off-chip SDRAM. The number of data accesses required

for inference will become a performance bottleneck if all data accesses must be from off-chip

SDRAM. Fortunately, there exists a data reuse opportunity for the weights, as well as for the

input and output neurons. As such, on-FPGA buffers can be used to temporarily store a part

Chapter 4. System Architecture 34

of data for reuse to reduce the off-chip memory traffic.

Based on the above pseudo code, every filter in a convolutional layer is shared by all neurons

in its associated output feature map and hence is reused for OFMY × OFMX times. Every

neuron in the input feature maps is accessed by all receptive fields that cover it, which leads

to KY ×KX

stridey×stridex× OFMZ reuses. The neurons in the output feature maps are also reused at

each one of the KY × KX × IFMZ accumulation operations. For fully-connected layers, the

input and output neurons are reused NO and NI times respectively. To take advantage of

the reuse opportunities and minimize off-chip memory access overhead, the proposed hardware

accelerator contains three buffers to cache input neurons, output neurons and weights.

4.1.3 Accelerator Structure versus Data Width of On-Chip Buffer

Given the large number of MAC operations involved in convolutional and fully-connected layers,

the accelerator design should naturally focus on producing high MAC operation throughput.

This can be achieved by including a large number of multipliers and adders in the custom

hardware design. These operators can be pipelined so that a new set of MAC operations can be

launched every clock cycle. A simple structure for the accelerator design can be similar to the

dot-product operator shown in Section 3.2. If M multipliers are used in such a structure, the

accelerator will accumulate one output neuron with the dot-product of M pairs of weights and

inputs every clock cycle. This implies a data bandwidth requirement of as many as 2×M + 1

elements per cycle to feed the dot-product pipeline, where the additional 1 is to access the

current neuron output value. The data bandwidth will then require both the input neuron

buffer and weights buffer to have a data width of M elements. Since the input neurons are

reused by different output neurons, such a data width requirement can be reduced by caching

smaller set of input neurons and sharing them among several output neurons in parallel. We

can distribute M multipliers into MO dot-product operators where each dot-product operator

will contain MI multipliers (MO × MI = M). The MO dot-product operators will share a

common set of MI input neurons and accumulate MO different output neurons in parallel.

In this case, each dot-product operator will receive a different set of MI weights and hence,

the weights buffer still needs to provide M total weights every cycle. In so doing, the data

bandwidth requirement can be reduced to M +MO +MI elements per cycle.

Chapter 4. System Architecture 35

!"#$%&'$(()* +),-.%/&'$(()*

0$%#$%&

'$(()*

0$%#$%&

1)23)*

0$%#$%&

+*,%)*

4, 5&67 48&5&4, 5&67

9:

48&5&67

48&5&67

9:

67

;; ;; !!

9:

+),-.%/&1)23)*!"#$%&1)23)*

9:

<< << <<

=8>#$%)&?",%

@8%A#*83$B%&8#)*2%8*/

CBB$>$D2%,8"

CB%,E2%,8"

F2*%,2DA/$>

Figure 4.1: Accelerator Design.

4.1.4 Accelerator Structure

Based on the above design considerations, we now present the accelerator design. Figure 4.1

shows the schematic of the accelerator. Three on-chip buffers are instantiated as explained

above to promote data reuse of weights, input neurons and output neurons. The data widths

of the on-chip buffers are matched with the data bandwidth requirements of the compute unit.

The input reader and weights reader will read a new set of inputs and weights from the buffers

and feed them to the compute unit every cycle. Since the input and weights buffers may not

store all necessary data for computing an output neuron, the partially-computed output (a

partial-sum of input and weights multiplies) will be temporarily stored in the output buffer

and get swapped back to the compute unit when the new inputs and weights become available.

Consequently, both read and write access are needed for the output buffer.

Delving further into the compute unit, the first part consists of MO dot-product operators.

Chapter 4. System Architecture 36

In the accumulation stage, each dot-product operator output is added to one of the MO par-

tial sums. Multiplexers are used to choose the partial sums to be accumulated, between the

swapped-back values or the ones stored in the registers. If an output neuron has accumulated

all necessary multiplies, activation can be performed and the result will be written to the out-

put buffer. In the case of AlexNet, all the layers use a rectified linear unit as the activation

function, which can be implemented with a 2-to-1 multiplexer that selects between the original

value or zero, based on the sign bit of the input. Meanwhile, a partially-computed output can

also be swapped to the output buffer so that the compute unit can continue working on the

other output neurons using the available data in input buffer and weights buffer.

As shown in the figure, for data widths we use a 16-bit fixed-point representation for all

input neurons, weights and output neurons. This choice is mainly based on three reasons,

1) 16-bit heterogeneous fixed-point format is sufficient to perform inference of AlexNet with

nearly no accuracy degradation (Section 3.5.4); 2) 16-bit data can be represented by primitive

data types in software (e.g., short int), which makes it convenient to integrate the accelerator

in a processor-accelerator co-design system (to be described in Section 4.2); and 3) reducing

the input bit width does not further yield significant hardware benefit for the dot-product

operators, as long as the input bit-width is less than or equal to 18 bits (Section 3.2). During

the computation, the multiplier outputs and adder outputs are 32-bit wide to preserve higher

precision. Therefore, when partial sums or final outputs are swapped out of the compute unit,

they need to be converted back to 16-bit format so that they can be stored in the output buffers.

Similarly, partial sums swapped back from the output buffer also need to be converted to 32-bit

format from 16-bit format.vim Both conversions are done by shifting. The shifting amount

depends on the heterogeneous fixed-point format used for the inputs, weights and outputs.

The number of bits to be shifted towards left/right is formulated as FractionWidthinput +

FractionWidthweight − FractionWidthoutput.

4.2 System Design

With the accelerator design in mind, we choose to realize our acceleration solution using a

System-on-Chip architecture comprising a processor and an FPGA device. The main idea be-

hind this is to perform the compute-intensive work within an FPGA accelerator, while the

remaining computational work is implemented in software and executed on the processor. For

Chapter 4. System Architecture 37

example, the computations of the max-pooling and LRN layers are carried out on the proces-

sor. An off-chip SDRAM is shared by both the processor and FPGA to store all pre-trained

parameters (e.g. weights and neuron biases) and intermediate neuron values.

nn.init( input_model )

nn.load_parameters()

download_images()

for_each image {

preprocess_image()

nn.forward()

}

void NN::forward() {

ConvFpgaBackend(

fm_1, fm_2, kernel)

Maxpool(fm_2, fm_3)

}

// Implemented in software.

Maxpool(fm_2, fm_3) {…}

!"#$%&&#"'&()% *!+,'&()%

!""#$#%&'(%)*&+#%!,,$-"&'-(.)*&+#% /%&.0$&'-(.)*&+#%

ConvFpgaBackend(

ifm, ofm, kernel) {

- Decompose into tiles

- Move data DDR->FPGA

- Send instructions

- Move data FPGA->DDR

}

-./01'2033%" 4%(561&'2033%"

701/01'2033%"

701/01'8%9)%"

701/01'4"(1%"

4%(561&'8%9)%"-./01'8%9)%"

,$$%:%"91#"';#.1"#::%"

< ="9.&:91%'(.&1"0$1(#.'(.1#$>$:%<?><$>$:%'$#.1"#:

&(5.9:&

-.&1"0$1(#.

!! !! !!

"" """"

Figure 4.2: Three abstract layers in the overall system.

Conceptually, the proposed system can be divided into three abstract layers. From the

bottom up, the lowest layer corresponds to the accelerator design described above. We call

this layer the Accelerator Layer. To make use of the accelerator, we use a Translation Layer

that implements two back-end API functions for convolutional and fully-connected layers. Both

APIs are in a “per-layer” granularity, meaning that each call to the API corresponds to the

computation of an entire layer of the neural network. The first part of the translation layer is

implemented in software. It decomposes a layer of computation into tiles based on the available

on-FPGA storage (i.e., the buffers). For each tile, the software first issues requests to move the

corresponding input neurons and weights from off-chip SDRAM to on-chip buffers. Then, it

sends an instruction that specifies the tile information to the hardware on the FPGA. Following

this, after the accelerator finishes computation, the software will initiate a data transfer to move

the outputs from the on-chip buffer to off-chip SDRAM. The second part of the translation

layer is an Accelerator Controller, implemented as a hardware component on the FPGA. The

accelerator controller receives the instruction from the processor side and translates it into

cycle-by-cycle control signals to control the buffer readers and writers, as well as the compute

unit. We will discuss the tiling scheme, instruction content and accelerator controller in the

subsequent sections.

Chapter 4. System Architecture 38

Lastly, an Application Layer is completely implemented in software and runs on the proces-

sor. It is responsible for constructing a neural network based on user-specified input, reading

pre-trained model parameters from disk to initialize the parameter matrices, preparing input

samples, which may be stored on disk or fed directly from network, and so on. The application

software will call backend APIs to off-load computations to the FPGA accelerator. It is worth

noting that all of these software implementations, including the backend APIs are integrated

into our software framework described in Section 3.4.

4.3 Translation Layer Details

In the proposed architecture, the computation work for fully-connected and convolutional layers

can be broken down into three levels. The first level refers to the tiling step of the translation

layer, where a layer of computation is divided into tiles based on the available buffer storage

on FPGA. After moving the required data from off-chip memory to the on-chip buffer, the

tiling software will send the accelerator controller an instruction describing the current tile. In

software, the instruction is simply organized as a C struct where each field represents specific

information. For example, the instruction contains a flag to indicate the layer type (fully-

connected or convolutional), two integers to represent the numbers of output and input neurons

in the current tile, etc. At the second level, the accelerator controller receives the instruction

and further breaks down the tile into smaller blocks such that the accelerator pipeline can start

a new block of computation every clock cycle. Each block will correspond to MO ×MI MAC

operations, involving MO output neurons, MI input neurons and MO×MI weights. At the last

level, the accelerator performs the corresponding operations based on the control signals issued

by the accelerator controller.

We now explain the translation layer design with the help of pseudo-code organized into the

three-level structure described above.

4.3.1 Translation Layer for Fully-Connected Layers

We organize the pseudo-code for the fully-connected layer into three sets of loop nests as shown

in Listing 4.3. The first loop nest corresponds to the tiling step of the translation layer that

is implemented in software. We select the tiling size TO and TI for output and input neurons

such that the required storage sizes (TO output neurons, TI input neurons and TO×TI weights)

Chapter 4. System Architecture 39

1 output[N_O]; // Output neurons.

2 input[N_I]; // Input neurons.

3 weights[N_O][N_I]; // Synaptic weights.

4 bias[N_O]; // Bias associated with output neurons.

6 // Loop nest 1: tiling step of the translation layer.

7 for (ot = 0; ot < N_O; ot += T_O) { // Loop 1-1.

8 // Move bias to output buffer as the initial values for

9 // output neurons.

10 bias[ot : ot + T_O] -> output_buffer

11 for (it = 0; it < N_I; it += T_I) { // Loop 1-2.

12 input[it : it + T_I] -> input_buffer

13 weights[ot : ot + T_O][it : it + T_I] -> weights_buffer

14 issue_instruction();

16 // Loop nest 2: accelerator controller breaks a tile into

17 // smaller blocks.

18 for (ob = ot; ob < ot + T_O; ob += M_O) { // Loop 2-1.

19 for (ib = it; ib < it + T_I; ib += M_I) { // Loop 2-2.

20 issue_control_signals();

22 // Loop nest 3: accelerator takes in M_I inputs

23 // and M_O * M_I weights and updates M_O partial

24 // sums every cycle.

25 for (o = ob; o < ob + M_O; o++) { // Loop 3-1.

26 for (i = ib; i < ib + M_I; i++) // Loop 3-2.

27 output[o] += input[i] * weights[o][i];

28 if (i == N_I - 1) output[o] = activation(output[o]);

29 }

30 } } // End of the loop nest 2.

31 } // End of input neuron tiling iteration.

33 // Move computed output neurons from output buffer to

34 // main memory.

35 output_buffer -> output[ot : ot + T_O];

36 }

Listing 4.3: Pseudo code of fully-connected layerwith tiling.

!"

#$%&'(($)&*+$,(

-&'%&'$)&*+$, .)"/0',

1++%232

1++%(434

1++%(234

1++%(432

1++%(232(5(234

1++%(432(5(434

6-

6"

!-7-

7"

Figure 4.3: Tiling and traversal of fully-connected layers.

are smaller or equal to the sizes of corresponding on-chip buffers. TO bias values, associated

with the output neurons in the current tile, will be moved to the output buffer (line 10) and

will be loaded to the compute unit as the initial partial-sum values. Since the buffer sizes are

limited, it will take multiple iterations to move all required inputs and weights to compute

TO outputs (line 11-13). Every time a tile of inputs and weights are moved to the on-chip

buffers, an instruction is issued by the tiling software to the accelerator controller to start

the corresponding computation (line 14). In the case of fully-connected layer, the instruction

specifies: 1) the layer type, 2) the number of input and output neurons in the current tile, and

3) whether activation should be performed for output neurons after accumulating all products

of inputs and weights in the current tile.

Starting from line 16, the inner two loop nests are performed by the hardware accelerator

on FPGA. In the second loop nest, the hardware accelerator controller further decomposes the

Chapter 4. System Architecture 40

tile into smaller blocks. Each block refers to the computation between MO output neurons and

MI input neurons. At every clock cycle, a set of control signals is issued to the accelerator

(line 20). For the buffer readers, addresses are provided to load the required data for current

block. The output writer will receive a signal indicating whether the compute unit outputs

should be stored in the output buffer, as well as the address to store the data. For the compute

unit, the control signals will indicate: 1) whether the pipeline should swap in a new set of

partial sums, 2) whether activation should be performed, and 3) the shifting distance to be

used when converting partial sums between 16-bit and 32-bit formats. Lastly, the third loop

nest corresponds to a block of computation (line 25-29). The compute unit can start one such

block of computations every clock cycle.

We can visualize the three-level loop nest as shown in Figure 4.3. The two strips on the

left and top of the figure represents the 1-dimensional array of output and input neurons,

respectively. The rectangle represents the weights connecting every pair of output and input

neurons. The region highlighted in blue refers to the data involved in a tile of computation,

while the orange region represents the data used in a block of computation, which can be

launched by the accelerator every clock cycle. The blue and orange arrows show the loop

traversal directions in the tiling software and accelerator controller, respectively. For example,

loop 1-1 and loop 1-2 traverse along the dimensions of output neurons and input neurons.

Loop 2-1 & 2-2 in the accelerator controller iterate inside the tile to decompose a tile into

smaller blocks.

4.3.2 Translation Layer for Convolutional Layers

Tiling Step in Software

The translation layer implementation for a convolutional layer can also be organized in a similar

three-level loop-nest structure. We begin the discussion with the first loop nest that corresponds

to the tiling step of the translation layer. The dimensions of output and input feature maps,

kernels (a set of 3 dimensional filters) and biases are defined in Listing 4.4, and labelled in

Figure 4.4.

Our implementation performs tiling along the row (Y), column (X) and depth (Z) dimensions

of output feature maps. Loop 1-1 in the pseudo code first iterates through the depth dimension

of output feature maps, selecting TOFM Z output feature maps at a time. Since each output

Chapter 4. System Architecture 41

ofm[OFM_Y][OFM_X][OFM_Z]; // Output feature maps.

ifm[IFM_Y][IFM_X][IFM_Z]; // Input feature maps.

kernel[OFM_Z][K_Y][K_X][IFM_Z]; // Filter kernels.

bias[OFM_Z]; // Bias associated with output feature map.

S <- stride;

// Loop nest 1: tiling step of the translation layer.

for (ozt = 0; ozt < OFM_Z; ozt += T_OFM_Z) { // Loop 1-1.

// Move associated filters to weights buffer.

kernel[ozt : ozt + T_OFM_Z][0 : K_Y][0 : K_X][0 : IFM_Z]

-> weights_buffer

for (oyt = 0; oyt < OFM_Y; oyt += T_OFM_Y) { // Loop 1-2.

for (oxt = 0; oxt < OFM_X; oxt += T_OFM_X) { // Loop 1-3.

// Move input feature map tile to input buffer.

ifm[oyt * S : oyt * S + (T_OFM_Y - 1) * S + K_Y]

[oxt * S : oxt * S + (T_OFM_X - 1) * S + K_X]

[0 : IFM_Z] -> input_buffer.

// Move bias to output buffer as the initial values.

// for output neurons.

bias[ozt : ozt + T_OFM_Z] -> output_buffer

issue_instruction();

/* Loop nest 2 and loop nest 3 are skipped here. */

// Move computed output neurons from output buffer to

// main memory.

output_buffer ->

ofm[oyt : oyt + T_OFM_Y]

[oxt : oxt + T_OFM_X]

[ozt : ozt + T_OFM_Z];

} } }

Listing 4.4: Tiling pseudo code for convolutionallayer.

!""#$%&' ($%&)

*+,-.+/

,-.+/

,-.+0*+,-.+0

!"#$"#

%&'#"(&

)'$*

111

111

!""#

%&%

2+0

2+/

,-.+3

+,-./01#&(*

*+,-.+3

4*+,-.+0$5 %6$7$8$9$2+0

4*+,-.+/$5 %6$7$8

9$2+/

!""#$%&' ($%&)

:-.+0

:-.+/

23$"#

%&'#"(&

)'$*

Figure 4.4: Traversal order in the tiling software.

feature map has an associated 3-D filter, the tiling along the depth dimension of output feature

maps implies a selection of corresponding 3-D filters. These selected 3-D filters (highlighted in

blue in Figure 4.4) are moved from off-chip memory to the weights buffer and are reused by

subsequent loops to compute neuron output values for a set of output feature maps.

Loop 1-2 & 1-3 further tiles along the rows and columns of output feature maps, with a

2-D patch size of TOFM Y × TOFM X . In the figure, the blue region in the output feature

maps represents the selected tile. Recall that each output feature map pixel has a receptive

field in the input feature maps, covering a local region on the Y- and X-dimensions but across

the entire Z-dimension. Hence, the tiling along the rows and columns of output feature maps

also implies the corresponding tiling of along the row and column dimensions of input feature

maps. Note that we do not perform tiling on the depth dimension for input feature maps or

filters – meaning that, all the required inputs for computing the output feature map tile will

Chapter 4. System Architecture 42

be moved to the on-chip buffers and be used in one tile of computation1 The blue region in

the input feature maps refers to the pixels that can be covered by the receptive fields of the

selected output feature map tile. This region of input feature maps needs to be moved to the

input buffer. We also move the associated biases to the output buffer and use them as the

initial values for output neurons.

After moving all the required inputs, the tiling software issues an instruction to the accel-

erator controller, specifying the size of the current output feature map tile, as well as the size

of 3-D filters. When the hardware on the FPGA finishes the tile of computation, the computed

outputs are then copied back from on-chip buffer to the off-chip memory.

// Loop nest 2: accelerator controller breaks a tile

// into smaller blocks.

for (oy = oyt; oy < oyt + T_OFM_Y; oy++) { // Loop 2-1.

for (ox = oxt; ox < oxt + T_OFM_X; ox++) { // Loop 2-2.

for (ozb = ozt; ozb < ozt + T_OFM_Z; ozb += M_O) { // 2-3.

// Current block will work on

// ofm[oy][ox][ozb : ozb + M_O].

for (ky = 0; ky < K_Y; ky++) { // Loop 2-4.

for (kx = 0; kx < K_X; kx++) { // Loop 2-5.

for (kzb = 0; kzb < IFM_Z; kzb += M_I) { // 2-6.

// Current block will work on

// ifm[oy * S + ky][ox * S + kx][kzb : kzb + M_I].

issue_control_signals();

// Loop nest 3: accelerator takes in M_I inputs

// and M_O * M_I weights and updates M_O partial

// sums every cycle.

for (oz = ozb; oz < ozb + M_O; oz++) {

for (kz = kzb; kz < kzb + M_I; kz++)

ofm[oy][ox][oz] += kernel[ky][kx][kz] *

ifm[oy * S + ky][ox * S + kx][kz];

if (all_multiples_are_accumulated)

ofm[oy][ox][oz] = activation(ofm[oy][ox][oz]);

} // End of loop nest 3.

}}}}}} // End of loop nest 2.

.

Listing 4.5: Accelerator controller pseudo code forconvolutional layer.

!""#$%&'

($%&%

!"#$"#%

&'(#")'%

*($%+,-'

)

)

)

!""#

%&*

!""#$%&+

($%&,

-./

-.0

12

./0%1,-#')2%(2234,(#'5%

#3%#6'%!&*%#,-'

!""#$%&+

($%&,

!""#$%&'

($%&%

78$"#

&'(#")'

*($%+,-'

-./

-.0

Figure 4.5: Traversal order in the accelerator con-troller.

1This tiling scheme does not work when the size of a 3-D filter (equivalent to the receptive field size of anoutput feature map pixel) is greater than the capacity of weights buffer. The largest 3-D filter in the AlexNetmodel has a size of 256×3×3 = 2, 304 weights. If weights are represented in 16-bit fixed-point, one such filter canbe accommodated by three M10K RAM blocks on Altera’s FPGA. Our implementation described in Chapter 6make sure that any 3-D filter in the AlexNet model can fit in the weights buffer.

Chapter 4. System Architecture 43

Accelerator Controller

Once the accelerator controller receives an instruction describing a tile of computational work,

it needs to decompose the tile into smaller blocks that match with the accelerator’s compute

bandwidth (the accelerator receives MI inputs, MO×MI weights and updates MO partial sums

every clock cycle). In our design, the MO partial sums refer to MO pixels in the output feature

map tile that are at the same <x,y> location across MO consecutive output feature maps (see

highlighted in output feature map tile in Figure 4.5). The MI inputs from the input feature

map tile are positioned as the figure shows. The weights connecting the MI inputs to the

MO outputs are therefore distributed among MO 3-D filters, where each filter has MI weights

positioned accordingly to the selected neurons in the input feature maps.

The decomposition of a tile starts by iterating along the row, column and depth dimensions

of the output feature map tile (see the first three loops in Listing 4.5, and the orange arrows

around the output feature map tile in Figure 4.5). At each step (of Loop 2-3), MO output

feature map pixels are selected. The <x,y> position of MO output feature map pixels defines

the receptive field in the input feature map (refer to the box with dotted line), while the

position on the z-dimension selects the associated filters (highlighted filters). Then, the next

three loops iterate through the row, column and depth dimensions of the receptive field and

the associated filters. At each step (of Loop 2-6), the accelerator controller selects MI inputs

from the receptive field and MO ×MI weights from the filters. A set of control signals are then

issued to guide the accelerator to process the selected inputs, weights and outputs. The control

signals are same as those for fully-connected layers. Again, the computation in the third loop

nest is one block of computation that can be launched by the accelerator every clock cycle.

4.4 Summary

In summary, our overall system is implemented as a processor-accelerator SoC. We propose

an accelerator design that supports fully-connected and convolutional layers with focus on the

performance of the MAC operations. The accelerator is augmented with on-chip buffers to take

advantage of data reuse opportunities so that we can minimize the off-chip memory traffic. On

the processor side, application software invokes a backend API to off-load the computation of

fully-connected and convolutional layers to the accelerators. The backend APIs are implemented

in the translation layer where data is divided into tiles and moved to the on-chip buffers. A

Chapter 4. System Architecture 44

hardware controller is designed to coordinate the cycle-by-cycle operation of the accelerator.

Application software is also responsible for initializing a neural network, loading pre-trained

parameters, preparing the input samples and so on. The max-pooling and LRN layers are also

implemented in software.

Chapter 5

Streaming Hardware Generation in

LegUp High-Level Synthesis

We adopt high-level synthesis (HLS) to generate the accelerator hardware from a high-level

description in standard software language, C. We use the LegUp open-source HLS tool [12] to

synthesize all the hardware circuits shown in Figure 4.1, including the accelerator controller,

compute unit, on-chip buffers, buffer readers and writer. As described in Section 4.1, the

accelerator circuit must be pipelined in order to maximize the compute throughput, while

using limited hardware resources. When this project commenced, LegUp HLS was not capable

of synthesizing pipelined functions. One of the contributions of this thesis is to add pipelined

function synthesis to the LegUp HLS tool.

Section 5.1 briefly summarizes an existing LegUp feature – loop pipelining, of which the

scheduling algorithm and datapath generation are re-used for the purpose of pipelined function

synthesis. From Section 5.2 to Section 5.4, we describe work done to support function pipelining

in LegUp, including the pipelined function interface, FIFO support and stall logic implemen-

tation. Lastly, Section 5.5 illustrates how we describe the neural network inference accelerator

circuit in software so that LegUp can generate the corresponding pipelined hardware.

5.1 Loop Pipelining

Although function pipelining was not supported, LegUp has a mature loop pipelining feature

that permits the synthesis of hardware to execute loop iterations in a pipeline manner. The

45

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 46

loop pipelining implementation uses the modulo SDC scheduling algorithm with a backtracking

mechanism [11] to find a valid schedule for the operations in a loop body. The objective of the

scheduling algorithm is to minimize the initiation interval (II), which is the number of cycles

between the launch of two consecutive loop iterations. The ideal value of II is 1, where a new

loop iteration is started every clock cycle.

In a pipelined circuit, a variable will require extra pipeline stage registers if its lifetime is

more than II cycles, where the lifetime of the variable is the cycle-count difference between the

stage where it is assigned a value and the last stage where it is used. The pipeline stage registers

are meant to preserve the computed value for its uses at a later pipeline stage, otherwise the

value will be overwritten by the subsequent iterations. For example, if the II is 1 (new inputs

are injected into the pipeline every clock clock) and a variable is computed at stage-1 and

used at stage-4, pipeline registers will be required at stage-1, stage-2 and stage-3. The existing

functionality within the LegUp HLS tool automatically inserts such pipeline stage registers

into the datapath to ensure correct functionality. We use the same scheduling algorithm and

datapath generation flow to implement pipelined functions. However, there are still several

missing pieces to fully support a pipelined function in a complete system. In the following

sections, we will describe the additional work done as part of this thesis to support pipelined

function hardware synthesis.

5.2 Pipeline Function Interface

For a pipelined function with an ideal II of 1, a new set of inputs have to be provided to

the function every clock cycle to maximize computational throughput. For example, in the

case of our accelerator design, the compute unit pipeline accepts a new set of data inputs from

the buffer reader modules and control signals from the accelerator controller every clock cycle,

where the buffer reader modules are also pipelined functions that accept new addresses from

the accelerator controller every cycle. This will require a mechanism to stitch together multiple

pipelined functions in a streaming manner, where upstream and downstream functions execute

in parallel with data flowing down the connected pipeline every clock cycle.

The traditional module interface used in LegUp HLS is designed for sequential functions.

The module interface is shown in Figure 5.1 and the handshaking between a caller and a callee

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 47

start

finish

arg_1

arg_2

return_val

Figure 5.1: Module interface of a sequen-tial function.

start

finish

arg_1

arg_2

return_val

clock

!"#"

!"#"

!"#"

Figure 5.2: Handshaking between sequential functions.

is shown in Figure 5.2. Function arguments and return values are presented as inputs and

output on the module interface. To invoke a callee function (module), the caller function

(module) asserts the start signal to start the function execution and also to indicate that the

two arguments on arg 1 and arg 2 have become valid. When the callee function completes

its execution, it asserts the finish signal to indicate completion and the return value on the

output port is valid. The start and finish notion of handshaking between sequential functions

is not sufficient for pipelined functions, where such functions continue to run in parallel, as long

as valid inputs are provided. Therefore, we need to define a new module interface for pipelined

functions synthesized into hardware by LegUp.

ready

valid

data

!"#$!%&'()

Figure 5.3: Ready-valid-data (RVD) interface.

Since a pipelined function can connect to more than one upstream function (e.g. the compute

unit in the accelerator design receives input from the accelerator controller and multiple buffer

readers), the inputs may not necessarily become valid at the same clock cycle. Hence, each

input of the pipelined function should be associated with a valid signal to indicate input validity

individually. Meanwhile, a pipelined function may not always be able to accept new inputs due

to a stall. Therefore, each input to a pipelined function should be bundled with a ready signal,

which is asserted by the downstream hardware only when it is ready to receive a new input. We

propose to use the ready-valid-data (RVD) handshaking interface where each input or output

of a pipelined function is associated with a valid signal and a ready signal with the directions

shown in Figure 5.3.

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 48

!"!# !$ !% !& !'

!"#$%&'"()

ready

valid

data

clock

!"#$%&'"(* !"#$%&'"(+

Figure 5.4: Handshaking between source and sink using RVD interface (dash-line represents the datatransfer phase).

Figure 5.4 shows the handshaking between the source and sink. The data transfer happens

at the clock edge when both valid and ready signals are high (Transfer-0 ). The valid sig-

nal remains high as long as the data is valid, and data is transferred once ready is asserted

(Transfer-1 ). From the source’s perspective, the ready signal acts as an acknowledgement that

valid data has been consumed by the sink. Also, the ready signal can be asserted as soon as

the sink is ready to receive the data, regardless of the state of the valid signal. If ready is

asserted, data transfer only happens when the valid goes high (Transfer-2 ).

The RVD interface allows data transfer to happen with 0-cycle latency. That is, if the valid

signal is high, the sink can use the data right away and assert the ready signal to expect the

next valid data to be presented at the immediately subsequent cycle. Similarly, if the ready

signal is high, the source can assert valid so that data will be transferred at the following

clock edge. This is a key advantage of using such interface in a streaming design. Consider

an example wherein the sink is stalled and de-asserts the ready signal, the source module can

keep the valid signal asserted with valid data placed on the data port, but needs to be stalled

as well to avoid dropping data. When the sink resumes from stall, it can re-assert the ready

signal and use the valid data at the same cycle, allowing both source and sink to resume to

steady-state immediately (i.e. one data flowing through the interface every cycle). Whereas, if

a 1-cycle latency interface is used, a pipelined function would require an extra cycle to resume

to steady-state after each stall.

5.3 FIFO Support

5.3.1 First-Word-Fall-Through FIFO

As mentioned above, if pipelined functions are directly connected using the RVD interface,

any stall of a pipelined function will require all its upstream functions to stall as well. This

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 49

will negatively impact the throughput of the entire pipeline. Moreover, the stall logic will

require a high-fanout signal traversing all connected pipelined functions (through the ready

port), such that when the downstream hardware de-asserts the ready port, all the upstream

hardware will stall immediately. Such high-fanout signal can potentially impact the maximum

clock frequency. An approach to mitigate these issues is to insert FIFOs between the pipelined

functions. FIFOs can be leveraged to relax the backpressure by buffering the data and break

the chaining of stall signals between pipelined functions.

!"!#

write_datadata dataread_data

write_envalid validnot_empty

ready not_full read_en ready

$%&'()*+

!,-.'/0-

102-&'()*+

!,-.'/0-

Figure 5.5: RVD-compatible interface of FWFT FIFO.

To be compliant with the RVD interface, we use the first-word-fall-through (FWFT) FIFO,

also known as the show-ahead FIFO. Figure 5.5 shows the interface of a FWFT FIFO and how

it is connected to the upstream and downstream functions using the RVD interface. The main

difference between a FWFT FIFO and a normal FIFO is that the read data presented at the

downstream output port is always valid, as long as the not empty signal is high. Similar to the

ready signal from an upstream pipelined function’s perspective, the read en input of the FWFT

FIFO acts as an acknowledge signal, rather than a request signal as it is in a normal FIFO.

Assertion of the read en input tells the FIFO that the data on the output port is taken and it

can present new data on the output at the next clock cycle. On the upstream side, assertion

of the FIFO’s not full signal means that the FIFO is ready to accept new write data. The

handshaking on both the upstream and downstream interface of the FWFT FIFO is the same

as that of RVD interface seen in Figure 5.4. Consequently, as a result of the handshaking

signalling compatibility, the FWFT FIFO can be placed between any two pipelined functions

that use the RVD interface.

5.3.2 Software Support

To allow HLS users to express and use FIFOs in software design, we provide a software library

implementing the FIFO, as well as related API functions. This software library is designed to

be compilable by standard C compilers so that users can test their C implementation in software

before synthesizing it using LegUp HLS.

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 50

typedef struct {

// Bit width of the elements stored in the FIFO.

int width;

// Number of elements that can be stored in the FIFO.

int depth;

// Pointer to a data array holding the elements.

long long *mem;

// Keeps track of where in the array to write to.

unsigned write_index;

// Keeps track of where in the array to read from.

unsigned read_index;

} FIFO;

Listing 5.1: A C-struct definition of FIFO.

// Initialize a FIFO with the specified width and depth.

FIFO *fifo_malloc(int width, int depth);

// Write an element into the FIFO.

void fifo_write(FIFO *fifo, long long data);

// Read an element from the FIFO.

long long fifo_read(FIFO *fifo);

// Free the memory allocated by the FIFO.

void fifo_free(FIFO *fifo);

Listing 5.2: FIFO related APIs.

The FIFO is defined as a C struct as seen in Listing 5.1. The elements of the struct are

used to define the storage, its width/depth, and where to read/write from/to in the storage.

The data array is used as a circular buffer to create the FIFO behaviour. Its type is a long

long, making it capable of handling the largest standard C integer data type, though it can also

be used to hold anything smaller. Listing 5.2 shows the four most commonly used APIs. The

fifo malloc function is to create a new FIFO by initializing the depth and width parameters

and allocating the data array. This function returns a pointer to the newly-created FIFO. The

user software will use this pointer when reading/writing from/to the FIFO. The data argument

of fifo write refers to the write data, while the return value of fifo read corresponds to the

read data. fifo free frees the memory allocated for the data array and is ignored during

LegUp hardware synthesis.

Note that LegUp does not synthesize an implementation of these API functions – they are

used strictly to permit execution in software. When a C program using the API is compiled

to hardware, LegUp detects its usage and instantiates the FIFO and parameterizes it based on

the width and depth arguments provided to the fifo malloc function call. Function calls to

fifo write and fifo read will be scheduled together with other operations by the scheduling

algorithm. The hardware circuit generated by LegUp will then perform the needed FIFO

read/write operations. On a call to fifo write by a upstream function, the function places

the data on the output data port and asserts the valid signal (connecting to write en). If the

associated ready signal (driven by not full) is low, the function has to stall until the FIFO

is not full. If the ready signal is high, the data is then considered successfully written to the

FIFO and the function continues execution. For fifo read, a downstream function asserts the

ready signal (connecting to read en) and checks the valid input (driven by not empty). If

valid is high, the input data is considered valid and can be used by the function’s datapath.

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 51

If valid is low, meaning that the FIFO is empty, the downstream function needs to stall and

wait for the input data to become valid.

5.4 Stall Logic and Datapath

As alluded to in the previous section, we must support the scenario when resources or data

become unavailable to the pipelined function. There are three primary cases that would require

a pipelined function to stall: 1) when the pipelined function shares a resource such as memory

(RAM block) or functional unit (e.g., a divider) with other functions, if the resource is presently

being used by other functions, the pipelined function has to stall until the resource becomes

available; 2) when the pipelined function is not receiving valid input from its upstream module,

the pipelined function needs to stall until all inputs become valid; and 3) when a pipelined

function’s downstream module is not ready to consume the pipelined function’s output, the

pipelined function also needs to stall to prevent losing data. The stall logic ensures the hardware

pipeline stalls appropriately and produces a functionally correct result. It is implemented as

a part of the circuitry that controls the pipelined datapath. Due to the nature of the stall

logic that requires the pipeline to stop immediately when encountering stall, the stall logic is

normally implemented as a combinational circuitry with a high-fanout to many or all stages in

the pipeline. The stall logic design can directly impact the overall system throughput, timing

performance and area. Our implementation aims to reduce the stall circuitry by minimizing

the stall signal fanout to only the stages that absolutely need to stall. That is, if an operation

at a specific pipeline stage encounters a stall and cannot proceed, we only stall this and all the

prior (upstream) pipeline stages, but all the later pipeline stages do not need to stall and may

continue to execute. This is analogous to what typically done in a pipelined processor, wherein

stages towards the front of the pipeline may stall (for example, due to a data hazard), while

the back-end stages may continue. We use an example to explain the design.

Figure 5.6 shows a pipelined function datapath with its control circuitry including the stall

logic. The pipelined circuit is a straight-line datapath with no control flow. Like other HLS

tools, we remove any diverging branches with if-conversion and back edges by unrolling all the

loops. All sub-functions are also inlined. In the figure, there are two inputs FIFOs, a non-FIFO

argument input, and two output FIFOs. This pipeline can be stalled if any input FIFO is empty

or any output FIFO is full. The not empty signals from the input FIFOs and the not full

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 52

!"#$%

&$'!"#$%&#%

'()*+,#%

-#.*%

/-/01

-#.*%

/-/02

0*%.*%

/-/02

3")45

(,),# ,#

(,),# ,#

(,),# ,#

(,),# ,#

(,),# ,#

(,),# ,#

3")45

0*%.*%

/-/01

3")45

()*## ()*##

,#

,#

,#

,#

(+,-'.

(+,-'.

/'"##

#01$2

3435

/$16"#/

78

79

7:

7;

/'"1+

+6"&#+

<"2=>-?+//*?+ (,)

,# ,#

Figure 5.6: Pipeline circuit datapath and stall logic.

signals from the output FIFOs are the inputs to the stall logic. The S’s denote pipeline stages,

with registers at each stage to hold data. Each pipeline stage is associated with a valid bit,

indicating whether the pipeline stage contains valid data. It is used together with the stall

logic to control the pipeline datapath. A pipeline stage is enabled when both the valid bit and

the output of the stall logic AND gate are high. When the stall logic AND gate is high, it means

that none of the relevant input/output FIFOs is empty/full, and therefore, execution should

continue. The stall logic also affects the update of the valid bits. As shown in the figure, a

valid bit register retains its value when a stall occurs (the output of the stall logic AND gate

connects to the enable of the valid bit register). When there is no stall, the valid bit updates

its value to 1 if the previous stage was enabled; otherwise, the valid bit becomes 0 since there

is no valid data flowing down from the previous stage.

Let us know examine how the stall signals from the FIFOs are connected to the stall logic.

For Input FIFO0, whose data is needed at S0, only S0 will be stalled when this FIFO becomes

empty. Data from Input FIFO1 is needed at S1, so if this FIFO is empty, S1 and S0 stall. S0

also needs to stall in this case since its next stage is stalled (allowing it to continue will overwrite

the valid data in S1). Output FIFO0 is written from S2. Hence, when this FIFO is full, it stalls

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 53

S2, S1, and S0. When Output FIFO1 is full, the entire pipeline stalls. Thus, a FIFO can stall

the pipeline stage where its data is needed/written, and all of its previous pipeline stages, but

the FIFO does not stall any later pipeline stages. For instance, when S0 stalls due to Input

FIFO0 only, S1, S2, and S3 may continue. When Output FIFO0 is full, valid data in S3 can

continue and be written to the Output FIFO1 (assuming it is not full).

There are also cases where the stall circuitry is not necessary. For instance, a constant

argument (such as an integer value), is stored in registers when the module starts and remain

unchanged during its execution. We do not create any stall logic for this argument, as it will

not be overwritten during the execution. This helps to reduce circuit area and the fanout of

the stall signals, which can become large if there are many FIFOs and pipeline stages.

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 54

5.5 Accelerator Design using LegUp’s Pipelined Function Fea-

ture

5.5.1 Software Implementation of Accelerator Controller

1 void AcceleratorController(

2 /* The instruction received from host software. */

3 LayerType instr_layer_type, BackendActivationType instr_activation_type, int16_t instr_shifting_distance,

4 // The number of buffer entries that will be used when processing this tile.

5 unsigned short instr_ib_entry_count, unsigned short instr_ob_entry_count,

6 /* Send addresses to input reader, weights reader, output reader and output writer. */

7 FIFO *F_ib_entry_index, FIFO *F_wb_entry_index, FIFO *F_ob_read_entry_index, FIFO *F_ob_write_entry_index,

8 /* Send control signals to compute unit. */

9 FIFO *F_should_rotate_in_partial_sum, FIFO *F_should_rotate_out_output,

10 FIFO *F_activation_type, FIFO *F_shifting_distance) {

12 if (instr_layer_type == kFCLayer) { // Fully-connected layer.

13 unsigned short ob = 0, ib = 0, wb = 0;

14 for (ob = 0; ob < instr_ob_entry_count; ob++) { // Iterate through each output buffer entry.

15 loop: for (ib = 0; ib < instr_ib_entry_count; ib++, wb++) { // Iterate through each input buffer entry.

16 // Rotate in partial-sums or bias when start computing a new set of outputs.

17 bool should_rotate_in_partial_sum = (ib == 0);

18 // Rotate out the partial sums or the final result after factor in the last set of inputs.

19 bool should_rotate_out_output = (ib == (instr_ib_entry_count - 1));

21 // Send addresses to buffer readers and writer.

22 fifo_write(F_ib_entry_index, ib);

23 fifo_write(F_wb_entry_index, wb);

24 if (should_rotate_in_partial_sum) fifo_write(F_ob_read_entry_index, ob);

25 if (should_rotate_out_output) fifo_write(F_ob_write_entry_index, ob);

27 // Output to compute unit.

28 fifo_write(F_should_rotate_in_partial_sum, should_rotate_in_partial_sum);

29 fifo_write(F_should_rotate_out_output, should_rotate_out_output);

30 fifo_write(F_activation_type, instr_activation_type);

31 fifo_write(F_shifting_distance, instr_shifting_distance);

32 }

33 }

34 } else if (instr_layer_type == kConvLayer) { ... }

36 return;

37 }

Listing 5.3: C-implementation of accelerator controller.

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 55

Now we are ready to show some example code for the accelerator design. Listing 5.3 presents

a code snippet of the accelerator controller for the fully-connected layer. Recall that the accel-

erator controller receives an instruction from the tiling software and generates cycle-by-cycle

control signals for other accelerator components to process a tile of computation. The function

accepts each instruction field as an argument (line 2-5). The instruction fields include the layer

type, the activation function type, the shifting distance to be used when converting between

the 16-bit and 32-bit fixed-point formats, and the numbers of output and input neurons in the

current tile. The numbers of output and input neurons are represented as the number of buffer

entries used to store the data. Each input buffer entry contains MI input neurons, each output

buffer entry contains MO output neurons and each weights buffer entry contains MO × MI

weights. We eliminate the instruction fields that are only used for the convolutional layer. The

second part of the arguments are the FIFOs connecting to the buffer readers, buffer writer and

the compute unit (lines 7-10). The generated control signals will be sent through these FIFOs.

The nested loop in the implementation corresponds to the second loop nest described in

Section 4.3.1. The accelerator controller iterates along the output and input neurons to decom-

pose a tile into smaller blocks, each involving one entry of data in the input, output and weights

buffers (line 14-15). The accelerator can start one such block of computation every clock cycle.

The accelerator controller needs to generate a new set of control signals every cycle for all other

accelerator components. Within the loop, the accelerator controller first determines whether a

new set of partial sums should be swapped into the compute unit pipeline (line 16-17). If ib

is zero, it implies the start of computation for a different set of outputs and the corresponding

partial sums or bias should be rotated into the compute unit. The accelerator controller also

need to determine if the current set of partial sums in the compute unit needs to be rotated out

(line 18-19). By the time the last set of inputs and weights (in current tile) are factored into

the current set of outputs (ib == instr ib entry count - 1), the accumulated partial sums

should be rotated out from the compute unit and stored in the output buffer. For each compute

block, the accelerator controller needs to send addresses to the input and weights readers so

that they can feed the corresponding data to the compute unit (line 22-23). Addresses will

also be sent to the output reader and writer when rotate-in or rotate-out is necessary (line 24-

25). Lastly, the accelerator controller sends the control signals to the compute unit, specifying

whether rotate-in or rotate-out is needed, the activation function type and shifting distance

(line 27-31).

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 56

Function pipelining requires all loops to be unrolled; however, the loops in this function have

variable loop bounds and cannot be unrolled. Hence, we do not pipeline the entire function,

but only pipeline the inner loop (labelled “loop” on line 15) such that a new iteration of the

inner loop can start every clock cycle. In this case, the accelerator controller is synthesized as

a sequential function of which the inner loop is pipelined using loop pipelining feature.

5.5.2 Software Implementation of the Compute Unit

1 #define DType short int

2 #define WDType int

4 void ComputeUnit(

5 // The M_i input neurons from input reader.

6 FIFO* F_input_0, ..., FIFO* F_input_7,

7 // The M_o x M_i weights from weights reader.

8 FIFO* F_weight_0_0, FIFO* F_weight_0_1, ..., FIFO* F_weight_0_7, ..., FIFO* F_weight_7_7,

9 // Control signals from accelerator controller.

10 FIFO* F_should_rotate_in_partial_sum, FIFO* F_should_rotate_out_output,

11 FIFO* F_activation_type, FIFO* F_shifting_distance,

12 // The rotate in partial-sums from output reader.

13 FIFO* F_rotate_in_output_0, ..., FIFO* F_rotate_in_output_7,

14 // The rotate out intermediate values or final output values, to output writer.

15 FIFO* F_rotate_out_output_0, ..., FIFO* F_rotate_out_output_7) {

17 // Read inputs.

18 DType input_0 = fifo_read(F_input_0); ... DType input_7 = fifo_read(F_input_7);

19 // Read weights.

20 DType weight_0_0 = fifo_read(F_weight_0_0); DType weight_0_1 = ... DType weight_7_7 = fifo_read(F_weight_7_7);

21 // Read control signals.

22 bool should_rotate_in_partial_sum = fifo_read(F_should_rotate_in_partial_sum);

23 bool should_rotate_out_output = fifo_read(F_should_rotate_out_output);

24 BackendActivationType activation_type = fifo_read(F_activation_type);

25 int16_t shifting_distance = fifo_read(F_shifting_distance);

27 // Rotate-in partial sums.

28 WDType rotate_in_output_0, ..., rotate_in_output_7;

29 if (should_read_rotate_in_output) {

30 // Read rotate-in outputs.

31 rotate_in_output_0 = fifo_read(F_rotate_in_output_0);

32 ...

33 // Convert to 32-bit format.

34 rotate_in_output_0 <<= shifting_distance;

35 ...

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 57

36 }

38 // The registers storing accumulating partial sums.

39 static WDType accumulator_0 = 0, accumulator_1 = 0, ..., accumulator_7 = 0;

40 // Select the partial sums to be accumulated, between the register and rotate-in values.

41 WDType select_accumulator_0 = should_read_rotate_in_output ? rotate_in_output_0 : accumulator_0;

42 ...

44 // Accumulate with sum of products.

45 accumulator_0 = select_accumulator_0 +

46 SumOfProducts(input_0, input_1, input_2, input_3, input_4, input_5, input_6, input_7,

47 weight_0_0, weight_0_1, weight_0_2, weight_0_3, weight_0_4, weight_0_5, weight_0_6, weight_0_7);

48 ...

50 // Rotate out intermediate accumulator values or final output values.

51 if (should_rotate_out_output) {

52 // Need to shift the accumulator value back to the ‘right’ decimal place before rotating out from the pipeline.

53 DType rotate_out_output_0 = ShiftAndActivate(activation_type, shifting_distance, accumulator_0);

54 ...

55 // Send data to output writer.

56 fifo_write(F_rotate_out_output_0, rotate_out_output_0);

57 ...

58 }

59 }

61 WDType SumOfProducts(

62 DType a_0, DType a_1, DType a_2, DType a_3, DType a_4, DType a_5, DType a_6, DType a_7,

63 DType b_0, DType b_1, DType b_2, DType b_3, DType b_4, DType b_5, DType b_6, DType b_7) {

64 // Pair-wise multiply.

65 WDType product_0 = (WDType)a_0 * b_0; ... WDType product_7 = (WDType)a_7 * b_7;

66 // Sum up the products.

67 return product_0 + product_1 + product_2 + product_3 + product_4 + product_5 + product_6 + product_7;

68 }

70 DType ShiftAndActivate(BackendActivationType activation_type, int16_t shifting_distance, WDType accumulator_value) {

71 // Shifting.

72 WDType shifted_output = accumulator_value >> shifting_distance;

73 // Activation.

74 if (activation_type == kBackendActivationLinear) return shifted_output;

75 else if (activation_type == kBackendActivationReLU) return (shifted_output > 0 ? shifted_output : 0);

76 }

Listing 5.4: C-implementation of compute unit.

Listing 5.4 shows the software implementation of the compute unit, with both MO and MI

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 58

set to 8 (see Section 4.1.4). That is, the compute unit will receive 8 input neurons and 64

weights and accumulate 8 output neurons every clock cycle. Recall that we use 16-bit fixed-

point (DType) for all data stored in the buffers and use 32-bit fixed-point (WDType) for the

multiplier outputs and accumulators. This function is synthesized by LegUp with the function

pipelining feature. For the sake of clarity, repetitive code snippets are replaced with ....

The first set of function parameters are the FIFOs connecting to the input and weights

readers (line 5-8). Each FIFO is used for receiving one input or weight. The next four arguments

are FIFOs used to receive control signals from the accelerator controller (line 9-11). Eight FIFOs

are used for receiving the rotate-in partial sums from the output reader (line 13). Another eight

FIFOs are created for sending the rotate-out values to the output writer (line 15).

In the implementation, the compute unit first reads in the input neurons, weights and

control signals using the fifo read API (line 17-25). If rotate-in is necessary, the compute

unit will read the partial sums from the FIFOs and convert them from 16-bit format to 32-bit

format (line 27-36). At line 39, the accumulators are declared as static variables to mimic the

registers that retain the values across pipeline iterations. At line 41, we select the partial-sums

to be accumulated between the rotate-in values and the register values. We then accumulate

each selected partial-sum with the sum of pair-wise products of inputs and weights (line 44-

48). Implementation of the helper function SumOfProducts is listed at lines 61-68. If the

control signal should rotate out output indicates that the accumulated partial sums should

be rotated out and stored into the output buffer, we will need to convert the 32-bit accumulators

into 16-bit format, perform activations, and write the data to the FIFOs connected to the output

writer using the fifo write API (lines 50-59). Lines 70-76 show the implementation of the

conversion and activation.

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 59

5.5.3 Software Implementation of Buffer Readers and Writer

1 void InputReader(

2 FIFO* F_ib_entry_index, /* Input address from the AcceleratorController. */

3 FIFO* F_input_0, FIFO* F_input_1, ..., FIFO* F_input_7 /* Input neurons to the compute unit. */) {

4 unsigned entry_index = fifo_read(F_ib_entry_index);

5 fifo_write(F_input_0, input_buffer_0[entry_index]);

6 fifo_write(F_input_1, input_buffer_1[entry_index]);

7 ...

8 fifo_write(F_input_7, input_buffer_7[entry_index]);

9 }

Listing 5.5: C-implementation of input reader.

For completeness, we also show the C-implementation of the buffer readers and writer.

Listing 5.5 is the implementation for the input reader. The implementations for the weights

reader, the output reader and writer are similar to the input reader shown here. For the function

signature, the first parameter is an input FIFO specifying the read address from the accelerator

controller. The remaining 8 parameters are output FIFOs for sending the input neurons to

the compute unit. The implementation is straightforward: the input reader first reads in the

address from the FIFO (line 4), then fetches the input neurons from the buffer and pushes

them into the output FIFOs (line 5-8). It is worth noting that the input buffer is organized as

eight arrays (MI = 8) of the short int data type. This guides the LegUp HLS tool to create

eight independent RAM modules to form the buffer, thereby permitting eight parallel reads to

be performed every clock cycle to maintain the accelerator throughput. Similarly, the output

buffer is organized as MO arrays and the weights buffer is organized as MO ×MI arrays.

5.6 Summary

In this chapter, we described the function pipelining feature added to the LegUp HLS framework

as part of this research. We use the RVD handshaking interface for pipelined functions, enabling

them to be connected to one another in a streaming manner. RVD-compatible FWFT FIFOs

are used to connect pipelined functions. The purpose of using FIFOs is to relax the backpres-

sure between pipeline functions, and to break the potentially timing-critical stall signal which

would otherwise go through all the connected functions. We also provide a software FIFO

library for HLS users to “instantiate” the FIFOs in software with parameterizable width and

depth, and to perform FIFO read/write operations using the library functions. The software

Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 60

library is compilable by any standard software compiler, allowing the software implementation

to be first verified in software before running high-level synthesis. LegUp does not synthesize

the software implementation of the library functions. Instead, it instantiates the FIFOs and

generates the FIFO read/write circuitry based on the function calls to the library functions. We

also presented the pipelined datapath architecture along with the stall logic, which is designed

with the objective of reducing the overhead on circuit throughput, timing and area. Finally,

we highlighted key aspects of the software design of the accelerator controller, the compute

unit and the input reader, which are synthesized by LegUp using loop pipelining and function

pipelining features. FIFOs are used to pass data and control signals between the accelerator

components. By leveraging LegUp HLS, we are able to implement the highly-pipelined hard-

ware accelerator completely in software, allowing easy experimentation with a wide variety of

hardware implementations, for example, having different numbers of pipeline stages, or with

different resource constraints.

Chapter 6

System Implementation

6.1 Introduction

This chapter overview the implementation details of the complete processor/accelerator hybrid

system for neural network inference. We also describe how the implementation was verified

functionally, present experimental results, and highlight related work.

6.2 SoC Device Overview

We implement the entire system on an Altera DE1-SoC board, which contains a Cyclone V

SoC device. This Cyclone V SoC device, as seen in Figure 6.1, comprises an ARM-based Hard

Processor System (HPS) along with FPGA fabric on the same die. The HPS has two ARM

Cortex-A9 processors, each with independent 32 KB L1 instruction and data caches. Cache

coherency between the separate L1 caches in the two processors is maintained by the snoop

control unit (SCU). There is also one 512 KB shared L2 cache connected to a DDR3 SDRAM

memory controller and the L3 interconnect. The L3 interconnect enables communication with

the memory-mapped peripherals of the HPS, such as the micro-SD card socket, Ethernet,

timers, etc. The L3 interconnect also enables communication to and from circuits implemented

within the FPGA fabric. Circuits in the FPGA can be slaves and/or masters attached to the

L3 interconnect via the FPGA bridge. Also, the FPGA can directly communicate with the

SDRAM memory controller. This DE1-SoC board has a 1 GB DDR3 SDRAM that can be

accessed using the SDRAM memory controller.

The memory layout, as seen by the processor, is divided into three main regions: SDRAM,

61

Chapter 6. System Implementation 62

!"#$%&'()(*%!

!"

#$%&'()$$&(%

#$*%'+(%,)$-.&/,*%&'*

0

1+%2+%-

3',%&'

1+%2+%-

.&45&'

#$2+%-

.&45&'

3&,/6%*

.&45&'

1+%2+%-7+88&'

0 0

.9:

9

7

#$2+%-7+88&'

0 0

.9:

9

7

3&,/6%*-7+88&'

0 0

.9:

9

7

0%4%+*

;+88&'455'&**&*

,$2+%<;+88&'<&$%'=<()+$%

9((&>&'4%)'

?)$%')>>&'

>4=&'<%=2&

)+%2+%<;+88&'<&$%'=<()+$%?)@2+%&

A$,%

4(%,B4%,)$<%=2&

*6,8%,$/<5,*%4$(&

*6)+>5<')%4%&<,$

*6)+>5<')%4%&<)+%

C:9

: :

0

0

*%4'% 8,$,*6

DEF9

GE0: 0

5@4<4((&**

9.:-?)'%&HI9J-:E?)'&

?EA-K ?EA-L

!L-?4(6&* !L-?4(6&*

9?E 0?A

!M-?4(6&

0C.9:

?)$%')>>&'

0

188I(6,2

CC."

0C.9:

:

0

!"#$%&'()$%&*"+%

,-".%'()$%&*"+%

Figure 6.1: Overall system integration.

FPGA slaves and the HPS peripheral region. The lower 3 GBs of addressable space is allocated

to the SDRAM, and 960 MBs of space is allocated to address FPGA slaves. The remaining

address space is available for the HPS peripherals. The FPGA’s view of memory is slightly

different. Addresses between 2 GB (0x8000 0000) and 3 GB (0xBFFF FFFF) are mapped to

the accelerator coherency port (ACP). The ACP allows FPGA peripherals to access data in a

cache-coherent manner, by routing transfers through the SCU and L2 cache.

To ease the software development, we run the Linux operating system (OS) on the HPS.

The 1 GB DDR memory is used as the main memory by the processor. In addition, a 64 GB

micro-SD card is available to the HPS and managed as part of the file system of the OS (i.e. is

Chapter 6. System Implementation 63

functions similar to a disk). Comparing to a bare-metal system, a great advantage of using an

OS is the convenience of dynamic memory allocation and the support for virtual memory. This

eliminates the requirement that we manually manage the limited memory space. The OS also

allows us to use the C++ Standard Template Library to develop a better-engineered software

framework (Section 3.4). Similarly, the presense of the OS makes it convenient for us to leverage

several required libraries. Our software framework uses the Google protocol buffer to define

and parse the input model configuration file (Section 3.4.2). The trained model parameters

are saved in HDF5 file format [32], allowing efficient/flexible read/write access to the disk file

using the provided library. The OpenCV library [10] is used in our software framework for

pre-processing images in standard formats (JPEG or PNG) into a fixed size 3-channel (RGB)

matrix compliant to the input format of the AlexNet model.

On the FPGA side, the hardware resources include 87 variable-precision DSP blocks, 397

M10K RAM blocks, 85K logic elements and 128K registers. Given the available resources, we set

the accelerator parameters, MO andMI , both to 8. That is, at every clock cycle of computation,

the compute unit receives 8 input neurons and 64 weights from the input and weight buffers,

and accumulates 8 output neurons in parallel. When the accumulated partial sums need to be

rotated-in/out, a set of 8 output neuron partial sums will be moved from/to the output buffer.

For the buffers, we choose the depth to be 1024 entries. Recall that the input, output and

weight buffers are organized as MI , MO and MO × MI arrays in software, respectively, and

are synthesized by LegUp into the same number of RAM modules (Section 5.5.3). Each RAM

module will have a depth of 1024 entries. In our design, the data stored in the buffers are in

16-bit heterogeneous fixed-point format. Hence, each RAM module can be implemented with

two M10K RAM blocks, where each RAM block is configured in 1024×8 (depth×width) mode.

In total, the input, output and weights buffer will occupy 2× (8+8+8× 8) = 160 M10K RAM

blocks.

6.3 System Integration

We now describe the overall system integration, as shown in Figure 6.1. On the FPGA side,

the accelerator controller, compute unit, buffers and buffer writer and readers are designed

in C and synthesized by the LegUp HLS tool. Three additional modules, named instruction

registers, status and DMA were developed to enable the integration of the FPGA accelerator

Chapter 6. System Implementation 64

with the HPS. They are implemented in hand-written Verilog. We use the Altera’s Qsys system

integration tool to connect these modules with the HPS, following the Avalon Memory Mapped

(AMM) interface specification. We elaborate further on the custom-designed modules below.

6.3.1 The Instruction Registers Module

As Section 5.5.1 described, the C-implementation of the accelerator controller receives each

instruction field as an individual function argument. To provide the input arguments to the

accelerator controller, the instruction registers module implements each instruction field as

a register and exposes these registers on the output port. These output registers are then

directly connected to the instruction argument ports of the accelerator controller (the arrows

connecting the instruction registers module and the accelerator controller in the upper-left

corner of Figure 6.1). Observe that the instruction registers module has an AMM slave interface,

allowing the HPS to read/write instructions in a memory-mapped style.

6.3.2 The Status Module

The status module is created for the HPS to start the accelerator execution and monitor execu-

tion status. The start and finish signals of the accelerator controller are directly connected

to the status module. The status module asserts the start signal when it receives a memory

mapped write (at a preset address) from the HPS. When the accelerator controller completes

execution and asserts the finish signal, the status module will update an internal register

representing the completion of the accelerator controller. The HPS can poll the value of this

completion register using a memory mapped read of the corresponding address.

6.3.3 The Accelerator Controller Interface

It is worthwhile to briefly explain the accelerator controller interface. Recall that the accel-

erator controller is synthesized by LegUp as a sequential function with a pipelined inner loop

(Section 5.5.1). The outputs of the accelerator controller are the control signals written to the

FIFOs connected to the other accelerator components. Hence, the interface of the accelerator

controller consists of the sequential function interface shown in Figure 5.1, and several RVD

interface ports connecting to the FIFOs. As seen in Figure 6.1, the sequential function inter-

face (black arrows) are the instruction arguments from the instruction registers module and the

Chapter 6. System Implementation 65

start and finish signals connected to the status module. The RVD interface connecting to

the control signal FIFOs are shown as blue arrows.

6.3.4 The DMA Module

As mentioned earlier, the computation involved in neural network inference involve a large

number of model parameters and intermediate values (neuron values in the hidden layers). The

limited on-chip storage is not sufficient to contain all the data. So, we use the on-board (off-

chip) SDRAM as the main memory to store these parameters and intermediate values, with

the help of virtual memory in the case of insufficient physical space. To reduce the off-chip

memory traffic, a layer of computational work is partitioned into tiles, wherein the required

data in each tile can fit in the on-FPGA buffers and is reused during the tile’s computation. A

DMA module was created to speedup the data transfer between the off-chip memory (SDRAM)

and the on-chip buffers.

The DMA has two AMM master interfaces and one AMM slave interface. The first master

interface is connected to the off-chip memory on the HPS side and the other master interface

is connected to the on-chip buffers. The slave interface is to receive DMA requests from the

HPS. A DMA request specifies the following information: 1) the data transfer direction, from

off-chip memory to on-chip buffer or vice versa; 2) the size of data to be transferred; and 3)

the starting addresses in memory and buffers. We will explain the interface in the next section

and describe the implementation of the DMA in Section 6.4.

6.3.5 Interconnect

In the hybrid processor/accelerator system, there are three independent sets of interconnects,

each with one master interface (as labelled in Figure 6.1). All of the slave interfaces are

connected to one master only. To begin, consider the master interface on the L3 interconnect.

This master interface is used by the processor to read/write memory mapped slaves on the

FPGA, including the instruction registers, status, DMA (request slave) and buffers. Each slave

interface will be assigned a base address in Qsys. Since the memory space of the FPGA slaves

begins at the third gigabyte of the HPS memory layout, the HPS master needs to offset the

base addresses with 0xC000 0000 to access these slave components. Given that our software is

executed within the operating system, the software process only sees the virtual memory space

and does not have direct access to the physical memory. To get handle this, we use the Linux

Chapter 6. System Implementation 66

mmap system call to create mappings in the virtual address space as follows: For each slave,

a region of the process’s virtual memory will be mapped to the corresponding region in the

/dev/mem device file, which is an image of the processors’ physical memory space. Through

this approach, the software can access the slave components by reading/writing from/to the

mapped virtual addresses.

The DMA module has two master interfaces as mentioned above, one for the access to off-

chip memory and one for the on-chip buffers. All accesses to off-chip memory are routed to

the ACP port so that DMA can access cache-coherent data. This is done by offsetting the L3

interconnect address with 0x8000 0000 (Section 6.2).

Turning attention now to the slave interface of the on-chip buffers. The on-chip buffers

are implemented with M10K RAM blocks, which are dual-port memories. We allocate one of

the two ports to the buffer readers and writer, and the other port for the HPS master and

DMA master. To allow both HPS and DMA masters to access the same memory port, one

straightforward approach is to create one slave interface for the memory port and connect both

masters to the slave interface. However, for such an interconnect setup, the Qsys tool will

generate arbitration logic for the slave to avoid access contention between two masters. The

arbitration logic can cause overhead on circuit area, timing and latency, and is also unnecessary

since our software can always avoid access contention by making sure that only one master is

accessing the on-chip buffers at a time. Therefore, we opted for a different approach, shown

in Figure 6.1. We create two slave interfaces for each buffer and connect them separately to

the two masters. A multiplexer is inserted in front of the memory port to steer the accesses

from the two masters. The select signal of the multiplexers is an output from the DMA, named

dma access. The DMA unit can assert this signal to steer the DMA access to the memory

ports. When the DMA unit receives a request from the HPS, it will first assert the dma access

signal before starting data transfer, and lower the signal after the transfer completes. When

this signal is low, only the HPS master has the access to the memory port. Our software ensures

that the DMA transfer and the direct access from HPS master do not happen concurrently.

The interconnect setup described above allows us to transfer data between off-chip memory

and on-chip buffers in two ways: The first way is to use the connection from the HPS master

to the on-chip buffer slaves. In this case, the data transfer will be controlled by the software

running on the processor. Again, we can use the above mentioned mmap method to create a

mapping in the software process’s virtual address space for the on-chip buffers. The software

Chapter 6. System Implementation 67

can then use pointers to reference mapped virtual addresses and read/write from/to the on-

chip buffers by de-referencing the pointers or using functions such as memcpy or memset. The

other way to access the buffers is to use the DMA unit. In this case, the software running

on the HPS issues transfer requests to the DMA and the DMA transfers the data accordingly.

Comparing the two approaches, DMA transfer can provide more data bandwidth by pipelining

the transfers with customizable data width. However, due to two limitations of the DMA unit

(will be descibred in Section 6.4.2), there are scenarios where data transfer cannot be done

with the DMA. Therefore, the connections from L3 interconnect master to the on-chip buffer

slaves are retained in the system in order to perform data transfer for the scenarios that are

not supported by the DMA.

6.4 Data Movement Optimization

6.4.1 Addressing Scheme for the On-Chip Buffers.

In this section, we describe several design decisions related to data transfer. The first item that

we want to discuss is the addressing scheme on the slave interface of the on-chip buffers. There

is a speed advantage to transfer data between contiguous address spaces in both source and

destination storage for the following reasons: 1) spatial locality can improve cache performance,

2) successive accesses to an open row in the SDRAM can reduce read/write latency, and 3) since

each DMA request/descriptor can only specify the transfer between two contiguous address

spaces, a transfer between non-contiguous address spaces requires multiple requests between

contiguous segments. Hence, the DMA request overhead can also be reduced if data transfers

are organized as large contiguous addresses. As a consequence of these issues, the addressing

scheme on the slave interface of the on-chip buffers should be designed to enable large transfers

between contiguous address spaces.

Before we describe the addressing scheme, we first review the data layout in off-chip memory

and on-chip buffers. Figure 6.2 shows the data layout in off-chip memory for a fully-connected

layer. The regions of data with coloured highlights are involved in a tile of computation, and

are to be transferred to the on-chip buffers (Section 4.3.1). The data highlighted in the same

colour are stored at contiguous addresses in the off-chip memory. For example, the output or

input neurons used in a computation tile are stored at contiguous addresses. For the weights,

which are stored in memory as a 2-dimensional row-major order array, weights[No][Ni], the

Chapter 6. System Implementation 68

!"#$%&'( weights[No][Ni]

)#

*+,-&..+"-/0+'(input[Ni]

1-&,-&.+"-/0+'(output[No]

21

2#

)1

31

3#

Figure 6.2: Data layout in off-chip memoryfor a tile in fully-connected layer.

!"#$#!%

&'()'(#$#*+)'(,'--./0

!%

!"#1#!%

2.%34(5#,'--./0

Figure 6.3: Data layout in on-chip buffers fora tile in fully-connected layer.

consecutive weights associated to the same output neuron are stored contiguously in memory.

That is, each row of weights in the figure are stored in contiguous memory addresses (highlighted

in one colour), but different rows of weights are not stored contiguously (highlighted in different

colours).

Figure 6.3 shows how the data used in a tile of computation are stored in the on-chip buffers.

Squares highlighted in the same colour in both figures represent the same data. For example,

the first row of weights in Figure 6.2 is stored at the bottom-left corner of the weights buffer

in Figure 6.3. Recall that the data width of output, input and weights buffers are Mo, Mi and

Mo ×Mi elements, respectively. At every clock cycle, the accelerator needs to read/write one

entry of data from all three buffers (as highlighted in orange). Each entry in the output or

input buffer contains Mo or Mi consecutive neurons. The data in consecutive entries of the

output/input buffers are stored contiguously in the off-chip memory. As such, the addressing

scheme for the slave interfaces of the output and input buffers is straightforward. The data

width of the slave interface is as wide as the entry width. The consecutive entries in the buffer

can be addressed contiguously through the slave interface.

In the accelerator design, the weights buffer must provide Mo groups of weights every clock

cycle. Each group of weights corresponds to one of the Mo output neurons, and the weights in

the same group are associated with the connections from the output neuron to the Mi input

neurons in the preceding layer. Hence, the weights stored in one weights buffer entry (the orange

box in Figure 6.3) are associated to Mo output neurons, and are not laid out at contiguous

Chapter 6. System Implementation 69

addresses in the memory (the orange square in Figure 6.2). To encourage transfers between

contiguous address spaces in off-chip memory and on-chip buffers, we divide the weights buffer

into Mo buffer banks. Each buffer bank is associated with a slave interface with a data width

equal to Mi elements. The consecutive entries in a buffer bank are assigned with contiguous

addresses. By doing so, we can transfer a row of consecutive weights from off-chip memory to

a buffer bank using contiguous addresses on the slave interface.

!

!

!

"#$

!

!

!

"#%

&'(#)

(&

*+,-./012345

filters [OFM_Z][K_Y][K_X][IFM_Z]

&'(#%

&'(#$

&61761-'281632-(8745

ofm [OFM_Y][OFM_X][OFM_Z]

9'(#$

9'(#%

9:761-'281632 (8745

ifm [IFM_Y][IFM_X][IFM_Z]

Figure 6.4: Data layout in off-chip memory for a tile inconvolutional layer.

!"

#$%&'(&))*+,

!-

.&'%&'(&))*+,

/*"01'23(&))*+,

!"

!-343!"

53 53 53

Figure 6.5: Data layout in on-chip buffers for a tile in convo-lutional layer.

For the convolutional layers, the data layouts in off-chip memory and on-chip buffers are

shown in Figure 6.4 and Figure 6.5, respectively. The 3-dimensional input, output feature maps

and filters are stored as row-major-order arrays in memory, where the consecutive elements

(at the same <x,y> location) along the depth dimension are contiguous in memory. The

Chapter 6. System Implementation 70

colour-highlighted regions of data in Figure 6.4 are the ones used in a tile of convolutional

layer computation (Section 4.3.2). Again, data highlighted in the same colour are contiguous

in memory. Since we tile along the depth dimension of the output feature maps, only the

neurons at the same <x,y> location are contiguous in memory. For input feature maps, tiling

is performed on the row and column dimensions, but not on the depth dimension. Hence, the

input neurons in the same horizontal xz-plate of the tile are contiguous. Since we always move

entire filters to the on-chip buffer (not partial filters), the filter weights to be transferred are

contiguous.

On the on-chip buffer side, the input/output neurons that are contiguous in off-chip memory

are also stored at consecutive buffer entries. Therefore, the slave interface setup of input and

output buffers used for fully-connected layers is also suitable for convolutional layers. Similar

to the fully-connected layer, each entry in the weights buffer (the orange box in Figure 6.5)

includes Mo groups of filter weights. To allow the transfer of an entire filter to a contiguous

address space in weights buffer, we can use the same buffer bank approach as mentioned above.

As shown in the figure, an entire filter resides at consecutive entries of a buffer bank. With the

consecutive buffer bank entries being addressable using contiguous addresses, we can transfer

an entire filter between continuous address spaces in both off-chip memory and the on-chip

buffer.

To summarize, we create one slave interface for each of the output and input buffers. The

data widths are as wide as the entry widths of the buffers, which are Mo and Mi elements.

As we choose both Mo and Mi to be 8 and use 16-bit fixed-point format for the data stored

in the buffers, the data widths of the output and input buffers are 16 bytes. For the weights

buffers, a slave interface is created for each of the Mo buffer banks. The buffer bank entries

are Mi elements wide. Hence, the data width of the buffer bank interface is also 16 bytes wide.

For all these slave interfaces, consecutive buffer entries or buffer bank entries are accessed with

contiguous addresses. Since all the slave interfaces of on-chip buffers are 16 bytes wide, we also

set the data width of the L3 interconnect slave interface to 16 bytes so that no data width

conversion is needed (on the interconnect) and the entire interconnect data width is also 16

bytes wide.

Chapter 6. System Implementation 71

6.4.2 Custom DMA Design

We now describe the DMA unit design used in the system. Instead of using the existing DMA

IPs provided by the FPGA vendor (Altera), we opted to design a custom DMA unit. Altera

provides two types of DMAs, a basic DMA unit and a scatter-gather DMA unit [7]. For the

basic DMA unit, the data transferred must be located in contiguous addresses in both the

source and destination storage, whereas the scatter-gather DMA unit can transfer and merge

non-contiguous memory to a continuous address space, and vice versa. The scatter-gather

style is unnecessary for our system, since most data transfer is between contiguous addresses.

Moreover, the basic DMA unit provided by Altera had several limitations, elaborated upon

below.

Unaligned Starting Address

A first limitation is that Altera’s basic DMA unit requires both the read and write starting

addresses to be aligned with the size of the individual transfers. The individual transfer size

refers to the data size that is transferred at each clock cycle. The individual transfer size

is specified for each DMA transfer request, with the options being 1, 2, 4, 8 and 16 bytes.

The address alignment requirement can be a significant drawback on the transfer bandwidth,

depending on the starting addresses. For example, if we aim to transfer 160 bytes of data from

byte address 0x2 of the off-chip memory to byte address 0x0 of the input buffer, the largest

allowable individual transfer size will be 2 bytes, requiring a total of 80 individual transfers.

However, 160 bytes of data can be ideally transferred with just 10 transfers, using an individual

transfer size of 16 bytes, matching the interconnect data width. In our case, data is stored

in the software process’s virtual memory and managed by the OS and the processors’ memory

management unit. Our software does not have direct control of the data location in the physical

memory (details on virtual address to physical address translation will be discussed in the later

section). Meaning that, in our system, the starting address of the data in physical memory is

not always aligned to the interconnect data width of 16 bytes. To allow maximum utilization of

the available data width on the interconnect, our custom DMA unit is designed to workaround

the alignment requirement between starting address and individual transfer size.

Figure 6.6 shows our solution for unaligned address data transfer from off-chip memory

to on-chip buffers, implemented with two assumptions that are valid in our use case, 1) the

Chapter 6. System Implementation 72

!"#$%&'

()')*'+*+,$-.&/*%01123

45445!45"456457893):;8')3'&,<;)993=6>!?

@(ABC*9)')*)'*)993288*,@(ABC*9)')*)'*)993288*,D!E

45F45G45E

! ! !

! ! !

Figure 6.6: Implementation with shift registers and multiplexers to handle unaligned SDRAMstarting address.

starting address in the off-chip memory side is always a multiple of 2 (16-bit fixed-point data

type), and 2) the starting address on the on-chip buffer side is always a multiple of 16 (buffer

entries are 16 bytes wide). We use two registers to hold two consecutive words (each word is

16 bytes wide) read from the memory. The data to be written to the on-chip buffer is selected

from the registers based on the second to the fourth least significant bits of the starting address

of memory. Note that the select signal does not change during a DMA transfer. For the above

example where the memory starting address is 0x2, the DMA starts by reading the first word

at address 0x0 from the memory and the second word at address 0x10. With the first two

words placed in the registers, we can select the initial data transferred to on-chip buffers based

on the starting address, whose bits 3–1 is 0x1 (in orange). Since the DMA unit is pipelined,

a new word is expected to be read from the memory every clock cycle. The two registers are

implemented as shift registers where the word on the left will be shifted to the right and the

new incoming word will be placed on the left. To transfer 160 bytes of data starting at address

0x2, we will need to read 11 words from the off-chip memory. For the last word whose address

starts at 0xA0 in memory, only the first two bytes are written to the on-chip memory. The

solution for transferring data from on-chip buffers to an unaligned address in off-chip memory

is done in a similar way, but requires some additional logic to control the byte enable signals.

DMA Interface

Altera’s basic DMA unit has two master interfaces, one dedicated for read and the other dedi-

cated for write. To allow transfers in both directions between the off-chip memory and the on-

chip buffers, we will need to connect both master interfaces to the memory and on-chip buffers.

When a slave is connected to more than one master, Qsys needs to resolve the potential access

Chapter 6. System Implementation 73

contention by inserting arbitration logic, which causes overhead in area and performance. In

our use case, another overhead of such a connection setup is the need for address span exten-

ders [8]. Since the DMA masters access the off-chip memory via the L3 slave interface whose

address span is already 4 GB, both master interfaces would need an address width of more than

32-bits in order to also connect to the on-chip buffers. However, the basic DMA unit provided

by Altera does not support more 32-bit address for the master interfaces. Hence, an address

span extender would be required for each of the read and write master interfaces so that they

can access both the off-chip memory and the on-chip buffers.

Our custom DMA unit uses a different master interface design. We dedicate one master

interface for read/write access to the off-chip memory and another for read/write access to the

on-chip buffers. Since every slave interface is only connected to one master, we can avoid the

overhead of using arbitration logic. Also, the master interface that connects to the off-chip

memory does not connect to other slaves. We therefore do not need to use the address span

extender.

FIFO-Less Implementation

Altera’s basic DMA unit requires a shallow FIFO whose width equals the word width, having

a minimum depth of 4. In a DMA unit design, when the access to a slave component has

variable latency, FIFOs are useful to relax the backpressure and improve throughput. However,

given that our interconnect allows the DMA to have exclusive access to the on-chip buffers

(Section 6.3.5), the read/write access from the DMA to the on-chip buffers can always be

performed in a fixed latency of 1 clock cycle. As such, our custom DMA does not need to use

any FIFO and will not cause performance/throughput degradation.

Limitation of the Custom DMA Unit Design

Our DMA unit design has two limitations: 1) the minimum transfer size of a DMA request is

equal to the word size (16 bytes), and 2) the DMA unit does not support writing a sequence of

zeros to a destination address (also not supported by Altera’s basic DMA unit). For scenarios

wherein the data transfer size is less than 16 bytes or stale content in the on-chip buffers needs

to be cleared, we resort to having the processor handle these cases by calling the memcpy and

memset functions.

Chapter 6. System Implementation 74

6.4.3 Translation From Virtual Addresses to Physical Addresses

When the DMA unit transfers data between the off-chip memory and the on-chip buffers, it

needs to use physical addresses to access the data. We can configure the addresses of the on-

chip buffers using the Qsys integration tool. However, since our software runs on top of the OS,

where the data is allocated in the virtual address space, the physical addresses of relevant data

in off-chip memory are not directly accessible. In Linux, a software process’s virtual memory

space is divided into 4 KB virtual pages. The upper 20 bits of a 32-bit virtual address is the

virtual page number and the lower 12 bits is the byte-address offset inside the page. To obtain

the corresponding physical address of a virtual address, we need to translate the virtual page

number to the physical frame number and offset the same byte address in the page/frame. Since

the OS manages the limited physical memory space using virtual memory and paging, when the

physical memory is full, a virtual page may be swapped to the disk (the micro-SD card) by the

OS. Therefore, we require a mechanism to lock the data to be transfered in physical memory,

before getting the corresponding physical addresses. This is needed because OS page swaps

may happen at anytime, leading to physical address unpredictability without such a locking

mechanism.

We can use the Linux system calls, mlock and munlock, to lock and unlock, respectively, part

of the software process’s virtual memory space in the physical memory. These two functions

take in two arguments: 1) the starting address in virtual memory, and 2) the number of bytes

of data to be locked/unlocked. Once mlock function is called, the virtual pages that contain

a part of the specified address range are guaranteed to remain in the memory (assuming the

function returns successfully), until the corresponding munlock function is called.

To translate a virtual address to a physical address, we utilize the page-map feature in

the Linux kernel. The Linux kernel provides a file for each process located at the path

/proc/pid/pagemap (pid refers to the process ID). It allows the process to find out which

physical frame each virtual page is mapped to. The file is in binary format and contains a list

of 64-bit values, one for each virtual page. These 64-bit values are listed in the order of virtual

page numbers. If a virtual page is present in the physical memory, the lower 55 bits of its

corresponding 64-bit value will be the physical frame number; and the leftmost bit (bit 63) will

be 1, indicating that the page is present. For example, to obtain the physical frame number for

the k-th virtual page, we read the 64-bit value stored at the byte address k × 8 and parse the

Chapter 6. System Implementation 75

lower 55 bits.

It is desirable to reduce the overhead of page swap and page-map parsing as they are run-

time intensive. When computing a convolutional or fully-connected layer, some data needs to

be transferred from the off-chip memory to the on-chip buffer more than once. Instead of doing

page swap and page-map parsing before each DMA transfer, we choose to perform these steps

once at the beginning of the layer computation. For all the data that will be transferred to

the on-chip memory, we first lock all virtual pages containing the data, then translate virtual

page numbers to physical frame numbers and store the mapping. Before each DMA transfer,

the physical frame numbers can then be directly looked up from the mapping; page swap or

page-map parsing is not needed. At the end of a layer of computation, we unlock all the virtual

pages. Because of paging, the data in a contiguous virtual address range is not necessarily

stored contiguously in the physical memory; however, it is guaranteed that the data within the

same page will be stored contiguously. Thus, when transferring a large chunk of contiguous

data in virtual memory, we need to issue a separate DMA request for each virtual page/physical

frame. The maximum length of each DMA transfer is limited to 4 KB (the page size).

6.5 Clock Frequency Optimization

We now briefly describe several changes made to raise the clock frequency of the accelerator

and DMA unit. The first change is the target clock period for LegUp HLS synthesis. LegUp

accepts a configuration file as input, specifying the synthesis settings and constraints. One of

the constraints is the target clock period. We reduce the target clock period from the default

20 ns to 5 ns. With this change, the number of pipeline stages of the generated compute unit

circuit is increased from 6 to 21. Using Altera’s Quartus II synthesis tool to place and route the

generated circuit in isolation (targeting the DE1-SoC board), we observe the achievable Fmax

is increased from 63 MHz to 153.6 MHz (2.44×.

After this change, we observed timing critical path to be within the stall logic circuitry,

originating from several output FIFOs’ full signals to the compute unit’s data path. These

output FIFOs are used to pass partial sums from the compute unit to the output buffer writer.

In fact, the output buffer writer will never stall during execution and will not cause backpressure.

Consequently, these FIFOs will never be full, allowing us to manually “ground” the full output

signals of the FIFOs, eliminating the critical paths.

Chapter 6. System Implementation 76

Another optimization was applied to the interconnect. There are two interconnect networks

wherein the master connects to a large number of slave components. One of the two DMA unit

master interfaces connects to all of the on-chip buffers, including two slave interfaces for input

and output buffers and eight slave interfaces for the weight buffer banks. The L3 interconnect

master interface is connected to the instruction registers, status module, DMA unit request

slave and all on-chip buffers. These connections appeared on the critical path as they required

large multiplexers on the response path (for read data returned from the slaves). The solution

was to insert pipeline bridges1 in the connections to break the big multiplexers down to smaller

ones. Given that the FPGA fabric uses 6-input LUTs, a 4-to-1 multiplexer can fit into a single

LUT (4 inputs bits and 2 select bits). Figure 6.7 gives an example of how an interconnect

from one master to ten slaves can be formed with 4-to-1 multiplexers by inserting pipeline

bridges. However, we cannot insert pipeline bridges within the interconnect from the DMA

master unit to the on-chip buffers. The reason is that the custom DMA unit is designed with

an assumption that read access to the on-chip buffers has a fixed latency of 1 clock cycle,

thereby allowing the DMA unit to be implemented without the use of FIFOs. Adding pipeline

bridge to the interconnect will increase the latency and violate the assumption. To work around

this limitation, we will need to update the custom DMA design to support a parameterizable

but fixed read latency. We leave this to future work.

!" !# !$ !% !& !' !( !) !* !+

,-./01

!"#$%"&$'()"*+$', !"#$%"&$'()"*+$'-

Figure 6.7: Example of forming the response path of an interconnect with 4-to-1 multiplexers(s* stands for slave interfaces).

6.6 Additional Optimizations

6.6.1 Using the Direct SDRAM Interface

As will be shown in the experiment results in Section 6.8.1, we observed that improving the

clock frequency for the accelerator and DMA did not reduce data transfer time. To investigate

1Pipeline bridge is a Qsys IP core that acts as a bridge between its connected master and slave components.Pipeline registers inside the IP can help to break the timing critical interconnect paths.

Chapter 6. System Implementation 77

!"#$%&'()(*%!

!"#$%&'()$$&(%

*+%,+%-.+//&'

0

123

2

.

#$,+%-.+//&'

0

123

2

.

4&567%8-.+//&'

932

3 3

0

:;<2

=;03 0

213->)'%&?@2A-3;>)'&

>;B-C >;B-D

!D->E(7&8 !D->E(7&8

2>; 0>B

!F->E(7&

09123>)$%')GG&'

0

*//@(75,991"09123

0

123

2

.

>G)(H-5$('&E8&I-/')J-KC3=L-%)-DCC3=L

2G%&'$E%5M&N-+8&-%7&-I5'&(%-5$%&'/E(&-%)-

09123

.)%%G&$&(H-)$-IE%E-%'E$8/&'-OE$IP5I%7

Figure 6.8: Interconnect from DMA master to SDRAM.

this issue, we measured the DMA transfer bandwidth in isolation by repeatedly moving 4 KB of

data back-and-forth between on-chip buffers and HPS memory (at contiguous physical address

in the same physical frame). Since 4 KB is well below the size of L1 cache (32 KB), we

expected that most of the data would be stored in the L1 cache during the transfer experiment.

Moreover, 4 KB (the page size in Linux) is also the maximum DMA transfer size that would

be allowed when virtual memory is used. Thus, such a setup reflects the ideal bandwidth that

can be achieved during actual execution. With the DMA and accelerator clocked at 50 MHz,

the transfer bandwidth from HPS memory to an on-chip buffer is 229 MB/s, while the transfer

bandwidth from an on-chip buffer to HPS memory is 367 MB/s. However, the maximum transfer

bandwidth that can be achieved on the interconnect from the DMA master to L3 interconnect

slave (highlighted with an orange dashed line in Figure 6.8) is equal to 800 MB/s (16 bytes data

width multiplied by 50 MHz clock). We thus concluded that the transfer bandwidth bottleneck

is on the L3 interconnect side (highlighted with red dashed line).

An alternative approach is to use the direct interface to the SDRAM controller (highlighted

with blue dashed line), which bypasses the L3 interconnect. We measured the transfer band-

width of this direct SDRAM interface between HPS memory and the on-chip buffers. The same

setup described above is used, except the DMA clock is set to 150 MHz so that the measured

bandwidth will not be limited by the interconnect bandwidth, which now is increased from 800

MB/s to 2.4 GB/s. The measured read/write bandwidths from/to the DDR3 SDRAM are 1.16

Chapter 6. System Implementation 78

GB/s and 1.18 GB/s, corresponding to 5× and 3.2× more bandwidth than using the L3 slave

interface. However, with the direct communication between the FPGA fabric and the SDRAM

controller, the DMA unit can no longer access cache-coherent data, since the direct interface

does not go thorugh the accelerator coherency port (ACP). To work around this, a solution

would be to clean the caches before read access by the DMA unit to the DDR3 memory, and

invalidate the caches after write access by the DMA unit to the DDR3 memory. Unfortunately,

such operations cannot be done by a user-space process, requiring a tedious process of updating

the OS kernel and creating system calls for our software framework to clean and invalidate the

caches. We therefore chose a more direct approach, which is to use the Linux mmap system call2

to allocate non-cacheable physical memory for storing the neurons and weights in the software

containers. This is done by creating a custom C++ template container class that allocates mem-

ory using the mmap system call. It is worth noting that the memory allocated through the mmap

system call is not managed by the OS virtual memory. This brings two benefits: 1) We no

longer need to translate virtual addresses to physical addresses nor are page swaps required; 2)

The maximum DMA transfer size is no longer limited to the OS page size (4 KB). However, the

run-time of computation performed on the processor, namely maxpooling and local response

normalization in AlexNet, is significantly increased as the data is no longer cached.

6.6.2 Control Loop Optimization in Accelerator Controller

This optimization aims to improve the efficiency of the control loops in the accelerator controller,

specifically for the control of convolutional layer computation. Recall that the accelerator

controller is first implemented in C and synthesized to Verilog using LegUp HLS tool. The

C implementation consists of 6 nested loops: The first 3 nested loops iterate along the row,

column and depth dimensions of the output feature map tile to select the output neurons to

be computed. The next 3 nested loops iterate along the row, column and depth dimensions

of the receptive field in the input feature map tile and the 3-D filters (ref. Section 4.3.2). In

the innermost loop, the accelerator controller issues a new set of control signals (e.g. the buffer

addresses to load inputs, load weights, and store outputs) to the buffer readers and writers,

and compute unit.

2The mmap function uses the standard Linux kernel call dma alloc coherent to request physically-contiguousmemory regions. In the default Linux kernel, dma alloc coherent does not allocate physically-contiguous mem-ory more than 0.5 MB in size. To allow the allocation of large amounts of physically-contiguous memory, theLinux OS that we are using has enabled the contiguous memory allocator (CMA) feature of the Linux kernel,allowing the dma alloc coherent call to allocate up to 512 MB of physically-contiguous memory [6].

Chapter 6. System Implementation 79

A common limitation in any HLS tool is that a pipelined loop or a pipelined function

cannot contain any control flow (i.e. branch or loop). Therefore, only the innermost loop can

be pipelined in the HLS-generated hardware. In our case, the generated hardware will include

a finite-state machine (FSM) that realizes the beahviour of the 6 nested loops, and a pipelined

datapath that corresponds to the loop body of the pipelined inner-most loop. The hardware

functions as such:

1. When the FSM reaches the state corresponding to the innermost loop, the pipelined

datapath starts execution by launching a new innermost loop iteration every clock cycle.

2. After the last iteration is launched, the FSM will wait for the pipeline to flush (i.e. the

final inner-most loop iteration to finish).

3. Then, the FSM proceeds and transits among the states corresponding to the outer loops,

and eventually comes back to the innermost loop state again.

The above steps repeat until all outer-loop iterations complete. In such an implementation,

there are many clock cycles spent waiting for the pipeline to flush, as well as transitions among

the outer loop states, considering that these steps repeat as many times as the product of loop

counts of the five outer loops. During these cycles, the accelerator controller is not generating

new control signals for the compute unit, resulting in low utilization of the compute unit. To

eliminate these “wasted” cycles, we collapsed the 6 nested loops into a single loop, where the

loop count of the collapsed loop equals the product of the loop counts of the 6 nested loops.

With such an implementation, the generated hardware can continuously launch a new iteration

every clock cycle, without waiting for pipeline to flush or FSM to transit among the outer loops

that were originally present. Note that this change is done entirely in C.

6.6.3 Reuse of Data Between Input Feature Map Tiles

This optimization is based on an observation that there exists a data re-use opportunity between

consecutive input feature map tiles. Recall that the tiling software traverses and tiles along

the row and column dimensions of the input feature map (as shown by the orange arrow in

Figure. 6.9). As seen in the figure, there is a large region of data overlapped by the first input

feature map tile (highlighted in red) and the next input feature map tile (highlighted in orange

box). Given the overlap, after the first tile of computation finishes, the tiling software may not

need to transfer all of the data in the next tile to the on-chip input buffer. However, due to

the organization of data in the on-chip buffer (in row-major order), the retiring data from the

Chapter 6. System Implementation 80

!"#$%

!"#$&

Figure 6.9: Previoustraversal order in tiling.

!"#$%

!"#$&

Figure 6.10: A data re-use-friendly traversal order.

!"#$%&$''()*

+,-(./

+,-(.0

+,-(.1

Figure 6.11: Using input buffer ina circular organization.

first tile (the left vertical slice in the first tile) would be interleaved among the on-chip buffer,

making it difficult to store the new data of the next tile (the right vertical slice in the orange

box) in the on-chip buffer. This problem disappears if we interchange the traversal order on

the x-, y- dimensions, such that the consecutive tiles are vertically aligned. As Figure 6.10

shows, the second tile (highlighed in green box) is one row below the first tile. The new data

required by the second tile (highlighted in light green) can be stored at the next empty entries

in the on-chip buffer, following the data in the first tile (Figure 6.11). All of the input feature

map data involved in the second tile of computation is still stored in contiguous buffer entries.

For the third tile, assuming the input buffer no longer has any empty entries, the new data

(highlighted in light yellow) can be stored at the beginning of input buffer, in a circular manner,

replacing the retired data. To cope with the circular input buffer, the accelerator controller

should be modified so that it will issue correct buffer-entry addresses to the input buffer reader.

The required changes include: 1) the instruction sent by the tiling software is updated with

a new field specifying the starting entry index in the input buffer for the current tile; 2) the

accelerator controller operates in the same way as if the starting index is zero, except that the

issuing address to the input buffer reader is first offset by the specified starting entry index

modulus the buffer depth.

6.7 Verification

We now highlight aspects of the verification flow for our acceleration solution, specifically veri-

fication of the convolutional and fully-connected layers. The verification flow is as follows:

Chapter 6. System Implementation 81

1. Initialize a test.

(a) Randomly generate the dimensions for a convolutional or fully-connected layer. For

a fully-connected layer, the dimensions are the number of output and input neurons.

For a convolutional layer, the dimensions include the widths, heights and depths of

output feature maps and 3-dimensional filters, as well as the padding and stride to be

applied when performing convolution on the input feature maps. These parameters

will define the dimensions of the input feature maps.

(b) Randomly configure the heterogeneous fixed-point representation for each of the input,

output and weights matrices. The bit widths of all fixed-point representations will

be 16 bits wide so that they match with our hardware accelerator setup.

(c) Fill in random values for the input and weights matrices. The random values are

within the representable value ranges of the heterogeneous fixed-point format.

2. Generate the golden output using a reference software implementation that has been pre-

viously verified. The reference implementation refers to the software that we developed

for the fixed-point experiments described in Chapter 3. It has been verified in a separate

verification flow.

3. Obtain the test output computed by the our acceleration hardware implementation.

4. Compare the test output against the golden output to confirm correct functionality.

Table 6.1: Design components being tested at each development/verification flow.

Development/Verification PhaseDesign Components Software RTL Hardware + Software

under Testing Verification Simulation on SoC FPGA Board

Tiling Software ✔ ✔

Instruction Generation ✔ ✔ ✔

Data Layout in Buffers ✔ ✔ ✔

Data Transfer ✔

Acceleration DesignHLS-synthesizable HLS-generated Hardware CircuitC-implementation Verilog on FPGA

We adopt Google’s C++ test framework – Google Test [2] to implement the verification flow. The

implementation of the accelerator design can be divided into three phases. The same verification

flow can be used to test design components at each implementation phase, as shown in Table 6.1.

In the first phase, we design the accelerator in LegUp-synthesizable C-implementation. Since

the C-implementation is compilable by standard software compiler, we can compile and execute

Chapter 6. System Implementation 82

this C-implementation along with the translation software that implements the two backend

API functions for convolutional and fully-connected layers (Section 4.2). This allows us to

perform the first phase verification completely in software. Given a randomly-generated test

input vector, the translation software will perform tiling, generate instructions, transfer data

to on-chip buffers, and invoke the C-implementation of accelerator to compute each tile. Since

the custom DMA unit and interconnect are not implemented in software, we cannot test data

transfer in this verification flow. To mimic data transfer, we use memcpy to copy data from/to

the arrays that represent the on-chip buffers in C.

The second verification phase refers to the RTL simulation. Since there is no simulation

model for the ARM processor, we choose to only simulate the LegUp-generated Verilog of our

accelerator at this step. Recall that the accelerator receives the instruction from the translation

software and performs a tile of computation accordingly. Therefore, the simulation can be done

at the granularity of a tile of computation. That is, each simulation test run corresponds to a

tile of computation, where the input test vector includes the instruction and the content of input

and weights buffers, while the test output will be the content of output buffers. The test input

and golden output can be obtained from running the above software verification flow. We insert

additional code into the translation software to save the instruction and buffer contents into

files for each tile of computation. For simulation, we prepare a simple testbench in Verilog.

The testbench does the following: 1) read the input test vector and golden output from the

files; 2) initialize the contents of RAM modules that implement the input and weights buffers;

3) present the instruction to the accelerator controller’s input ports; 4) start the accelerator’s

execution and wait for its completion; and 5) compare the content of output buffer against the

golden output. In this verification flow, we can verify the data layout in the on-chip buffers,

the generated instruction and the LegUp-generated Verilog of the accelerator.

The last verification flow is the complete system verification on the SoC FPGA board.

This test covers all of the final hardware and software implementation for the convolutional

and fully-connected layers. All the hardware components shown in Figure 6.1 are tested, in-

cluding the HPS, accelerator, custom DMA, instruction registers, status module and system

interconnect. On the software side, the tested components are the translation software and the

data-transfer implementation, including virtual page locking, virtual-to-physical address trans-

lation and DMA request generation. Note that the same verification software is used in this

test and executed on the HPS with the translation software invoking the hardware accelerator

Chapter 6. System Implementation 83

on FPGA.

6.8 Experimental Results

In this section, we will present and discuss experimental results for the accelerator design. Sub-

section 6.8.1 reports the run-time (performance) results for the accelerator implementation with

optimizations successively applied. Area and power results are reported in Subsections 6.8.2

and 6.8.3. Our hardware system is synthesized using the Altera’s synthesis tool, Quartus II

15.0. The clock frequency and area results are obtained from Quartus II synthesis reports. The

run-time results are obtained by adding wall-clock time measurement code into the software

framework. For power measurements, we use a DC power generator to supply the required

DC voltage to the FPGA board and measure the power consumption based on the generator’s

current reading.

6.8.1 Performance Results

We present the run-time breakdown per image inference of AlexNet in Table 6.2. The run-

time of AlexNet inference can be broken down into two parts: 1) the computations on the

accelerator and 2) computations on the ARM processor. The ARM processor computes three

layer types: maxpooling, local response normalization and depth-concatenation3. Given that

the HPS contains a dual-core ARM processor, we divide the computation of both maxpooling

and local response normalization layers into two parts, each part is computed by one software

thread. The total run-time spent on these three layers is 90.3 ms. For the computation on

accelerator, we created several version of the hardware, each having different optimizations

applied, as shown in the table and elaborated upon below.

Initial Version

We begin by looking at the initial version, which is our first fully functional version that supports

AlexNet. This implementation does not include the data movement optimization (Section 6.4)

or the clock frequency optimization (Section 6.5). Without the use of DMA, all data transfer is

3Recall that the AlexNet model is split into two partitions, where in the original paper, each partition wastrained on one GPU. Communication between the two GPUs only happens at certain layers. To support thismodel in our framework, when a convolutional layer requires input feature maps from both partitions, we needto copy two input feature map matrices into a single input feature map matrix. The depth-concatenation layeris created for this purpose and only does memory copy.

Chapter 6. System Implementation 84

Table 6.2: Run-time breakdown per image inference of AlexNet (unit in millisecond).

Implementation Versions

Cyclone V SOC Arria V SoC

InitialVersion

DataFmaxOpt.

Direct Increase Control Data Re-use

Transfer SDRAM Accelerator Loop Between

Opt. Interface Size Opt. Conv. Tiles

Clock Frequency on FPGA 50 MHz 50 MHz 120 MHz 150 MHz 120 MHz 120 MHz 120 MHz

Convo-lutional

onFPGA

Accelerator 315.6 315.6 161 111.2 45.0 23.5 23.3

Data Transfer 769 248.8 258.3 68.7 42.6 42.6 9.6

Tiling (on ARM) 7.1 7.1 7.1 4.2 2.3 2.3 2.1

Total 1091.7 571.5 426.4 184.1 89.9 68.4 35.0

Fully-Connected

onFPGA

Accelerator 25.4 25.4 13.3 9.2 2.8 2.8 2.8

Data Transfer 7174 534.9 561 66.0 41.3 41.3 40.7

Tiling (on ARM) 8.1 8.1 8.1 3.5 2.2 2.2 2.0

Total 7207.5 568.4 582.4 78.7 46.3 46.3 45.5

Computeon

ARM

Maxpooling 43 80.3

LRN 45.5 51.8

Depth-Concat. 1.8 7.1

Total 90.3 139.2

Total 8389.5 1230.2 1099.1 402 275.4 253.9 219.7

done by the HPS processor calling the memcpy function. The clock frequency of the accelerator

was reported to be ∼60 MHz by static timing analysis, but we set the clock frequency at 50 MHz

for the measurements. In this version, the run-time for convolutional and fully-connected layers

are 1091.7 ms and 7227.5 ms, respectively. For both of these layers, the majority of the run-

time is spent on data transfers (70.4% for convolutional layers and 99.2% for fully-connected

layers). The data transfer time spent on the fully-connected layers is ∼10× more than that for

the convolutional layers, due to the fact that ∼90% of the AlexNet model parameters are the

weights in fully-connected layers.

Data Transfer Optimization

Table 6.3: Time spent on data transfer between memory and on-chip buffers (unit in millisec-ond).

Implementation Versions Initial VersionAddressScheme Setup

Transfer w/ Custom DMA

DMA Page Swap &Total

Transfer Address Translation

Convolutional Layers 769 640.1 (1.2x) 237.5 9.3 246.8 (3.1x)

Fully-connected Layers 7174 3774 (1.9x) 433.9 101 534.9 (13.4x)

In the second version, we focus on improving data transfer performance. The first opti-

mization is the address scheme setup for the weights buffer described in Section 6.4.1. We

split the weights buffer into eight buffer banks, each with its own contiguous address space in

Chapter 6. System Implementation 85

the memory-mapped interconnect. This change allows a larger chunk of consecutive weights

to be stored contiguously in the on-chip buffers, and hence improves data transfer efficiency.

As shown in Table 6.3, this change improves the data transfer performance by 1.2× and 1.9×

for convolutional and fully-connected layers, respectively. The second optimization is to use

the custom DMA unit. To do so, we need to explicitly swap in the virtual pages containing

the data to be transferred and parse the pagemap file to obtain the physical addresses. The

time spent on DMA transfer, as well as page swap and address translation are reported in the

table. The data transfer speedup over the initial version are 3.1× and 13.4× for convolutional

and fully-connected layers, respectively. The reason for the speedup difference between the two

layer types is mainly because data to be transferred for convolutional layers is more fragmented

than that for the fully-connected layers, and hence, it benefits less from DMA transfer. With

both optimizations to the data transfer, we are able to reduce the total run-time per image

inference to 1230.2 ms, corresponding to a 6.8× speedup.

Clock Period Optimization

The third implementation version improves the clock frequency for the accelerator and DMA

unit, which share the same clock domain. This change aims to speedup both data transfer

and accelerator computation. After applying the clock frequency optimizations described in

Section 6.5, the achievable Fmax is reported at ∼124 MHz by the Quartus II synthesis tool.

We observe that the achievable clock frequency is largely limited by congestion and high resource

Figure 6.12: Screen shot of chip planner onCyclone V SoC device.

Figure 6.13: Screen shot of chip planner onArria V SoC device.

usage of the FPGA fabric, as seen in Figure 6.12. According to the Quartus II synthesis report,

Chapter 6. System Implementation 86

the design reaches 76% logic utilization and uses 70% RAM blocks and 74% DSP blocks. We

set the clock frequency to 120 MHz when measuring the run-time. As presented in Table 6.2,

the time spent on accelerator computation is proportionally reduced by ∼2.3×. However, the

time spent on data transfer does not decrease as expected and even increases slightly. This

is due to the fact that the data transfer bottleneck is not on the interconnect between DMA

unit and L3 interconnect slave, but on the path from L3 interconnect to ACP and caches (as

described in Section 6.6.1).

Port to Arria V SoC and Use of the Direct SDRAM Interface

Our fourth implementation aims to overcome the data transfer bottleneck limited by the L3

interconnect, as well as the congested resource usage on the FPGA. We port the design onto a

larger device, an Arria V SoC FPGA (on the Arria V SoC development board), which contains

the same HPS component and∼5.5×more logic elements, ∼5.7×more RAM blocks and∼12.5×

more DSPs than available on the Cyclone V SoC FPGA on the DE1-SoC board. We also

change the interconnect from the DMA unit to the HPS SDRAM to use the direct SDRAM

interface (the FPGA fabric connects directly to SDRAM without passing through the processor

subsystem). As decribed in Section 6.6.1, the DMA unit can no longer access cache coherent

data via the direct SDRAM interface, and hence, the mmap system call is used to allocate non-

cacheable memory for storing the neurons and weights. With the system implemented on the

Arria V SoC device, we can achieve an Fmax of ∼152 MHz for the clock used by the accelerator

and DMA unit. The Fmax is improved because the congestion issue is alleviated by having

more resources available on this Arria V device, as seen in Figure 6.13. The run-time results

are reported in Table 6.2, with clock frequency set at 150 MHz for the accelerator and DMA.

We can see that the time spent on accelerator computation is again proportionally reduced

to 111.2 ms and 9.2 ms for convolutional and fully-connected layers, respectively. For data

transfer, comparing to the second implementation (Column “Data Transfer Opt.”), the run-

time is reduced by 3.8× and 8.5× for convolutional and fully-connected layers, respectively. The

reduction of the data transfer time is not only due to the improved data transfer bandwidth from

using the direct SDRAM interface, but also because the time spent on page swap and virtual-to-

physical address translation is eliminated with the use of mmap system call for memory allocation

(Section 6.6.1). However, since the neurons and weights are no longer cached, the run-time of

the layers computed on the ARM processor is significantly increased, from a total of 90.3 ms

Chapter 6. System Implementation 87

to 139.2 ms. In this implementation, the total run-time per image inference is reduced from

previous 1099.1 ms to 402.9 ms.

Increased Accelerator Size

Given that more hardware resources are available on the Arria V SoC device, we opted to

increase the accelerator size by incorporating more multipliers and adders in the compute unit

and enlarging the on-chip buffers. We increase the compute unit parameters Mo and Mi from 8

to 16 such that the compute throughput per clock cycle is improved from 64 MAC operations to

256 MAC operations. That is, the compute unit now reads in 16 input neurons, 256 weights and

computes 16 output neurons every clock cycle. The on-chip buffer widths are also increased

accordingly to the data bandwidth requirement of compute unit. The widths of input and

output buffers are increased from 8 elements to 16 elements. The number of weight buffer

banks is increased from 8 to 16, while the width of each buffer bank is also increased from

8 elements to 16 elements. We kept the same buffer depth of 1024 entries for all buffers. In

summary, the storage capacity of input, output and weights buffers are respectively increased

by 2×, 2× and 4×.

With the larger accelerator design, the reported Fmax of the clock for the accelerator and

DMA unit is reduced to ∼119 MHz. We over-clock slightly and set the clock frequency at

120 MHz. As such, the overall accelerator compute throughput is expected to be improved

by 3.2× (4 × 120/150). According to our run-time measurements (Table 6.2), the accelerator

compute time spent on fully-connected layers is reduced by 3.2×, proportional to the increase

of compute throughput. However, the accelerator compute time spent on convolutional layers is

only reduced by 2.5×. This is because the accelerator controller “wastes” a significant number

of clock cycles waiting for the control pipeline to flush and on the branches between the non-

piplined nested loops, resulting in lower utilization of the compute unit (Section 6.6.2).

The data transfer time is also reduced by ∼1.6× for both convolutional and fully-connected

layers. The larger buffer size allows an increase in tile size and hence requires fewer tiles,

resulting in fewer input and weight transfers to on-chip buffers. In other words, the increase

in tile size permits more data reuse and therefore reduces the data transfer traffic. Fewer tiles

also implies less time spent in the tiling software.

Overall, this change reduces the total run-time per image inference from 402.9 ms to 271.5

ms.

Chapter 6. System Implementation 88

Loop Optimization in the Accelerator Controller

This optimization is to overcome the low utilization of the compute unit caused by the inefficient

loop implementation in the accelerator controller, specifically for convolutional layers. A loop

collapsing technique is applied to the loop nest in the accelerator controller, as described in

Section 6.6.2. This optimization allows the pipelined collapsed loop to continuously generate a

new set of control signals for the compute unit and buffer readers and writers every clock cycle,

during an entire tile of computation. With this change, the accelerator compute time spent on

convolutional layers is reduced from 45 ms to 23.5 ms, corresponding to a 1.9× speedup.

Data Re-use Between Input Feature Map Tiles

The last implementation incorporates the optimization described in Section 6.6.3, which allows

a fraction of an input feature map tile to be re-used by the subsequent tile. In addition, we

further increase the depths of all buffers from 1024 entries to 2048 entries, doubling the storage

capacity of all buffers. By doing so, the weights buffer can simultaneously store all the 3-D

filters of any of the five convolutional layers in AlexNet. This means that the tiling along

the depth dimension of the output feature maps is no longer needed. Consequently, an input

feature map tile can be factored into all of the output feature maps in one tile of computation,

eliminating the need for transferring the same input feature map tile a second time. Combining

both changes, the data transfer time spent on convolutional layers is reduced from 42.6 ms to

9.6 ms (4.4×), as shown in Table 6.2). Overall, the run-time per image inference is reduced to

219.7ms.

6.8.2 Hardware Resource Usage

Results on the Cyclone V SoC FPGA Device

Table 6.4 shows the FPGA resource usage of the system on the Cyclone V SoC FPGA. The

results are obtained from the last version implemented on the DE1-SoC board, with optimiza-

tions on data movement and clock frequency. Several design entities have significant differences

in resource usage between the final and the initial versions. Their resource usage in the ini-

tial version are listed in the parentheses. In the accelerator, the compute unit and the FIFOs

connecting the pipelined modules consume the most logic elements and registers. In the last

version, the target clock period setting in LegUp HLS is set to 5 ns, the generated compute unit

Chapter 6. System Implementation 89

Table 6.4: Resource usage of the final implementation on the Cyclone V SoC FPGA device(numbers in the parentheses correspond to the first version on the Cyclone V SoC FPGAdevice).

Design Entity Logic Elements Registers Mem. Bits M10Ks DSPs

Accelerator

Compute Unit 3,500 (3,200) 9,600 (2,700) - - 56

On-Chip Buffers - - 1,310 K 160 -

Buffer Readers & Writers 140 330 - - -

Streaming FIFOs 2,100 2,800 60 K 102 -

Accelerator Controller 1,200 1,200 - - 8

Total 7,000 (6,700) 14,000 (7,000) 1,370 K 262 64

Interconnect 7,300 (12,100) 23,200 (9,200) - - -

Custom DMA 1,500 (-) 1,400 (-) - - -

Status Module 9 4 - - -

Instruction Registers 147 345 - - -

Total 16,000 (16,900) 39,000 (18,000) 1,370 K 262 64

hardware has 21 pipeline stages and requires 3,500 logic elements and 9,600 registers, which are

1.1× and 3.6× more than that in the initial version (having 6 pipeline stages generated with

a 20 ns target clock period setting). The compute unit contains 64 16-bit multipliers, which

can be theoretically implemented with only 32 DSPs. However, to ease the circuit routing

and achieve better timing, the FPGA synthesis tool does not always use the most compact

implementation. In this case, the synthesis tool chooses to use 56 DSP units.

The high resource usage of streaming FIFOs is due to two reasons. First, LegUp currently

requires all the data that flows between pipelined functions to be passed through FIFOs. As

a result of this limitation, our design needs to use a total of 104 FIFOs. Second, we did not

fine-tune the FIFO depths in this version of the design and set the depth to 20 words for all

FIFOs. These FIFOs also use a significant amount of the 102 M10K RAM blocks.

The on-chip buffers use 160 M10K RAM blocks, all configured in 8× 1024 (width× depth)

mode. Buffer readers and writers are small modules, which only use 140 logic elements and 330

registers in total. Besides the accelerator, the largest design entity is the system interconnect

generated by Qsys (Altera’s on-chip bus interface generator tool). In the initial version, 12,100

logic elements and 9,200 registers are used by the interconnect. With the addition of pipeline

bridges in the final version, we observe a 40% decrease in logic element usage and a 2.5× increase

in register usage.

In total, the last version of the implementation on the Cyclone V SoC FPGA device uses

12K logic elements, 39K registers, 262 M10K RAM blocks and 64 DSP blocks.

Chapter 6. System Implementation 90

Results on Arria V SoC FPGA Device

Table 6.5: Resource usage of the final implementation on the Arria V SoC FPGA device(numbers in the parentheses correspond to the first version on the Arria V SoC FPGA device).

Design Entity Logic Elements Registers Mem. Bits M10Ks DSPs

Ac-cel-era-tor

Compute Unit 8,500 (3,300) 39,000 (9,600) - - 240 (56)

On-Chip Buffers - - 9,440 K (1,310 K) 1152 (160) -

Buffer Readers & Writers 70 (100) 560 (330) - - -

Streaming FIFOs 990 (2,000) 1,870 (2,800) 2 K (30K) 24 (102) -

Accelerator Controller 1,100 (1,200) 1,300 (1,180) - - 9 (8)

Total 10,700 (6,600) 42,700 (13,900) 9,442 K (1,340 K) 1,176 (262) 249 (64)

Interconnect 9,100 (6,900) 10,400 (6,000) - - -

Custom DMA 2,800 (1,330) 2,770 (1,500) - - -

Status Module 5 3 - - -

Instruction Registers 147 342 - - -

Total 22,800 (15,000) 56,000 (22,000) 9,442 K (1,340 K) 1,176 (262) 249 (64)

Table 6.5 presents the hardware resource usage results of our final implementation on the

Arria V SoC FPGA Device. The numbers in the parentheses are from the first implementa-

tion on the Arria V SoC, corresponding to the fourth implementation in Table 6.2 (labelled

“Direct SDRAM Interface”). Comparing the final and initial versions on Arria V SoC, the

final version incorporates several additional optimizations, including the increased accelerator

size (Section 6.8.1), loop optimization in the accelerator controller (Section 6.6.2) and the data

re-use optimization between consecutive input feature map tiles (Section 6.6.3).

As the accelerator parameters Mo and Mi are increased from 8 to 16, the resource usage of

both the compute unit and on-chip buffers are significantly increased. For the compute unit,

the logic element usage is increased by 2.6×, the register usage is increased by 4.1× and DSP

usage is increased by 4.3×. For on-chip buffers, 7.2× more memory bits and M10K RAM blocks

are used.

In the final version, we made two design changes to reduce the resource usage of streaming

FIFOs. The first change is to remove the FIFOs that were previously present between input

and weights buffer readers to the compute unit. This was done by inlining the C functions

for the input and weights buffer readers into the compute unit function. If these FIFOs had

not been eliminated, with the size of accelerator increased, the number of FIFOs used in the

accelerator would have increased from 104 to 312 (as the compute unit reads/writes more data

from/to buffers every clock cycle). With this change, the number of FIFOs dropped to 40. The

second change is to fine-tune the FIFO depths accordingly to the pipeline schedule reported by

the LegUp HLS tool. The resource usage of FIFOs is greatly reduced for logic elements (2×),

Chapter 6. System Implementation 91

registers (1.5×), memory bits (15×), and M10K RAM blocks (4.3×). The loop optimization

in the accelerator controller does not have much impact on the resource usage as seen in the

table. In total, the final version of the accelerator uses 10 K logic elements, 43 K registers, 1 K

M10K RAM blocks and 249 DSP units.

For the interconnect, the resource usage is also increased due to the change of accelerator

size. First, we changed the data width of the interconnect to match with the enlarged on-chip

buffer width. Also, more weights buffer slave components are added to the interconnect since

there are now more weights buffer banks. Resource usage of logic elements and registers are

respectively increased by 1.6× and 3.1×. We also increased the data width of custom DMA to

match with the interconnect data width, resulting in 1.3× and 1.7× higher logic element and

register usage, respectively.

The final implementation on Arria V SoC FPGA uses 23 K logic elements, 56 K registers,

1 K M10K RAM blocks and 249 DSP units.

6.8.3 Power Results

We use a DC power generator to measure the power consumption for our last implementation

on DE1-SoC (Cyclone V SoC FPGA), which provides a required 6 V DC voltage to the FPGA

board. The power is measured based on the current reading. When the board is power on

with the OS being idle on the HPS, the total power usage is 6 W. When the accelerator is

active, the total power usage increases slightly to 6.15 W. It is worth noting that there are

many components besides the FPGA device on the DE1-SoC board, including the Ethernet

PHY, VGA output, video decoder, etc. We expect that these components also consume a

considerable percentage of the total static power.

The Arria V SoC development board is equipped with a power measurement circuit. The

development kit includes power monitoring software that allows real-time measurements of

power consumption. Figure 6.14 shows a screenshot of the power monitoring software. Each

of the monitor screens measures one of the power rails of the HPS and FPGA fabric. The

four measuring rails on the left are for HPS and the top three rails on the right are for FPGA

fabric. Looking at the HPS core power and HPS DDR3 devices power, we can break our

system execution into two parts: 1) the software running on the HPS first initializes the neural

network model and loads pre-trained model parameters from disk; 2) the software and hardware

accelerator on FPGA performs inference execution. As seen in the figure, during initialization,

Chapter 6. System Implementation 92

!"#$%&'($)&*('

+,-$./0$!"#$1123$0(45%(6

+,-$./0$!"#$5/7('/.8$./0$)('5)9('.8$

0(45%(6

+,-$./0$:";<$5/7('/.8$./0$

)('5)9('.8$0(45%(6

+,-$./0$:";<$112$0(45%(

:";<$%&'($)&*('=$7'./6%(54('=$./0$

%8&%>

+/575.85?.75&/ +/@('(/%($AB(%C75&/

Figure 6.14: Power monitoring of HPS and FPGA fabric on the Arria V SoC developmentboard.

both the HPS core power, and the power of I/O and DDR3 device are increased to ∼736

mW and ∼135 mW, respectively. During inference execution, the peak readings of HPS core

power, HPS DDR3 device power, and FPGA core power are ∼840 mW, ∼522 mW, ∼2725 mW,

respectively. Combining all power rails for HPS and FPGA, the peak power readings of the

HPS and FPGA are ∼2.5 W and ∼2.8 W, respectively.

6.9 Related Work

In this section, we discuss related work on hardware acceleration of neural network computation.

In [13], the authors propose an ASIC implementation of a neural network accelerator called

DianNao. As seen in Figure 6.15, the proposed accelerator architecture bares similiarity to

our own. The compute unit (NFU ) is designed to concurrently accumulate 16 output neurons

with the sum of products of 16 inputs and 16 sets of weights. This is roughly equivalent to

our architecture with both Mo and Mi equal to 16. In addition to convolutional and fully-

connected layers, the proposed architecture also supports max-pooling layers. This is done

by adding comparator trees to form max operators on the side of the adder tree (at NFU-

Chapter 6. System Implementation 93

2 ). Three on-chip buffers are used for storing the input neurons (NBin), weights (SB) and

output neurons (NBout). Each on-chip buffer is augmented with a DMA unit to improve

the data transfer speed from/to off-chip DRAM. They also created a control processor that

consists of three code generators – one for each of the three supported layer types. Each code

generator creates “instructions” for controlling the buffers and the compute unit. The authors

implemented the accelerator design in custom Verilog and synthesized the ASIC implementation

to a TSMC 65nm GP standard Vth library. The reported Fmax is at 980 MHz. The area and

power consumption are ∼3 mm2 and 485 mW. For run-time measurements, the cycle latencies

were gathered from a cycle-accurate C++ simulator, with the bandwidth to main memory (off-

chip DRAM) set to 250 GB/s. The run-time benchmarking is individually done on several

different neural network layers. Compared to a 128-bit 2GHz SIMD processor, the proposed

accelerator is 117.87× faster. For the largest layer in their experiment, the 5-th convolutional

layer in AlexNet, the proposed accelerator is about ∼500× faster than the SIMD baseline (the

actual run-time result is not reported in the paper).!"#

!"#$%

&"%

!'()*%

$%&'()#*"+%(,-

.%#

!'()+% !'(),%

*"/+0#

1$2#

1$2#

*"/+0#

!"#3#!

"#

!"-./%

4'"+('5#6('.%//'(#7468#

*"/+(9.:'"/#

*"/+0#

1$2#

!"#

Figure 6.15: Neural Network Accelerator Architecture of DianNao [13].

The authors further propose a multi-core implementation in [14]. This design aims to

improve the computational throughput and reduce the impact of limited memory bandwidth

to off-chip DRAM. One of the main design characteristics is that all neural network model

parameters (weights and biases) are stored in the on-chip storage. In this multi-core design, each

compute unit (NFU ) is associated with several dedicated RAM blocks for storing the weights.

The weights are kept stationary to the compute units, while the input and output neurons are

transferred among the compute units, since the number of weights is an order of magnitude

greater than the number of input and output neurons. Compared to an NVidia K20M GPU,

the performance speedups, on average, for the 16-, 64-, 256- and 1024-unit implementations

Chapter 6. System Implementation 94

are respectively 21.4×, 79.8×, 216.7×, and 450.7×. For the four implementations, the average

energy reductions are 330.6×, 323.7×, 276× and 150.3×, respectively, compared to the GPU

baseline. The K20M GPUs area is about 550 mm2, while a 16-unit DaDianNao implementation

is 67.7 mm2. However, a major limitation is that their design does not support any model

having more weights than can be accommodated in on-chip storage. The DaDianNao design

nevertheless illustrates the performance benefits of reducing off-chip data transfer by storing

all the weights on chip. Given that the Arria V SoC has a large number of RAM blocks, we

should consider an implementation that maximizes the on-chip buffer capacity to store as many

weights as possible.

Figure 6.16: Neural Network Accelerator Architecture in [33].

For a DNN hardware accelerator design on FPGAs, the authors in [33] proposed the ac-

celerator architecture shown in Figure 6.16. The compute unit is also capable of processing

multiple outputs in parallel with multiple input neurons and weights. For on-chip storage, they

use two sets of buffers to implement double-buffering, so that data transfer can be overlapped

with computation. In addition, they obtain an optimized compute unit structure (the number

of outputs (Mo) and inputs (Mi) to be processed in parallel) for the five convolutional layers in

AlexNet, with the consideration of available DSP resources, buffer sizes and off-chip memory

bandwidth. They report that the optimal structure is to have Mo equal to 64 and Mi equal

to 7 (roughly 1.75 times bigger than our final implementation on Arria V SoC FPGA). For

the five convolutional layers, the design achieves a computation time of 21 ms, outperforming

a baseline 16-thread software implementation on an Intel XEON CPU (@2.2 GHz) by 3.64×,

with a board power consumption of 18.6 W.

Comparing to this implementation, the run-time spent on convolutional layers in our final

implementation is 1.67× more than theirs. However, we use 1.75× fewer multipliers and adders.

Using the multiplier usage as area cost, the area-delay product of our implementation is 8,960

Chapter 6. System Implementation 95

(256 multipliers × 35 ms), which is 1.05× better than their area-delay product of 9,408 (448

multipliers × 21 ms). Their implementation uses 32-bit floating-point multipliers and adders,

which require significantly more DSP units. Moreover, the actual compute time spent by our

accelerator is 23 ms. We believe the total run-time of our implementation can be further reduced

if the data transfer and tiling software execution are overlapped with accelerator computation.

Their design only supports convolutional layers, whereas our solution is a complete end-to-

end system that supports all the layer types in AlexNet and handles the problems one would

encounter in a real application, such as the virtual memory management in an OS.

6.10 Summary

The chapter presents the overall implementation of the complete processor/accelerator hybrid

system for neural network inference, including the system integration, custom DMA unit, inter-

connect, and memory system. We also persent optimizations for improving data transfer, clock

frequency, and hardware utilization. A unified verification flow is used to test our implemen-

tation at different development stages. We present the run-time and resource usage results for

a series of incrementally optimized implementations. Lastly, we compared our proposed design

to related works.

Chapter 7

Conclusion

7.1 Summary of Contributions

We implemented a DNN inference accelerator, synthesized with LegUp HLS, that operates in

conjunction with an embedded ARM processor. A dual-core ARM processor running Linux

OS executes software that decomposes the neural network computations into tiles. The neural

network accelerator is designed for massively parallel reduced-precision MAC operations. It

accepts DNN-specific instructions from the ARM processor. DMA is used to transfer DNN

neuron outputs and weights from off-chip DDR memory into the on-chip buffers inside the

DNN accelerator. We added the function pipelining feature to the LegUp HLS tool in order to

generate the accelerator hardware from a software design in C. We created a software framework

that integrates all the system components, including the hardware system, the backend APIs

for tiling control and memory transfer, and the software implementation of the inference and

training for all layer types in AlexNet.

Using Heterogeneous Fixed-Point Representation In Neural Network Computation

We investigate the impact on neural network model accuracy of using a variety of data formats

for representing the DNN neurons and weights (including floating-point, uniform fixed-point and

heterogeneous fixed-point). Our experiments show that the heterogeneous fixed-point represen-

tation can achieve a model accuracy close to that obtained using a floating-point representation,

reducing bitwidth and hardware cost. In heterogeneous fixed-point, the precision of neurons

and weights can be tailored individually on a layer-by-layer basis, where the MAC computations

at a given layer are performed in fixed-point according to a configurable number of decimal bits

96

Chapter 7. Conclusion 97

and fraction bits.

Software Framework

We have developed a C++ software framework that models reduced-precision DNN inference of a

generic neural network architecture specified in a user-provided configuration file. This frame-

work allows experimentation with a wide variety of DNN architectures and machine learning

applications. When deployed on our SoC system, the software framework divides the computa-

tion into tiles and off-loads the tiles to the hardware accelerator on the FPGA. The software is

also responsible for orchestrating the data transfer and generating custom instructions to guide

the accelerator execution.

Hardware Accelerator

The hardware accelerator consists of three major components: the accelerator controller, on-

chip reuse buffers and the compute unit. The accelerator controller is responsible for translating

the custom instructions into cycle-by-cycle control signals for the buffers and the compute unit.

The on-chip buffers are designed to take advantage of reuse opportunities and reduce the off-

chip memory traffic. The highly-pipelined compute unit is capable of performing 64 MAC

operations every clock cycle. We added a function pipelining feature to the LegUp HLS tool,

which permits the streaming hardware accelerator to be efficiently implemented in a high level

language (the C-language). The function pipelining support in LegUp HLS has been submitted

for publication to the 2016 IEEE int’l Conference on Application-specific Systems, Architectures

and Processors (ASAP).

A Complete End-to-End System

We implemented a complete and working system integrating the software framework along with

the hardware accelerator, allowing the accelerated inference of a neural network to be performed

based on a high-level configuration file. The software framework also includes the necessary

support for it to be deployed running on top of the Linux OS, which required support for page

locking and virtual-to-physical address translation. The complete system running on an OS

makes it readily usable for real-world neural network applications.

Chapter 7. Conclusion 98

7.2 Future Work

Translation Software Optimization and Multi-Core Implementation

One useful future direction is to optimize the tiling scheme, specifically the tiling size and

traversal order. We believe the tiling scheme can be optimally customized for each layer of

computation in order to improve data reuse and minimize memory traffic. Another useful

improvement to our current implementation is to overlap data transfer with accelerator compu-

tation so that the latency can be reduced. Furthermore, we can implement a multi-core system

on a larger FPGA device where multiple accelerator cores are instantiated and perform compu-

tation in parallel. For such a multi-core system, we would need to update the tiling scheme and

add support to coordinate data transfer and schedule computation among multiple accelerator

cores. Data reuse and sharing between accelerator cores will be an important design considera-

tion to prevent the off-chip memory traffic from limiting the overall system performance. Since

the above optimization tasks are co-dependent, an optimal tiling or scheduling scheme cannot

be found without considering all related factors as a whole. Therefore, an interesting future

research project is to design a software simulator that explores the solution space and identifies

an optimal tiling and scheduling schemes for a specific neural network model.

Using a More Cost-Efficient Data Representation and Arithmetic

Recent research [17, 28] (Section 3.6) has shown that a neural network trained with the binary

value constraint (i.e., +1 or -1) on its weights can achieve nearly state-of-the-art accuracy

performance. Such a weight representation can be stored with just one bit and it converts the

multiplication of a weight and a neuron into a sign inversion operation. Another cost-efficient

data format for weights is to limit the number of bits being 1s in a fixed-point representation.

For instance, we can limit all the weights in a neural network to have no more than two bits

being 1s, allowing the multiply between weights and neurons to be performed with two shifts

and an add. This representation also permits more compact data format, where a value can

be represented with the indices of the 1s in the wider word. Future work is to investigate the

feasibility of using such restricted weight representations in large-scale neural network models.

If the restricted weight representations prove to be feasible, it would be exciting to design a

custom hardware accelerator that takes advantage of the low-cost operations (inversion, shift

and add) and the reduced data storage and traffic. We have been mentoring a fourth year

Chapter 7. Conclusion 99

undergraduate student’s thesis project to investigate in this direction.

Neural Network Compression

Recent studies [21, 20] have shown that the numerical values used in deep neural networks

can be compressed significantly by pruning the weight connections, sharing weight values and

encoding the non-zero-value indices in sparse weight matrices. The experiments in [20] show

that the storage requirements for the AlexNet model can be reduced by 35×, down to 6.9 MB,

without affecting the accuracy. From our point of view, neural network compression techniques

can be exploited to reduce the off-chip memory traffic or even make it possible to store all the

weights in the hardware accelerator’s on-FPGA memory.

Support for Recurrent Neural Networks

In this project, we only considered feed-forward networks where all the layers are connected in a

uniform direction. Recurrent neural networks (RNN) that have additional feedback connections

from a deeper layer to a shallower layer, have also shown promising performance in many

applications. The computation of RNNs are fairly similar to that of the feed-forward networks,

and thus, most of the work in this project is re-usable if one were to build an RNN accelerator.

Bibliography

[1] BVLC CaffeNet model. https://github.com/BVLC/caffe/tree/master/models/bvlc reference caffenet.

[2] Google Test, Google’s C++ test framework. https://github.com/google/googletest.

[3] Protocol buffer. https://developers.google.com/protocol-buffers.

[4] Standford vision lab. http://vision.stanford.edu.

[5] Altera Corporation. Enabling High-Performance DSP Applications with Stratix V Variable-

Precision DSP Blocks, May 2011. Version 1.1.

[6] Altera Corporation. Altera Cyclone V SoC Development Kit, Reference Platform Porting

Guide, November 2015.

[7] Altera Corporation. Embedded Peripherals IP User Guide, December 2015. Version

2015.12.16.

[8] Altera Corporation. Quartus Prime Standard Edition Handbook Volume 1: Design and

Synthesis, November 2015. Version 15.1.0.

[9] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new per-

spectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–

1828, 2013.

[10] G. Bradski. Dr. Dobb’s Journal of Software Tools.

[11] A. Canis, S.D. Brown, and J.H. Anderson. Modulo sdc scheduling with recurrence min-

imization in high-level synthesis. In Field Programmable Logic and Applications (FPL),

2014 24th International Conference on, pages 1–8, Sept 2014.

100

Bibliography 101

[12] A. Canis, Jongsok Choi, B. Fort, Ruolong Lian, Qijing Huang, N. Calagar, M. Gort,

Jia Jun Qin, M. Aldham, T. Czajkowski, S. Brown, and J. Anderson. From software to

accelerators with legup high-level synthesis. In Compilers, Architecture and Synthesis for

Embedded Systems (CASES), 2013 International Conference on, pages 1–9, Sept 2013.

[13] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier

Temam. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-

learning. In Proceedings of the 19th International Conference on Architectural Support for

Programming Languages and Operating Systems, ASPLOS ’14, pages 269–284, New York,

NY, USA, 2014. ACM.

[14] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi

Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. Dadiannao: A machine-learning su-

percomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on

Microarchitecture, MICRO-47, pages 609–622, Washington, DC, USA, 2014. IEEE Com-

puter Society.

[15] J. Cong, Bin Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Zhiru Zhang. High-level

synthesis for fpgas: From prototyping to deployment. Computer-Aided Design of Integrated

Circuits and Systems, IEEE Transactions on, 30(4):473–491, April 2011.

[16] Jason Cong and Yi Zou. Fpga-based hardware acceleration of lithographic aerial image

simulation. ACM Trans. Reconfigurable Technol. Syst., 2(3):17:1–17:29, September 2009.

[17] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training

deep neural networks with binary weights during propagations. pages 3105–3113, 2015.

[18] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Low precision arithmetic

for deep learning. ICLR Workshop, 2015.

[19] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep

learning with limited numerical precision. CoRR, abs/1502.02551, 2015.

[20] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural

network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149,

2015.

Bibliography 102

[21] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connec-

tions for efficient neural networks. CoRR, abs/1506.02626, 2015.

[22] Yangqing Jia. Learning Semantic Image Representations at a Large Scale. PhD thesis,

Electrical Engineering and Computer Sciences, University of California at Berkeley, May

2014.

[23] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Gir-

shick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast

feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with

deep convolutional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q.

Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–

1105. Curran Associates, Inc., 2012.

[25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to

document recognition. In Intelligent Signal Processing, pages 306–351. IEEE Press, 2001.

[26] Yann Lecun and Corinna Cortes. The MNIST database of handwritten digits.

[27] Fei-Fei Li. ImageNet: crowdsourcing, benchmarking & other cool things. CMU VASC

Seminar, March 2010.

[28] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural

networks with few multiplications. CoRR, abs/1510.03009, 2015.

[29] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Neurocomputing: Foun-

dations of research. chapter Learning Representations by Back-propagating Errors, pages

696–699. MIT Press, Cambridge, MA, USA, 1988.

[30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-

heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and

Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of

Computer Vision (IJCV), 115(3):211–252, 2015.

Bibliography 103

[31] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-

houcke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and

Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1–9, June 2015.

[32] The HDF Group. Hierarchical Data Format, version 5, 1997-NNNN.

http://www.hdfgroup.org/HDF5/.

[33] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimiz-

ing fpga-based accelerator design for deep convolutional neural networks. In Proceedings

of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,

FPGA ’15, pages 161–170, New York, NY, USA, 2015. ACM.