a framework for fpga-basedacceleration of neural … · inference with limited numerical...
TRANSCRIPT
A Framework for FPGA-Based Acceleration of Neural Network
Inference with Limited Numerical Precision via High-Level Synthesis
with Streaming Functionality
by
Ruo Long Lian
A thesis submitted in conformity with the requirements
for the degree of Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
c© Copyright 2016 by Ruo Long Lian
Abstract
A Framework for FPGA-Based Acceleration of Neural Network Inference with Limited
Numerical Precision via High-Level Synthesis with Streaming Functionality
Ruo Long Lian
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2016
Deep neural networks (DNN) are achieving state-of-the-art performance in many artificial intel-
ligence tasks, such as computer vision and speech recognition. Due to the high computational
requirements of DNN, there is an increasing need to design custom hardware for accelerating
the DNN computation with a low power budget.
This thesis proposes an FPGA-based acceleration solution for DNN inference, realized on a
SoC device where software controls the execution and off-loads compute-intensive operations to
the hardware accelerator. To minimize the hardware cost, limited precision data representations
are investigated for DNN computations, and incorporated in the accelerator design. Streaming
functionality is added to the LegUp high-level synthesis tool, allowing the hardware accelerator
to be designed entirely in C-language and synthesized to pipelined hardware. The accelerator
solution is not tied to a particular DNN architecture; rather, it is configurable in software,
permitting the acceleration for a range of DNN architectures proposed in recent literatures.
ii
Acknowledgements
I am very grateful to have worked with many wonderful people throughout my M.A.Sc study
and research. First and foremost, I would like to express my sincere gratitude to my advisor
Professor Jason Anderson. This thesis would not have been possible without his guidance and
support. I admire, appreciate and am inspired by Jason’s enthusiasm, dedication and work
ethic on his teaching and research. Ever since my very first undergraduate courses on computer
fundamentals and digital systems, to my undergraduate summer research project, and to this
thesis work, Jason has always been a great teacher and mentor. Jason has not only introduced
me to the wonderful field of computer engineering, but has also helped me to vastly improved
my research ability and communication skills.
I would like to thank our collaborators from Samsung, John Brothers and Joohoon Lee
for the invaluable discussion and feedback. I would also like to thank my thesis committee
members, Professor Vaughn Betz and Professor Andreas Moshovos for their edits and feedback
on this work.
I would like to thank the people involved with the LegUp project, I was lucky to work with
such an amazing team: Blair Fort, Nazanin Calagar, Bain Syrowik, Joy Chen, and Julie Hsiao.
In particular, I want to thank Andrew Canis and Jongsok (James) Choi, of whom we spent
lots of time together discussing ideas and improving LegUp. I would also like to thank my
roommate Max for the thesis edits and many stress-reducing talks.
I am deeply indebted to the unconditional support from my parents. I would not be able
to achieve any of this without you guys. Last but not least, a special thanks to Isabella for all
the love and understanding.
iii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Primer on Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 General Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Neural Network Training and Inference . . . . . . . . . . . . . . . . . . . 6
2.2 Major Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Fully-connected Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Maxpooling Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Local Response Normalization (LRN) Layers . . . . . . . . . . . . . . . . 10
2.3 Benchmark Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 A Toy Model for The MNIST Dataset . . . . . . . . . . . . . . . . . . . . 11
2.3.2 AlexNet for The ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Key Operation - Multiply and Accumulate (MAC) . . . . . . . . . . . . . . . . . 14
3 Using Low-Precision Fixed-point in Neural Networks 15
3.1 Floating-Point versus Fixed-Point . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Impact of Bit-width on Hardware Operator . . . . . . . . . . . . . . . . . . . . . 16
3.3 Low-Precision Fixed-point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Heterogeneous Fixed-Point Representation . . . . . . . . . . . . . . . . . 18
3.3.2 Heterogeneous Fixed-Point Arithmetic . . . . . . . . . . . . . . . . . . . . 19
iv
3.3.3 Conversion Between Fixed-Point Formats . . . . . . . . . . . . . . . . . . 20
3.4 Software Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.1 Object-Oriented Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.2 Model Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Uniform Fixed-Point in Neural Network Training . . . . . . . . . . . . . . 24
3.5.2 Value Range Profiling in Floating-Point Neural Network Training . . . . . 25
3.5.3 Heterogeneous Fixed-point in Neural Network Inference . . . . . . . . . . 26
3.5.4 Heterogeneous Fixed-point in AlexNet Inference . . . . . . . . . . . . . . 27
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 System Architecture 31
4.1 Accelerator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Computation and Data Access . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Data Reuse and On-Chip Storage . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.3 Accelerator Structure versus Data Width of On-Chip Buffer . . . . . . . . 34
4.1.4 Accelerator Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Translation Layer Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Translation Layer for Fully-Connected Layers . . . . . . . . . . . . . . . . 38
4.3.2 Translation Layer for Convolutional Layers . . . . . . . . . . . . . . . . . 40
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Streaming Hardware Generation in LegUp High-Level Synthesis 45
5.1 Loop Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Pipeline Function Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 FIFO Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 First-Word-Fall-Through FIFO . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.2 Software Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Stall Logic and Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5 Accelerator Design using LegUp’s Pipelined Function Feature . . . . . . . . . . . 54
5.5.1 Software Implementation of Accelerator Controller . . . . . . . . . . . . . 54
5.5.2 Software Implementation of the Compute Unit . . . . . . . . . . . . . . . 56
v
5.5.3 Software Implementation of Buffer Readers and Writer . . . . . . . . . . . 59
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6 System Implementation 61
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 SoC Device Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.3 System Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.1 The Instruction Registers Module . . . . . . . . . . . . . . . . . . . . . . 64
6.3.2 The Status Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.3.3 The Accelerator Controller Interface . . . . . . . . . . . . . . . . . . . . . 64
6.3.4 The DMA Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.3.5 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.4 Data Movement Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4.1 Addressing Scheme for the On-Chip Buffers. . . . . . . . . . . . . . . . . 67
6.4.2 Custom DMA Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4.3 Translation From Virtual Addresses to Physical Addresses . . . . . . . . . 74
6.5 Clock Frequency Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.6 Additional Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.6.1 Using the Direct SDRAM Interface . . . . . . . . . . . . . . . . . . . . . . 76
6.6.2 Control Loop Optimization in Accelerator Controller . . . . . . . . . . . . 78
6.6.3 Reuse of Data Between Input Feature Map Tiles . . . . . . . . . . . . . . 79
6.7 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.8.1 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.8.2 Hardware Resource Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.8.3 Power Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
7 Conclusion 96
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
vi
List of Tables
3.1 Value range profiling of the MNIST model . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Value range profiling of AlexNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Inference Accuracy of AlexNet With Heterogeneous Fixed-point Representations 29
6.1 Design components being tested at each development/verification flow. . . . . . . 81
6.2 Run-time breakdown per image inference of AlexNet . . . . . . . . . . . . . . . . 84
6.3 Time spent on data transfer between memory and on-chip buffers . . . . . . . . . 84
6.4 Resource usage of the proposed system on the Cyclone V SoC FPGA . . . . . . . 89
6.5 Resource usage of the proposed system on the Arria V SoC FPGA . . . . . . . . 90
viii
List of Figures
2.1 A neuron in artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 A simple deep neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Example activation functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Illustration of a convolutional layer. . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Illustration of a maxpooling layer. . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 A neural network model for MNIST dataset . . . . . . . . . . . . . . . . . . . . . 11
2.7 ILSVRC image classification error rate from 2010 to 2015. . . . . . . . . . . . . . 13
2.8 An illustration of the AlexNet architecture . . . . . . . . . . . . . . . . . . . . . . 13
2.9 Computation time distribution of individual layers on GPU and CPU . . . . . . 14
3.1 Schematic of a 16-input dot product operator. . . . . . . . . . . . . . . . . . . . . 17
3.2 The impact of input bit-width on the resource usage, FMax and power of a
dot-product operator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Example of conversion between heterogeneous fixed-point representations . . . . 20
3.4 Key components of the object-oriented software architecture. . . . . . . . . . . . 21
3.5 A snippet of a model configuration file. . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6 Training of MNIST model using 32-bit fixed-point representations. . . . . . . . . 24
3.7 Inference of MNIST model using uniform and heterogeneous fixed-point repre-
sentations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Accelerator Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Three abstract layers in the overall system. . . . . . . . . . . . . . . . . . . . . . 37
4.3 Tiling and traversal of fully-connected layers. . . . . . . . . . . . . . . . . . . . . 39
4.4 Traversal order in the tiling software for convolutional layers . . . . . . . . . . . 41
4.5 Traversal order in the accelerator controller for convolutional layers . . . . . . . . 42
ix
5.1 Module interface of a sequential function. . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Handshaking between sequential functions. . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Ready-valid-data (RVD) interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Handshaking between source and sink using RVD interface . . . . . . . . . . . . 48
5.5 RVD-compatible interface of FWFT FIFO. . . . . . . . . . . . . . . . . . . . . . . . 49
5.6 Pipeline circuit datapath and stall logic. . . . . . . . . . . . . . . . . . . . . . . . 52
6.1 Overall system integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Data layout in off-chip memory for a tile in fully-connected layer. . . . . . . . . . 68
6.3 Data layout in on-chip buffers for a tile in fully-connected layer. . . . . . . . . . . 68
6.4 Data layout in off-chip memory for a tile in convolutional layer. . . . . . . . . . . 69
6.5 Data layout in on-chip buffers for a tile in convolutional layer. . . . . . . . . . . . 69
6.6 Implementation in the custom DMA unit that handles unaligned starting address 72
6.7 Example use of pipeline bridges in the interconnect . . . . . . . . . . . . . . . . . 76
6.8 Interconnect from DMA master to SDRAM. . . . . . . . . . . . . . . . . . . . . . 77
6.9 Previous traversal order during the tiling of a convolutional layer . . . . . . . . . 80
6.10 A data re-use-friendly traversal order. . . . . . . . . . . . . . . . . . . . . . . . . 80
6.11 Using input buffer in a circular organization. . . . . . . . . . . . . . . . . . . . . 80
6.12 Screen shot of chip planner on Cyclone V SoC device. . . . . . . . . . . . . . . . 85
6.13 Screen shot of chip planner on Arria V SoC device. . . . . . . . . . . . . . . . . . 85
6.14 Power monitoring of HPS and FPGA fabric on the Arria V SoC development
board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.15 Neural Network Accelerator Architecture of DianNao [13]. . . . . . . . . . . . . . 93
6.16 Neural Network Accelerator Architecture in [33]. . . . . . . . . . . . . . . . . . . 94
x
Chapter 1
Introduction
1.1 Motivation
Deep neural networks (DNNs) have gained prominence recently by producing state-of-the-art
results in pattern recognition, speech synthesis, customer preference elicitation, and other ma-
chine learning tasks [9]. Like genetic algorithms and simulated annealing, DNNs are based on
an analogy with real-world biological/physical processes. A DNN is a computational model
inspired by the structure and function of the brain, wherein abstract “neurons” perform com-
putations on inputs received from other adjacent neurons, and produce outputs that are then
received and processed by other neurons in the network. The degree of influence a neuron
has on another neuron is reflected by a numerical weight. In simple terms, training a DNN is
the process of selecting values for the weights so that the overall neural network produces the
desired output for a given input. On the other hand, inference of a DNN refers to the use of
an already-trained model to make a prediction for an input that was not seen during training.
For example, in the case of image recognition, when the DNN is presented with an image of a
tree, the DNNs outputs recognize it as so. DNN research is very active in both the research
community and in industry, with companies such as Google, Facebook, and Baidu actively
publishing (e.g. [31]) and competing with one another. The future potential for DNNs, both
in big-data/data center applications and in low-power/mobile/IoT applications, appears to be
very strong. One can imagine, for example, DNNs in data centers being applied for analytics
on data gathered from social media, and DNNs applied in smartphones for image recognition
directly from the camera feed. In both of such contexts, there is a pressing need for both low
power and high speed.
1
Chapter 1. Introduction 2
In most applications, DNNs are first trained off-line with a large set of training data on
machine clusters or GPUs. Trained neural network models are then deployed for inference
tasks in data centres or in an embedded environment, serving a large set of end-users and
applications. For example, a pre-trained neural network model that predicts web-advertisement
click-through rates can be deployed on thousands of ad servers, making inferences for billions
of ad placements on websites and apps everyday. Likewise, in the embedded context, an image-
recognition neural network could be deployed in self-driving cars. Generally speaking, neural
network inference is executed much more frequently than training – a trained network would be
used for inference many times. Therefore, for this research, we optimized to focus on optimizing
inference. The goal of the research is to investigate, design, develop and evaluate customized
hardware implementations for the inference of deep neural networks, with the aim of achieving
considerably better speed and energy efficiency than those can be realized with standard processors
or GPUs.
Broadly speaking, there are two ways of implementing computations: in hardware or in
software. The latter approach is most frequently used, as it is more straightforward and software
development skills are widely available. The software is run on a standard processor, which
is a generic platform with fixed datapath widths, and high overhead associated with fetching
and decoding instructions, accessing memory, and so on. The hardware approach involves
the custom design of a circuit dedicated to the particular application needs. The functions
performed by the circuit, the degree of parallelism, and the datapath widths can all be tailored
precisely to the application requirements, reducing overhead considerably, and improving speed
and power versus using a processor. Indeed, such a customized computing approach can provide
order(s) of magnitude improvement in speed and energy over processors [16].
Although specialized hardware has the potential to provide huge acceleration at a fraction
of a processors energy, the main drawback is related to its design. On one hand, describing
hardware components in a hardware description language (HDL) (e.g. VHDL or Verilog) allows
a designer to adopt existing tools for RTL and logic synthesis into the target technology. On the
other hand, this requires the designer to specify functionality at a low level of abstraction, where
cycle-by-cycle behaviour is completely specified. The use of such languages requires advanced
hardware expertise, besides being cumbersome to develop in. This leads to longer development
times that can critically impact the time-to-market.
An interesting solution to realize customized computing and, at the same time, address the
Chapter 1. Introduction 3
time-to-market problem is the combination of reconfigurable hardware architectures, such as
field-programmable gate arrays (FPGAs) and high-level synthesis (HLS) tools [15]. FPGAs are
integrated circuits that can be configured by the end user to implement digital circuits. Most
FPGAs are also reconfigurable, allowing a relatively quick refinement and optimization of a
hardware design with no additional manufacturing costs. The designer can modify the HDL
description of the components and then use an FPGA vendor toolchain for the synthesis of the
bitstream to configure the FPGA. HLS tools start from a high-level software programmable
language (HLL) (e.g. C, C++, SystemC) to automatically produce a circuit specification
in HDL that performs the same function as the software. HLS offers benefits to software
engineers, enabling them to reap some of the speed and energy benefits of hardware, without
actually having to build up hardware expertise. HLS also offers benefits to hardware engineers,
by allowing them to design systems faster at a high level abstraction and rapidly explore the
design space. This is crucial in the design of complex systems and especially suitable for FPGA
design where many alternative implementations can be easily generated, deployed onto the
target device, and compared.
1.2 Contributions
We design and implement a complete acceleration solution for the inference step of deep neural
networks, specifically in an embedded context. Our implementation is realized on a System-on-
Chip (SoC) FPGA device where a software framework accepts a generic (user-specified) neural
network model as input, then executes and accelerates the corresponding inference by off-
loading the computation to the hardware accelerator implemented on the FPGA. The principal
contributions of this thesis include:
• Using a heterogeneous fixed-point representation in the neural network computation,
which brings performance and area benefits to the custom hardware, while at the same
time, retaining the model accuracy.
• A high-throughput hardware design that accelerates the most compute-intensive part of
neural network inference.
• A function pipelining feature implemented as part of the LegUp high-level synthesis tool,
which allows a streaming design to be described in C-language.
• A complete system implementation that performs accelerated inference using an already-
Chapter 1. Introduction 4
trained neural network model.
• A working software and hardware framework that will enable many further optimizations
to be realized rapidly.
The function pipelining support in LegUp HLS has been submitted for publication to
the 2016 IEEE Int’l Conference on Application-specific Systems, Architectures and Processors
(ASAP).
1.3 Thesis Overview
Chapter 2 first introduces the basics of neural networks and two example models that serve as
target benchmarks in this project. Chapter 3 explores the feasibility of using a reduced-precision
data representation in neural network computation. Our system architecture is described in
Chapter 4. Chapter 5 describes the work done for adding pipeline function support to the LegUp
HLS tool. In Chapter 6, we explain the system implementation and present the experimental
results. Chapter 7 summarizes the thesis contributions and proposes suggestions for future
work.
Chapter 2
Background
The chapter provides basic background on neural networks. We begin by introducing their
general structure, as well as how they can be trained and then used for inference, in Section 2.1.
Four types of neural network layers are described in the Section 2.2. Section 2.3 describes two
benchmark models that are used in this project. We highlight the key operation of neural
network computation, multiply and accumulate, in Section 2.4.
2.1 Primer on Neural Networks
2.1.1 General Structure
The neuron is the basic element in an artificial neural network (Figure 2.1). A neuron receives
inputs that may come from the observable properties of a given problem (e.g., image pixels, or
audio samples) or may be the outputs of other neurons. Each connection from an input to a
neuron is associated with a synaptic weight. The neuron sums up the products of all pairs
of inputs and synaptic weights. This weighted sum is typically offset with a bias term in order
to make the model more general. The bias term can be considered as an additional synaptic
weight that is always connected to a constant input of +1. Nonlinearity is introduced into
neural networks by introducing an activation function at the neuron output. The activation
function transfers the weighted sum to an output value that defines the state of the neuron.
Example activation functions may be a sigmoid function that maps weighted sum values from
(-inf, +inf) to (0, 1) (Figure 2.3a), or a Rectified Linear Unit (ReLU) which clamps all negative
values to 0 and retains all positive values (Figure 2.3b).
5
Chapter 2. Background 6
z! o
w1
w2
w3
wn
+1
!"!"!
i1
i2
i3
in
b
Figure 2.1: A neuron in artificial neural net-works, o = f(z) = f(
∑nk=1(wk · ik) + b ).
!"""""!"""""!
!""!""!
Output Layer
Hidden Layer
Input Layer
Figure 2.2: A simple deep neural network.
−10 −5 0 5 10
0
0.2
0.4
0.6
0.8
1
x
y
(a) Sigmoid function, y = 1
1+e−x
−10 −5 0 5 10−4
−2
0
2
4
6
8
10
x
y
(b) ReLU function, y = (x > 0) ? x : 0
Figure 2.3: Example activation functions.
A neural network with only one layer of synaptic weights can only approximate linear
functions of the inputs. Therefore, intermediate layers are introduced to form more powerful
neural networks in order to approximate more complex models. These intermediate layers are
known as hidden layers since their states are usually not directly observable. Neural networks
with one or more hidden layers are referred to as deep neural networks (Figure 2.2). As shown in
the figure, a neural network can have more than one output. In the case of deep neural networks,
values of the neuron outputs are computed in topological order, from inputs to outputs – the
outputs from neurons in one layer are used as inputs to neurons in deeper layers.
2.1.2 Neural Network Training and Inference
Training of a neural network is the process of finding a set of parameters (weights and bias)
that minimize the model’s approximation error on the training dataset. The approximation
error can be calculated by a loss function, which is typically determined based on the task.
For example, the Euclidean loss function is a popular choice for real-valued regression tasks. It
Chapter 2. Background 7
computes the sum of squares of differences between each model output o and desired output t
as follows:
E =1
2N·
N∑
i=1
(oi − ti)2 (2.1)
where N is the total number of samples in the training dataset.
The combination of gradient descent and back-propagation [29] is the most commonly used
technique to train a neural network. Gradient descent aims to minimize the error, E. In
gradient descent, the synaptic weight w, is updated by a value that is proportional to the
derivative of total error with respect to w, i.e.,
w = w − α · ∂E/∂w (2.2)
where the term α is known as the learning rate.
The back-propagation algorithm can be divided into two phases, the forward pass and the
backward pass. Using the neuron model in Figure 2.1 as an example, the forward pass computes
the neuron output values layer by layer from input to output, with the form
o = f(z) = f(n∑
k=1
(wk · ik) + b ) (2.3)
where f is the non-linear activation function.
The purpose of the backward pass is to find the error derivative with respect to each model
parameter (∂E/∂wk in Equation 2.2), so that the parameters can be updated using the gradient
descent method. Given the error between actual output and desired output, E, the error
gradient with respect to the weight can be computed as:
∂E/∂wk = ∂E/∂o · ∂o/∂z · ∂z/∂wk = ∂E/∂o · ∂o/∂z · ik (2.4)
where ∂E/∂o is the error derivative with respect to the output neuron, ∂o/∂z is the derivative
of the activation function f , and ∂z/∂wk can be reduced to the input neuron value ik of which
the connection to o is associated with wk. When computing the error derivative with respect to
a weight in a hidden layer, the term ∂E/∂o refers to the error derivative of the hidden neuron.
The error derivative with respect to neurons in each hidden layer is computed by propagating
the error derivatives backward layer by layer, from output neurons to input neurons, with the
Chapter 2. Background 8
form:
∂E/∂ik = ∂E/∂o · ∂o/∂z · ∂z/∂ik = ∂E/∂o · ∂o/∂z · wk (2.5)
where ∂E/∂o and ∂E/∂ik are the error derivatives with respect to the output and input neuron
of the current layer; the term ∂z/∂ik is simplified to the weight wk that corresponds to the
connection between input ik and output o. When computing the error derivatives for the
preceding layer (in the sense of forward pass), an input neuron ik in current layer becomes
an output neuron of the preceding layer. Thus, the computed ∂E/∂ik becomes ∂E/∂o in
Equation 2.4 when computing the error derivative with respect to the weights in the preceding
layer, and substitutes ∂E/∂o in Equation 2.5 for computing the error derivative with respect
to the input neurons in the preceding layer.
Using a neural network for inference is to use a pre-trained model to approximate the
output for unseen data samples. Neural network inference is equivalent to the forward pass of
the back-propagation algorithm.
2.2 Major Layers
The following paragraphs describe the four types of layers used by the target benchmarks in
this research.
2.2.1 Fully-connected Layers
In a fully-connected layer, every neuron is connected to all the neurons in its previous layer.
Each connection between neurons is associated with a unique synaptic weight and each neuron
is associated with a bias term. For a fully-connected layer with M outputs and N inputs, there
will be as many as M ×N trainable weights and M trainable biases.
2.2.2 Convolutional Layers
Unlike fully-connected layers, output neurons in a convolutional layer are only connected to a
local region of input neurons. Output neurons may share a common set of synaptic weights but
apply the same connectivity pattern on different regions of inputs. The set of shared synaptic
weights can be thought as a filter or a feature extractor. Consider the scenario where the input
to the neural network is an image with three colour channels. A 3D filter with dimensions of
K×K×3 is applied to a region of pixels in the input to extract the feature state at that particular
Chapter 2. Background 9
!"#$%&'()')*+,%&-.'/
/
012",'/%3,"&%'432-5",2",'/%3,"&%'432-
6
6
Figure 2.4: Illustration of a convolutional layer.
local region. Such a local region is sometimes referred as the output neuron’s receptive field in
its previous layer. As shown in Figure 2.4, the filter acts as a sliding window that moves across
a multi-channel image to extract a feature from all local regions. The extracted values at all
locations form a feature map representing the intensity of the feature at each location.
A convolutional layer is usually designed to capture more than one feature and hence,
multiple filters are used to construct a stack of output feature maps. For example, one filter
may be “looking” for horizontal lines in an input image; another filter may be “looking” for
diagonal lines. As with the filter weights, the bias term is shared by all output neurons in
the same feature map. The training process for a convolutional layer aims to adjust filter
weights and bias in order to extract features that can help to minimize approximation error of
the overall network. It is worth mentioning that filters may not be applied at every possible
starting position of the input feature maps; rather, a convolutional layer may have non-unit
stride. For example, with a stride of 2 in both dimensions, filters would be applied at positions
separated by 2 pixels.
2.2.3 Maxpooling Layers
A maxpooling layer is an approach to sub-sampling and it does not have trainable parameters,
i.e., synaptic weights. It reduces the input dimensionality by extracting the maximum value
from a set of neighbouring inputs. For example, a maxpooling layer can be applied on a set
Chapter 2. Background 10
!"#$%&'()%$*(&+)#,
-$%#$%&'()%$*(&+)#,
./ . 01
2 0/ 3
4 05 0/
Figure 2.5: Illustration of a maxpooling layer.
of input feature maps (Figure 2.5). For each individual feature map, it extracts the most
responsive neurons – the neurons with the highest output values – from the patches covered
by a sliding window. In this case, the number of output feature maps will be the same as the
number of input feature maps, but output feature maps are typically smaller on the X- and
Y-dimensions. During back-propagation, the error derivatives with respect to input neurons
unselected by the max operation in the forward pass are zero. For the selected input neurons,
their error derivatives are equal to the error derivatives with respect to their corresponding
output neurons.
2.2.4 Local Response Normalization (LRN) Layers
In image recognition tasks, local response normalization layers are designed to normalize the
response of neurons at the same spatial location, but from different feature maps [24]. This is
loosely akin to averaging a neuron’s output with those of other neurons at the same location in
different feature maps. The computation can be formulated as follows:
bix,y = aix,y/
(
k + α
min(N−1,i+n/2)∑
j=max(0,i−n/2)
(ajx,y)2)β
(2.6)
The terms aix,y and bix,y are input and output neurons at location <x,y> of feature map f ,
respectively. N refers to the total number of feature maps and n refers to the number of adjacent
feature maps to be considered. The constants k, n, α, and β are non-trainable parameters, also
known as hyper-parameters. Their values are selected from validation where the best set of
Chapter 2. Background 11
values are chosen based on many trial runs. It is worth noting that the ordering of feature
maps are arbitrary and it is up to the network itself to adjust the feature extractors to adapt to
the normalizations. As it pointed out in [24], this response normalization method bares some
similarity to lateral inhibition behaviour found in biological neurons.
2.3 Benchmark Models
In this dissertation, we experiment with two image recognition benchmarks: a small model
designed to recognize the MNIST dataset [26] of handwritten digits and a more sophisticated
image classification model for the ImageNet dataset [30]. The MNIST dataset is useful for
validating some of our design ideas that will be discussed in Chapter 3. The ImageNet dataset
is more complex and reflects datasets used in real applications. Hence, this model would be
more representative when evaluating our overall hardware implementation.
2.3.1 A Toy Model for The MNIST Dataset
The MNIST dataset consists of 28 × 28 greyscale images of handwritten digits. The dataset
has a training set of 60,000 images and a test set of 10,000 images. The task is to classify each
image into one of the 10 digit classes from 0 to 9. For this dataset, we use a neural network
model that is similar to LeNet-5 proposed by Yann LeCun in 1998 [25]. The model has two
!"
!
#
!"
#
$"
!
$"
%&'(&)*+,&'-. /+0,123.4 %&'(&)*+,&'-. /+0,123.45678&&)-. /+0,123.!
5678&&)-. /+0,123.!
9...9
...9
!"#$$%&'$(')'*+,%&-+(&
9.....9
.....9
4":$""
!$
4!;!;
$
<*))=>?&''2?+21
Figure 2.6: The neural network architecture of the toy model for MNIST dataset.
convolutional layers, each followed by a maxpooling layer, and one fully-connected layer used as
a classifier at the end of the network. The first convolution layer applies 20 5× 5× 11 filters on
the input image with a stride of 1, and creates 20 feature maps of size 24× 24. A maxpooling
layer downsamples the feature maps with a 2×2 non-overlapped (i.e., stride of 2) sliding window
1Since the inputs are single-channel greyscale images, the third dimension of the 3D filter will have a size of1.
Chapter 2. Background 12
to produce 20 12× 12 feature maps. In the next convolutional layer, the downsampled feature
maps are convolved with 40 independent 5 × 5 × 20 filters with a stride of 1, resulting in 40
8× 8 feature maps. The identical maxpooling layer is used again to downsample each of the 40
feature maps into a smaller size of 4× 4. Lastly, the fully-connected layer discards the spatial
information and takes in the feature maps as an input vector of 6400 (4× 4× 40) neurons, and
produces 10 outputs: one corresponding to each digit class.
We use ReLU as the activation functions after the two convolutional layers and softmax
activation after the fully-connected layer. Softmax differs from the ReLU or sigmoid functions
that map each input independently to a new value. Softmax takes all K neurons (K = 10 in
this case) and produces a probability value for each neuron, where the sum of all K probability
values equals to 1 (Equation 2.7).
bi =exp(ai)
∑Kj=1 exp(aj)
(2.7)
where ai is the output of neuron i before the softmax activation, and bi is the output of neuron
i after softmax is applied. In this case, a probability value is the estimated likelihood of an
input image belonging to the corresponding digit class.
We trained the network for 100 epochs2 and were able to obtain a validation accuracy of
99.34%. This accuracy serves as the baseline for our later evaluations.
2.3.2 AlexNet for The ImageNet Dataset
ImageNet [27] is a large image dataset organized primarily by the Standford Vision Lab [4].
The dataset contains more than 15 million high-resolution images belonging to around 22,000
categories. The images are collected from the internet and labelled by humans using a crowd-
sourcing tool. This dataset has become an invaluable resource for computer vision and machine
learning researchers. The ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [30]
uses a subset of ImageNet, containing ∼12 million training images and 50 thousand validation
images, with roughly even number of images for each of the 1000 selected classes. The competi-
tion has been held annually for six years and has become a standard benchmark for large-scale
image recognition. Significant progress has been made by researchers around the world. The
2One epoch consists of one full training cycle on the entire training data set.
Chapter 2. Background 13
top-5 error rate3 of the image classification task for instance, has been substantially reduced
from 28.2% in 2010 to 3.6% in 2015 (Figure. 2.7).
!"#!$!%#"$
&'#($
&&#)$
'#)$
*#'$
+#+$
%#+$
&+#+$
&%#+$
!+#+$
!%#+$
*+#+$
!+&+ !+&& !+&! !+&* !+&( !+&%
,-./%0122-203456
7642
Figure 2.7: ILSVRC image classification error rate from 2010 to 2015.
Most notably, in 2012, Krizhevsky et al. proposed a large convolutional neural network
(CNN) architecture and showed significant improvement upon previous approaches [24]. This
CNN architecture is commonly known as AlexNet and has been widely used as a reference
model in many research papers. Figure 2.8 illustrates the architecture of AlexNet.
Figure 2.8: An illustration of the AlexNet architecture (The figure is taken from [24]).
The model has 8 trainable layers with adjustable weights. The first five are convolutional
layers and the last three are fully-connected. The output layer is activated by a 1000-way
softmax that produces probability distribution over the 1000 classes. The first two convolutional
layers are followed by local response normalization layers. Three maxpooling layers are placed
after the two normalization layers and the fifth convolutional layer. All maxpooling layers use
a 3×3 pooling window with a stride of 2. To speedup the training, the network is split into two
partitions so that each partition can be trained on one GPU. Communication between the two
GPUs only happens at certain layers. As the figure shows, the third convolutional layer applies
3Top-5 error rate is the fraction of images whose correct labels are not matched by any of the top-5 predictionsmade by the model.
Chapter 2. Background 14
the filters on all feature maps in the second layer, and output neurons in fully-connected layers
are connected to all input neurons in the previous layers. The model has around 60 million
parameters and 650,000 neurons. It was trained on two NVIDIA GTX580 3GB GPUs for almost
six days.
In this research, we use a pre-trained AlexNet model provided by an open-source neural
network framework called Caffe [23]. This model has a minor variation from the original one
described above and achieved a top-1 accuracy of 57.4% and a top-5 accuracy of 80.4% [1].
2.4 Key Operation - Multiply and Accumulate (MAC)
As pointed out in [22], the most compute-intensive layers in typical neural networks are the
fully-connected and convolutional layers.
Figure 2.9: Computation time distribution of individual layers, on both GPUs and CPUs forthe forward pass. This figure is taken from [22].
Figure 2.9 shows the run-time time breakdown of the forward pass (inference) of AlexNet
on both GPUs and CPUs. Labels prefixed with fc and conv are the fully-connected and
convolutional layers. These two types of layers combined take 95% and 89% of time on GPUs
and CPUs, respectively. In both layers, most of the work for an output neuron is to compute the
dot-product between its connected input neurons and the associated weights. The underlying
operation is essentially multiply-accumulate (MAC).
Chapter 3
Using Low-Precision Fixed-point in
Neural Networks
Neural network computations are typically performed using 32- or 64-bit floating-point because
of their ease-of-use on general processors (CPUs or GPUs). However, in custom hardware,
we have the ability to use a more cost-efficient fixed-point representation and even tailor the
numerical precision to the minimum required for inference of the desired accuracy. We propose
to use heterogeneous fixed-point representations during neural network inference, in hope of
the maximizing computational throughput of our custom hardware and also raising energy
efficiency.
In this chapter, we begin by reviewing the trade-offs between floating-point and fixed-point
representations in Section 3.1. Section 3.2 illustrates the positive impact of reducing fixed-
point precision on hardware performance, area and power cost. We then present the concept of
heterogeneous fixed-point representation in Section 3.3. In Section 3.4, we introduce a software
framework that was developed to experiment with low-precision fixed-point in neural networks.
In Section 3.5, our experiments show that neural network computations can be carried out
using fixed-point arithmetic and the reduced bit-width heterogeneous fixed-point format can be
used in neural network inference with minimum damage on the accuracy. Lastly, we summarize
several related works in Section 3.6.
15
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 16
3.1 Floating-Point versus Fixed-Point
In computing, real-valued numbers are represented in two main categories of data types,
floating-point and fixed-point. The floating-point representation consists of three components,
a sign bit (S), exponent (E) and mantissa (M). For example, the single-precision floating-
point (SPFP) representation defined in IEEE standard 754 has 1 sign bit, 8 exponent bits
and 23 mantissa bits. The value of a floating-point number is formulated as (−1)S × (1 +
M × 2−23)× 2E−127. The exponent part allows the floating-point format to represent a wide
range of magnitudes, from 2−127 to 2128 in the case of SPFP, while the mantissa limits the
relative error by keeping the resolution at the most significant bits.
Fixed-point representation is essentially an integer with a fixed radix point position that de-
termines the scaling factor. We refer the bits on the left and right of the radix-point as decimal
bits (D) and fraction bits (F), respectively. Given a [D.F ] fixed-point representation, its
precision is limited by the smallest representable magnitude 2−F and the representable values
are bound in the range of [−2D−1, 2D−1 − 2−F ].
In scientific computing, floating-point is more commonly used because of its ease-of-development
in most programming environments: 1) Floating-point can represent a wider range of values,
and hence programmers generally do not need to worry about overflow or underflow scenarios,
which are more likely when fixed-point is used; 2) Floating-point also provides better precision
than fixed-point in general; 3) Mathematical functions, such as exponential and power, are bet-
ter supported for floating-point by software libraries; for instance, the C math library math.h
supports only single- and double-precision floating-point data types.
While float-point provides far better ease-of-use for software developers, the computational
complexity of fixed-point arithmetic is considerably less than that of floating-point. Fixed-point
arithmetic operators are generally faster and more area- and power-efficient. Given that we are
designing custom hardware, we have the opportunity to use custom fixed-point arithmetic in
order to maximize the hardware throughput, while maintaining a low power budget.
3.2 Impact of Bit-width on Hardware Operator
Fixed-point arithmetic is mostly the same as integer arithmetic with additional conversion
operations. The hardware cost of integer arithmetic operators, specifically multiplication and
addition, are tightly correlated with the data bit-width.
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 17
Modern FPGA architectures contain pre-fabricated DSP blocks for implementing multipliers
and adders in a wide range of precisions [5]. Using the Altera Stratix V FPGA as an example,
each DSP block can implement three independent 9-bit multiplies, a sum of two 18-bit multiplies
or one independent 27-bit multiply. These variable-precision DSP blocks can also be cascaded
to support higher precision modes, e.g., two DSP blocks can be chained to implement an
independent 36-bit multiply operation1.
To illustrate the impact of operator bit-width on hardware performance, area and power
cost, we implement a dot-product operator with different bit-widths on the Altera Stratix V
device. Figure 3.1 shows the structure of the dot-product operator, consisting of 8 multipliers
and 7 adders. Inputs and outputs of the dot-product circuit are registered. All multiplier
Figure 3.1: Schematic of a 16-input dot product operator.
and adder outputs are two times wider than the multiplier inputs. We placed-and-routed the
circuit using Altera’s Quartus II synthesis tool and report the resource usage, clock frequency
and estimated power estimation in Figure 3.2. The power consumption is estimated in vectorless
mode with 12.5% input toggle rate2 and 100MHz clock frequency. In the experiment, the DSP
blocks are used in three configurations: 1) when the input width is less than or equal to 18 bits,
the DSPs are used in “the sum of two 18-bit multiplies” mode and hence 4 DSPs are enough
to implement 8 multipliers; 2) when the input width ranges from 19 bits to 27 bits, the DSPs
are used in “one independent 27-bit multiply” mode, resulting in 1 DSP per multiply; and 3)
when the input width goes beyond 27 bits to 32 bits, two DSPs are cascaded to form an up-to
36-bit multiply. Combinational ALUT usage and estimated power consumption both increase
1Consider, for example, that a 36-bit multiply can be realized using four 18-bit multiplies, and each DSPblock can implement two independent 18-bit multiplies.
2The power consumption results gathered from the vectorless estimation may not be accurate. However, theestimated results can accurately reflect the relative difference in power consumption when different input bitwidths are used by the dot-product operator.
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 18
8 18 19 27 28 320
5
10
15
20
DP
S B
locks
8 18 19 27 28 320
100
200
300
Com
bin
ational A
LU
Ts
8 18 19 27 28 32100
150
200
250
300
Fm
ax [M
Hz]
8 18 19 27 28 320
10
20
30
Therm
al P
ow
er
[mW
]
Input Bit−width
Figure 3.2: The impact of input bit-width on the resource usage, FMax and power of a dot-product operator.
as the input gets wider, and have steeper increases when more DSPs are required. The circuit
Fmax degrades in general as bit-width increases and falls rapidly when more DSPs are used.
Based on the result, we can see that minimizing the operation bit-width can largely improve
the hardware efficiency in terms of performance, area and power. This motivates us to explore
the possibility of using lower-precision fixed-point arithmetic.
3.3 Low-Precision Fixed-point
3.3.1 Heterogeneous Fixed-Point Representation
Overflow and insufficient precision are the major challenges to using a fixed-point representation.
Overflow can be avoided by allocating enough decimal width (west of the decimal point) such
that the fixed-point format covers the representing data’s value range. It is harder to analyze
the sufficiency of precision, but in general it is always better to use a wider fixed-point format.
In Section 3.5.2, our experiments show that different parts of a neural network can have very
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 19
diverse value ranges, e.g., one layer’s neuron values can range from −34.8 to 16.9, while another
layer’s bias only ranges within ±3.2× 10−5.
In scenarios with diverse precision requirements, it is wasteful to use a uniform fixed-point
format that prevents overflow and provides sufficient precision throughout the entire neural
network. Doing so would imply that many leading decimal bits of a small-value-range data
will be constant 0, while the provided precision may be “overkill” for large-value-range data.
Therefore, we propose to use a heterogeneous fixed-point representation, where each part of a
neural network can have its own decimal width and fraction width. It is worth noting that we
permit the custom decimal width to be a negative value for variables with small value ranges.
The decimal width needed to avoid overflow can be formulated as ⌈log2(MaxMagnitude)⌉+1.
For example, for variable ranges within (-0.25, 0.25), the required decimal width would be -1
since the first bit on the right of the radix point does not carry additional information.
3.3.2 Heterogeneous Fixed-Point Arithmetic
Heterogeneous fixed-point arithmetic is a generalized form of uniform fixed-point arithmetic.
We use the dot-product operation as an example to illustrate the design considerations in
heterogeneous fixed-point arithmetic implementation.
Given a dot-product operation of y = ~A · ~B and the corresponding fixed-point formats
[Dy.Fy], [DA.FA] and [DB .FB ], the operation can be done in the following steps:
1. For each pair of elements (a, b) in ~A and ~B,
(a) Perform integer multiplication of a × b to obtain a temporary product t in the
format of [Dt.Ft] where Dt = Da +Db and Ft = Fa + Fb;
(b) Accumulate temporary product t to the sum s, whose fixed-point format is [Ds.Fs].
2. Convert the sum s from format [Ds.Fs] to format [Dy.Fy].
At the accumulation step (1.b), the accumulator s’s fraction width should be as wide as Ft to
preserve best possible precision, while the decimal width needs to beDt+⌈log2(InputV ectorLength)⌉
to prevent overflow completely. In our following experiment, the accumulator uses the same
fixed-point format as the temporary product, [Dt, Ft]. In this case, potential overflow could
happen. We discuss how we handle such cases and the last conversion step in the following
subsection.
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 20
3.3.3 Conversion Between Fixed-Point Formats
As mentioned above, fixed-point arithmetic requires conversion operations in addition to the
underlying integer arithmetic. When converting a fixed-point number to another format with
wider decimal width or fraction width, the conversion can be done by padding zeros before
or after the original number. When converting to a narrower decimal width format, potential
overflow can occur. To minimize conversion error due to overflow, overflow protection can
be implemented by setting the value to the maximum (or minimum if the original value is
negative) representable value of the target format.
Furthermore, truncation or rounding is needed when converting to a narrower fraction width.
Truncation simply discards the additional fraction bits and does not require extra operations or
hardware cost. Round-to-nearest, converts the original value to its closest representable value
010.01101
+ 1
010.11001
010.01101
+ 011 !rand()
010.11000
010.01101
+ 010 !rand()
010.01111
Round-to-nearest: Stochastic rounding:
Figure 3.3: Example of converting a fixed-point number from [3.5] format to [3.2] format usinground-to-nearest and stochastic rounding schemes.
of the target format, resulting in less precision lost than truncation. Round-to-nearest can
be carried out by adding a bit 1 at the most significant position of the trailing fraction bits,
as shown in Figure 3.3. Stochastic rounding is another rounding method that can effectively
reduce precision lost. It probabilistically rounds the original value to the representable value
just above or just below the original value. The decision to round up or down is based on rolling
a weighted die, weighted by the distance from the original value to the two neighbouring values.
Concretely, given the above and below closest representable values u and l, the probability of
rounding the original value o to u is P = (o − l)/(u − l), and the probability of rounding
down to l is 1− P . Statistically speaking, stochastic rounding preserves the most precision as
the expected value of the rounded number equals to the original value. Stochastic rounding
can be implemented by truncating the sum of the original number and a random value, whose
bit-width equals to the number of trailing fraction bits. This method will require a random
number generator and an adder in hardware.
In the following experiments, we use overflow protection and truncation for fixed-point
format conversion.
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 21
3.4 Software Framework
A software framework was built to facilitate experiments that use heterogeneous fixed-point
in neural network computations. The framework supports both inference and training of any
feed-forward neural network architecture consisting of the layers described in Chapter 2. Users
can specify the neural network model in a standalone configuration file, which is read in as input
by the software at run-time. Computations can selectively be carried-out in both floating-point
and heterogeneous fixed-point representation.
3.4.1 Object-Oriented Architecture
The software is written in C++ in an object-oriented style. Figure 3.4 shows several key classes
of the design and their relationships in a UML diagram.
Matrix
-fixpt_matrix: vector<int>
-flopt_matrix: vector<float>
-is_fixpt: bool
-decimal_width: int
-fraction_width: int
Edge
-src: Layer*
-dst: Layer*
+FP()
+BP()
EdgeWithWeights
-weights: Matrix
-bias: Matrix
-grad_weights: Matrix
-grad_bias: Matrix
+UpdateWeightsAndBias()
Layer
-state: Matrix
-deriv: Matrix
+ApplyActivation()
+ApplyDerivOfActivation()
Maxpool_Edge
FC_Edge
Conv_Edge
LRN_Edge
NN
+layers: vector<Layer>
+edges: vector<Edge>
+ForwardPass()
+BackwardPass()
Figure 3.4: Key components of the object-oriented software architecture.
From the lowest level, a Matrix class implements N-dimensional arrays of fixed-point or
floating-point data. The underlying container is a one-dimensional array of integers or floating-
point numbers. In the case of fixed-point, all elements in the matrix share the same format,
i.e., decimal width and fraction width.
In this design, the Layer class encapsulates a layer of neurons. It contains two Matrix
objects, one for the layer’s neuron values (named state) and the other one for the corresponding
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 22
error derivatives (named deriv). During the forward and backward pass, two member methods,
ApplyActivation and ApplyDerivOfActivation, are respectively invoked to apply activations
(e.g., ReLU) to state or apply the derivative of the activation function to deriv.
Edge is an abstract class that connects two adjacent layers of neurons in the network. It
contains two pointers pointing to the source (src) and destination (dst) Layers. Two vir-
tual methods corresponding to the forward propagation (FP) and backward propagation (BP)
must be implemented by its subclasses. For example, a Maxpooling layer’s FP implementation
downsamples the src layer and assigns values to the neurons in the dst layer.
EdgeWithWeights is also an abstract class, which itself is a subclass of Edge. It is used as
the base class for the Edges with trainable parameters, such as the fully-connected (FC) and
convolutional (Conv) edges. Four Matrix objects are used for storing the weights, bias and their
corresponding gradients. A member method called UpdateWeightsAndBias is implemented to
update the parameters based on the computed gradients from back propagation.
Lastly, the NN class is designed to construct the network based on the user-provided con-
figuration file and to orchestrate the neural network computations. During the forward pass,
this class goes through a series of alternating Layers and Edges, from the input layer to the
output layer, by calling the ApplyActivation and FP functions, respectively. In the backward
pass, this class invokes the Layers’ ApplyDerivOfActivation and Edges’ BP in reverse order,
and calls the UpdateWeightsAndBias method to update parameters.
3.4.2 Model Configuration File
The model configuration file is designed using Google protocol buffers [3], a language-neutral and
platform-neutral mechanism for serializing structured data. Our implementation of the model
configuration file leverages a subset of protocol buffer features, including the human-readable
text format, auto-generated source code for serializing and parsing between configuration file
and in-memory data objects and auto-generated data-access classes that are programmatically
easy-to-use.
Figure 3.5 shows a snippet of the configuration file for the last three layers of the MNIST
model described in Section 2.3.1. In the configure file, a layer structure specifies the name,
dimensions, activation function type, and data representation for the layer of neurons. The
data representation is described by a structure with fields indicating whether a fixed-point
representation should be used and also the fixed-point format. For example, the layer named
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 23
!"#!$ %
&'(!) *+,&-.!"#!./*
01+.&'(!) *('23,,45&#.4'6!1.7*
"!08.&'(!) *+,&-.4'6!1./*
!"#!.863!) 9:;<:=>?@:;A=
B548!1.1,C0) D
B548!1.+,40) D
0815"!) 7
C!5#E80."'8'.1!3 %$FFF$G
H5'0."'8'.1!3 %$FFF$G
G
!"#!$ %
&'(!) *('23,,45&#.!"#!./*
01+.&'(!) *+,&-.4'6!1./*
"!08.&'(!) *('23,,45&#.4'6!1./*
!"#!.863!) IAJK::=
B548!1.1,C0) /
B548!1.+,40) /
0815"!) 7
G
!"#!$ %
&'(!) *,L83L8.B+.!"#!*
01+.&'(!) *('23,,45&#.4'6!1./*
"!08.&'(!) *,L83L8.4'6!1*
!"#!.863!) M9
C!5#E80."'8'.1!3 %
50.B5238) 81L!
H58.C5"8E) N
B1'+85,&.C5"8E) N
G
H5'0."'8'.1!3 %$FFF$G
G
4'6!1$ %
&'(!) *+,&-.4'6!1./*
&L(.+E'&&!40) OP
&L(.5('#!.1,C0) N
&L(.5('#!.+,40) N
'+85-'85,&) QR9?@M@RS.=@;RAQ
08'8!."'8'.1!3 %
50.B5238) 81L!
H58.C5"8E) N
B1'+85,&.C5"8E) 7
G
G
4'6!1$ %
&'(!) *('23,,45&#.4'6!1./*
&L(.+E'&&!40) OP
&L(.5('#!.1,C0) O
&L(.5('#!.+,40) O
'+85-'85,&) QR9?@M@RS.=@;RAQ
08'8!."'8'.1!3 %
50.B5238) 81L!
H58.C5"8E) N
B1'+85,&.C5"8E) /
G
G
4'6!1$ %
&'(!) *,L83L8.4'6!1*
&L(.+E'&&!40) 7
&L(.5('#!.1,C0) 7
&L(.5('#!.+,40)$ 7P
'+85-'85,&) T:M?IAJ
08'8!."'8'.1!3 %
50.B5238) 81L!
H58.C5"8E) N
B1'+85,&.C5"8E) 7
G
G
4'6!1$ %
&'(!) *('23,,45&#.4'6!1.7*
&L(.+E'&&!40)$ /P
&L(.5('#!.1,C0)$ 7/
&L(.5('#!.+,40)$ 7/
'+85-'85,&) QR9?@M@RS.=@;RAQ
08'8!."'8'.1!3 %
50.B5238) 81L!
H58.C5"8E) N
B1'+85,&.C5"8E) O
G
G
Figure 3.5: A snippet of a model configuration file.
‘maxpooling layer 1’, which is the output of the first maxpooling edge in the MNIST model
(not shown in the snippet) model, has 20 12×12 feature maps that are represented in 8-bit fixed-
point with the 4 least-significant bits allocated for fraction bits. An edge structure specifies the
names of its source and destination layers, the edge type, additional parameters associated with
the specific edge type (e.g., the filter size and stride for convolutional and maxpooling edges),
and also the data representations to be used for weights and bias (if the edge contains trainable
parameters). For instance, the second convolutional layer in the MNIST model convolves its
input feature maps with 40 5×5×20 filters with a stride of 1. The edge named ‘conv edge 2’
in the configuration file describes such computation. The source and destination of the edge
are the ‘maxpooling layer 1’ and ‘conv layer 2’ layers. Since the depth of filters (i.e. 20)
always equals the number of input feature maps and the number of filters (i.e. 40) can be
implied by the number of output feature maps, we can omit this information from the edge
structure as they are already specified in the source and destination layer structures.
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 24
3.5 Experimental Study
3.5.1 Uniform Fixed-Point in Neural Network Training
We start by using uniform fixed-point representation to train the MNIST model described in
Section 2.3.1. All neurons’ states and derivatives, as well as all parameter values and gradients
are represented in a uniform fixed-point format with the same decimal width and fraction width.
All multiply and add operations are performed in fixed-point arithmetic. The only non-linear
function used in the MNIST model is the softmax function. It is performed in floating-point
format, where fixed-point inputs and outputs are converted to/from floating-point before and
after the calculation.
In the experiment, we fix the total bit-width at 32 bits and vary the radix point position to a
range of configurations. Figure 3.6 reports the best validation accuracy achieved by each fixed-
!"#$%&
!$#!'& !$#!'& !$#$(&
!$#"!&
!(#')&
!"#"*&
!+#**&
!+#"*&
!(#**&
!(#"*&
!$#**&
!$#"*&
!!#**&
!!#"*&
'#%$ +#%+ $#%' ,*#%% ,%#% ,'#,$ ,+#,+
-./01234567.863.9:4;<<=>6<?4;<0.1@18
A.B18CD9.:34 <9:E./=>63.9:2
FGAGH4!$#!$&
$"#%%&
Figure 3.6: Training of MNIST model using 32-bit fixed-point representations.
point configuration during 100 epochs of training. The red line corresponds to the baseline
accuracy achieved by using single-precision floating-point. Both [6.26] and [8.24] fixed-point
formats achieve the highest accuracy at 98.94%, which is very closed to the 98.98% accuracy
achieved using floating-point. In general, the model achieves better validation accuracy when
better precision is preserved by allocating more bits to the fraction part. However, when the
fraction width is increased to 28 bits, the model stops from converging to a higher accuracy.
Most likely this is the result of frequent overflow caused by the small decimal width of 4 bits.
On the other end, the model trained with [16.16] fixed-point format does not achieve any higher
accuracy than 85.22%, due to the insufficient precision and potential underflow.
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 25
3.5.2 Value Range Profiling in Floating-Point Neural Network Training
To better understand the reason behind the lower accuracy obtained when using narrower
decimal width or fraction width, we examine each matrix’s value range during the training
with a floating-point representation. In Table 3.1, we report the maximum and minimum
Table 3.1: The value range of each matrix during training using float-point.
Layer/Edge MatrixDecimal
Max. Min. WidthValue Value Needed
input layerstate 1 0 1
deriv 3.28e+00 -2.89e+00 3
conv 1 layerstate 5.23e+00 -3.44e+00 4
deriv 3.91e-01 -4.46e-01 0
conv 2 layerstate 1.69e+01 -3.48e+01 7
deriv 5.31e-01 -5.34e-01 1
output layerstate 3.46e+01 -3.29e+01 7
deriv 9.99e-01 -1.00e+00 1
conv 1 edgeweights
value 7.72e-01 -4.98e-01 1
gradient 1.12e-01 -1.28e-02 -2
biasvalue 3.20e-05 -3.20e-05 -13
gradient 5.20e-05 -6.30e-05 -12
conv 2 edgeweights
value 2.32e-01 -2.26e-01 -1
gradient 2.47e-01 -1.31e-01 -1
biasvalue 6.10e-05 -6.20e-05 -12
gradient 2.30e-05 -2.00e-05 -14
fc edgeweights
value 3.25e-01 -4.32e-01 0
gradient 2.22e-01 -9.51e-02 -1
biasvalue 1.59e-04 -1.69e-04 -11
gradient 2.70e-05 -2.90e-05 -14
values that have ever appeared in each matrix during the 100-th training epoch (not including
the first 99 training epochs). The last column reports the minimum decimal width required
to prevent overflow if fixed-point representation is used. The results of maxpooling layers are
skipped because their value ranges are the same as their previous convolutional layers. From the
table, we observe a wide swing in the required decimal width across matrices. The conv 2 layer
and output layer require the decimal width to be as wide as 7 bits. This confirms that the reason
for lower accuracy when using [4.28] format is indeed due to overflow caused by small decimal
width. On the other hand, the bias value and gradient matrices are tend to have very small
value ranges, requiring less than -12 bits decimal width. Moreover, a parameter is updated
during training by incrementing itself with the term Gradient×LearningRate. The Gradient
is initially a larger value (than the ones reported here at 100-th epoch) and diminishes as
learning proceeds since the model gets closer to convergence. The LearningRate is set to 10−3
in our experiment. This means that, when the [16.16] fixed-point format is used, the term
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 26
Gradient × LearningRate is going to underflow if the gradient has a magnitude smaller than
2−16/10−3 ≈ 1.526 × 10−2, and there will be no update to the parameter. This explains the
reason why the accuracy does not converge any higher than 85.22% when the [16.16] fixed-point
format is used.
3.5.3 Heterogeneous Fixed-point in Neural Network Inference
The above experiments show that it is feasible to train a neural network in fixed-point repre-
sentation if a proper configuration is used. However, it is not trivial to find a good fixed-point
configuration without doing any test runs, e.g., training with floating-point representation and
profiling the value ranges. This could lead to practical difficulty when using fixed-point in
neural network training. On the other hand, neural network inference can take advantage of
knowing the value ranges of a pre-trained model and customizing the fixed-point representation
with reduced bit-width. We therefore propose to tailor the decimal width based on the profiled
value range and analyze the accuracy change while reducing the total bit-width by shrinking
the fraction part of the data representation. The experiment is conducted on both uniform
and heterogeneous fixed-point configurations. The decimal width of the uniform fixed-point
representation is set to 7 bits, the widest decimal width needed reported above. For hetero-
geneous fixed-point representation, the decimal width is tailored for each matrix based on its
value range. It is worth noting that the value ranges are profiled during the training process
where the training dataset is used. Meaning that, these custom fixed-point configurations are
not specialized to the validation dataset that is used in this inference experiment.
Figure 3.7 shows the validation accuracy at each bit-width setting. On the left end of the
figure, where the bit-width is as wide as 24 bits, both representations can achieve 98.98% of
validation accuracy, which is the same as that when using floating-point representation. For
uniform fixed-point representation, the accuracy does not decrease by more than 0.02% until
the bit-width is reduced to 13 bits. Then the accuracy starts to drop significantly as the bit-
width is further reduced. For heterogeneous representation, more bit-width can be reduced
without causing much accuracy degradation. Even when the bit-width is reduced to 8 bits, the
validation accuracy is only 0.21% worse than the floating-point baseline.
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 27
!" !# !! !$ !% $& $' $( $) $* $" $# $! $$ $% & ' (
+,-./01 &'2&'3 &'2&'3 &'2&'3 &'2&(3 &'2&(3 &'2&(3 &'2&(3 &'2&)3 &'2&)3 &'2&'3 &'2&(3 &'2'&3 &'2*(3 &$2"&3 #$2!$3 &2'%3 &2'%3 &2'%3
45650/75,5/89 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&'3 &'2&(3 &'2&(3 &'2&)3 &'2&*3 &'2&(3 &&2%$3 &'2&*3 &'2((3 $"2)'3
&'2*%3
&'2)%3
&'2(%3
&'2'%3
&'2&%3
&&2%%3
&&2$%3
:;<-=;6-/,>?@@80;@A
B-6CD-=6E
+,-./01 45650/75,5/89
Figure 3.7: Inference of MNIST model using uniform and heterogeneous fixed-point represen-tations.
3.5.4 Heterogeneous Fixed-point in AlexNet Inference
With the above data show that using heterogeneous representation can effectively reduce the
data bit-width, we applied the same approach to the AlexNet model described in Section 2.3.2.
Table 3.2 shows the matrix value range profiled from the pre-trained AlexNet. Based on
the profiled result, we customize each matrix’s decimal width to the minimum width required
to avoid overflow. We then perform inference of AlexNet using three bit-width settings. The
first two settings use 32-bit and 16-bit fixed-point representations for all matrices. In the
third setting, 16-bit fixed-point is used for neuron matrices (layer), while 8-bit fixed-point
is used for parameter matrices (edge). This setting is based on the fact that most neuron
matrices require more than 8 decimal bits, while none of the parameter matrices requires more
than 2 decimal bits. The inference accuracy achieved with each of the heterogeneous fixed-
point settings is reported in Table 3.3. Inference with 32-bit and 16-bit heterogeneous fixed-
point achieve similar results, corresponding to 1.7% and 1.2% degradations in top-1 and top-
5 accuracy versus floating-point, respectively. Using the 8-bit parameter and 16-bit neuron
setting, top-1 and top-5 accuracy are 3.2% and 2.7% worse than that of using floating-point.
Broadly speaking, the results presented in this chapter are encouraging in that they show
that fixed point representations can be used for neural network inference without significant
accuracy loss.
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 28
Table 3.2: Value range profiling of AlexNet3
DecimalRange Width
Layer Magnitude Needed
input 1.61e+02 9
conv 1 1 3.09e+03 13
conv 1 2 2.17e+03 13
lrn 1 1 1.39e+02 9
lrn 1 2 1.39e+02 9
conv 2 1 7.30e+02 11
conv 2 2 6.18e+02 11
lrn 2 1 1.39e+02 9
lrn 2 2 1.39e+02 9
conv 3 1 3.75e+02 10
conv 3 2 4.44e+02 10
conv 4 1 2.53e+02 9
conv 4 2 3.54e+02 10
conv 5 1 2.24e+02 9
conv 5 2 2.85e+02 10
fc 6 1.31e+02 9
fc 7 1.65e+01 6
fc 8 1.11e+01 5
DecimalRange Width
Edge Magnitude Needed
weights 4.03e-01 0conv 1 1
bias 8.05e-01 1
weights 3.71e-01 0conv 1 2
bias 8.54e-01 1
weights 3.79e-01 0conv 2 1
bias 1.03e+00 2
weights 4.16e-01 0conv 2 2
bias 1.03e+00 2
weights 3.22e-01 0conv 3 1
bias 9.05e-02 -2
weights 5.12e-01 1conv 3 2
bias 1.04e-01 -2
weights 3.22e-01 0conv 4 1
bias 1.14e+00 2
weights 3.53e-01 0conv 4 2
bias 1.22e+00 2
weights 2.54e-01 0conv 5 1
bias 1.50e+00 2
weights 3.15e-01 0conv 5 2
bias 1.74e+00 2
weights 4.84e-02 -3fc 6
bias 1.06e+00 2
weights 5.21e-02 -3fc 7
bias 1.26e+00 2
weights 6.74e-02 -2fc 8
bias 4.25e-01 0
3.6 Related Work
Several approaches have been proposed recently to use low-precision data representations in
neural networks. The work highlighted here has been done concurrently with the research work
in this thesis.
In [19], Gupta et al. propose to train neural networks in uniform fixed-point with the use
of stochastic rounding. Their experiments show that the convolutional neural networks trained
with [2.14] or [4.12] 16-bit uniform fixed-point format do not converge when the round-to-nearest
scheme is used. Conversely, models trained with stochastic rounding can achieve test errors
that are very closed to the floating-point baseline. For the CIFAR10 dataset4, the authors also
find that the model trained with [4.12] fixed-point format stops learning after some point in the
training process to the diminishing gradients. This aligns with our observation in Section 3.5.2,
where training with insufficient precision causes underflow in parameter updates and leads to
4CIFAR10 is a collection of 32×32 colour images categorized into 10 classes. There are 50,000 training imagesand 10,000 test images.
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 29
Table 3.3: Inference Accuracy of AlexNet With Different Heterogeneous Fixed-point Represen-tations.
Data Representation Top-1 Accuracy Top-5 Accuracy
floating-point 57.40% 80.40%32-bit fixed-point 55.73% 79.19%16-bit fixed-point 55.75% 79.18%8/16-bit fixed-point 54.25% 77.69%
poor model convergence. To overcome this problem, they increment the fraction width by 4
bits to [4.16] format after 100 epochs of training to keep the learning going. The test error can
then be further reduced and converged near the floating point baseline.
In [18], Courbariaux et al. compares three types of data representations to be used in
neural network training. In addition to floating-point and uniform fixed-point format, they also
propose to use dynamic fixed-point format, which can be thought as a dynamically-adjustable
form of the heterogeneous fixed-point. Besides customizing each matrix’s format, dynamic fixed-
point allows one to update the decimal width during the training process. They keep track of
the overflow rate for each matrix and periodically compare the overflow rate to a overflow
threshold. The decimal width is then updated under these two conditions: a) if a matrix’s
overflow rate is higher than the threshold, its decimal width will be incremented by 1 to reduce
overflow, b) if the overflow rate is lower than half of the threshold, the decimal is decremented
by 1 to gain more precision. In order to work around the above mentioned diminishing gradient
problem, they use a higher precision format for parameters during the updates than during the
forward- and backward-propagations so that the small parameter changes can be accumulated.
They are able to achieve validation accuracy that is very close to the floating point baseline
using 20-bit uniform fixed-point or 10-bit dynamic fixed-point (12-bit for parameters during
update).
Courbariaux et al. further explore the possibility of using more restricted form of data
representation to enable more efficient hardware design. In [17], they introduced a technique
called BinaryConnect that stochastically binarizes the real-valued weights to be either +1 or
−1 during the forward propagation. The real valued weights are limited to be within the range
from -1 to +1, and the probability of rounding a weight to +1 is proportional to how “close” the
weight is to +1. By doing so, each multiplication between a neuron and a binarized weight in
the forward pass can be converted into a conditional sign-inversion operation. The authors have
Chapter 3. Using Low-Precision Fixed-point in Neural Networks 30
successfully trained DNNs with BinaryConnect on the MNIST, CIFAR10 and SVHN5 datasets
and achieved nearly state-of-the-art accuracy performance. We agree with the authors that
such a method can greatly speedup neural network computation, as well as improve area and
energy efficiency, especially for a specialized hardware implementation. However, the datasets
used by the authors are small compared to large-scale datasets, such as ImageNet. Thus, it is
still unknown if BinaryConnect approach will be applicable to more sophisticated models (e.g.,
AlexNet).
5SVHN is a colour image dataset of street view house numbers.
Chapter 4
System Architecture
This chapter introduces the system architecture developed for accelerating the inference com-
putation of neural networks in an embedded environment. Section 4.1 discusses the design
considerations for the accelerator on an FPGA and proposes our accelerator design. Section 4.2
describes the overall system, structured as a processor-accelerator SoC system. In Section 4.3,
we then describe the following in detail: 1) the tiling software that decomposes a layer of
computation into smaller tiles based on the available on-FPGA storage, and 2) the hardware
controller that orchestrates the data access and computation of the accelerator pipeline.
4.1 Accelerator Design
An natural/intuitive implementation of a neural network accelerator would be to physically lay
out the entire neural network structure within the hardware circuit. In such a scheme, each
connection with a synaptic weight would require a multiplier, and each neuron would require an
activation unit and several adders to sum the multiplier outputs. This is not realistic for state-
of-the-art neural networks due to their large model sizes. For example, AlexNet has around
650K neurons and 60M weights. The corresponding requirements for hardware resources are far
beyond any modern FPGA. Therefore, the computations of a neural network must be divided
into smaller tasks, such that the accelerator can be time-shared among the tasks.
As discussed in Section 2.4, the computations of convolutional and fully-connected layers
require more than 90% of total run-time in both CPU and GPU implementations. Hence, the
focus of our design is on the acceleration of these two types of layers. The computations of
these two layers are essentially multiply and accumulate (MAC) operations (weights are mul-
31
Chapter 4. System Architecture 32
tiplied by neuron outputs). Meanwhile, the operations used in max-pooling and local response
normalization layers are quite different. The max-pooling layer repeatedly compares neuron
values to find the largest neuron value in each pooling region, while the operations used in local
response normalization (LRN) layer include exponentiation, summation and division. This im-
plies that an accelerator circuit designed for convolutional and fully-connected layers cannot be
used to implement max-pooling or LRN. Since the majority of run-time is spent on the MAC
operations, we opted to focus on the MAC operations in our accelerator design and not have
accelerator support for the max-pooling and LRN layers. The max-pooling and LRN layers will
be implemented in software and executed on a processor (on the FPGA die).
The following subsections will review the computations of convolutional and fully-connected
layers, highlight the design considerations of the accelerator and then propose an accelerator
design.
4.1.1 Computation and Data Access
ofm[OFM_Y][OFM_X][OFM_Z]; // Output feature maps.
ifm[IFM_Y][IFM_X][IFM_Z]; // Input feature maps.
kernel[OFM_Z][K_Y][K_X][IFM_Z]; // Filter kernels.
bias[OFM_Z]; // Bias associated with output feature map.
// Iterate through all neurons in output feature maps.
for (oy = 0; y < OFM_Y; y++) {
for (ox = 0; x < OFM_X; x++) {
for (oz = 0; z < OFM_Z; z++) {
// Compute the dot-product of associated kernel and the corresponding receptive field.
for (ky = 0; ky < K_Y; ky++) {
for (kx = 0; kx < K_X; kx++) {
for (kz = 0; kz < IFM_Z; kz++) {
ofm[oy][ox][oz] += kernel[oz][ky][kx][kz] * ifm[stride_y * oy + ky][stride_x * ox + kx][kz];
} } }
// Accumulate with the bias associated to the output feature map.
ofm[oy][ox][oz] += bias[oz];
// Perform activation.
ofm[oy][ox][oz] = activation(ofm[oy][ox][oz]);
} } }
Listing 4.1: Pseudo code of a convolutional layer.
Listing 4.1 shows the pseudo code for a convolutional layer. The three outer loops iterate
through the neurons in the output feature maps along the row, column and depth dimensions,
Chapter 4. System Architecture 33
respectively. The three inner loops compute the dot-product of the associated kernel and the
corresponding receptive field in the input feature maps. The output neuron is then accumulated
with the associated bias, and passed through the activation function at the end.
From the pseudo code, we can see that the number of MAC operations is equal to the size of
output feature maps times the size of a receptive field, i.e. OFMY ×OFMX ×OFMZ ×KY ×
KX × IFMZ . For the largest convolutional layer in AlexNet, the number of MAC operations
is as large as 75 million. Since each MAC operation accepts two inputs, the number of data
accesses may be two times the number of MAC operations (assuming no data re-use).
output[N_O]; // Output neurons.
input[N_I]; // Input neurons.
weights[N_O][N_I]; // Synaptic weights.
bias[N_O]; // Bias associated with output neurons.
for (o = 0; o < N_O; o++) {
for (i = 0; i < N_I; i++)
output[o] += weights[o][i] * input[i];
output[o] += bias[o];
// Perform activation.
output[o] = activation(output[o]);
}
Listing 4.2: Pseudo code of a fully-connected layer.
The computations of fully-connected layer also require a large amount of data accesses and
MAC operations. As shown in the pseudo code in Listing 4.2, a fully-connected layer with
NO output neurons and NI input neurons will have NO × NI synaptic weights and NO × NI
MAC operations. As an example, for the first fully-connected layer in AlexNet, there are 9,216
input neurons and 4,096 output neurons, which correspond to 36 million weights and MAC
operations.
4.1.2 Data Reuse and On-Chip Storage
Recall that AlexNet has roughly 60 million parameters and 650 thousand neurons. The data
requirements are much more than the capacity of the RAM blocks available on modern FPGAs,
and hence, some data must be stored in off-chip SDRAM. The number of data accesses required
for inference will become a performance bottleneck if all data accesses must be from off-chip
SDRAM. Fortunately, there exists a data reuse opportunity for the weights, as well as for the
input and output neurons. As such, on-FPGA buffers can be used to temporarily store a part
Chapter 4. System Architecture 34
of data for reuse to reduce the off-chip memory traffic.
Based on the above pseudo code, every filter in a convolutional layer is shared by all neurons
in its associated output feature map and hence is reused for OFMY × OFMX times. Every
neuron in the input feature maps is accessed by all receptive fields that cover it, which leads
to KY ×KX
stridey×stridex× OFMZ reuses. The neurons in the output feature maps are also reused at
each one of the KY × KX × IFMZ accumulation operations. For fully-connected layers, the
input and output neurons are reused NO and NI times respectively. To take advantage of
the reuse opportunities and minimize off-chip memory access overhead, the proposed hardware
accelerator contains three buffers to cache input neurons, output neurons and weights.
4.1.3 Accelerator Structure versus Data Width of On-Chip Buffer
Given the large number of MAC operations involved in convolutional and fully-connected layers,
the accelerator design should naturally focus on producing high MAC operation throughput.
This can be achieved by including a large number of multipliers and adders in the custom
hardware design. These operators can be pipelined so that a new set of MAC operations can be
launched every clock cycle. A simple structure for the accelerator design can be similar to the
dot-product operator shown in Section 3.2. If M multipliers are used in such a structure, the
accelerator will accumulate one output neuron with the dot-product of M pairs of weights and
inputs every clock cycle. This implies a data bandwidth requirement of as many as 2×M + 1
elements per cycle to feed the dot-product pipeline, where the additional 1 is to access the
current neuron output value. The data bandwidth will then require both the input neuron
buffer and weights buffer to have a data width of M elements. Since the input neurons are
reused by different output neurons, such a data width requirement can be reduced by caching
smaller set of input neurons and sharing them among several output neurons in parallel. We
can distribute M multipliers into MO dot-product operators where each dot-product operator
will contain MI multipliers (MO × MI = M). The MO dot-product operators will share a
common set of MI input neurons and accumulate MO different output neurons in parallel.
In this case, each dot-product operator will receive a different set of MI weights and hence,
the weights buffer still needs to provide M total weights every cycle. In so doing, the data
bandwidth requirement can be reduced to M +MO +MI elements per cycle.
Chapter 4. System Architecture 35
!"#$%&'$(()* +),-.%/&'$(()*
0$%#$%&
'$(()*
0$%#$%&
1)23)*
0$%#$%&
+*,%)*
4, 5&67 48&5&4, 5&67
9:
48&5&67
48&5&67
9:
67
;; ;; !!
9:
+),-.%/&1)23)*!"#$%&1)23)*
9:
<< << <<
=8>#$%)&?",%
@8%A#*83$B%&8#)*2%8*/
CBB$>$D2%,8"
CB%,E2%,8"
F2*%,2DA/$>
Figure 4.1: Accelerator Design.
4.1.4 Accelerator Structure
Based on the above design considerations, we now present the accelerator design. Figure 4.1
shows the schematic of the accelerator. Three on-chip buffers are instantiated as explained
above to promote data reuse of weights, input neurons and output neurons. The data widths
of the on-chip buffers are matched with the data bandwidth requirements of the compute unit.
The input reader and weights reader will read a new set of inputs and weights from the buffers
and feed them to the compute unit every cycle. Since the input and weights buffers may not
store all necessary data for computing an output neuron, the partially-computed output (a
partial-sum of input and weights multiplies) will be temporarily stored in the output buffer
and get swapped back to the compute unit when the new inputs and weights become available.
Consequently, both read and write access are needed for the output buffer.
Delving further into the compute unit, the first part consists of MO dot-product operators.
Chapter 4. System Architecture 36
In the accumulation stage, each dot-product operator output is added to one of the MO par-
tial sums. Multiplexers are used to choose the partial sums to be accumulated, between the
swapped-back values or the ones stored in the registers. If an output neuron has accumulated
all necessary multiplies, activation can be performed and the result will be written to the out-
put buffer. In the case of AlexNet, all the layers use a rectified linear unit as the activation
function, which can be implemented with a 2-to-1 multiplexer that selects between the original
value or zero, based on the sign bit of the input. Meanwhile, a partially-computed output can
also be swapped to the output buffer so that the compute unit can continue working on the
other output neurons using the available data in input buffer and weights buffer.
As shown in the figure, for data widths we use a 16-bit fixed-point representation for all
input neurons, weights and output neurons. This choice is mainly based on three reasons,
1) 16-bit heterogeneous fixed-point format is sufficient to perform inference of AlexNet with
nearly no accuracy degradation (Section 3.5.4); 2) 16-bit data can be represented by primitive
data types in software (e.g., short int), which makes it convenient to integrate the accelerator
in a processor-accelerator co-design system (to be described in Section 4.2); and 3) reducing
the input bit width does not further yield significant hardware benefit for the dot-product
operators, as long as the input bit-width is less than or equal to 18 bits (Section 3.2). During
the computation, the multiplier outputs and adder outputs are 32-bit wide to preserve higher
precision. Therefore, when partial sums or final outputs are swapped out of the compute unit,
they need to be converted back to 16-bit format so that they can be stored in the output buffers.
Similarly, partial sums swapped back from the output buffer also need to be converted to 32-bit
format from 16-bit format.vim Both conversions are done by shifting. The shifting amount
depends on the heterogeneous fixed-point format used for the inputs, weights and outputs.
The number of bits to be shifted towards left/right is formulated as FractionWidthinput +
FractionWidthweight − FractionWidthoutput.
4.2 System Design
With the accelerator design in mind, we choose to realize our acceleration solution using a
System-on-Chip architecture comprising a processor and an FPGA device. The main idea be-
hind this is to perform the compute-intensive work within an FPGA accelerator, while the
remaining computational work is implemented in software and executed on the processor. For
Chapter 4. System Architecture 37
example, the computations of the max-pooling and LRN layers are carried out on the proces-
sor. An off-chip SDRAM is shared by both the processor and FPGA to store all pre-trained
parameters (e.g. weights and neuron biases) and intermediate neuron values.
nn.init( input_model )
nn.load_parameters()
download_images()
for_each image {
preprocess_image()
nn.forward()
}
…
void NN::forward() {
ConvFpgaBackend(
fm_1, fm_2, kernel)
Maxpool(fm_2, fm_3)
…
}
// Implemented in software.
Maxpool(fm_2, fm_3) {…}
!"#$%&&#"'&()% *!+,'&()%
!""#$#%&'(%)*&+#%!,,$-"&'-(.)*&+#% /%&.0$&'-(.)*&+#%
ConvFpgaBackend(
ifm, ofm, kernel) {
- Decompose into tiles
- Move data DDR->FPGA
- Send instructions
- Move data FPGA->DDR
}
-./01'2033%" 4%(561&'2033%"
701/01'2033%"
701/01'8%9)%"
701/01'4"(1%"
4%(561&'8%9)%"-./01'8%9)%"
,$$%:%"91#"';#.1"#::%"
< ="9.&:91%'(.&1"0$1(#.'(.1#$>$:%<?><$>$:%'$#.1"#:
&(5.9:&
-.&1"0$1(#.
!! !! !!
"" """"
Figure 4.2: Three abstract layers in the overall system.
Conceptually, the proposed system can be divided into three abstract layers. From the
bottom up, the lowest layer corresponds to the accelerator design described above. We call
this layer the Accelerator Layer. To make use of the accelerator, we use a Translation Layer
that implements two back-end API functions for convolutional and fully-connected layers. Both
APIs are in a “per-layer” granularity, meaning that each call to the API corresponds to the
computation of an entire layer of the neural network. The first part of the translation layer is
implemented in software. It decomposes a layer of computation into tiles based on the available
on-FPGA storage (i.e., the buffers). For each tile, the software first issues requests to move the
corresponding input neurons and weights from off-chip SDRAM to on-chip buffers. Then, it
sends an instruction that specifies the tile information to the hardware on the FPGA. Following
this, after the accelerator finishes computation, the software will initiate a data transfer to move
the outputs from the on-chip buffer to off-chip SDRAM. The second part of the translation
layer is an Accelerator Controller, implemented as a hardware component on the FPGA. The
accelerator controller receives the instruction from the processor side and translates it into
cycle-by-cycle control signals to control the buffer readers and writers, as well as the compute
unit. We will discuss the tiling scheme, instruction content and accelerator controller in the
subsequent sections.
Chapter 4. System Architecture 38
Lastly, an Application Layer is completely implemented in software and runs on the proces-
sor. It is responsible for constructing a neural network based on user-specified input, reading
pre-trained model parameters from disk to initialize the parameter matrices, preparing input
samples, which may be stored on disk or fed directly from network, and so on. The application
software will call backend APIs to off-load computations to the FPGA accelerator. It is worth
noting that all of these software implementations, including the backend APIs are integrated
into our software framework described in Section 3.4.
4.3 Translation Layer Details
In the proposed architecture, the computation work for fully-connected and convolutional layers
can be broken down into three levels. The first level refers to the tiling step of the translation
layer, where a layer of computation is divided into tiles based on the available buffer storage
on FPGA. After moving the required data from off-chip memory to the on-chip buffer, the
tiling software will send the accelerator controller an instruction describing the current tile. In
software, the instruction is simply organized as a C struct where each field represents specific
information. For example, the instruction contains a flag to indicate the layer type (fully-
connected or convolutional), two integers to represent the numbers of output and input neurons
in the current tile, etc. At the second level, the accelerator controller receives the instruction
and further breaks down the tile into smaller blocks such that the accelerator pipeline can start
a new block of computation every clock cycle. Each block will correspond to MO ×MI MAC
operations, involving MO output neurons, MI input neurons and MO×MI weights. At the last
level, the accelerator performs the corresponding operations based on the control signals issued
by the accelerator controller.
We now explain the translation layer design with the help of pseudo-code organized into the
three-level structure described above.
4.3.1 Translation Layer for Fully-Connected Layers
We organize the pseudo-code for the fully-connected layer into three sets of loop nests as shown
in Listing 4.3. The first loop nest corresponds to the tiling step of the translation layer that
is implemented in software. We select the tiling size TO and TI for output and input neurons
such that the required storage sizes (TO output neurons, TI input neurons and TO×TI weights)
Chapter 4. System Architecture 39
1 output[N_O]; // Output neurons.
2 input[N_I]; // Input neurons.
3 weights[N_O][N_I]; // Synaptic weights.
4 bias[N_O]; // Bias associated with output neurons.
6 // Loop nest 1: tiling step of the translation layer.
7 for (ot = 0; ot < N_O; ot += T_O) { // Loop 1-1.
8 // Move bias to output buffer as the initial values for
9 // output neurons.
10 bias[ot : ot + T_O] -> output_buffer
11 for (it = 0; it < N_I; it += T_I) { // Loop 1-2.
12 input[it : it + T_I] -> input_buffer
13 weights[ot : ot + T_O][it : it + T_I] -> weights_buffer
14 issue_instruction();
16 // Loop nest 2: accelerator controller breaks a tile into
17 // smaller blocks.
18 for (ob = ot; ob < ot + T_O; ob += M_O) { // Loop 2-1.
19 for (ib = it; ib < it + T_I; ib += M_I) { // Loop 2-2.
20 issue_control_signals();
22 // Loop nest 3: accelerator takes in M_I inputs
23 // and M_O * M_I weights and updates M_O partial
24 // sums every cycle.
25 for (o = ob; o < ob + M_O; o++) { // Loop 3-1.
26 for (i = ib; i < ib + M_I; i++) // Loop 3-2.
27 output[o] += input[i] * weights[o][i];
28 if (i == N_I - 1) output[o] = activation(output[o]);
29 }
30 } } // End of the loop nest 2.
31 } // End of input neuron tiling iteration.
33 // Move computed output neurons from output buffer to
34 // main memory.
35 output_buffer -> output[ot : ot + T_O];
36 }
Listing 4.3: Pseudo code of fully-connected layerwith tiling.
!"
#$%&'(($)&*+$,(
-&'%&'$)&*+$, .)"/0',
1++%232
1++%(434
1++%(234
1++%(432
1++%(232(5(234
1++%(432(5(434
6-
6"
!-7-
7"
Figure 4.3: Tiling and traversal of fully-connected layers.
are smaller or equal to the sizes of corresponding on-chip buffers. TO bias values, associated
with the output neurons in the current tile, will be moved to the output buffer (line 10) and
will be loaded to the compute unit as the initial partial-sum values. Since the buffer sizes are
limited, it will take multiple iterations to move all required inputs and weights to compute
TO outputs (line 11-13). Every time a tile of inputs and weights are moved to the on-chip
buffers, an instruction is issued by the tiling software to the accelerator controller to start
the corresponding computation (line 14). In the case of fully-connected layer, the instruction
specifies: 1) the layer type, 2) the number of input and output neurons in the current tile, and
3) whether activation should be performed for output neurons after accumulating all products
of inputs and weights in the current tile.
Starting from line 16, the inner two loop nests are performed by the hardware accelerator
on FPGA. In the second loop nest, the hardware accelerator controller further decomposes the
Chapter 4. System Architecture 40
tile into smaller blocks. Each block refers to the computation between MO output neurons and
MI input neurons. At every clock cycle, a set of control signals is issued to the accelerator
(line 20). For the buffer readers, addresses are provided to load the required data for current
block. The output writer will receive a signal indicating whether the compute unit outputs
should be stored in the output buffer, as well as the address to store the data. For the compute
unit, the control signals will indicate: 1) whether the pipeline should swap in a new set of
partial sums, 2) whether activation should be performed, and 3) the shifting distance to be
used when converting partial sums between 16-bit and 32-bit formats. Lastly, the third loop
nest corresponds to a block of computation (line 25-29). The compute unit can start one such
block of computations every clock cycle.
We can visualize the three-level loop nest as shown in Figure 4.3. The two strips on the
left and top of the figure represents the 1-dimensional array of output and input neurons,
respectively. The rectangle represents the weights connecting every pair of output and input
neurons. The region highlighted in blue refers to the data involved in a tile of computation,
while the orange region represents the data used in a block of computation, which can be
launched by the accelerator every clock cycle. The blue and orange arrows show the loop
traversal directions in the tiling software and accelerator controller, respectively. For example,
loop 1-1 and loop 1-2 traverse along the dimensions of output neurons and input neurons.
Loop 2-1 & 2-2 in the accelerator controller iterate inside the tile to decompose a tile into
smaller blocks.
4.3.2 Translation Layer for Convolutional Layers
Tiling Step in Software
The translation layer implementation for a convolutional layer can also be organized in a similar
three-level loop-nest structure. We begin the discussion with the first loop nest that corresponds
to the tiling step of the translation layer. The dimensions of output and input feature maps,
kernels (a set of 3 dimensional filters) and biases are defined in Listing 4.4, and labelled in
Figure 4.4.
Our implementation performs tiling along the row (Y), column (X) and depth (Z) dimensions
of output feature maps. Loop 1-1 in the pseudo code first iterates through the depth dimension
of output feature maps, selecting TOFM Z output feature maps at a time. Since each output
Chapter 4. System Architecture 41
ofm[OFM_Y][OFM_X][OFM_Z]; // Output feature maps.
ifm[IFM_Y][IFM_X][IFM_Z]; // Input feature maps.
kernel[OFM_Z][K_Y][K_X][IFM_Z]; // Filter kernels.
bias[OFM_Z]; // Bias associated with output feature map.
S <- stride;
// Loop nest 1: tiling step of the translation layer.
for (ozt = 0; ozt < OFM_Z; ozt += T_OFM_Z) { // Loop 1-1.
// Move associated filters to weights buffer.
kernel[ozt : ozt + T_OFM_Z][0 : K_Y][0 : K_X][0 : IFM_Z]
-> weights_buffer
for (oyt = 0; oyt < OFM_Y; oyt += T_OFM_Y) { // Loop 1-2.
for (oxt = 0; oxt < OFM_X; oxt += T_OFM_X) { // Loop 1-3.
// Move input feature map tile to input buffer.
ifm[oyt * S : oyt * S + (T_OFM_Y - 1) * S + K_Y]
[oxt * S : oxt * S + (T_OFM_X - 1) * S + K_X]
[0 : IFM_Z] -> input_buffer.
// Move bias to output buffer as the initial values.
// for output neurons.
bias[ozt : ozt + T_OFM_Z] -> output_buffer
issue_instruction();
/* Loop nest 2 and loop nest 3 are skipped here. */
// Move computed output neurons from output buffer to
// main memory.
output_buffer ->
ofm[oyt : oyt + T_OFM_Y]
[oxt : oxt + T_OFM_X]
[ozt : ozt + T_OFM_Z];
} } }
Listing 4.4: Tiling pseudo code for convolutionallayer.
!""#$%&' ($%&)
*+,-.+/
,-.+/
,-.+0*+,-.+0
!"#$"#
%&'#"(&
)'$*
111
111
!""#
%&%
2+0
2+/
,-.+3
+,-./01#&(*
*+,-.+3
4*+,-.+0$5 %6$7$8$9$2+0
4*+,-.+/$5 %6$7$8
9$2+/
!""#$%&' ($%&)
:-.+0
:-.+/
23$"#
%&'#"(&
)'$*
Figure 4.4: Traversal order in the tiling software.
feature map has an associated 3-D filter, the tiling along the depth dimension of output feature
maps implies a selection of corresponding 3-D filters. These selected 3-D filters (highlighted in
blue in Figure 4.4) are moved from off-chip memory to the weights buffer and are reused by
subsequent loops to compute neuron output values for a set of output feature maps.
Loop 1-2 & 1-3 further tiles along the rows and columns of output feature maps, with a
2-D patch size of TOFM Y × TOFM X . In the figure, the blue region in the output feature
maps represents the selected tile. Recall that each output feature map pixel has a receptive
field in the input feature maps, covering a local region on the Y- and X-dimensions but across
the entire Z-dimension. Hence, the tiling along the rows and columns of output feature maps
also implies the corresponding tiling of along the row and column dimensions of input feature
maps. Note that we do not perform tiling on the depth dimension for input feature maps or
filters – meaning that, all the required inputs for computing the output feature map tile will
Chapter 4. System Architecture 42
be moved to the on-chip buffers and be used in one tile of computation1 The blue region in
the input feature maps refers to the pixels that can be covered by the receptive fields of the
selected output feature map tile. This region of input feature maps needs to be moved to the
input buffer. We also move the associated biases to the output buffer and use them as the
initial values for output neurons.
After moving all the required inputs, the tiling software issues an instruction to the accel-
erator controller, specifying the size of the current output feature map tile, as well as the size
of 3-D filters. When the hardware on the FPGA finishes the tile of computation, the computed
outputs are then copied back from on-chip buffer to the off-chip memory.
// Loop nest 2: accelerator controller breaks a tile
// into smaller blocks.
for (oy = oyt; oy < oyt + T_OFM_Y; oy++) { // Loop 2-1.
for (ox = oxt; ox < oxt + T_OFM_X; ox++) { // Loop 2-2.
for (ozb = ozt; ozb < ozt + T_OFM_Z; ozb += M_O) { // 2-3.
// Current block will work on
// ofm[oy][ox][ozb : ozb + M_O].
for (ky = 0; ky < K_Y; ky++) { // Loop 2-4.
for (kx = 0; kx < K_X; kx++) { // Loop 2-5.
for (kzb = 0; kzb < IFM_Z; kzb += M_I) { // 2-6.
// Current block will work on
// ifm[oy * S + ky][ox * S + kx][kzb : kzb + M_I].
issue_control_signals();
// Loop nest 3: accelerator takes in M_I inputs
// and M_O * M_I weights and updates M_O partial
// sums every cycle.
for (oz = ozb; oz < ozb + M_O; oz++) {
for (kz = kzb; kz < kzb + M_I; kz++)
ofm[oy][ox][oz] += kernel[ky][kx][kz] *
ifm[oy * S + ky][ox * S + kx][kz];
if (all_multiples_are_accumulated)
ofm[oy][ox][oz] = activation(ofm[oy][ox][oz]);
} // End of loop nest 3.
}}}}}} // End of loop nest 2.
.
Listing 4.5: Accelerator controller pseudo code forconvolutional layer.
!""#$%&'
($%&%
!"#$"#%
&'(#")'%
*($%+,-'
)
)
)
!""#
%&*
!""#$%&+
($%&,
-./
-.0
12
./0%1,-#')2%(2234,(#'5%
#3%#6'%!&*%#,-'
!""#$%&+
($%&,
!""#$%&'
($%&%
78$"#
&'(#")'
*($%+,-'
-./
-.0
Figure 4.5: Traversal order in the accelerator con-troller.
1This tiling scheme does not work when the size of a 3-D filter (equivalent to the receptive field size of anoutput feature map pixel) is greater than the capacity of weights buffer. The largest 3-D filter in the AlexNetmodel has a size of 256×3×3 = 2, 304 weights. If weights are represented in 16-bit fixed-point, one such filter canbe accommodated by three M10K RAM blocks on Altera’s FPGA. Our implementation described in Chapter 6make sure that any 3-D filter in the AlexNet model can fit in the weights buffer.
Chapter 4. System Architecture 43
Accelerator Controller
Once the accelerator controller receives an instruction describing a tile of computational work,
it needs to decompose the tile into smaller blocks that match with the accelerator’s compute
bandwidth (the accelerator receives MI inputs, MO×MI weights and updates MO partial sums
every clock cycle). In our design, the MO partial sums refer to MO pixels in the output feature
map tile that are at the same <x,y> location across MO consecutive output feature maps (see
highlighted in output feature map tile in Figure 4.5). The MI inputs from the input feature
map tile are positioned as the figure shows. The weights connecting the MI inputs to the
MO outputs are therefore distributed among MO 3-D filters, where each filter has MI weights
positioned accordingly to the selected neurons in the input feature maps.
The decomposition of a tile starts by iterating along the row, column and depth dimensions
of the output feature map tile (see the first three loops in Listing 4.5, and the orange arrows
around the output feature map tile in Figure 4.5). At each step (of Loop 2-3), MO output
feature map pixels are selected. The <x,y> position of MO output feature map pixels defines
the receptive field in the input feature map (refer to the box with dotted line), while the
position on the z-dimension selects the associated filters (highlighted filters). Then, the next
three loops iterate through the row, column and depth dimensions of the receptive field and
the associated filters. At each step (of Loop 2-6), the accelerator controller selects MI inputs
from the receptive field and MO ×MI weights from the filters. A set of control signals are then
issued to guide the accelerator to process the selected inputs, weights and outputs. The control
signals are same as those for fully-connected layers. Again, the computation in the third loop
nest is one block of computation that can be launched by the accelerator every clock cycle.
4.4 Summary
In summary, our overall system is implemented as a processor-accelerator SoC. We propose
an accelerator design that supports fully-connected and convolutional layers with focus on the
performance of the MAC operations. The accelerator is augmented with on-chip buffers to take
advantage of data reuse opportunities so that we can minimize the off-chip memory traffic. On
the processor side, application software invokes a backend API to off-load the computation of
fully-connected and convolutional layers to the accelerators. The backend APIs are implemented
in the translation layer where data is divided into tiles and moved to the on-chip buffers. A
Chapter 4. System Architecture 44
hardware controller is designed to coordinate the cycle-by-cycle operation of the accelerator.
Application software is also responsible for initializing a neural network, loading pre-trained
parameters, preparing the input samples and so on. The max-pooling and LRN layers are also
implemented in software.
Chapter 5
Streaming Hardware Generation in
LegUp High-Level Synthesis
We adopt high-level synthesis (HLS) to generate the accelerator hardware from a high-level
description in standard software language, C. We use the LegUp open-source HLS tool [12] to
synthesize all the hardware circuits shown in Figure 4.1, including the accelerator controller,
compute unit, on-chip buffers, buffer readers and writer. As described in Section 4.1, the
accelerator circuit must be pipelined in order to maximize the compute throughput, while
using limited hardware resources. When this project commenced, LegUp HLS was not capable
of synthesizing pipelined functions. One of the contributions of this thesis is to add pipelined
function synthesis to the LegUp HLS tool.
Section 5.1 briefly summarizes an existing LegUp feature – loop pipelining, of which the
scheduling algorithm and datapath generation are re-used for the purpose of pipelined function
synthesis. From Section 5.2 to Section 5.4, we describe work done to support function pipelining
in LegUp, including the pipelined function interface, FIFO support and stall logic implemen-
tation. Lastly, Section 5.5 illustrates how we describe the neural network inference accelerator
circuit in software so that LegUp can generate the corresponding pipelined hardware.
5.1 Loop Pipelining
Although function pipelining was not supported, LegUp has a mature loop pipelining feature
that permits the synthesis of hardware to execute loop iterations in a pipeline manner. The
45
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 46
loop pipelining implementation uses the modulo SDC scheduling algorithm with a backtracking
mechanism [11] to find a valid schedule for the operations in a loop body. The objective of the
scheduling algorithm is to minimize the initiation interval (II), which is the number of cycles
between the launch of two consecutive loop iterations. The ideal value of II is 1, where a new
loop iteration is started every clock cycle.
In a pipelined circuit, a variable will require extra pipeline stage registers if its lifetime is
more than II cycles, where the lifetime of the variable is the cycle-count difference between the
stage where it is assigned a value and the last stage where it is used. The pipeline stage registers
are meant to preserve the computed value for its uses at a later pipeline stage, otherwise the
value will be overwritten by the subsequent iterations. For example, if the II is 1 (new inputs
are injected into the pipeline every clock clock) and a variable is computed at stage-1 and
used at stage-4, pipeline registers will be required at stage-1, stage-2 and stage-3. The existing
functionality within the LegUp HLS tool automatically inserts such pipeline stage registers
into the datapath to ensure correct functionality. We use the same scheduling algorithm and
datapath generation flow to implement pipelined functions. However, there are still several
missing pieces to fully support a pipelined function in a complete system. In the following
sections, we will describe the additional work done as part of this thesis to support pipelined
function hardware synthesis.
5.2 Pipeline Function Interface
For a pipelined function with an ideal II of 1, a new set of inputs have to be provided to
the function every clock cycle to maximize computational throughput. For example, in the
case of our accelerator design, the compute unit pipeline accepts a new set of data inputs from
the buffer reader modules and control signals from the accelerator controller every clock cycle,
where the buffer reader modules are also pipelined functions that accept new addresses from
the accelerator controller every cycle. This will require a mechanism to stitch together multiple
pipelined functions in a streaming manner, where upstream and downstream functions execute
in parallel with data flowing down the connected pipeline every clock cycle.
The traditional module interface used in LegUp HLS is designed for sequential functions.
The module interface is shown in Figure 5.1 and the handshaking between a caller and a callee
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 47
start
finish
arg_1
arg_2
return_val
Figure 5.1: Module interface of a sequen-tial function.
start
finish
arg_1
arg_2
return_val
clock
!"#"
!"#"
!"#"
Figure 5.2: Handshaking between sequential functions.
is shown in Figure 5.2. Function arguments and return values are presented as inputs and
output on the module interface. To invoke a callee function (module), the caller function
(module) asserts the start signal to start the function execution and also to indicate that the
two arguments on arg 1 and arg 2 have become valid. When the callee function completes
its execution, it asserts the finish signal to indicate completion and the return value on the
output port is valid. The start and finish notion of handshaking between sequential functions
is not sufficient for pipelined functions, where such functions continue to run in parallel, as long
as valid inputs are provided. Therefore, we need to define a new module interface for pipelined
functions synthesized into hardware by LegUp.
ready
valid
data
!"#$!%&'()
Figure 5.3: Ready-valid-data (RVD) interface.
Since a pipelined function can connect to more than one upstream function (e.g. the compute
unit in the accelerator design receives input from the accelerator controller and multiple buffer
readers), the inputs may not necessarily become valid at the same clock cycle. Hence, each
input of the pipelined function should be associated with a valid signal to indicate input validity
individually. Meanwhile, a pipelined function may not always be able to accept new inputs due
to a stall. Therefore, each input to a pipelined function should be bundled with a ready signal,
which is asserted by the downstream hardware only when it is ready to receive a new input. We
propose to use the ready-valid-data (RVD) handshaking interface where each input or output
of a pipelined function is associated with a valid signal and a ready signal with the directions
shown in Figure 5.3.
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 48
!"!# !$ !% !& !'
!"#$%&'"()
ready
valid
data
clock
!"#$%&'"(* !"#$%&'"(+
Figure 5.4: Handshaking between source and sink using RVD interface (dash-line represents the datatransfer phase).
Figure 5.4 shows the handshaking between the source and sink. The data transfer happens
at the clock edge when both valid and ready signals are high (Transfer-0 ). The valid sig-
nal remains high as long as the data is valid, and data is transferred once ready is asserted
(Transfer-1 ). From the source’s perspective, the ready signal acts as an acknowledgement that
valid data has been consumed by the sink. Also, the ready signal can be asserted as soon as
the sink is ready to receive the data, regardless of the state of the valid signal. If ready is
asserted, data transfer only happens when the valid goes high (Transfer-2 ).
The RVD interface allows data transfer to happen with 0-cycle latency. That is, if the valid
signal is high, the sink can use the data right away and assert the ready signal to expect the
next valid data to be presented at the immediately subsequent cycle. Similarly, if the ready
signal is high, the source can assert valid so that data will be transferred at the following
clock edge. This is a key advantage of using such interface in a streaming design. Consider
an example wherein the sink is stalled and de-asserts the ready signal, the source module can
keep the valid signal asserted with valid data placed on the data port, but needs to be stalled
as well to avoid dropping data. When the sink resumes from stall, it can re-assert the ready
signal and use the valid data at the same cycle, allowing both source and sink to resume to
steady-state immediately (i.e. one data flowing through the interface every cycle). Whereas, if
a 1-cycle latency interface is used, a pipelined function would require an extra cycle to resume
to steady-state after each stall.
5.3 FIFO Support
5.3.1 First-Word-Fall-Through FIFO
As mentioned above, if pipelined functions are directly connected using the RVD interface,
any stall of a pipelined function will require all its upstream functions to stall as well. This
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 49
will negatively impact the throughput of the entire pipeline. Moreover, the stall logic will
require a high-fanout signal traversing all connected pipelined functions (through the ready
port), such that when the downstream hardware de-asserts the ready port, all the upstream
hardware will stall immediately. Such high-fanout signal can potentially impact the maximum
clock frequency. An approach to mitigate these issues is to insert FIFOs between the pipelined
functions. FIFOs can be leveraged to relax the backpressure by buffering the data and break
the chaining of stall signals between pipelined functions.
!"!#
write_datadata dataread_data
write_envalid validnot_empty
ready not_full read_en ready
$%&'()*+
!,-.'/0-
102-&'()*+
!,-.'/0-
Figure 5.5: RVD-compatible interface of FWFT FIFO.
To be compliant with the RVD interface, we use the first-word-fall-through (FWFT) FIFO,
also known as the show-ahead FIFO. Figure 5.5 shows the interface of a FWFT FIFO and how
it is connected to the upstream and downstream functions using the RVD interface. The main
difference between a FWFT FIFO and a normal FIFO is that the read data presented at the
downstream output port is always valid, as long as the not empty signal is high. Similar to the
ready signal from an upstream pipelined function’s perspective, the read en input of the FWFT
FIFO acts as an acknowledge signal, rather than a request signal as it is in a normal FIFO.
Assertion of the read en input tells the FIFO that the data on the output port is taken and it
can present new data on the output at the next clock cycle. On the upstream side, assertion
of the FIFO’s not full signal means that the FIFO is ready to accept new write data. The
handshaking on both the upstream and downstream interface of the FWFT FIFO is the same
as that of RVD interface seen in Figure 5.4. Consequently, as a result of the handshaking
signalling compatibility, the FWFT FIFO can be placed between any two pipelined functions
that use the RVD interface.
5.3.2 Software Support
To allow HLS users to express and use FIFOs in software design, we provide a software library
implementing the FIFO, as well as related API functions. This software library is designed to
be compilable by standard C compilers so that users can test their C implementation in software
before synthesizing it using LegUp HLS.
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 50
typedef struct {
// Bit width of the elements stored in the FIFO.
int width;
// Number of elements that can be stored in the FIFO.
int depth;
// Pointer to a data array holding the elements.
long long *mem;
// Keeps track of where in the array to write to.
unsigned write_index;
// Keeps track of where in the array to read from.
unsigned read_index;
} FIFO;
Listing 5.1: A C-struct definition of FIFO.
// Initialize a FIFO with the specified width and depth.
FIFO *fifo_malloc(int width, int depth);
// Write an element into the FIFO.
void fifo_write(FIFO *fifo, long long data);
// Read an element from the FIFO.
long long fifo_read(FIFO *fifo);
// Free the memory allocated by the FIFO.
void fifo_free(FIFO *fifo);
Listing 5.2: FIFO related APIs.
The FIFO is defined as a C struct as seen in Listing 5.1. The elements of the struct are
used to define the storage, its width/depth, and where to read/write from/to in the storage.
The data array is used as a circular buffer to create the FIFO behaviour. Its type is a long
long, making it capable of handling the largest standard C integer data type, though it can also
be used to hold anything smaller. Listing 5.2 shows the four most commonly used APIs. The
fifo malloc function is to create a new FIFO by initializing the depth and width parameters
and allocating the data array. This function returns a pointer to the newly-created FIFO. The
user software will use this pointer when reading/writing from/to the FIFO. The data argument
of fifo write refers to the write data, while the return value of fifo read corresponds to the
read data. fifo free frees the memory allocated for the data array and is ignored during
LegUp hardware synthesis.
Note that LegUp does not synthesize an implementation of these API functions – they are
used strictly to permit execution in software. When a C program using the API is compiled
to hardware, LegUp detects its usage and instantiates the FIFO and parameterizes it based on
the width and depth arguments provided to the fifo malloc function call. Function calls to
fifo write and fifo read will be scheduled together with other operations by the scheduling
algorithm. The hardware circuit generated by LegUp will then perform the needed FIFO
read/write operations. On a call to fifo write by a upstream function, the function places
the data on the output data port and asserts the valid signal (connecting to write en). If the
associated ready signal (driven by not full) is low, the function has to stall until the FIFO
is not full. If the ready signal is high, the data is then considered successfully written to the
FIFO and the function continues execution. For fifo read, a downstream function asserts the
ready signal (connecting to read en) and checks the valid input (driven by not empty). If
valid is high, the input data is considered valid and can be used by the function’s datapath.
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 51
If valid is low, meaning that the FIFO is empty, the downstream function needs to stall and
wait for the input data to become valid.
5.4 Stall Logic and Datapath
As alluded to in the previous section, we must support the scenario when resources or data
become unavailable to the pipelined function. There are three primary cases that would require
a pipelined function to stall: 1) when the pipelined function shares a resource such as memory
(RAM block) or functional unit (e.g., a divider) with other functions, if the resource is presently
being used by other functions, the pipelined function has to stall until the resource becomes
available; 2) when the pipelined function is not receiving valid input from its upstream module,
the pipelined function needs to stall until all inputs become valid; and 3) when a pipelined
function’s downstream module is not ready to consume the pipelined function’s output, the
pipelined function also needs to stall to prevent losing data. The stall logic ensures the hardware
pipeline stalls appropriately and produces a functionally correct result. It is implemented as
a part of the circuitry that controls the pipelined datapath. Due to the nature of the stall
logic that requires the pipeline to stop immediately when encountering stall, the stall logic is
normally implemented as a combinational circuitry with a high-fanout to many or all stages in
the pipeline. The stall logic design can directly impact the overall system throughput, timing
performance and area. Our implementation aims to reduce the stall circuitry by minimizing
the stall signal fanout to only the stages that absolutely need to stall. That is, if an operation
at a specific pipeline stage encounters a stall and cannot proceed, we only stall this and all the
prior (upstream) pipeline stages, but all the later pipeline stages do not need to stall and may
continue to execute. This is analogous to what typically done in a pipelined processor, wherein
stages towards the front of the pipeline may stall (for example, due to a data hazard), while
the back-end stages may continue. We use an example to explain the design.
Figure 5.6 shows a pipelined function datapath with its control circuitry including the stall
logic. The pipelined circuit is a straight-line datapath with no control flow. Like other HLS
tools, we remove any diverging branches with if-conversion and back edges by unrolling all the
loops. All sub-functions are also inlined. In the figure, there are two inputs FIFOs, a non-FIFO
argument input, and two output FIFOs. This pipeline can be stalled if any input FIFO is empty
or any output FIFO is full. The not empty signals from the input FIFOs and the not full
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 52
!"#$%
&$'!"#$%&#%
'()*+,#%
-#.*%
/-/01
-#.*%
/-/02
0*%.*%
/-/02
3")45
(,),# ,#
(,),# ,#
(,),# ,#
(,),# ,#
(,),# ,#
(,),# ,#
3")45
0*%.*%
/-/01
3")45
()*## ()*##
,#
,#
,#
,#
(+,-'.
(+,-'.
/'"##
#01$2
3435
/$16"#/
78
79
7:
7;
/'"1+
+6"&#+
<"2=>-?+//*?+ (,)
,# ,#
Figure 5.6: Pipeline circuit datapath and stall logic.
signals from the output FIFOs are the inputs to the stall logic. The S’s denote pipeline stages,
with registers at each stage to hold data. Each pipeline stage is associated with a valid bit,
indicating whether the pipeline stage contains valid data. It is used together with the stall
logic to control the pipeline datapath. A pipeline stage is enabled when both the valid bit and
the output of the stall logic AND gate are high. When the stall logic AND gate is high, it means
that none of the relevant input/output FIFOs is empty/full, and therefore, execution should
continue. The stall logic also affects the update of the valid bits. As shown in the figure, a
valid bit register retains its value when a stall occurs (the output of the stall logic AND gate
connects to the enable of the valid bit register). When there is no stall, the valid bit updates
its value to 1 if the previous stage was enabled; otherwise, the valid bit becomes 0 since there
is no valid data flowing down from the previous stage.
Let us know examine how the stall signals from the FIFOs are connected to the stall logic.
For Input FIFO0, whose data is needed at S0, only S0 will be stalled when this FIFO becomes
empty. Data from Input FIFO1 is needed at S1, so if this FIFO is empty, S1 and S0 stall. S0
also needs to stall in this case since its next stage is stalled (allowing it to continue will overwrite
the valid data in S1). Output FIFO0 is written from S2. Hence, when this FIFO is full, it stalls
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 53
S2, S1, and S0. When Output FIFO1 is full, the entire pipeline stalls. Thus, a FIFO can stall
the pipeline stage where its data is needed/written, and all of its previous pipeline stages, but
the FIFO does not stall any later pipeline stages. For instance, when S0 stalls due to Input
FIFO0 only, S1, S2, and S3 may continue. When Output FIFO0 is full, valid data in S3 can
continue and be written to the Output FIFO1 (assuming it is not full).
There are also cases where the stall circuitry is not necessary. For instance, a constant
argument (such as an integer value), is stored in registers when the module starts and remain
unchanged during its execution. We do not create any stall logic for this argument, as it will
not be overwritten during the execution. This helps to reduce circuit area and the fanout of
the stall signals, which can become large if there are many FIFOs and pipeline stages.
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 54
5.5 Accelerator Design using LegUp’s Pipelined Function Fea-
ture
5.5.1 Software Implementation of Accelerator Controller
1 void AcceleratorController(
2 /* The instruction received from host software. */
3 LayerType instr_layer_type, BackendActivationType instr_activation_type, int16_t instr_shifting_distance,
4 // The number of buffer entries that will be used when processing this tile.
5 unsigned short instr_ib_entry_count, unsigned short instr_ob_entry_count,
6 /* Send addresses to input reader, weights reader, output reader and output writer. */
7 FIFO *F_ib_entry_index, FIFO *F_wb_entry_index, FIFO *F_ob_read_entry_index, FIFO *F_ob_write_entry_index,
8 /* Send control signals to compute unit. */
9 FIFO *F_should_rotate_in_partial_sum, FIFO *F_should_rotate_out_output,
10 FIFO *F_activation_type, FIFO *F_shifting_distance) {
12 if (instr_layer_type == kFCLayer) { // Fully-connected layer.
13 unsigned short ob = 0, ib = 0, wb = 0;
14 for (ob = 0; ob < instr_ob_entry_count; ob++) { // Iterate through each output buffer entry.
15 loop: for (ib = 0; ib < instr_ib_entry_count; ib++, wb++) { // Iterate through each input buffer entry.
16 // Rotate in partial-sums or bias when start computing a new set of outputs.
17 bool should_rotate_in_partial_sum = (ib == 0);
18 // Rotate out the partial sums or the final result after factor in the last set of inputs.
19 bool should_rotate_out_output = (ib == (instr_ib_entry_count - 1));
21 // Send addresses to buffer readers and writer.
22 fifo_write(F_ib_entry_index, ib);
23 fifo_write(F_wb_entry_index, wb);
24 if (should_rotate_in_partial_sum) fifo_write(F_ob_read_entry_index, ob);
25 if (should_rotate_out_output) fifo_write(F_ob_write_entry_index, ob);
27 // Output to compute unit.
28 fifo_write(F_should_rotate_in_partial_sum, should_rotate_in_partial_sum);
29 fifo_write(F_should_rotate_out_output, should_rotate_out_output);
30 fifo_write(F_activation_type, instr_activation_type);
31 fifo_write(F_shifting_distance, instr_shifting_distance);
32 }
33 }
34 } else if (instr_layer_type == kConvLayer) { ... }
36 return;
37 }
Listing 5.3: C-implementation of accelerator controller.
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 55
Now we are ready to show some example code for the accelerator design. Listing 5.3 presents
a code snippet of the accelerator controller for the fully-connected layer. Recall that the accel-
erator controller receives an instruction from the tiling software and generates cycle-by-cycle
control signals for other accelerator components to process a tile of computation. The function
accepts each instruction field as an argument (line 2-5). The instruction fields include the layer
type, the activation function type, the shifting distance to be used when converting between
the 16-bit and 32-bit fixed-point formats, and the numbers of output and input neurons in the
current tile. The numbers of output and input neurons are represented as the number of buffer
entries used to store the data. Each input buffer entry contains MI input neurons, each output
buffer entry contains MO output neurons and each weights buffer entry contains MO × MI
weights. We eliminate the instruction fields that are only used for the convolutional layer. The
second part of the arguments are the FIFOs connecting to the buffer readers, buffer writer and
the compute unit (lines 7-10). The generated control signals will be sent through these FIFOs.
The nested loop in the implementation corresponds to the second loop nest described in
Section 4.3.1. The accelerator controller iterates along the output and input neurons to decom-
pose a tile into smaller blocks, each involving one entry of data in the input, output and weights
buffers (line 14-15). The accelerator can start one such block of computation every clock cycle.
The accelerator controller needs to generate a new set of control signals every cycle for all other
accelerator components. Within the loop, the accelerator controller first determines whether a
new set of partial sums should be swapped into the compute unit pipeline (line 16-17). If ib
is zero, it implies the start of computation for a different set of outputs and the corresponding
partial sums or bias should be rotated into the compute unit. The accelerator controller also
need to determine if the current set of partial sums in the compute unit needs to be rotated out
(line 18-19). By the time the last set of inputs and weights (in current tile) are factored into
the current set of outputs (ib == instr ib entry count - 1), the accumulated partial sums
should be rotated out from the compute unit and stored in the output buffer. For each compute
block, the accelerator controller needs to send addresses to the input and weights readers so
that they can feed the corresponding data to the compute unit (line 22-23). Addresses will
also be sent to the output reader and writer when rotate-in or rotate-out is necessary (line 24-
25). Lastly, the accelerator controller sends the control signals to the compute unit, specifying
whether rotate-in or rotate-out is needed, the activation function type and shifting distance
(line 27-31).
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 56
Function pipelining requires all loops to be unrolled; however, the loops in this function have
variable loop bounds and cannot be unrolled. Hence, we do not pipeline the entire function,
but only pipeline the inner loop (labelled “loop” on line 15) such that a new iteration of the
inner loop can start every clock cycle. In this case, the accelerator controller is synthesized as
a sequential function of which the inner loop is pipelined using loop pipelining feature.
5.5.2 Software Implementation of the Compute Unit
1 #define DType short int
2 #define WDType int
4 void ComputeUnit(
5 // The M_i input neurons from input reader.
6 FIFO* F_input_0, ..., FIFO* F_input_7,
7 // The M_o x M_i weights from weights reader.
8 FIFO* F_weight_0_0, FIFO* F_weight_0_1, ..., FIFO* F_weight_0_7, ..., FIFO* F_weight_7_7,
9 // Control signals from accelerator controller.
10 FIFO* F_should_rotate_in_partial_sum, FIFO* F_should_rotate_out_output,
11 FIFO* F_activation_type, FIFO* F_shifting_distance,
12 // The rotate in partial-sums from output reader.
13 FIFO* F_rotate_in_output_0, ..., FIFO* F_rotate_in_output_7,
14 // The rotate out intermediate values or final output values, to output writer.
15 FIFO* F_rotate_out_output_0, ..., FIFO* F_rotate_out_output_7) {
17 // Read inputs.
18 DType input_0 = fifo_read(F_input_0); ... DType input_7 = fifo_read(F_input_7);
19 // Read weights.
20 DType weight_0_0 = fifo_read(F_weight_0_0); DType weight_0_1 = ... DType weight_7_7 = fifo_read(F_weight_7_7);
21 // Read control signals.
22 bool should_rotate_in_partial_sum = fifo_read(F_should_rotate_in_partial_sum);
23 bool should_rotate_out_output = fifo_read(F_should_rotate_out_output);
24 BackendActivationType activation_type = fifo_read(F_activation_type);
25 int16_t shifting_distance = fifo_read(F_shifting_distance);
27 // Rotate-in partial sums.
28 WDType rotate_in_output_0, ..., rotate_in_output_7;
29 if (should_read_rotate_in_output) {
30 // Read rotate-in outputs.
31 rotate_in_output_0 = fifo_read(F_rotate_in_output_0);
32 ...
33 // Convert to 32-bit format.
34 rotate_in_output_0 <<= shifting_distance;
35 ...
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 57
36 }
38 // The registers storing accumulating partial sums.
39 static WDType accumulator_0 = 0, accumulator_1 = 0, ..., accumulator_7 = 0;
40 // Select the partial sums to be accumulated, between the register and rotate-in values.
41 WDType select_accumulator_0 = should_read_rotate_in_output ? rotate_in_output_0 : accumulator_0;
42 ...
44 // Accumulate with sum of products.
45 accumulator_0 = select_accumulator_0 +
46 SumOfProducts(input_0, input_1, input_2, input_3, input_4, input_5, input_6, input_7,
47 weight_0_0, weight_0_1, weight_0_2, weight_0_3, weight_0_4, weight_0_5, weight_0_6, weight_0_7);
48 ...
50 // Rotate out intermediate accumulator values or final output values.
51 if (should_rotate_out_output) {
52 // Need to shift the accumulator value back to the ‘right’ decimal place before rotating out from the pipeline.
53 DType rotate_out_output_0 = ShiftAndActivate(activation_type, shifting_distance, accumulator_0);
54 ...
55 // Send data to output writer.
56 fifo_write(F_rotate_out_output_0, rotate_out_output_0);
57 ...
58 }
59 }
61 WDType SumOfProducts(
62 DType a_0, DType a_1, DType a_2, DType a_3, DType a_4, DType a_5, DType a_6, DType a_7,
63 DType b_0, DType b_1, DType b_2, DType b_3, DType b_4, DType b_5, DType b_6, DType b_7) {
64 // Pair-wise multiply.
65 WDType product_0 = (WDType)a_0 * b_0; ... WDType product_7 = (WDType)a_7 * b_7;
66 // Sum up the products.
67 return product_0 + product_1 + product_2 + product_3 + product_4 + product_5 + product_6 + product_7;
68 }
70 DType ShiftAndActivate(BackendActivationType activation_type, int16_t shifting_distance, WDType accumulator_value) {
71 // Shifting.
72 WDType shifted_output = accumulator_value >> shifting_distance;
73 // Activation.
74 if (activation_type == kBackendActivationLinear) return shifted_output;
75 else if (activation_type == kBackendActivationReLU) return (shifted_output > 0 ? shifted_output : 0);
76 }
Listing 5.4: C-implementation of compute unit.
Listing 5.4 shows the software implementation of the compute unit, with both MO and MI
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 58
set to 8 (see Section 4.1.4). That is, the compute unit will receive 8 input neurons and 64
weights and accumulate 8 output neurons every clock cycle. Recall that we use 16-bit fixed-
point (DType) for all data stored in the buffers and use 32-bit fixed-point (WDType) for the
multiplier outputs and accumulators. This function is synthesized by LegUp with the function
pipelining feature. For the sake of clarity, repetitive code snippets are replaced with ....
The first set of function parameters are the FIFOs connecting to the input and weights
readers (line 5-8). Each FIFO is used for receiving one input or weight. The next four arguments
are FIFOs used to receive control signals from the accelerator controller (line 9-11). Eight FIFOs
are used for receiving the rotate-in partial sums from the output reader (line 13). Another eight
FIFOs are created for sending the rotate-out values to the output writer (line 15).
In the implementation, the compute unit first reads in the input neurons, weights and
control signals using the fifo read API (line 17-25). If rotate-in is necessary, the compute
unit will read the partial sums from the FIFOs and convert them from 16-bit format to 32-bit
format (line 27-36). At line 39, the accumulators are declared as static variables to mimic the
registers that retain the values across pipeline iterations. At line 41, we select the partial-sums
to be accumulated between the rotate-in values and the register values. We then accumulate
each selected partial-sum with the sum of pair-wise products of inputs and weights (line 44-
48). Implementation of the helper function SumOfProducts is listed at lines 61-68. If the
control signal should rotate out output indicates that the accumulated partial sums should
be rotated out and stored into the output buffer, we will need to convert the 32-bit accumulators
into 16-bit format, perform activations, and write the data to the FIFOs connected to the output
writer using the fifo write API (lines 50-59). Lines 70-76 show the implementation of the
conversion and activation.
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 59
5.5.3 Software Implementation of Buffer Readers and Writer
1 void InputReader(
2 FIFO* F_ib_entry_index, /* Input address from the AcceleratorController. */
3 FIFO* F_input_0, FIFO* F_input_1, ..., FIFO* F_input_7 /* Input neurons to the compute unit. */) {
4 unsigned entry_index = fifo_read(F_ib_entry_index);
5 fifo_write(F_input_0, input_buffer_0[entry_index]);
6 fifo_write(F_input_1, input_buffer_1[entry_index]);
7 ...
8 fifo_write(F_input_7, input_buffer_7[entry_index]);
9 }
Listing 5.5: C-implementation of input reader.
For completeness, we also show the C-implementation of the buffer readers and writer.
Listing 5.5 is the implementation for the input reader. The implementations for the weights
reader, the output reader and writer are similar to the input reader shown here. For the function
signature, the first parameter is an input FIFO specifying the read address from the accelerator
controller. The remaining 8 parameters are output FIFOs for sending the input neurons to
the compute unit. The implementation is straightforward: the input reader first reads in the
address from the FIFO (line 4), then fetches the input neurons from the buffer and pushes
them into the output FIFOs (line 5-8). It is worth noting that the input buffer is organized as
eight arrays (MI = 8) of the short int data type. This guides the LegUp HLS tool to create
eight independent RAM modules to form the buffer, thereby permitting eight parallel reads to
be performed every clock cycle to maintain the accelerator throughput. Similarly, the output
buffer is organized as MO arrays and the weights buffer is organized as MO ×MI arrays.
5.6 Summary
In this chapter, we described the function pipelining feature added to the LegUp HLS framework
as part of this research. We use the RVD handshaking interface for pipelined functions, enabling
them to be connected to one another in a streaming manner. RVD-compatible FWFT FIFOs
are used to connect pipelined functions. The purpose of using FIFOs is to relax the backpres-
sure between pipeline functions, and to break the potentially timing-critical stall signal which
would otherwise go through all the connected functions. We also provide a software FIFO
library for HLS users to “instantiate” the FIFOs in software with parameterizable width and
depth, and to perform FIFO read/write operations using the library functions. The software
Chapter 5. Streaming Hardware Generation in LegUp High-Level Synthesis 60
library is compilable by any standard software compiler, allowing the software implementation
to be first verified in software before running high-level synthesis. LegUp does not synthesize
the software implementation of the library functions. Instead, it instantiates the FIFOs and
generates the FIFO read/write circuitry based on the function calls to the library functions. We
also presented the pipelined datapath architecture along with the stall logic, which is designed
with the objective of reducing the overhead on circuit throughput, timing and area. Finally,
we highlighted key aspects of the software design of the accelerator controller, the compute
unit and the input reader, which are synthesized by LegUp using loop pipelining and function
pipelining features. FIFOs are used to pass data and control signals between the accelerator
components. By leveraging LegUp HLS, we are able to implement the highly-pipelined hard-
ware accelerator completely in software, allowing easy experimentation with a wide variety of
hardware implementations, for example, having different numbers of pipeline stages, or with
different resource constraints.
Chapter 6
System Implementation
6.1 Introduction
This chapter overview the implementation details of the complete processor/accelerator hybrid
system for neural network inference. We also describe how the implementation was verified
functionally, present experimental results, and highlight related work.
6.2 SoC Device Overview
We implement the entire system on an Altera DE1-SoC board, which contains a Cyclone V
SoC device. This Cyclone V SoC device, as seen in Figure 6.1, comprises an ARM-based Hard
Processor System (HPS) along with FPGA fabric on the same die. The HPS has two ARM
Cortex-A9 processors, each with independent 32 KB L1 instruction and data caches. Cache
coherency between the separate L1 caches in the two processors is maintained by the snoop
control unit (SCU). There is also one 512 KB shared L2 cache connected to a DDR3 SDRAM
memory controller and the L3 interconnect. The L3 interconnect enables communication with
the memory-mapped peripherals of the HPS, such as the micro-SD card socket, Ethernet,
timers, etc. The L3 interconnect also enables communication to and from circuits implemented
within the FPGA fabric. Circuits in the FPGA can be slaves and/or masters attached to the
L3 interconnect via the FPGA bridge. Also, the FPGA can directly communicate with the
SDRAM memory controller. This DE1-SoC board has a 1 GB DDR3 SDRAM that can be
accessed using the SDRAM memory controller.
The memory layout, as seen by the processor, is divided into three main regions: SDRAM,
61
Chapter 6. System Implementation 62
!"#$%&'()(*%!
!"
#$%&'()$$&(%
#$*%'+(%,)$-.&/,*%&'*
0
1+%2+%-
3',%&'
1+%2+%-
.&45&'
#$2+%-
.&45&'
3&,/6%*
.&45&'
1+%2+%-7+88&'
0 0
.9:
9
7
#$2+%-7+88&'
0 0
.9:
9
7
3&,/6%*-7+88&'
0 0
.9:
9
7
0%4%+*
;+88&'455'&**&*
,$2+%<;+88&'<&$%'=<()+$%
9((&>&'4%)'
?)$%')>>&'
>4=&'<%=2&
)+%2+%<;+88&'<&$%'=<()+$%?)@2+%&
A$,%
4(%,B4%,)$<%=2&
*6,8%,$/<5,*%4$(&
*6)+>5<')%4%&<,$
*6)+>5<')%4%&<)+%
C:9
: :
0
0
*%4'% 8,$,*6
DEF9
GE0: 0
5@4<4((&**
9.:-?)'%&HI9J-:E?)'&
?EA-K ?EA-L
!L-?4(6&* !L-?4(6&*
9?E 0?A
!M-?4(6&
0C.9:
?)$%')>>&'
0
188I(6,2
CC."
0C.9:
:
0
!"#$%&'()$%&*"+%
,-".%'()$%&*"+%
Figure 6.1: Overall system integration.
FPGA slaves and the HPS peripheral region. The lower 3 GBs of addressable space is allocated
to the SDRAM, and 960 MBs of space is allocated to address FPGA slaves. The remaining
address space is available for the HPS peripherals. The FPGA’s view of memory is slightly
different. Addresses between 2 GB (0x8000 0000) and 3 GB (0xBFFF FFFF) are mapped to
the accelerator coherency port (ACP). The ACP allows FPGA peripherals to access data in a
cache-coherent manner, by routing transfers through the SCU and L2 cache.
To ease the software development, we run the Linux operating system (OS) on the HPS.
The 1 GB DDR memory is used as the main memory by the processor. In addition, a 64 GB
micro-SD card is available to the HPS and managed as part of the file system of the OS (i.e. is
Chapter 6. System Implementation 63
functions similar to a disk). Comparing to a bare-metal system, a great advantage of using an
OS is the convenience of dynamic memory allocation and the support for virtual memory. This
eliminates the requirement that we manually manage the limited memory space. The OS also
allows us to use the C++ Standard Template Library to develop a better-engineered software
framework (Section 3.4). Similarly, the presense of the OS makes it convenient for us to leverage
several required libraries. Our software framework uses the Google protocol buffer to define
and parse the input model configuration file (Section 3.4.2). The trained model parameters
are saved in HDF5 file format [32], allowing efficient/flexible read/write access to the disk file
using the provided library. The OpenCV library [10] is used in our software framework for
pre-processing images in standard formats (JPEG or PNG) into a fixed size 3-channel (RGB)
matrix compliant to the input format of the AlexNet model.
On the FPGA side, the hardware resources include 87 variable-precision DSP blocks, 397
M10K RAM blocks, 85K logic elements and 128K registers. Given the available resources, we set
the accelerator parameters, MO andMI , both to 8. That is, at every clock cycle of computation,
the compute unit receives 8 input neurons and 64 weights from the input and weight buffers,
and accumulates 8 output neurons in parallel. When the accumulated partial sums need to be
rotated-in/out, a set of 8 output neuron partial sums will be moved from/to the output buffer.
For the buffers, we choose the depth to be 1024 entries. Recall that the input, output and
weight buffers are organized as MI , MO and MO × MI arrays in software, respectively, and
are synthesized by LegUp into the same number of RAM modules (Section 5.5.3). Each RAM
module will have a depth of 1024 entries. In our design, the data stored in the buffers are in
16-bit heterogeneous fixed-point format. Hence, each RAM module can be implemented with
two M10K RAM blocks, where each RAM block is configured in 1024×8 (depth×width) mode.
In total, the input, output and weights buffer will occupy 2× (8+8+8× 8) = 160 M10K RAM
blocks.
6.3 System Integration
We now describe the overall system integration, as shown in Figure 6.1. On the FPGA side,
the accelerator controller, compute unit, buffers and buffer writer and readers are designed
in C and synthesized by the LegUp HLS tool. Three additional modules, named instruction
registers, status and DMA were developed to enable the integration of the FPGA accelerator
Chapter 6. System Implementation 64
with the HPS. They are implemented in hand-written Verilog. We use the Altera’s Qsys system
integration tool to connect these modules with the HPS, following the Avalon Memory Mapped
(AMM) interface specification. We elaborate further on the custom-designed modules below.
6.3.1 The Instruction Registers Module
As Section 5.5.1 described, the C-implementation of the accelerator controller receives each
instruction field as an individual function argument. To provide the input arguments to the
accelerator controller, the instruction registers module implements each instruction field as
a register and exposes these registers on the output port. These output registers are then
directly connected to the instruction argument ports of the accelerator controller (the arrows
connecting the instruction registers module and the accelerator controller in the upper-left
corner of Figure 6.1). Observe that the instruction registers module has an AMM slave interface,
allowing the HPS to read/write instructions in a memory-mapped style.
6.3.2 The Status Module
The status module is created for the HPS to start the accelerator execution and monitor execu-
tion status. The start and finish signals of the accelerator controller are directly connected
to the status module. The status module asserts the start signal when it receives a memory
mapped write (at a preset address) from the HPS. When the accelerator controller completes
execution and asserts the finish signal, the status module will update an internal register
representing the completion of the accelerator controller. The HPS can poll the value of this
completion register using a memory mapped read of the corresponding address.
6.3.3 The Accelerator Controller Interface
It is worthwhile to briefly explain the accelerator controller interface. Recall that the accel-
erator controller is synthesized by LegUp as a sequential function with a pipelined inner loop
(Section 5.5.1). The outputs of the accelerator controller are the control signals written to the
FIFOs connected to the other accelerator components. Hence, the interface of the accelerator
controller consists of the sequential function interface shown in Figure 5.1, and several RVD
interface ports connecting to the FIFOs. As seen in Figure 6.1, the sequential function inter-
face (black arrows) are the instruction arguments from the instruction registers module and the
Chapter 6. System Implementation 65
start and finish signals connected to the status module. The RVD interface connecting to
the control signal FIFOs are shown as blue arrows.
6.3.4 The DMA Module
As mentioned earlier, the computation involved in neural network inference involve a large
number of model parameters and intermediate values (neuron values in the hidden layers). The
limited on-chip storage is not sufficient to contain all the data. So, we use the on-board (off-
chip) SDRAM as the main memory to store these parameters and intermediate values, with
the help of virtual memory in the case of insufficient physical space. To reduce the off-chip
memory traffic, a layer of computational work is partitioned into tiles, wherein the required
data in each tile can fit in the on-FPGA buffers and is reused during the tile’s computation. A
DMA module was created to speedup the data transfer between the off-chip memory (SDRAM)
and the on-chip buffers.
The DMA has two AMM master interfaces and one AMM slave interface. The first master
interface is connected to the off-chip memory on the HPS side and the other master interface
is connected to the on-chip buffers. The slave interface is to receive DMA requests from the
HPS. A DMA request specifies the following information: 1) the data transfer direction, from
off-chip memory to on-chip buffer or vice versa; 2) the size of data to be transferred; and 3)
the starting addresses in memory and buffers. We will explain the interface in the next section
and describe the implementation of the DMA in Section 6.4.
6.3.5 Interconnect
In the hybrid processor/accelerator system, there are three independent sets of interconnects,
each with one master interface (as labelled in Figure 6.1). All of the slave interfaces are
connected to one master only. To begin, consider the master interface on the L3 interconnect.
This master interface is used by the processor to read/write memory mapped slaves on the
FPGA, including the instruction registers, status, DMA (request slave) and buffers. Each slave
interface will be assigned a base address in Qsys. Since the memory space of the FPGA slaves
begins at the third gigabyte of the HPS memory layout, the HPS master needs to offset the
base addresses with 0xC000 0000 to access these slave components. Given that our software is
executed within the operating system, the software process only sees the virtual memory space
and does not have direct access to the physical memory. To get handle this, we use the Linux
Chapter 6. System Implementation 66
mmap system call to create mappings in the virtual address space as follows: For each slave,
a region of the process’s virtual memory will be mapped to the corresponding region in the
/dev/mem device file, which is an image of the processors’ physical memory space. Through
this approach, the software can access the slave components by reading/writing from/to the
mapped virtual addresses.
The DMA module has two master interfaces as mentioned above, one for the access to off-
chip memory and one for the on-chip buffers. All accesses to off-chip memory are routed to
the ACP port so that DMA can access cache-coherent data. This is done by offsetting the L3
interconnect address with 0x8000 0000 (Section 6.2).
Turning attention now to the slave interface of the on-chip buffers. The on-chip buffers
are implemented with M10K RAM blocks, which are dual-port memories. We allocate one of
the two ports to the buffer readers and writer, and the other port for the HPS master and
DMA master. To allow both HPS and DMA masters to access the same memory port, one
straightforward approach is to create one slave interface for the memory port and connect both
masters to the slave interface. However, for such an interconnect setup, the Qsys tool will
generate arbitration logic for the slave to avoid access contention between two masters. The
arbitration logic can cause overhead on circuit area, timing and latency, and is also unnecessary
since our software can always avoid access contention by making sure that only one master is
accessing the on-chip buffers at a time. Therefore, we opted for a different approach, shown
in Figure 6.1. We create two slave interfaces for each buffer and connect them separately to
the two masters. A multiplexer is inserted in front of the memory port to steer the accesses
from the two masters. The select signal of the multiplexers is an output from the DMA, named
dma access. The DMA unit can assert this signal to steer the DMA access to the memory
ports. When the DMA unit receives a request from the HPS, it will first assert the dma access
signal before starting data transfer, and lower the signal after the transfer completes. When
this signal is low, only the HPS master has the access to the memory port. Our software ensures
that the DMA transfer and the direct access from HPS master do not happen concurrently.
The interconnect setup described above allows us to transfer data between off-chip memory
and on-chip buffers in two ways: The first way is to use the connection from the HPS master
to the on-chip buffer slaves. In this case, the data transfer will be controlled by the software
running on the processor. Again, we can use the above mentioned mmap method to create a
mapping in the software process’s virtual address space for the on-chip buffers. The software
Chapter 6. System Implementation 67
can then use pointers to reference mapped virtual addresses and read/write from/to the on-
chip buffers by de-referencing the pointers or using functions such as memcpy or memset. The
other way to access the buffers is to use the DMA unit. In this case, the software running
on the HPS issues transfer requests to the DMA and the DMA transfers the data accordingly.
Comparing the two approaches, DMA transfer can provide more data bandwidth by pipelining
the transfers with customizable data width. However, due to two limitations of the DMA unit
(will be descibred in Section 6.4.2), there are scenarios where data transfer cannot be done
with the DMA. Therefore, the connections from L3 interconnect master to the on-chip buffer
slaves are retained in the system in order to perform data transfer for the scenarios that are
not supported by the DMA.
6.4 Data Movement Optimization
6.4.1 Addressing Scheme for the On-Chip Buffers.
In this section, we describe several design decisions related to data transfer. The first item that
we want to discuss is the addressing scheme on the slave interface of the on-chip buffers. There
is a speed advantage to transfer data between contiguous address spaces in both source and
destination storage for the following reasons: 1) spatial locality can improve cache performance,
2) successive accesses to an open row in the SDRAM can reduce read/write latency, and 3) since
each DMA request/descriptor can only specify the transfer between two contiguous address
spaces, a transfer between non-contiguous address spaces requires multiple requests between
contiguous segments. Hence, the DMA request overhead can also be reduced if data transfers
are organized as large contiguous addresses. As a consequence of these issues, the addressing
scheme on the slave interface of the on-chip buffers should be designed to enable large transfers
between contiguous address spaces.
Before we describe the addressing scheme, we first review the data layout in off-chip memory
and on-chip buffers. Figure 6.2 shows the data layout in off-chip memory for a fully-connected
layer. The regions of data with coloured highlights are involved in a tile of computation, and
are to be transferred to the on-chip buffers (Section 4.3.1). The data highlighted in the same
colour are stored at contiguous addresses in the off-chip memory. For example, the output or
input neurons used in a computation tile are stored at contiguous addresses. For the weights,
which are stored in memory as a 2-dimensional row-major order array, weights[No][Ni], the
Chapter 6. System Implementation 68
!"#$%&'( weights[No][Ni]
)#
*+,-&..+"-/0+'(input[Ni]
1-&,-&.+"-/0+'(output[No]
21
2#
)1
31
3#
Figure 6.2: Data layout in off-chip memoryfor a tile in fully-connected layer.
!"#$#!%
&'()'(#$#*+)'(,'--./0
!%
!"#1#!%
2.%34(5#,'--./0
Figure 6.3: Data layout in on-chip buffers fora tile in fully-connected layer.
consecutive weights associated to the same output neuron are stored contiguously in memory.
That is, each row of weights in the figure are stored in contiguous memory addresses (highlighted
in one colour), but different rows of weights are not stored contiguously (highlighted in different
colours).
Figure 6.3 shows how the data used in a tile of computation are stored in the on-chip buffers.
Squares highlighted in the same colour in both figures represent the same data. For example,
the first row of weights in Figure 6.2 is stored at the bottom-left corner of the weights buffer
in Figure 6.3. Recall that the data width of output, input and weights buffers are Mo, Mi and
Mo ×Mi elements, respectively. At every clock cycle, the accelerator needs to read/write one
entry of data from all three buffers (as highlighted in orange). Each entry in the output or
input buffer contains Mo or Mi consecutive neurons. The data in consecutive entries of the
output/input buffers are stored contiguously in the off-chip memory. As such, the addressing
scheme for the slave interfaces of the output and input buffers is straightforward. The data
width of the slave interface is as wide as the entry width. The consecutive entries in the buffer
can be addressed contiguously through the slave interface.
In the accelerator design, the weights buffer must provide Mo groups of weights every clock
cycle. Each group of weights corresponds to one of the Mo output neurons, and the weights in
the same group are associated with the connections from the output neuron to the Mi input
neurons in the preceding layer. Hence, the weights stored in one weights buffer entry (the orange
box in Figure 6.3) are associated to Mo output neurons, and are not laid out at contiguous
Chapter 6. System Implementation 69
addresses in the memory (the orange square in Figure 6.2). To encourage transfers between
contiguous address spaces in off-chip memory and on-chip buffers, we divide the weights buffer
into Mo buffer banks. Each buffer bank is associated with a slave interface with a data width
equal to Mi elements. The consecutive entries in a buffer bank are assigned with contiguous
addresses. By doing so, we can transfer a row of consecutive weights from off-chip memory to
a buffer bank using contiguous addresses on the slave interface.
!
!
!
"#$
!
!
!
"#%
&'(#)
(&
*+,-./012345
filters [OFM_Z][K_Y][K_X][IFM_Z]
&'(#%
&'(#$
&61761-'281632-(8745
ofm [OFM_Y][OFM_X][OFM_Z]
9'(#$
9'(#%
9:761-'281632 (8745
ifm [IFM_Y][IFM_X][IFM_Z]
Figure 6.4: Data layout in off-chip memory for a tile inconvolutional layer.
!"
#$%&'(&))*+,
!-
.&'%&'(&))*+,
/*"01'23(&))*+,
!"
!-343!"
53 53 53
Figure 6.5: Data layout in on-chip buffers for a tile in convo-lutional layer.
For the convolutional layers, the data layouts in off-chip memory and on-chip buffers are
shown in Figure 6.4 and Figure 6.5, respectively. The 3-dimensional input, output feature maps
and filters are stored as row-major-order arrays in memory, where the consecutive elements
(at the same <x,y> location) along the depth dimension are contiguous in memory. The
Chapter 6. System Implementation 70
colour-highlighted regions of data in Figure 6.4 are the ones used in a tile of convolutional
layer computation (Section 4.3.2). Again, data highlighted in the same colour are contiguous
in memory. Since we tile along the depth dimension of the output feature maps, only the
neurons at the same <x,y> location are contiguous in memory. For input feature maps, tiling
is performed on the row and column dimensions, but not on the depth dimension. Hence, the
input neurons in the same horizontal xz-plate of the tile are contiguous. Since we always move
entire filters to the on-chip buffer (not partial filters), the filter weights to be transferred are
contiguous.
On the on-chip buffer side, the input/output neurons that are contiguous in off-chip memory
are also stored at consecutive buffer entries. Therefore, the slave interface setup of input and
output buffers used for fully-connected layers is also suitable for convolutional layers. Similar
to the fully-connected layer, each entry in the weights buffer (the orange box in Figure 6.5)
includes Mo groups of filter weights. To allow the transfer of an entire filter to a contiguous
address space in weights buffer, we can use the same buffer bank approach as mentioned above.
As shown in the figure, an entire filter resides at consecutive entries of a buffer bank. With the
consecutive buffer bank entries being addressable using contiguous addresses, we can transfer
an entire filter between continuous address spaces in both off-chip memory and the on-chip
buffer.
To summarize, we create one slave interface for each of the output and input buffers. The
data widths are as wide as the entry widths of the buffers, which are Mo and Mi elements.
As we choose both Mo and Mi to be 8 and use 16-bit fixed-point format for the data stored
in the buffers, the data widths of the output and input buffers are 16 bytes. For the weights
buffers, a slave interface is created for each of the Mo buffer banks. The buffer bank entries
are Mi elements wide. Hence, the data width of the buffer bank interface is also 16 bytes wide.
For all these slave interfaces, consecutive buffer entries or buffer bank entries are accessed with
contiguous addresses. Since all the slave interfaces of on-chip buffers are 16 bytes wide, we also
set the data width of the L3 interconnect slave interface to 16 bytes so that no data width
conversion is needed (on the interconnect) and the entire interconnect data width is also 16
bytes wide.
Chapter 6. System Implementation 71
6.4.2 Custom DMA Design
We now describe the DMA unit design used in the system. Instead of using the existing DMA
IPs provided by the FPGA vendor (Altera), we opted to design a custom DMA unit. Altera
provides two types of DMAs, a basic DMA unit and a scatter-gather DMA unit [7]. For the
basic DMA unit, the data transferred must be located in contiguous addresses in both the
source and destination storage, whereas the scatter-gather DMA unit can transfer and merge
non-contiguous memory to a continuous address space, and vice versa. The scatter-gather
style is unnecessary for our system, since most data transfer is between contiguous addresses.
Moreover, the basic DMA unit provided by Altera had several limitations, elaborated upon
below.
Unaligned Starting Address
A first limitation is that Altera’s basic DMA unit requires both the read and write starting
addresses to be aligned with the size of the individual transfers. The individual transfer size
refers to the data size that is transferred at each clock cycle. The individual transfer size
is specified for each DMA transfer request, with the options being 1, 2, 4, 8 and 16 bytes.
The address alignment requirement can be a significant drawback on the transfer bandwidth,
depending on the starting addresses. For example, if we aim to transfer 160 bytes of data from
byte address 0x2 of the off-chip memory to byte address 0x0 of the input buffer, the largest
allowable individual transfer size will be 2 bytes, requiring a total of 80 individual transfers.
However, 160 bytes of data can be ideally transferred with just 10 transfers, using an individual
transfer size of 16 bytes, matching the interconnect data width. In our case, data is stored
in the software process’s virtual memory and managed by the OS and the processors’ memory
management unit. Our software does not have direct control of the data location in the physical
memory (details on virtual address to physical address translation will be discussed in the later
section). Meaning that, in our system, the starting address of the data in physical memory is
not always aligned to the interconnect data width of 16 bytes. To allow maximum utilization of
the available data width on the interconnect, our custom DMA unit is designed to workaround
the alignment requirement between starting address and individual transfer size.
Figure 6.6 shows our solution for unaligned address data transfer from off-chip memory
to on-chip buffers, implemented with two assumptions that are valid in our use case, 1) the
Chapter 6. System Implementation 72
!"#$%&'
()')*'+*+,$-.&/*%01123
45445!45"456457893):;8')3'&,<;)993=6>!?
@(ABC*9)')*)'*)993288*,@(ABC*9)')*)'*)993288*,D!E
45F45G45E
! ! !
! ! !
Figure 6.6: Implementation with shift registers and multiplexers to handle unaligned SDRAMstarting address.
starting address in the off-chip memory side is always a multiple of 2 (16-bit fixed-point data
type), and 2) the starting address on the on-chip buffer side is always a multiple of 16 (buffer
entries are 16 bytes wide). We use two registers to hold two consecutive words (each word is
16 bytes wide) read from the memory. The data to be written to the on-chip buffer is selected
from the registers based on the second to the fourth least significant bits of the starting address
of memory. Note that the select signal does not change during a DMA transfer. For the above
example where the memory starting address is 0x2, the DMA starts by reading the first word
at address 0x0 from the memory and the second word at address 0x10. With the first two
words placed in the registers, we can select the initial data transferred to on-chip buffers based
on the starting address, whose bits 3–1 is 0x1 (in orange). Since the DMA unit is pipelined,
a new word is expected to be read from the memory every clock cycle. The two registers are
implemented as shift registers where the word on the left will be shifted to the right and the
new incoming word will be placed on the left. To transfer 160 bytes of data starting at address
0x2, we will need to read 11 words from the off-chip memory. For the last word whose address
starts at 0xA0 in memory, only the first two bytes are written to the on-chip memory. The
solution for transferring data from on-chip buffers to an unaligned address in off-chip memory
is done in a similar way, but requires some additional logic to control the byte enable signals.
DMA Interface
Altera’s basic DMA unit has two master interfaces, one dedicated for read and the other dedi-
cated for write. To allow transfers in both directions between the off-chip memory and the on-
chip buffers, we will need to connect both master interfaces to the memory and on-chip buffers.
When a slave is connected to more than one master, Qsys needs to resolve the potential access
Chapter 6. System Implementation 73
contention by inserting arbitration logic, which causes overhead in area and performance. In
our use case, another overhead of such a connection setup is the need for address span exten-
ders [8]. Since the DMA masters access the off-chip memory via the L3 slave interface whose
address span is already 4 GB, both master interfaces would need an address width of more than
32-bits in order to also connect to the on-chip buffers. However, the basic DMA unit provided
by Altera does not support more 32-bit address for the master interfaces. Hence, an address
span extender would be required for each of the read and write master interfaces so that they
can access both the off-chip memory and the on-chip buffers.
Our custom DMA unit uses a different master interface design. We dedicate one master
interface for read/write access to the off-chip memory and another for read/write access to the
on-chip buffers. Since every slave interface is only connected to one master, we can avoid the
overhead of using arbitration logic. Also, the master interface that connects to the off-chip
memory does not connect to other slaves. We therefore do not need to use the address span
extender.
FIFO-Less Implementation
Altera’s basic DMA unit requires a shallow FIFO whose width equals the word width, having
a minimum depth of 4. In a DMA unit design, when the access to a slave component has
variable latency, FIFOs are useful to relax the backpressure and improve throughput. However,
given that our interconnect allows the DMA to have exclusive access to the on-chip buffers
(Section 6.3.5), the read/write access from the DMA to the on-chip buffers can always be
performed in a fixed latency of 1 clock cycle. As such, our custom DMA does not need to use
any FIFO and will not cause performance/throughput degradation.
Limitation of the Custom DMA Unit Design
Our DMA unit design has two limitations: 1) the minimum transfer size of a DMA request is
equal to the word size (16 bytes), and 2) the DMA unit does not support writing a sequence of
zeros to a destination address (also not supported by Altera’s basic DMA unit). For scenarios
wherein the data transfer size is less than 16 bytes or stale content in the on-chip buffers needs
to be cleared, we resort to having the processor handle these cases by calling the memcpy and
memset functions.
Chapter 6. System Implementation 74
6.4.3 Translation From Virtual Addresses to Physical Addresses
When the DMA unit transfers data between the off-chip memory and the on-chip buffers, it
needs to use physical addresses to access the data. We can configure the addresses of the on-
chip buffers using the Qsys integration tool. However, since our software runs on top of the OS,
where the data is allocated in the virtual address space, the physical addresses of relevant data
in off-chip memory are not directly accessible. In Linux, a software process’s virtual memory
space is divided into 4 KB virtual pages. The upper 20 bits of a 32-bit virtual address is the
virtual page number and the lower 12 bits is the byte-address offset inside the page. To obtain
the corresponding physical address of a virtual address, we need to translate the virtual page
number to the physical frame number and offset the same byte address in the page/frame. Since
the OS manages the limited physical memory space using virtual memory and paging, when the
physical memory is full, a virtual page may be swapped to the disk (the micro-SD card) by the
OS. Therefore, we require a mechanism to lock the data to be transfered in physical memory,
before getting the corresponding physical addresses. This is needed because OS page swaps
may happen at anytime, leading to physical address unpredictability without such a locking
mechanism.
We can use the Linux system calls, mlock and munlock, to lock and unlock, respectively, part
of the software process’s virtual memory space in the physical memory. These two functions
take in two arguments: 1) the starting address in virtual memory, and 2) the number of bytes
of data to be locked/unlocked. Once mlock function is called, the virtual pages that contain
a part of the specified address range are guaranteed to remain in the memory (assuming the
function returns successfully), until the corresponding munlock function is called.
To translate a virtual address to a physical address, we utilize the page-map feature in
the Linux kernel. The Linux kernel provides a file for each process located at the path
/proc/pid/pagemap (pid refers to the process ID). It allows the process to find out which
physical frame each virtual page is mapped to. The file is in binary format and contains a list
of 64-bit values, one for each virtual page. These 64-bit values are listed in the order of virtual
page numbers. If a virtual page is present in the physical memory, the lower 55 bits of its
corresponding 64-bit value will be the physical frame number; and the leftmost bit (bit 63) will
be 1, indicating that the page is present. For example, to obtain the physical frame number for
the k-th virtual page, we read the 64-bit value stored at the byte address k × 8 and parse the
Chapter 6. System Implementation 75
lower 55 bits.
It is desirable to reduce the overhead of page swap and page-map parsing as they are run-
time intensive. When computing a convolutional or fully-connected layer, some data needs to
be transferred from the off-chip memory to the on-chip buffer more than once. Instead of doing
page swap and page-map parsing before each DMA transfer, we choose to perform these steps
once at the beginning of the layer computation. For all the data that will be transferred to
the on-chip memory, we first lock all virtual pages containing the data, then translate virtual
page numbers to physical frame numbers and store the mapping. Before each DMA transfer,
the physical frame numbers can then be directly looked up from the mapping; page swap or
page-map parsing is not needed. At the end of a layer of computation, we unlock all the virtual
pages. Because of paging, the data in a contiguous virtual address range is not necessarily
stored contiguously in the physical memory; however, it is guaranteed that the data within the
same page will be stored contiguously. Thus, when transferring a large chunk of contiguous
data in virtual memory, we need to issue a separate DMA request for each virtual page/physical
frame. The maximum length of each DMA transfer is limited to 4 KB (the page size).
6.5 Clock Frequency Optimization
We now briefly describe several changes made to raise the clock frequency of the accelerator
and DMA unit. The first change is the target clock period for LegUp HLS synthesis. LegUp
accepts a configuration file as input, specifying the synthesis settings and constraints. One of
the constraints is the target clock period. We reduce the target clock period from the default
20 ns to 5 ns. With this change, the number of pipeline stages of the generated compute unit
circuit is increased from 6 to 21. Using Altera’s Quartus II synthesis tool to place and route the
generated circuit in isolation (targeting the DE1-SoC board), we observe the achievable Fmax
is increased from 63 MHz to 153.6 MHz (2.44×.
After this change, we observed timing critical path to be within the stall logic circuitry,
originating from several output FIFOs’ full signals to the compute unit’s data path. These
output FIFOs are used to pass partial sums from the compute unit to the output buffer writer.
In fact, the output buffer writer will never stall during execution and will not cause backpressure.
Consequently, these FIFOs will never be full, allowing us to manually “ground” the full output
signals of the FIFOs, eliminating the critical paths.
Chapter 6. System Implementation 76
Another optimization was applied to the interconnect. There are two interconnect networks
wherein the master connects to a large number of slave components. One of the two DMA unit
master interfaces connects to all of the on-chip buffers, including two slave interfaces for input
and output buffers and eight slave interfaces for the weight buffer banks. The L3 interconnect
master interface is connected to the instruction registers, status module, DMA unit request
slave and all on-chip buffers. These connections appeared on the critical path as they required
large multiplexers on the response path (for read data returned from the slaves). The solution
was to insert pipeline bridges1 in the connections to break the big multiplexers down to smaller
ones. Given that the FPGA fabric uses 6-input LUTs, a 4-to-1 multiplexer can fit into a single
LUT (4 inputs bits and 2 select bits). Figure 6.7 gives an example of how an interconnect
from one master to ten slaves can be formed with 4-to-1 multiplexers by inserting pipeline
bridges. However, we cannot insert pipeline bridges within the interconnect from the DMA
master unit to the on-chip buffers. The reason is that the custom DMA unit is designed with
an assumption that read access to the on-chip buffers has a fixed latency of 1 clock cycle,
thereby allowing the DMA unit to be implemented without the use of FIFOs. Adding pipeline
bridge to the interconnect will increase the latency and violate the assumption. To work around
this limitation, we will need to update the custom DMA design to support a parameterizable
but fixed read latency. We leave this to future work.
!" !# !$ !% !& !' !( !) !* !+
,-./01
!"#$%"&$'()"*+$', !"#$%"&$'()"*+$'-
Figure 6.7: Example of forming the response path of an interconnect with 4-to-1 multiplexers(s* stands for slave interfaces).
6.6 Additional Optimizations
6.6.1 Using the Direct SDRAM Interface
As will be shown in the experiment results in Section 6.8.1, we observed that improving the
clock frequency for the accelerator and DMA did not reduce data transfer time. To investigate
1Pipeline bridge is a Qsys IP core that acts as a bridge between its connected master and slave components.Pipeline registers inside the IP can help to break the timing critical interconnect paths.
Chapter 6. System Implementation 77
!"#$%&'()(*%!
!"#$%&'()$$&(%
*+%,+%-.+//&'
0
123
2
.
#$,+%-.+//&'
0
123
2
.
4&567%8-.+//&'
932
3 3
0
:;<2
=;03 0
213->)'%&?@2A-3;>)'&
>;B-C >;B-D
!D->E(7&8 !D->E(7&8
2>; 0>B
!F->E(7&
09123>)$%')GG&'
0
*//@(75,991"09123
0
123
2
.
>G)(H-5$('&E8&I-/')J-KC3=L-%)-DCC3=L
2G%&'$E%5M&N-+8&-%7&-I5'&(%-5$%&'/E(&-%)-
09123
.)%%G&$&(H-)$-IE%E-%'E$8/&'-OE$IP5I%7
Figure 6.8: Interconnect from DMA master to SDRAM.
this issue, we measured the DMA transfer bandwidth in isolation by repeatedly moving 4 KB of
data back-and-forth between on-chip buffers and HPS memory (at contiguous physical address
in the same physical frame). Since 4 KB is well below the size of L1 cache (32 KB), we
expected that most of the data would be stored in the L1 cache during the transfer experiment.
Moreover, 4 KB (the page size in Linux) is also the maximum DMA transfer size that would
be allowed when virtual memory is used. Thus, such a setup reflects the ideal bandwidth that
can be achieved during actual execution. With the DMA and accelerator clocked at 50 MHz,
the transfer bandwidth from HPS memory to an on-chip buffer is 229 MB/s, while the transfer
bandwidth from an on-chip buffer to HPS memory is 367 MB/s. However, the maximum transfer
bandwidth that can be achieved on the interconnect from the DMA master to L3 interconnect
slave (highlighted with an orange dashed line in Figure 6.8) is equal to 800 MB/s (16 bytes data
width multiplied by 50 MHz clock). We thus concluded that the transfer bandwidth bottleneck
is on the L3 interconnect side (highlighted with red dashed line).
An alternative approach is to use the direct interface to the SDRAM controller (highlighted
with blue dashed line), which bypasses the L3 interconnect. We measured the transfer band-
width of this direct SDRAM interface between HPS memory and the on-chip buffers. The same
setup described above is used, except the DMA clock is set to 150 MHz so that the measured
bandwidth will not be limited by the interconnect bandwidth, which now is increased from 800
MB/s to 2.4 GB/s. The measured read/write bandwidths from/to the DDR3 SDRAM are 1.16
Chapter 6. System Implementation 78
GB/s and 1.18 GB/s, corresponding to 5× and 3.2× more bandwidth than using the L3 slave
interface. However, with the direct communication between the FPGA fabric and the SDRAM
controller, the DMA unit can no longer access cache-coherent data, since the direct interface
does not go thorugh the accelerator coherency port (ACP). To work around this, a solution
would be to clean the caches before read access by the DMA unit to the DDR3 memory, and
invalidate the caches after write access by the DMA unit to the DDR3 memory. Unfortunately,
such operations cannot be done by a user-space process, requiring a tedious process of updating
the OS kernel and creating system calls for our software framework to clean and invalidate the
caches. We therefore chose a more direct approach, which is to use the Linux mmap system call2
to allocate non-cacheable physical memory for storing the neurons and weights in the software
containers. This is done by creating a custom C++ template container class that allocates mem-
ory using the mmap system call. It is worth noting that the memory allocated through the mmap
system call is not managed by the OS virtual memory. This brings two benefits: 1) We no
longer need to translate virtual addresses to physical addresses nor are page swaps required; 2)
The maximum DMA transfer size is no longer limited to the OS page size (4 KB). However, the
run-time of computation performed on the processor, namely maxpooling and local response
normalization in AlexNet, is significantly increased as the data is no longer cached.
6.6.2 Control Loop Optimization in Accelerator Controller
This optimization aims to improve the efficiency of the control loops in the accelerator controller,
specifically for the control of convolutional layer computation. Recall that the accelerator
controller is first implemented in C and synthesized to Verilog using LegUp HLS tool. The
C implementation consists of 6 nested loops: The first 3 nested loops iterate along the row,
column and depth dimensions of the output feature map tile to select the output neurons to
be computed. The next 3 nested loops iterate along the row, column and depth dimensions
of the receptive field in the input feature map tile and the 3-D filters (ref. Section 4.3.2). In
the innermost loop, the accelerator controller issues a new set of control signals (e.g. the buffer
addresses to load inputs, load weights, and store outputs) to the buffer readers and writers,
and compute unit.
2The mmap function uses the standard Linux kernel call dma alloc coherent to request physically-contiguousmemory regions. In the default Linux kernel, dma alloc coherent does not allocate physically-contiguous mem-ory more than 0.5 MB in size. To allow the allocation of large amounts of physically-contiguous memory, theLinux OS that we are using has enabled the contiguous memory allocator (CMA) feature of the Linux kernel,allowing the dma alloc coherent call to allocate up to 512 MB of physically-contiguous memory [6].
Chapter 6. System Implementation 79
A common limitation in any HLS tool is that a pipelined loop or a pipelined function
cannot contain any control flow (i.e. branch or loop). Therefore, only the innermost loop can
be pipelined in the HLS-generated hardware. In our case, the generated hardware will include
a finite-state machine (FSM) that realizes the beahviour of the 6 nested loops, and a pipelined
datapath that corresponds to the loop body of the pipelined inner-most loop. The hardware
functions as such:
1. When the FSM reaches the state corresponding to the innermost loop, the pipelined
datapath starts execution by launching a new innermost loop iteration every clock cycle.
2. After the last iteration is launched, the FSM will wait for the pipeline to flush (i.e. the
final inner-most loop iteration to finish).
3. Then, the FSM proceeds and transits among the states corresponding to the outer loops,
and eventually comes back to the innermost loop state again.
The above steps repeat until all outer-loop iterations complete. In such an implementation,
there are many clock cycles spent waiting for the pipeline to flush, as well as transitions among
the outer loop states, considering that these steps repeat as many times as the product of loop
counts of the five outer loops. During these cycles, the accelerator controller is not generating
new control signals for the compute unit, resulting in low utilization of the compute unit. To
eliminate these “wasted” cycles, we collapsed the 6 nested loops into a single loop, where the
loop count of the collapsed loop equals the product of the loop counts of the 6 nested loops.
With such an implementation, the generated hardware can continuously launch a new iteration
every clock cycle, without waiting for pipeline to flush or FSM to transit among the outer loops
that were originally present. Note that this change is done entirely in C.
6.6.3 Reuse of Data Between Input Feature Map Tiles
This optimization is based on an observation that there exists a data re-use opportunity between
consecutive input feature map tiles. Recall that the tiling software traverses and tiles along
the row and column dimensions of the input feature map (as shown by the orange arrow in
Figure. 6.9). As seen in the figure, there is a large region of data overlapped by the first input
feature map tile (highlighted in red) and the next input feature map tile (highlighted in orange
box). Given the overlap, after the first tile of computation finishes, the tiling software may not
need to transfer all of the data in the next tile to the on-chip input buffer. However, due to
the organization of data in the on-chip buffer (in row-major order), the retiring data from the
Chapter 6. System Implementation 80
!"#$%
!"#$&
Figure 6.9: Previoustraversal order in tiling.
!"#$%
!"#$&
Figure 6.10: A data re-use-friendly traversal order.
!"#$%&$''()*
+,-(./
+,-(.0
+,-(.1
Figure 6.11: Using input buffer ina circular organization.
first tile (the left vertical slice in the first tile) would be interleaved among the on-chip buffer,
making it difficult to store the new data of the next tile (the right vertical slice in the orange
box) in the on-chip buffer. This problem disappears if we interchange the traversal order on
the x-, y- dimensions, such that the consecutive tiles are vertically aligned. As Figure 6.10
shows, the second tile (highlighed in green box) is one row below the first tile. The new data
required by the second tile (highlighted in light green) can be stored at the next empty entries
in the on-chip buffer, following the data in the first tile (Figure 6.11). All of the input feature
map data involved in the second tile of computation is still stored in contiguous buffer entries.
For the third tile, assuming the input buffer no longer has any empty entries, the new data
(highlighted in light yellow) can be stored at the beginning of input buffer, in a circular manner,
replacing the retired data. To cope with the circular input buffer, the accelerator controller
should be modified so that it will issue correct buffer-entry addresses to the input buffer reader.
The required changes include: 1) the instruction sent by the tiling software is updated with
a new field specifying the starting entry index in the input buffer for the current tile; 2) the
accelerator controller operates in the same way as if the starting index is zero, except that the
issuing address to the input buffer reader is first offset by the specified starting entry index
modulus the buffer depth.
6.7 Verification
We now highlight aspects of the verification flow for our acceleration solution, specifically veri-
fication of the convolutional and fully-connected layers. The verification flow is as follows:
Chapter 6. System Implementation 81
1. Initialize a test.
(a) Randomly generate the dimensions for a convolutional or fully-connected layer. For
a fully-connected layer, the dimensions are the number of output and input neurons.
For a convolutional layer, the dimensions include the widths, heights and depths of
output feature maps and 3-dimensional filters, as well as the padding and stride to be
applied when performing convolution on the input feature maps. These parameters
will define the dimensions of the input feature maps.
(b) Randomly configure the heterogeneous fixed-point representation for each of the input,
output and weights matrices. The bit widths of all fixed-point representations will
be 16 bits wide so that they match with our hardware accelerator setup.
(c) Fill in random values for the input and weights matrices. The random values are
within the representable value ranges of the heterogeneous fixed-point format.
2. Generate the golden output using a reference software implementation that has been pre-
viously verified. The reference implementation refers to the software that we developed
for the fixed-point experiments described in Chapter 3. It has been verified in a separate
verification flow.
3. Obtain the test output computed by the our acceleration hardware implementation.
4. Compare the test output against the golden output to confirm correct functionality.
Table 6.1: Design components being tested at each development/verification flow.
Development/Verification PhaseDesign Components Software RTL Hardware + Software
under Testing Verification Simulation on SoC FPGA Board
Tiling Software ✔ ✔
Instruction Generation ✔ ✔ ✔
Data Layout in Buffers ✔ ✔ ✔
Data Transfer ✔
Acceleration DesignHLS-synthesizable HLS-generated Hardware CircuitC-implementation Verilog on FPGA
We adopt Google’s C++ test framework – Google Test [2] to implement the verification flow. The
implementation of the accelerator design can be divided into three phases. The same verification
flow can be used to test design components at each implementation phase, as shown in Table 6.1.
In the first phase, we design the accelerator in LegUp-synthesizable C-implementation. Since
the C-implementation is compilable by standard software compiler, we can compile and execute
Chapter 6. System Implementation 82
this C-implementation along with the translation software that implements the two backend
API functions for convolutional and fully-connected layers (Section 4.2). This allows us to
perform the first phase verification completely in software. Given a randomly-generated test
input vector, the translation software will perform tiling, generate instructions, transfer data
to on-chip buffers, and invoke the C-implementation of accelerator to compute each tile. Since
the custom DMA unit and interconnect are not implemented in software, we cannot test data
transfer in this verification flow. To mimic data transfer, we use memcpy to copy data from/to
the arrays that represent the on-chip buffers in C.
The second verification phase refers to the RTL simulation. Since there is no simulation
model for the ARM processor, we choose to only simulate the LegUp-generated Verilog of our
accelerator at this step. Recall that the accelerator receives the instruction from the translation
software and performs a tile of computation accordingly. Therefore, the simulation can be done
at the granularity of a tile of computation. That is, each simulation test run corresponds to a
tile of computation, where the input test vector includes the instruction and the content of input
and weights buffers, while the test output will be the content of output buffers. The test input
and golden output can be obtained from running the above software verification flow. We insert
additional code into the translation software to save the instruction and buffer contents into
files for each tile of computation. For simulation, we prepare a simple testbench in Verilog.
The testbench does the following: 1) read the input test vector and golden output from the
files; 2) initialize the contents of RAM modules that implement the input and weights buffers;
3) present the instruction to the accelerator controller’s input ports; 4) start the accelerator’s
execution and wait for its completion; and 5) compare the content of output buffer against the
golden output. In this verification flow, we can verify the data layout in the on-chip buffers,
the generated instruction and the LegUp-generated Verilog of the accelerator.
The last verification flow is the complete system verification on the SoC FPGA board.
This test covers all of the final hardware and software implementation for the convolutional
and fully-connected layers. All the hardware components shown in Figure 6.1 are tested, in-
cluding the HPS, accelerator, custom DMA, instruction registers, status module and system
interconnect. On the software side, the tested components are the translation software and the
data-transfer implementation, including virtual page locking, virtual-to-physical address trans-
lation and DMA request generation. Note that the same verification software is used in this
test and executed on the HPS with the translation software invoking the hardware accelerator
Chapter 6. System Implementation 83
on FPGA.
6.8 Experimental Results
In this section, we will present and discuss experimental results for the accelerator design. Sub-
section 6.8.1 reports the run-time (performance) results for the accelerator implementation with
optimizations successively applied. Area and power results are reported in Subsections 6.8.2
and 6.8.3. Our hardware system is synthesized using the Altera’s synthesis tool, Quartus II
15.0. The clock frequency and area results are obtained from Quartus II synthesis reports. The
run-time results are obtained by adding wall-clock time measurement code into the software
framework. For power measurements, we use a DC power generator to supply the required
DC voltage to the FPGA board and measure the power consumption based on the generator’s
current reading.
6.8.1 Performance Results
We present the run-time breakdown per image inference of AlexNet in Table 6.2. The run-
time of AlexNet inference can be broken down into two parts: 1) the computations on the
accelerator and 2) computations on the ARM processor. The ARM processor computes three
layer types: maxpooling, local response normalization and depth-concatenation3. Given that
the HPS contains a dual-core ARM processor, we divide the computation of both maxpooling
and local response normalization layers into two parts, each part is computed by one software
thread. The total run-time spent on these three layers is 90.3 ms. For the computation on
accelerator, we created several version of the hardware, each having different optimizations
applied, as shown in the table and elaborated upon below.
Initial Version
We begin by looking at the initial version, which is our first fully functional version that supports
AlexNet. This implementation does not include the data movement optimization (Section 6.4)
or the clock frequency optimization (Section 6.5). Without the use of DMA, all data transfer is
3Recall that the AlexNet model is split into two partitions, where in the original paper, each partition wastrained on one GPU. Communication between the two GPUs only happens at certain layers. To support thismodel in our framework, when a convolutional layer requires input feature maps from both partitions, we needto copy two input feature map matrices into a single input feature map matrix. The depth-concatenation layeris created for this purpose and only does memory copy.
Chapter 6. System Implementation 84
Table 6.2: Run-time breakdown per image inference of AlexNet (unit in millisecond).
Implementation Versions
Cyclone V SOC Arria V SoC
InitialVersion
DataFmaxOpt.
Direct Increase Control Data Re-use
Transfer SDRAM Accelerator Loop Between
Opt. Interface Size Opt. Conv. Tiles
Clock Frequency on FPGA 50 MHz 50 MHz 120 MHz 150 MHz 120 MHz 120 MHz 120 MHz
Convo-lutional
onFPGA
Accelerator 315.6 315.6 161 111.2 45.0 23.5 23.3
Data Transfer 769 248.8 258.3 68.7 42.6 42.6 9.6
Tiling (on ARM) 7.1 7.1 7.1 4.2 2.3 2.3 2.1
Total 1091.7 571.5 426.4 184.1 89.9 68.4 35.0
Fully-Connected
onFPGA
Accelerator 25.4 25.4 13.3 9.2 2.8 2.8 2.8
Data Transfer 7174 534.9 561 66.0 41.3 41.3 40.7
Tiling (on ARM) 8.1 8.1 8.1 3.5 2.2 2.2 2.0
Total 7207.5 568.4 582.4 78.7 46.3 46.3 45.5
Computeon
ARM
Maxpooling 43 80.3
LRN 45.5 51.8
Depth-Concat. 1.8 7.1
Total 90.3 139.2
Total 8389.5 1230.2 1099.1 402 275.4 253.9 219.7
done by the HPS processor calling the memcpy function. The clock frequency of the accelerator
was reported to be ∼60 MHz by static timing analysis, but we set the clock frequency at 50 MHz
for the measurements. In this version, the run-time for convolutional and fully-connected layers
are 1091.7 ms and 7227.5 ms, respectively. For both of these layers, the majority of the run-
time is spent on data transfers (70.4% for convolutional layers and 99.2% for fully-connected
layers). The data transfer time spent on the fully-connected layers is ∼10× more than that for
the convolutional layers, due to the fact that ∼90% of the AlexNet model parameters are the
weights in fully-connected layers.
Data Transfer Optimization
Table 6.3: Time spent on data transfer between memory and on-chip buffers (unit in millisec-ond).
Implementation Versions Initial VersionAddressScheme Setup
Transfer w/ Custom DMA
DMA Page Swap &Total
Transfer Address Translation
Convolutional Layers 769 640.1 (1.2x) 237.5 9.3 246.8 (3.1x)
Fully-connected Layers 7174 3774 (1.9x) 433.9 101 534.9 (13.4x)
In the second version, we focus on improving data transfer performance. The first opti-
mization is the address scheme setup for the weights buffer described in Section 6.4.1. We
split the weights buffer into eight buffer banks, each with its own contiguous address space in
Chapter 6. System Implementation 85
the memory-mapped interconnect. This change allows a larger chunk of consecutive weights
to be stored contiguously in the on-chip buffers, and hence improves data transfer efficiency.
As shown in Table 6.3, this change improves the data transfer performance by 1.2× and 1.9×
for convolutional and fully-connected layers, respectively. The second optimization is to use
the custom DMA unit. To do so, we need to explicitly swap in the virtual pages containing
the data to be transferred and parse the pagemap file to obtain the physical addresses. The
time spent on DMA transfer, as well as page swap and address translation are reported in the
table. The data transfer speedup over the initial version are 3.1× and 13.4× for convolutional
and fully-connected layers, respectively. The reason for the speedup difference between the two
layer types is mainly because data to be transferred for convolutional layers is more fragmented
than that for the fully-connected layers, and hence, it benefits less from DMA transfer. With
both optimizations to the data transfer, we are able to reduce the total run-time per image
inference to 1230.2 ms, corresponding to a 6.8× speedup.
Clock Period Optimization
The third implementation version improves the clock frequency for the accelerator and DMA
unit, which share the same clock domain. This change aims to speedup both data transfer
and accelerator computation. After applying the clock frequency optimizations described in
Section 6.5, the achievable Fmax is reported at ∼124 MHz by the Quartus II synthesis tool.
We observe that the achievable clock frequency is largely limited by congestion and high resource
Figure 6.12: Screen shot of chip planner onCyclone V SoC device.
Figure 6.13: Screen shot of chip planner onArria V SoC device.
usage of the FPGA fabric, as seen in Figure 6.12. According to the Quartus II synthesis report,
Chapter 6. System Implementation 86
the design reaches 76% logic utilization and uses 70% RAM blocks and 74% DSP blocks. We
set the clock frequency to 120 MHz when measuring the run-time. As presented in Table 6.2,
the time spent on accelerator computation is proportionally reduced by ∼2.3×. However, the
time spent on data transfer does not decrease as expected and even increases slightly. This
is due to the fact that the data transfer bottleneck is not on the interconnect between DMA
unit and L3 interconnect slave, but on the path from L3 interconnect to ACP and caches (as
described in Section 6.6.1).
Port to Arria V SoC and Use of the Direct SDRAM Interface
Our fourth implementation aims to overcome the data transfer bottleneck limited by the L3
interconnect, as well as the congested resource usage on the FPGA. We port the design onto a
larger device, an Arria V SoC FPGA (on the Arria V SoC development board), which contains
the same HPS component and∼5.5×more logic elements, ∼5.7×more RAM blocks and∼12.5×
more DSPs than available on the Cyclone V SoC FPGA on the DE1-SoC board. We also
change the interconnect from the DMA unit to the HPS SDRAM to use the direct SDRAM
interface (the FPGA fabric connects directly to SDRAM without passing through the processor
subsystem). As decribed in Section 6.6.1, the DMA unit can no longer access cache coherent
data via the direct SDRAM interface, and hence, the mmap system call is used to allocate non-
cacheable memory for storing the neurons and weights. With the system implemented on the
Arria V SoC device, we can achieve an Fmax of ∼152 MHz for the clock used by the accelerator
and DMA unit. The Fmax is improved because the congestion issue is alleviated by having
more resources available on this Arria V device, as seen in Figure 6.13. The run-time results
are reported in Table 6.2, with clock frequency set at 150 MHz for the accelerator and DMA.
We can see that the time spent on accelerator computation is again proportionally reduced
to 111.2 ms and 9.2 ms for convolutional and fully-connected layers, respectively. For data
transfer, comparing to the second implementation (Column “Data Transfer Opt.”), the run-
time is reduced by 3.8× and 8.5× for convolutional and fully-connected layers, respectively. The
reduction of the data transfer time is not only due to the improved data transfer bandwidth from
using the direct SDRAM interface, but also because the time spent on page swap and virtual-to-
physical address translation is eliminated with the use of mmap system call for memory allocation
(Section 6.6.1). However, since the neurons and weights are no longer cached, the run-time of
the layers computed on the ARM processor is significantly increased, from a total of 90.3 ms
Chapter 6. System Implementation 87
to 139.2 ms. In this implementation, the total run-time per image inference is reduced from
previous 1099.1 ms to 402.9 ms.
Increased Accelerator Size
Given that more hardware resources are available on the Arria V SoC device, we opted to
increase the accelerator size by incorporating more multipliers and adders in the compute unit
and enlarging the on-chip buffers. We increase the compute unit parameters Mo and Mi from 8
to 16 such that the compute throughput per clock cycle is improved from 64 MAC operations to
256 MAC operations. That is, the compute unit now reads in 16 input neurons, 256 weights and
computes 16 output neurons every clock cycle. The on-chip buffer widths are also increased
accordingly to the data bandwidth requirement of compute unit. The widths of input and
output buffers are increased from 8 elements to 16 elements. The number of weight buffer
banks is increased from 8 to 16, while the width of each buffer bank is also increased from
8 elements to 16 elements. We kept the same buffer depth of 1024 entries for all buffers. In
summary, the storage capacity of input, output and weights buffers are respectively increased
by 2×, 2× and 4×.
With the larger accelerator design, the reported Fmax of the clock for the accelerator and
DMA unit is reduced to ∼119 MHz. We over-clock slightly and set the clock frequency at
120 MHz. As such, the overall accelerator compute throughput is expected to be improved
by 3.2× (4 × 120/150). According to our run-time measurements (Table 6.2), the accelerator
compute time spent on fully-connected layers is reduced by 3.2×, proportional to the increase
of compute throughput. However, the accelerator compute time spent on convolutional layers is
only reduced by 2.5×. This is because the accelerator controller “wastes” a significant number
of clock cycles waiting for the control pipeline to flush and on the branches between the non-
piplined nested loops, resulting in lower utilization of the compute unit (Section 6.6.2).
The data transfer time is also reduced by ∼1.6× for both convolutional and fully-connected
layers. The larger buffer size allows an increase in tile size and hence requires fewer tiles,
resulting in fewer input and weight transfers to on-chip buffers. In other words, the increase
in tile size permits more data reuse and therefore reduces the data transfer traffic. Fewer tiles
also implies less time spent in the tiling software.
Overall, this change reduces the total run-time per image inference from 402.9 ms to 271.5
ms.
Chapter 6. System Implementation 88
Loop Optimization in the Accelerator Controller
This optimization is to overcome the low utilization of the compute unit caused by the inefficient
loop implementation in the accelerator controller, specifically for convolutional layers. A loop
collapsing technique is applied to the loop nest in the accelerator controller, as described in
Section 6.6.2. This optimization allows the pipelined collapsed loop to continuously generate a
new set of control signals for the compute unit and buffer readers and writers every clock cycle,
during an entire tile of computation. With this change, the accelerator compute time spent on
convolutional layers is reduced from 45 ms to 23.5 ms, corresponding to a 1.9× speedup.
Data Re-use Between Input Feature Map Tiles
The last implementation incorporates the optimization described in Section 6.6.3, which allows
a fraction of an input feature map tile to be re-used by the subsequent tile. In addition, we
further increase the depths of all buffers from 1024 entries to 2048 entries, doubling the storage
capacity of all buffers. By doing so, the weights buffer can simultaneously store all the 3-D
filters of any of the five convolutional layers in AlexNet. This means that the tiling along
the depth dimension of the output feature maps is no longer needed. Consequently, an input
feature map tile can be factored into all of the output feature maps in one tile of computation,
eliminating the need for transferring the same input feature map tile a second time. Combining
both changes, the data transfer time spent on convolutional layers is reduced from 42.6 ms to
9.6 ms (4.4×), as shown in Table 6.2). Overall, the run-time per image inference is reduced to
219.7ms.
6.8.2 Hardware Resource Usage
Results on the Cyclone V SoC FPGA Device
Table 6.4 shows the FPGA resource usage of the system on the Cyclone V SoC FPGA. The
results are obtained from the last version implemented on the DE1-SoC board, with optimiza-
tions on data movement and clock frequency. Several design entities have significant differences
in resource usage between the final and the initial versions. Their resource usage in the ini-
tial version are listed in the parentheses. In the accelerator, the compute unit and the FIFOs
connecting the pipelined modules consume the most logic elements and registers. In the last
version, the target clock period setting in LegUp HLS is set to 5 ns, the generated compute unit
Chapter 6. System Implementation 89
Table 6.4: Resource usage of the final implementation on the Cyclone V SoC FPGA device(numbers in the parentheses correspond to the first version on the Cyclone V SoC FPGAdevice).
Design Entity Logic Elements Registers Mem. Bits M10Ks DSPs
Accelerator
Compute Unit 3,500 (3,200) 9,600 (2,700) - - 56
On-Chip Buffers - - 1,310 K 160 -
Buffer Readers & Writers 140 330 - - -
Streaming FIFOs 2,100 2,800 60 K 102 -
Accelerator Controller 1,200 1,200 - - 8
Total 7,000 (6,700) 14,000 (7,000) 1,370 K 262 64
Interconnect 7,300 (12,100) 23,200 (9,200) - - -
Custom DMA 1,500 (-) 1,400 (-) - - -
Status Module 9 4 - - -
Instruction Registers 147 345 - - -
Total 16,000 (16,900) 39,000 (18,000) 1,370 K 262 64
hardware has 21 pipeline stages and requires 3,500 logic elements and 9,600 registers, which are
1.1× and 3.6× more than that in the initial version (having 6 pipeline stages generated with
a 20 ns target clock period setting). The compute unit contains 64 16-bit multipliers, which
can be theoretically implemented with only 32 DSPs. However, to ease the circuit routing
and achieve better timing, the FPGA synthesis tool does not always use the most compact
implementation. In this case, the synthesis tool chooses to use 56 DSP units.
The high resource usage of streaming FIFOs is due to two reasons. First, LegUp currently
requires all the data that flows between pipelined functions to be passed through FIFOs. As
a result of this limitation, our design needs to use a total of 104 FIFOs. Second, we did not
fine-tune the FIFO depths in this version of the design and set the depth to 20 words for all
FIFOs. These FIFOs also use a significant amount of the 102 M10K RAM blocks.
The on-chip buffers use 160 M10K RAM blocks, all configured in 8× 1024 (width× depth)
mode. Buffer readers and writers are small modules, which only use 140 logic elements and 330
registers in total. Besides the accelerator, the largest design entity is the system interconnect
generated by Qsys (Altera’s on-chip bus interface generator tool). In the initial version, 12,100
logic elements and 9,200 registers are used by the interconnect. With the addition of pipeline
bridges in the final version, we observe a 40% decrease in logic element usage and a 2.5× increase
in register usage.
In total, the last version of the implementation on the Cyclone V SoC FPGA device uses
12K logic elements, 39K registers, 262 M10K RAM blocks and 64 DSP blocks.
Chapter 6. System Implementation 90
Results on Arria V SoC FPGA Device
Table 6.5: Resource usage of the final implementation on the Arria V SoC FPGA device(numbers in the parentheses correspond to the first version on the Arria V SoC FPGA device).
Design Entity Logic Elements Registers Mem. Bits M10Ks DSPs
Ac-cel-era-tor
Compute Unit 8,500 (3,300) 39,000 (9,600) - - 240 (56)
On-Chip Buffers - - 9,440 K (1,310 K) 1152 (160) -
Buffer Readers & Writers 70 (100) 560 (330) - - -
Streaming FIFOs 990 (2,000) 1,870 (2,800) 2 K (30K) 24 (102) -
Accelerator Controller 1,100 (1,200) 1,300 (1,180) - - 9 (8)
Total 10,700 (6,600) 42,700 (13,900) 9,442 K (1,340 K) 1,176 (262) 249 (64)
Interconnect 9,100 (6,900) 10,400 (6,000) - - -
Custom DMA 2,800 (1,330) 2,770 (1,500) - - -
Status Module 5 3 - - -
Instruction Registers 147 342 - - -
Total 22,800 (15,000) 56,000 (22,000) 9,442 K (1,340 K) 1,176 (262) 249 (64)
Table 6.5 presents the hardware resource usage results of our final implementation on the
Arria V SoC FPGA Device. The numbers in the parentheses are from the first implementa-
tion on the Arria V SoC, corresponding to the fourth implementation in Table 6.2 (labelled
“Direct SDRAM Interface”). Comparing the final and initial versions on Arria V SoC, the
final version incorporates several additional optimizations, including the increased accelerator
size (Section 6.8.1), loop optimization in the accelerator controller (Section 6.6.2) and the data
re-use optimization between consecutive input feature map tiles (Section 6.6.3).
As the accelerator parameters Mo and Mi are increased from 8 to 16, the resource usage of
both the compute unit and on-chip buffers are significantly increased. For the compute unit,
the logic element usage is increased by 2.6×, the register usage is increased by 4.1× and DSP
usage is increased by 4.3×. For on-chip buffers, 7.2× more memory bits and M10K RAM blocks
are used.
In the final version, we made two design changes to reduce the resource usage of streaming
FIFOs. The first change is to remove the FIFOs that were previously present between input
and weights buffer readers to the compute unit. This was done by inlining the C functions
for the input and weights buffer readers into the compute unit function. If these FIFOs had
not been eliminated, with the size of accelerator increased, the number of FIFOs used in the
accelerator would have increased from 104 to 312 (as the compute unit reads/writes more data
from/to buffers every clock cycle). With this change, the number of FIFOs dropped to 40. The
second change is to fine-tune the FIFO depths accordingly to the pipeline schedule reported by
the LegUp HLS tool. The resource usage of FIFOs is greatly reduced for logic elements (2×),
Chapter 6. System Implementation 91
registers (1.5×), memory bits (15×), and M10K RAM blocks (4.3×). The loop optimization
in the accelerator controller does not have much impact on the resource usage as seen in the
table. In total, the final version of the accelerator uses 10 K logic elements, 43 K registers, 1 K
M10K RAM blocks and 249 DSP units.
For the interconnect, the resource usage is also increased due to the change of accelerator
size. First, we changed the data width of the interconnect to match with the enlarged on-chip
buffer width. Also, more weights buffer slave components are added to the interconnect since
there are now more weights buffer banks. Resource usage of logic elements and registers are
respectively increased by 1.6× and 3.1×. We also increased the data width of custom DMA to
match with the interconnect data width, resulting in 1.3× and 1.7× higher logic element and
register usage, respectively.
The final implementation on Arria V SoC FPGA uses 23 K logic elements, 56 K registers,
1 K M10K RAM blocks and 249 DSP units.
6.8.3 Power Results
We use a DC power generator to measure the power consumption for our last implementation
on DE1-SoC (Cyclone V SoC FPGA), which provides a required 6 V DC voltage to the FPGA
board. The power is measured based on the current reading. When the board is power on
with the OS being idle on the HPS, the total power usage is 6 W. When the accelerator is
active, the total power usage increases slightly to 6.15 W. It is worth noting that there are
many components besides the FPGA device on the DE1-SoC board, including the Ethernet
PHY, VGA output, video decoder, etc. We expect that these components also consume a
considerable percentage of the total static power.
The Arria V SoC development board is equipped with a power measurement circuit. The
development kit includes power monitoring software that allows real-time measurements of
power consumption. Figure 6.14 shows a screenshot of the power monitoring software. Each
of the monitor screens measures one of the power rails of the HPS and FPGA fabric. The
four measuring rails on the left are for HPS and the top three rails on the right are for FPGA
fabric. Looking at the HPS core power and HPS DDR3 devices power, we can break our
system execution into two parts: 1) the software running on the HPS first initializes the neural
network model and loads pre-trained model parameters from disk; 2) the software and hardware
accelerator on FPGA performs inference execution. As seen in the figure, during initialization,
Chapter 6. System Implementation 92
!"#$%&'($)&*('
+,-$./0$!"#$1123$0(45%(6
+,-$./0$!"#$5/7('/.8$./0$)('5)9('.8$
0(45%(6
+,-$./0$:";<$5/7('/.8$./0$
)('5)9('.8$0(45%(6
+,-$./0$:";<$112$0(45%(
:";<$%&'($)&*('=$7'./6%(54('=$./0$
%8&%>
+/575.85?.75&/ +/@('(/%($AB(%C75&/
Figure 6.14: Power monitoring of HPS and FPGA fabric on the Arria V SoC developmentboard.
both the HPS core power, and the power of I/O and DDR3 device are increased to ∼736
mW and ∼135 mW, respectively. During inference execution, the peak readings of HPS core
power, HPS DDR3 device power, and FPGA core power are ∼840 mW, ∼522 mW, ∼2725 mW,
respectively. Combining all power rails for HPS and FPGA, the peak power readings of the
HPS and FPGA are ∼2.5 W and ∼2.8 W, respectively.
6.9 Related Work
In this section, we discuss related work on hardware acceleration of neural network computation.
In [13], the authors propose an ASIC implementation of a neural network accelerator called
DianNao. As seen in Figure 6.15, the proposed accelerator architecture bares similiarity to
our own. The compute unit (NFU ) is designed to concurrently accumulate 16 output neurons
with the sum of products of 16 inputs and 16 sets of weights. This is roughly equivalent to
our architecture with both Mo and Mi equal to 16. In addition to convolutional and fully-
connected layers, the proposed architecture also supports max-pooling layers. This is done
by adding comparator trees to form max operators on the side of the adder tree (at NFU-
Chapter 6. System Implementation 93
2 ). Three on-chip buffers are used for storing the input neurons (NBin), weights (SB) and
output neurons (NBout). Each on-chip buffer is augmented with a DMA unit to improve
the data transfer speed from/to off-chip DRAM. They also created a control processor that
consists of three code generators – one for each of the three supported layer types. Each code
generator creates “instructions” for controlling the buffers and the compute unit. The authors
implemented the accelerator design in custom Verilog and synthesized the ASIC implementation
to a TSMC 65nm GP standard Vth library. The reported Fmax is at 980 MHz. The area and
power consumption are ∼3 mm2 and 485 mW. For run-time measurements, the cycle latencies
were gathered from a cycle-accurate C++ simulator, with the bandwidth to main memory (off-
chip DRAM) set to 250 GB/s. The run-time benchmarking is individually done on several
different neural network layers. Compared to a 128-bit 2GHz SIMD processor, the proposed
accelerator is 117.87× faster. For the largest layer in their experiment, the 5-th convolutional
layer in AlexNet, the proposed accelerator is about ∼500× faster than the SIMD baseline (the
actual run-time result is not reported in the paper).!"#
!"#$%
&"%
!'()*%
$%&'()#*"+%(,-
.%#
!'()+% !'(),%
*"/+0#
1$2#
1$2#
*"/+0#
!"#3#!
"#
!"-./%
4'"+('5#6('.%//'(#7468#
*"/+(9.:'"/#
*"/+0#
1$2#
!"#
Figure 6.15: Neural Network Accelerator Architecture of DianNao [13].
The authors further propose a multi-core implementation in [14]. This design aims to
improve the computational throughput and reduce the impact of limited memory bandwidth
to off-chip DRAM. One of the main design characteristics is that all neural network model
parameters (weights and biases) are stored in the on-chip storage. In this multi-core design, each
compute unit (NFU ) is associated with several dedicated RAM blocks for storing the weights.
The weights are kept stationary to the compute units, while the input and output neurons are
transferred among the compute units, since the number of weights is an order of magnitude
greater than the number of input and output neurons. Compared to an NVidia K20M GPU,
the performance speedups, on average, for the 16-, 64-, 256- and 1024-unit implementations
Chapter 6. System Implementation 94
are respectively 21.4×, 79.8×, 216.7×, and 450.7×. For the four implementations, the average
energy reductions are 330.6×, 323.7×, 276× and 150.3×, respectively, compared to the GPU
baseline. The K20M GPUs area is about 550 mm2, while a 16-unit DaDianNao implementation
is 67.7 mm2. However, a major limitation is that their design does not support any model
having more weights than can be accommodated in on-chip storage. The DaDianNao design
nevertheless illustrates the performance benefits of reducing off-chip data transfer by storing
all the weights on chip. Given that the Arria V SoC has a large number of RAM blocks, we
should consider an implementation that maximizes the on-chip buffer capacity to store as many
weights as possible.
Figure 6.16: Neural Network Accelerator Architecture in [33].
For a DNN hardware accelerator design on FPGAs, the authors in [33] proposed the ac-
celerator architecture shown in Figure 6.16. The compute unit is also capable of processing
multiple outputs in parallel with multiple input neurons and weights. For on-chip storage, they
use two sets of buffers to implement double-buffering, so that data transfer can be overlapped
with computation. In addition, they obtain an optimized compute unit structure (the number
of outputs (Mo) and inputs (Mi) to be processed in parallel) for the five convolutional layers in
AlexNet, with the consideration of available DSP resources, buffer sizes and off-chip memory
bandwidth. They report that the optimal structure is to have Mo equal to 64 and Mi equal
to 7 (roughly 1.75 times bigger than our final implementation on Arria V SoC FPGA). For
the five convolutional layers, the design achieves a computation time of 21 ms, outperforming
a baseline 16-thread software implementation on an Intel XEON CPU (@2.2 GHz) by 3.64×,
with a board power consumption of 18.6 W.
Comparing to this implementation, the run-time spent on convolutional layers in our final
implementation is 1.67× more than theirs. However, we use 1.75× fewer multipliers and adders.
Using the multiplier usage as area cost, the area-delay product of our implementation is 8,960
Chapter 6. System Implementation 95
(256 multipliers × 35 ms), which is 1.05× better than their area-delay product of 9,408 (448
multipliers × 21 ms). Their implementation uses 32-bit floating-point multipliers and adders,
which require significantly more DSP units. Moreover, the actual compute time spent by our
accelerator is 23 ms. We believe the total run-time of our implementation can be further reduced
if the data transfer and tiling software execution are overlapped with accelerator computation.
Their design only supports convolutional layers, whereas our solution is a complete end-to-
end system that supports all the layer types in AlexNet and handles the problems one would
encounter in a real application, such as the virtual memory management in an OS.
6.10 Summary
The chapter presents the overall implementation of the complete processor/accelerator hybrid
system for neural network inference, including the system integration, custom DMA unit, inter-
connect, and memory system. We also persent optimizations for improving data transfer, clock
frequency, and hardware utilization. A unified verification flow is used to test our implemen-
tation at different development stages. We present the run-time and resource usage results for
a series of incrementally optimized implementations. Lastly, we compared our proposed design
to related works.
Chapter 7
Conclusion
7.1 Summary of Contributions
We implemented a DNN inference accelerator, synthesized with LegUp HLS, that operates in
conjunction with an embedded ARM processor. A dual-core ARM processor running Linux
OS executes software that decomposes the neural network computations into tiles. The neural
network accelerator is designed for massively parallel reduced-precision MAC operations. It
accepts DNN-specific instructions from the ARM processor. DMA is used to transfer DNN
neuron outputs and weights from off-chip DDR memory into the on-chip buffers inside the
DNN accelerator. We added the function pipelining feature to the LegUp HLS tool in order to
generate the accelerator hardware from a software design in C. We created a software framework
that integrates all the system components, including the hardware system, the backend APIs
for tiling control and memory transfer, and the software implementation of the inference and
training for all layer types in AlexNet.
Using Heterogeneous Fixed-Point Representation In Neural Network Computation
We investigate the impact on neural network model accuracy of using a variety of data formats
for representing the DNN neurons and weights (including floating-point, uniform fixed-point and
heterogeneous fixed-point). Our experiments show that the heterogeneous fixed-point represen-
tation can achieve a model accuracy close to that obtained using a floating-point representation,
reducing bitwidth and hardware cost. In heterogeneous fixed-point, the precision of neurons
and weights can be tailored individually on a layer-by-layer basis, where the MAC computations
at a given layer are performed in fixed-point according to a configurable number of decimal bits
96
Chapter 7. Conclusion 97
and fraction bits.
Software Framework
We have developed a C++ software framework that models reduced-precision DNN inference of a
generic neural network architecture specified in a user-provided configuration file. This frame-
work allows experimentation with a wide variety of DNN architectures and machine learning
applications. When deployed on our SoC system, the software framework divides the computa-
tion into tiles and off-loads the tiles to the hardware accelerator on the FPGA. The software is
also responsible for orchestrating the data transfer and generating custom instructions to guide
the accelerator execution.
Hardware Accelerator
The hardware accelerator consists of three major components: the accelerator controller, on-
chip reuse buffers and the compute unit. The accelerator controller is responsible for translating
the custom instructions into cycle-by-cycle control signals for the buffers and the compute unit.
The on-chip buffers are designed to take advantage of reuse opportunities and reduce the off-
chip memory traffic. The highly-pipelined compute unit is capable of performing 64 MAC
operations every clock cycle. We added a function pipelining feature to the LegUp HLS tool,
which permits the streaming hardware accelerator to be efficiently implemented in a high level
language (the C-language). The function pipelining support in LegUp HLS has been submitted
for publication to the 2016 IEEE int’l Conference on Application-specific Systems, Architectures
and Processors (ASAP).
A Complete End-to-End System
We implemented a complete and working system integrating the software framework along with
the hardware accelerator, allowing the accelerated inference of a neural network to be performed
based on a high-level configuration file. The software framework also includes the necessary
support for it to be deployed running on top of the Linux OS, which required support for page
locking and virtual-to-physical address translation. The complete system running on an OS
makes it readily usable for real-world neural network applications.
Chapter 7. Conclusion 98
7.2 Future Work
Translation Software Optimization and Multi-Core Implementation
One useful future direction is to optimize the tiling scheme, specifically the tiling size and
traversal order. We believe the tiling scheme can be optimally customized for each layer of
computation in order to improve data reuse and minimize memory traffic. Another useful
improvement to our current implementation is to overlap data transfer with accelerator compu-
tation so that the latency can be reduced. Furthermore, we can implement a multi-core system
on a larger FPGA device where multiple accelerator cores are instantiated and perform compu-
tation in parallel. For such a multi-core system, we would need to update the tiling scheme and
add support to coordinate data transfer and schedule computation among multiple accelerator
cores. Data reuse and sharing between accelerator cores will be an important design considera-
tion to prevent the off-chip memory traffic from limiting the overall system performance. Since
the above optimization tasks are co-dependent, an optimal tiling or scheduling scheme cannot
be found without considering all related factors as a whole. Therefore, an interesting future
research project is to design a software simulator that explores the solution space and identifies
an optimal tiling and scheduling schemes for a specific neural network model.
Using a More Cost-Efficient Data Representation and Arithmetic
Recent research [17, 28] (Section 3.6) has shown that a neural network trained with the binary
value constraint (i.e., +1 or -1) on its weights can achieve nearly state-of-the-art accuracy
performance. Such a weight representation can be stored with just one bit and it converts the
multiplication of a weight and a neuron into a sign inversion operation. Another cost-efficient
data format for weights is to limit the number of bits being 1s in a fixed-point representation.
For instance, we can limit all the weights in a neural network to have no more than two bits
being 1s, allowing the multiply between weights and neurons to be performed with two shifts
and an add. This representation also permits more compact data format, where a value can
be represented with the indices of the 1s in the wider word. Future work is to investigate the
feasibility of using such restricted weight representations in large-scale neural network models.
If the restricted weight representations prove to be feasible, it would be exciting to design a
custom hardware accelerator that takes advantage of the low-cost operations (inversion, shift
and add) and the reduced data storage and traffic. We have been mentoring a fourth year
Chapter 7. Conclusion 99
undergraduate student’s thesis project to investigate in this direction.
Neural Network Compression
Recent studies [21, 20] have shown that the numerical values used in deep neural networks
can be compressed significantly by pruning the weight connections, sharing weight values and
encoding the non-zero-value indices in sparse weight matrices. The experiments in [20] show
that the storage requirements for the AlexNet model can be reduced by 35×, down to 6.9 MB,
without affecting the accuracy. From our point of view, neural network compression techniques
can be exploited to reduce the off-chip memory traffic or even make it possible to store all the
weights in the hardware accelerator’s on-FPGA memory.
Support for Recurrent Neural Networks
In this project, we only considered feed-forward networks where all the layers are connected in a
uniform direction. Recurrent neural networks (RNN) that have additional feedback connections
from a deeper layer to a shallower layer, have also shown promising performance in many
applications. The computation of RNNs are fairly similar to that of the feed-forward networks,
and thus, most of the work in this project is re-usable if one were to build an RNN accelerator.
Bibliography
[1] BVLC CaffeNet model. https://github.com/BVLC/caffe/tree/master/models/bvlc reference caffenet.
[2] Google Test, Google’s C++ test framework. https://github.com/google/googletest.
[3] Protocol buffer. https://developers.google.com/protocol-buffers.
[4] Standford vision lab. http://vision.stanford.edu.
[5] Altera Corporation. Enabling High-Performance DSP Applications with Stratix V Variable-
Precision DSP Blocks, May 2011. Version 1.1.
[6] Altera Corporation. Altera Cyclone V SoC Development Kit, Reference Platform Porting
Guide, November 2015.
[7] Altera Corporation. Embedded Peripherals IP User Guide, December 2015. Version
2015.12.16.
[8] Altera Corporation. Quartus Prime Standard Edition Handbook Volume 1: Design and
Synthesis, November 2015. Version 15.1.0.
[9] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new per-
spectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–
1828, 2013.
[10] G. Bradski. Dr. Dobb’s Journal of Software Tools.
[11] A. Canis, S.D. Brown, and J.H. Anderson. Modulo sdc scheduling with recurrence min-
imization in high-level synthesis. In Field Programmable Logic and Applications (FPL),
2014 24th International Conference on, pages 1–8, Sept 2014.
100
Bibliography 101
[12] A. Canis, Jongsok Choi, B. Fort, Ruolong Lian, Qijing Huang, N. Calagar, M. Gort,
Jia Jun Qin, M. Aldham, T. Czajkowski, S. Brown, and J. Anderson. From software to
accelerators with legup high-level synthesis. In Compilers, Architecture and Synthesis for
Embedded Systems (CASES), 2013 International Conference on, pages 1–9, Sept 2013.
[13] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier
Temam. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-
learning. In Proceedings of the 19th International Conference on Architectural Support for
Programming Languages and Operating Systems, ASPLOS ’14, pages 269–284, New York,
NY, USA, 2014. ACM.
[14] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi
Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. Dadiannao: A machine-learning su-
percomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on
Microarchitecture, MICRO-47, pages 609–622, Washington, DC, USA, 2014. IEEE Com-
puter Society.
[15] J. Cong, Bin Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Zhiru Zhang. High-level
synthesis for fpgas: From prototyping to deployment. Computer-Aided Design of Integrated
Circuits and Systems, IEEE Transactions on, 30(4):473–491, April 2011.
[16] Jason Cong and Yi Zou. Fpga-based hardware acceleration of lithographic aerial image
simulation. ACM Trans. Reconfigurable Technol. Syst., 2(3):17:1–17:29, September 2009.
[17] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training
deep neural networks with binary weights during propagations. pages 3105–3113, 2015.
[18] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Low precision arithmetic
for deep learning. ICLR Workshop, 2015.
[19] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep
learning with limited numerical precision. CoRR, abs/1502.02551, 2015.
[20] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural
network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149,
2015.
Bibliography 102
[21] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connec-
tions for efficient neural networks. CoRR, abs/1506.02626, 2015.
[22] Yangqing Jia. Learning Semantic Image Representations at a Large Scale. PhD thesis,
Electrical Engineering and Computer Sciences, University of California at Berkeley, May
2014.
[23] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Gir-
shick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast
feature embedding. arXiv preprint arXiv:1408.5093, 2014.
[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with
deep convolutional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q.
Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–
1105. Curran Associates, Inc., 2012.
[25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
document recognition. In Intelligent Signal Processing, pages 306–351. IEEE Press, 2001.
[26] Yann Lecun and Corinna Cortes. The MNIST database of handwritten digits.
[27] Fei-Fei Li. ImageNet: crowdsourcing, benchmarking & other cool things. CMU VASC
Seminar, March 2010.
[28] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural
networks with few multiplications. CoRR, abs/1510.03009, 2015.
[29] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Neurocomputing: Foun-
dations of research. chapter Learning Representations by Back-propagating Errors, pages
696–699. MIT Press, Cambridge, MA, USA, 1988.
[30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-
heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and
Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of
Computer Vision (IJCV), 115(3):211–252, 2015.
Bibliography 103
[31] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-
houcke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and
Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1–9, June 2015.
[32] The HDF Group. Hierarchical Data Format, version 5, 1997-NNNN.
http://www.hdfgroup.org/HDF5/.
[33] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimiz-
ing fpga-based accelerator design for deep convolutional neural networks. In Proceedings
of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,
FPGA ’15, pages 161–170, New York, NY, USA, 2015. ACM.