low power data storage
TRANSCRIPT
-
8/12/2019 low power data storage
1/4
An Autonomous Vector/Scalar Floating Point Coprocessor for FPGAs
Jainik Kathiara
Analog Devices, Inc.3 Technology Way,
Norwood, MA USA
Email: [email protected]
Miriam Leeser
Dept of Electrical and Computer EngineeringNortheastern University
Boston, MA USA
Email: [email protected]
AbstractWe present a Floating Point Vector Coprocessorthat works with the Xilinx embedded processors. The FPVC iscompletely autonomous from the embedded processor, exploit-ing parallelism and exhibiting greater speedup than alternativevector processors. The FPVC supports scalar computation sothat loops can be executed independently of the main embeddedprocessor. Floating point addition, multiplication, division andsquare root are implemented with the Northeastern UniversityVFLOAT library. The FPVC is parameterized so that thenumber of vector lanes and maximum vector length can beeasily modified. We have implemented the FPVC on a XilinxVirtex 5 connected via the Processor Local Bus (PLB) to theembedded PowerPC. Our results show more than five timesimproved performance over the PowerPC augmented with theXilinx Floating Point Unit on applications from linear algebra:QR and Cholesky decomposition.
Keywords-floating point; vector processing; FPGA
I. INTRODUCTION
There is increased interest in using embedded processing
on FPGAs including for applications that make use of float-
ing point operations. The current design practice for bothXilinx and Altera FPGAs is to generate an auxiliary floating
point processing unit. These FPUs rely on the embedded
processor for fetching instructions, which inherently limits
the parallelism in the implementation and hurts performance.
We have implemented a floating point co-processor, the
floating point vector/scalar co-processor (FPVC), that runs
independent of the main embedded processor. The FPVC has
its own local instruction memory (IRAM) and data memory
(DRAM) under DMA control. The main processor initiates
the DMA of instructions to IRAM and then starts the FPVC.
The FPVC achieves performance by fetching and decoding
instructions in parallel with other operations on the FPGA.
In addition, scalar instructions are supported. This allowsall loop control to be handled locally without requiring
intervention of the main processor.
Vector processing has several advantages for an FPGA
implementation. A much smaller program is required to
implement a program, reducing the static instruction count.
Fewer instructions need to be decoded dynamically, which
simplifies instruction execution. Hazards only need to be
checked at the start of an instruction. The FPVC design
takes advantages of the reconfigurable nature of FPGAs by
Figure 1. The Floating Point Vector Co-Processor
including design time parameters including the maximum
vector length (MVL) supported in hardware, number of
vector lanes implemented as well as the size of the local
instruction and data memories. Details of the FPVC and its
implementation can be found in [1].
I I . THE F LOATING P OINTV ECTOR C O-PROCESSOR
The FPVC (Fig. 1) has the following features:
Complete autonomy from the main processor
Support for single precision floating point and 32-bit
integer arithmetic operations
Four stage RISC pipeline for integer arithmetic and
memory access
Variable length RISC pipeline for floating point arith-
metic
Unified vector/scalar general purpose register file Support for a modified Harvard style memory architec-
ture with separate level 1 instruction and data RAM
and unified level 2 memory
The FPVC uses the Processor Local Bus(PLB) as the
system bus interface. The FPVC has one slave (SLV) port
for communicating with the main processor and one mas-
ter (MST) port for main memory accesses. The memory
architecture is divided into two levels: main memory and
local memory. Both types of memory are implemented in on-
IEEE International Symposium on Field-Programmable Custom Computing Machines
978-0-7695-4301-7/11 $26.00 2011 IEEE
DOI 10.1109/FCCM.2011.14
33
-
8/12/2019 low power data storage
2/4
Figure 2. The Vector Scalar Register File
chip BlockRAM. Main memory is connected to the FPVC
through the master port of the system bus while local
memory sits between the bus and the processing core. All
memory transfers are under program control; no caching
is implemented. Instruction memory is loaded under DMA
control before program execution begins. Data memory
is loaded from main memory using DMA under FPVC
program control. The local memories are part of system
address space and can be accessed by any master on thesystem bus.
A. Vector Scalar Instruction Set Architecture
We have designed a new Instruction Set Architecture
(ISA) inspired by the VIRAM vector ISA [2] as well as
the Microblaze and Power ISAs. The Vector-Scalar ISA is
a 32-bit instruction set. The instruction encoding allows
for 32 vector-scalar registers with variable vector length.
As shown in Fig. 2, the top short vector of each vector
register can be used as a scalar register. This allows us
to freely mix vector and scalar registers without requiring
communication with the host processor. The vector reg-
ister file supports a configurable lane width. The vector-scalar ISA supports a maximum vector length (MVL)
ofC_NUM_OF_LANE * C_NUM_OF_VECTOR. These pa-
rameters are design time parameters. Instructions are classi-
fied into three major classes: Arithmetic instructions, mem-
ory instructions and inter-lane communication instructions
(expand and compress).
Arithmetic Instructions Fig. 3 shows vector and scalar
arithmetic operation. We implement several floating point
instructions: add, multiply, divide and square root; as well as
basic integer arithmetic, and compare and shift instructions
which operate on the full data word (32 bits). The integer
multiply instruction operates on the lower 16-bits of the
operands and produces a 32 bit result. All scalar operationsare performed on the first element of the first short vector
of each register and the result is replicated to all lanes and
stored on the first short vector of the destination register.
Vector instructions which require scalar data can reference
the top of each register.
Memory Instructions The memory instructions
(load.xxx and store.xxx) support strided memory
access (both unit and non-unit stride), permuted access,
Figure 3. Vector and Scalar Arithmetic Operation
Figure 4. Memory Access Patterns
indexed access and rake access. Fig. 4 shows the rake
access pattern, which can be described by a single vector
instruction. It requires a vector base address, vector index
address and immediate offset value (the distance between
two neighbor elements). All neighbor elements within
each rake are stored in a single short vector while each
rake of elements is stored in a different short vector. The
same instruction can be used for unit stride and non-unit
stride access by setting the immediate offset value to an
equal distance between two vector elements in memory.
Permutation and look-up table access classes are realized
by setting the immediate offset to zero and providing an
index register. Scalar accesses are supported with the same
memory instructions.
Interlane Communication Instructions We implement
vector compression and expansion instructions. Compress
instructions select the subset of an input vector marked by a
flag vector and pack these together into contiguous elementsat the start of a destination vector. Expand instructions
perform the reverse operation to unpack a vector, placing
source elements into a destination vector at locations marked
by bits in a flag register. Previous vector processors [2],
[3] implement a crossbar switch to perform inter-lane com-
munication. The FPVC compress and expand instructions
implement the same functionality with lower hardware cost
but slower access to the vector elements.
34
-
8/12/2019 low power data storage
3/4
Figure 5. Vector Scalar Pipeline
B. The Vector Scalar Pipeline
The FPVC pipeline (Fig. 5) is based on the classic in-
order issue, out-of-order completion RISC pipeline. The four
stages are Instruction Fetch, Decode, Execute and Write
Back. The pipeline is intentionally kept short so integer
vector instructions can complete in a small number of cyclesand floating point instructions spend most of their time in
the floating point unit, optimizing the execution latency. As
both scalar and vector instructions are executed from the
same instruction pipeline, both type of instructions are freely
mixed in the program execution and stored in the same local
instruction memory.
Floating point operations are implemented using the
VFloat library [4]. Each functional unit implements IEEE
754 single precision floating point. Arithmetic operations
all have different latencies: each unit is fully pipelined
so that a new operation can be started every clock cycle.
Normalization and rounding are implemented along with
the multiplexers shown in the datapath. Due to the differentlatencies of different operations, instructions are issued in
order but can complete out of order. Hence, a structural
hazard may occur if more than one instruction completes
in the same clock cycle. We have implemented an arbiter,
which can commit one result each cycle, between the end of
the execution stage and the write back stage of the pipeline
to eliminate this hazard. When multiple results are available
at the same time, one will be written to the register file, and
the rest will be stalled.
Design Time Parameters We take advantage of
the flexibility of the reconfigurable fabric to provide
design time reconfigurable parameters as well as runtime
parameters. At design time, the implementer can chose thenumber of vector lanes(C_NUM_OF_LANE), number
of short vectors supported(C_VECTOR_LENGTH),
and number of bytes of local BRAM
memory(C_INSTR_MEM_SIZE, C_DATA_MEM_SIZE)
as well as the bitwidth of the floating point components.
For the experiments described here, we always implement
single precision. The maximum vector length supported is
Figure 6. Experimental Setup
the number of lanes times the vector length.
III. EXPERIMENTS ANDR ESULTS
The FPVC is implemented in VHDL and synthesizedusing Xilinx ISE 10.1 CAD tools targeting Virtex-5 FPGAs.
We compare the FPVC against thePowerPC 440 with Xilinx
FPU using linear algebra kernels as examples. All code runs
on the Xilinx ML510 board. We connect the FPVC via
a PLB to the PowerPC and to the main on-chip memory
(BRAM, Fig. 6). We also connect the PowerPC to the
Xilinx FPU via the Fabric Co-processor Bus (FCB). For the
experiments either the PowerPC plus FPU or the PowerPC
and FPVC are used. The performance metric is the number
of clock cycles between the start and the end of a kernel.
Clock cycles are counted using the PowerPCs internal timer
and results are compared to the runtime of the PowerPC plus
FPU.Local IRAM and DRAM can be con-
figured for various sizes using parameters
C_INSTR_MEM_SIZE, C_DATA_MEM_SIZE and
C_MPLB_DWIDTH at design compile time. For all of the
results presented, we have set the instruction and data
memory sizes to 64KB each and the PLB to 32 bits. We
vary the number of vector lanes and length of short vectors.
For running on the PowerPC, the linear algebra kernels were
written in C and compiled using gcc with -o2 optimization.
The FPVC based kernels are written in machine code.
Program and data are stored in the 64KB main memory
shown in Fig. 6. The FPVC system bus interface is used
to load instructions into the local IRAM of the FPVC. Wetest our FPVC for performance on floating point numerical
linear systems solvers. Here we present results for QR and
Cholesky decomposition. While our runtimes are not as
fast as a custom datapath, they consume less area and are
more flexible than previously published vector processors
while running faster than the Xilinx PowerPC embedded
processor with auxiliary FPU. The FPGA resources used
for a range of the runtime parameter values is shown in
35
-
8/12/2019 low power data storage
4/4
Figure 7. Resources Used
Figure 8. Results for QR Decomposition
Table 7. The factor that has the largest influence on the
number of resources used is the number of vector lanes.
We implemented QR and Cholesky Decomposition on the
FPVC and compared the results to a PowerPC connected
via the APU interface to the Xilinx FPU; this setup has
performance equal to one.
Fig. 8 shows the results for QR on the FPVC. For the
FPVC implementations, the short vector length was kept
constant at 32; the number of vector lanes was varied. The
FPVC outperforms the Xilinx FPU even with only one
lane implemented. This is due to the fact that there aresignificantly fewer vector instructions to decode thus most
of the time is spent implementing floating point operations.
QR has plenty of parallelism, increasing the number of lanes
improves the speedup. The best speedup is achieved with 8
lanes on a 24 by 24 matrix; the speedup is over five times
the performance of the Xilinx solution.
Results comparing Cholesky on FPVC to PowerPC plus
FPU, varying the number of lanes, is shown in Fig. 9. For
the implementations shown, Cholesky runs more than three
times faster on the FPVC. Cholesky does not exhibit as
much parallelism as QR, hence increasing lanes does not
give that much improvement. Increasing the short vector size
to 32 did result in significant improvement. The maximumperformance improvement achieved is over 5x for 8 lanes.
However the four lane solution, which uses significantly
fewer resources, achieves nearly as good performance and
is the best choice for Cholesky.
IV. CONCLUSIONS AND FUTURE WORK
The completely autonomous floating point vector/scalar
co-processor exhibits speedup on important linear algebra
Figure 9. Results for Cholesky Decomposition
kernels when compared to the implementation used by most
practioners: embedded processor FPU provided by Xilinx.
The FPVC is easier to implement at the cost of a decrease
in performance compared to a custom datapath. Hence theFPVC occupies a middle ground in the range of designs
that make use of floating point. The FPVC is completely
autonomous. Thus, the PowerPC can be doing independent
work while the FPVC is computing floating point solutions.
We have not yet exploited this concurrency.
The FPVC is configurable at design time. The number
of lanes, size of vectors and local memory sizes can be
configured to fit the application. Results above show that for
QR decomposition, the designer may choose 8 lanes while
for Cholesky, 4 lanes at 32 element short vectors is more
efficient. The bitwidth of the FPVC datapath can also easily
be modified. We plan to implement double precision in the
near future. We also plan to add instruction caching so thatlarger programs can be implemented on the FPVC, and to
provide tools such as assemblers and compilers to make the
FPVC easier to use.
REFERENCES
[1] J. Kathiara, The Unified Floating Point Vector Co-processor for Reconfigurable Hardware, Masters thesis,Northeastern University Dept. of ECE, Boston, MA,2011. [Online]. Available: http://www.coe.neu.edu/Research/rcl/publications.php#theses
[2] C. Kozyrakis and D. Patterson, Overcoming the limitationsof conventional vector processors, in Intl Symposium onComputer Architecture, June 2003, pp. 399 409.
[3] J. Yu, C. Eaglestonet al., Vector processing as a soft processoraccelerator, ACM Trans. Reconfigurable Technol. Syst., vol. 2,pp. 12:112:34, June 2009.
[4] X. Wang and M. Leeser, VFloat: a variable precision fixed-and floating-point library for reconfigurable hardware, ACMTrans. Reconfigurable Technol. Syst., vol. 3, pp. 16:116:34,September 2010.
36