low power data storage

8/12/2019 low power data storage

1/4

An Autonomous Vector/Scalar Floating Point Coprocessor for FPGAs

Jainik Kathiara

Analog Devices, Inc.3 Technology Way,

Norwood, MA USA

Email: [email protected]

Miriam Leeser

Dept of Electrical and Computer EngineeringNortheastern University

Boston, MA USA

Email: [email protected]

AbstractWe present a Floating Point Vector Coprocessorthat works with the Xilinx embedded processors. The FPVC iscompletely autonomous from the embedded processor, exploit-ing parallelism and exhibiting greater speedup than alternativevector processors. The FPVC supports scalar computation sothat loops can be executed independently of the main embeddedprocessor. Floating point addition, multiplication, division andsquare root are implemented with the Northeastern UniversityVFLOAT library. The FPVC is parameterized so that thenumber of vector lanes and maximum vector length can beeasily modified. We have implemented the FPVC on a XilinxVirtex 5 connected via the Processor Local Bus (PLB) to theembedded PowerPC. Our results show more than five timesimproved performance over the PowerPC augmented with theXilinx Floating Point Unit on applications from linear algebra:QR and Cholesky decomposition.

Keywords-floating point; vector processing; FPGA

I. INTRODUCTION

There is increased interest in using embedded processing

on FPGAs including for applications that make use of float-

ing point operations. The current design practice for bothXilinx and Altera FPGAs is to generate an auxiliary floating

point processing unit. These FPUs rely on the embedded

processor for fetching instructions, which inherently limits

the parallelism in the implementation and hurts performance.

We have implemented a floating point co-processor, the

floating point vector/scalar co-processor (FPVC), that runs

independent of the main embedded processor. The FPVC has

its own local instruction memory (IRAM) and data memory

(DRAM) under DMA control. The main processor initiates

the DMA of instructions to IRAM and then starts the FPVC.

The FPVC achieves performance by fetching and decoding

instructions in parallel with other operations on the FPGA.

In addition, scalar instructions are supported. This allowsall loop control to be handled locally without requiring

intervention of the main processor.

Vector processing has several advantages for an FPGA

implementation. A much smaller program is required to

implement a program, reducing the static instruction count.

Fewer instructions need to be decoded dynamically, which

simplifies instruction execution. Hazards only need to be

checked at the start of an instruction. The FPVC design

takes advantages of the reconfigurable nature of FPGAs by

Figure 1. The Floating Point Vector Co-Processor

including design time parameters including the maximum

vector length (MVL) supported in hardware, number of

vector lanes implemented as well as the size of the local

instruction and data memories. Details of the FPVC and its

implementation can be found in [1].

I I . THE F LOATING P OINTV ECTOR C O-PROCESSOR

The FPVC (Fig. 1) has the following features:

Complete autonomy from the main processor

Support for single precision floating point and 32-bit

integer arithmetic operations

Four stage RISC pipeline for integer arithmetic and

memory access

Variable length RISC pipeline for floating point arith-

metic

Unified vector/scalar general purpose register file Support for a modified Harvard style memory architec-

ture with separate level 1 instruction and data RAM

and unified level 2 memory

The FPVC uses the Processor Local Bus(PLB) as the

system bus interface. The FPVC has one slave (SLV) port

for communicating with the main processor and one mas-

ter (MST) port for main memory accesses. The memory

architecture is divided into two levels: main memory and

local memory. Both types of memory are implemented in on-

IEEE International Symposium on Field-Programmable Custom Computing Machines

978-0-7695-4301-7/11 $26.00 2011 IEEE

DOI 10.1109/FCCM.2011.14

33


2/4

Figure 2. The Vector Scalar Register File

chip BlockRAM. Main memory is connected to the FPVC

through the master port of the system bus while local

memory sits between the bus and the processing core. All

memory transfers are under program control; no caching

is implemented. Instruction memory is loaded under DMA

control before program execution begins. Data memory

is loaded from main memory using DMA under FPVC

program control. The local memories are part of system

address space and can be accessed by any master on thesystem bus.

A. Vector Scalar Instruction Set Architecture

We have designed a new Instruction Set Architecture

(ISA) inspired by the VIRAM vector ISA [2] as well as

the Microblaze and Power ISAs. The Vector-Scalar ISA is

a 32-bit instruction set. The instruction encoding allows

for 32 vector-scalar registers with variable vector length.

As shown in Fig. 2, the top short vector of each vector

register can be used as a scalar register. This allows us

to freely mix vector and scalar registers without requiring

communication with the host processor. The vector reg-

ister file supports a configurable lane width. The vector-scalar ISA supports a maximum vector length (MVL)

ofC_NUM_OF_LANE * C_NUM_OF_VECTOR. These pa-

rameters are design time parameters. Instructions are classi-

fied into three major classes: Arithmetic instructions, mem-

ory instructions and inter-lane communication instructions

(expand and compress).

Arithmetic Instructions Fig. 3 shows vector and scalar

arithmetic operation. We implement several floating point

instructions: add, multiply, divide and square root; as well as

basic integer arithmetic, and compare and shift instructions

which operate on the full data word (32 bits). The integer

multiply instruction operates on the lower 16-bits of the

operands and produces a 32 bit result. All scalar operationsare performed on the first element of the first short vector

of each register and the result is replicated to all lanes and

stored on the first short vector of the destination register.

Vector instructions which require scalar data can reference

the top of each register.

Memory Instructions The memory instructions

(load.xxx and store.xxx) support strided memory

access (both unit and non-unit stride), permuted access,

Figure 3. Vector and Scalar Arithmetic Operation

Figure 4. Memory Access Patterns

indexed access and rake access. Fig. 4 shows the rake

access pattern, which can be described by a single vector

instruction. It requires a vector base address, vector index

address and immediate offset value (the distance between

two neighbor elements). All neighbor elements within

each rake are stored in a single short vector while each

rake of elements is stored in a different short vector. The

same instruction can be used for unit stride and non-unit

stride access by setting the immediate offset value to an

equal distance between two vector elements in memory.

Permutation and look-up table access classes are realized

by setting the immediate offset to zero and providing an

index register. Scalar accesses are supported with the same

memory instructions.

Interlane Communication Instructions We implement

vector compression and expansion instructions. Compress

instructions select the subset of an input vector marked by a

flag vector and pack these together into contiguous elementsat the start of a destination vector. Expand instructions

perform the reverse operation to unpack a vector, placing

source elements into a destination vector at locations marked

by bits in a flag register. Previous vector processors [2],

[3] implement a crossbar switch to perform inter-lane com-

munication. The FPVC compress and expand instructions

implement the same functionality with lower hardware cost

but slower access to the vector elements.

34


3/4

Figure 5. Vector Scalar Pipeline

B. The Vector Scalar Pipeline

The FPVC pipeline (Fig. 5) is based on the classic in-

order issue, out-of-order completion RISC pipeline. The four

stages are Instruction Fetch, Decode, Execute and Write

Back. The pipeline is intentionally kept short so integer

vector instructions can complete in a small number of cyclesand floating point instructions spend most of their time in

the floating point unit, optimizing the execution latency. As

both scalar and vector instructions are executed from the

same instruction pipeline, both type of instructions are freely

mixed in the program execution and stored in the same local

instruction memory.

Floating point operations are implemented using the

VFloat library [4]. Each functional unit implements IEEE

754 single precision floating point. Arithmetic operations

all have different latencies: each unit is fully pipelined

so that a new operation can be started every clock cycle.

Normalization and rounding are implemented along with

the multiplexers shown in the datapath. Due to the differentlatencies of different operations, instructions are issued in

order but can complete out of order. Hence, a structural

hazard may occur if more than one instruction completes

in the same clock cycle. We have implemented an arbiter,

which can commit one result each cycle, between the end of

the execution stage and the write back stage of the pipeline

to eliminate this hazard. When multiple results are available

at the same time, one will be written to the register file, and

the rest will be stalled.

Design Time Parameters We take advantage of

the flexibility of the reconfigurable fabric to provide

design time reconfigurable parameters as well as runtime

parameters. At design time, the implementer can chose thenumber of vector lanes(C_NUM_OF_LANE), number

of short vectors supported(C_VECTOR_LENGTH),

and number of bytes of local BRAM

memory(C_INSTR_MEM_SIZE, C_DATA_MEM_SIZE)

as well as the bitwidth of the floating point components.

For the experiments described here, we always implement

single precision. The maximum vector length supported is

Figure 6. Experimental Setup

the number of lanes times the vector length.

III. EXPERIMENTS ANDR ESULTS

The FPVC is implemented in VHDL and synthesizedusing Xilinx ISE 10.1 CAD tools targeting Virtex-5 FPGAs.

We compare the FPVC against thePowerPC 440 with Xilinx

FPU using linear algebra kernels as examples. All code runs

on the Xilinx ML510 board. We connect the FPVC via

a PLB to the PowerPC and to the main on-chip memory

(BRAM, Fig. 6). We also connect the PowerPC to the

Xilinx FPU via the Fabric Co-processor Bus (FCB). For the

experiments either the PowerPC plus FPU or the PowerPC

and FPVC are used. The performance metric is the number

of clock cycles between the start and the end of a kernel.

Clock cycles are counted using the PowerPCs internal timer

and results are compared to the runtime of the PowerPC plus

FPU.Local IRAM and DRAM can be con-

figured for various sizes using parameters

C_INSTR_MEM_SIZE, C_DATA_MEM_SIZE and

C_MPLB_DWIDTH at design compile time. For all of the

results presented, we have set the instruction and data

memory sizes to 64KB each and the PLB to 32 bits. We

vary the number of vector lanes and length of short vectors.

For running on the PowerPC, the linear algebra kernels were

written in C and compiled using gcc with -o2 optimization.

The FPVC based kernels are written in machine code.

Program and data are stored in the 64KB main memory

shown in Fig. 6. The FPVC system bus interface is used

to load instructions into the local IRAM of the FPVC. Wetest our FPVC for performance on floating point numerical

linear systems solvers. Here we present results for QR and

Cholesky decomposition. While our runtimes are not as

fast as a custom datapath, they consume less area and are

more flexible than previously published vector processors

while running faster than the Xilinx PowerPC embedded

processor with auxiliary FPU. The FPGA resources used

for a range of the runtime parameter values is shown in

35


4/4

Figure 7. Resources Used

Figure 8. Results for QR Decomposition

Table 7. The factor that has the largest influence on the

number of resources used is the number of vector lanes.

We implemented QR and Cholesky Decomposition on the

FPVC and compared the results to a PowerPC connected

via the APU interface to the Xilinx FPU; this setup has

performance equal to one.

Fig. 8 shows the results for QR on the FPVC. For the

FPVC implementations, the short vector length was kept

constant at 32; the number of vector lanes was varied. The

FPVC outperforms the Xilinx FPU even with only one

lane implemented. This is due to the fact that there aresignificantly fewer vector instructions to decode thus most

of the time is spent implementing floating point operations.

QR has plenty of parallelism, increasing the number of lanes

improves the speedup. The best speedup is achieved with 8

lanes on a 24 by 24 matrix; the speedup is over five times

the performance of the Xilinx solution.

Results comparing Cholesky on FPVC to PowerPC plus

FPU, varying the number of lanes, is shown in Fig. 9. For

the implementations shown, Cholesky runs more than three

times faster on the FPVC. Cholesky does not exhibit as

much parallelism as QR, hence increasing lanes does not

give that much improvement. Increasing the short vector size

to 32 did result in significant improvement. The maximumperformance improvement achieved is over 5x for 8 lanes.

However the four lane solution, which uses significantly

fewer resources, achieves nearly as good performance and

is the best choice for Cholesky.

IV. CONCLUSIONS AND FUTURE WORK

The completely autonomous floating point vector/scalar

co-processor exhibits speedup on important linear algebra

Figure 9. Results for Cholesky Decomposition

kernels when compared to the implementation used by most

practioners: embedded processor FPU provided by Xilinx.

The FPVC is easier to implement at the cost of a decrease

in performance compared to a custom datapath. Hence theFPVC occupies a middle ground in the range of designs

that make use of floating point. The FPVC is completely

autonomous. Thus, the PowerPC can be doing independent

work while the FPVC is computing floating point solutions.

We have not yet exploited this concurrency.

The FPVC is configurable at design time. The number

of lanes, size of vectors and local memory sizes can be

configured to fit the application. Results above show that for

QR decomposition, the designer may choose 8 lanes while

for Cholesky, 4 lanes at 32 element short vectors is more

efficient. The bitwidth of the FPVC datapath can also easily

be modified. We plan to implement double precision in the

near future. We also plan to add instruction caching so thatlarger programs can be implemented on the FPVC, and to

provide tools such as assemblers and compilers to make the

FPVC easier to use.

REFERENCES

[1] J. Kathiara, The Unified Floating Point Vector Co-processor for Reconfigurable Hardware, Masters thesis,Northeastern University Dept. of ECE, Boston, MA,2011. [Online]. Available: http://www.coe.neu.edu/Research/rcl/publications.php#theses

[2] C. Kozyrakis and D. Patterson, Overcoming the limitationsof conventional vector processors, in Intl Symposium onComputer Architecture, June 2003, pp. 399 409.

[3] J. Yu, C. Eaglestonet al., Vector processing as a soft processoraccelerator, ACM Trans. Reconfigurable Technol. Syst., vol. 2,pp. 12:112:34, June 2009.

[4] X. Wang and M. Leeser, VFloat: a variable precision fixed-and floating-point library for reconfigurable hardware, ACMTrans. Reconfigurable Technol. Syst., vol. 3, pp. 16:116:34,September 2010.

36

low power data storage

Documents