low power data storage

Upload: rakesh-rakee

Post on 03-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 low power data storage

    1/4

    An Autonomous Vector/Scalar Floating Point Coprocessor for FPGAs

    Jainik Kathiara

    Analog Devices, Inc.3 Technology Way,

    Norwood, MA USA

    Email: [email protected]

    Miriam Leeser

    Dept of Electrical and Computer EngineeringNortheastern University

    Boston, MA USA

    Email: [email protected]

    AbstractWe present a Floating Point Vector Coprocessorthat works with the Xilinx embedded processors. The FPVC iscompletely autonomous from the embedded processor, exploit-ing parallelism and exhibiting greater speedup than alternativevector processors. The FPVC supports scalar computation sothat loops can be executed independently of the main embeddedprocessor. Floating point addition, multiplication, division andsquare root are implemented with the Northeastern UniversityVFLOAT library. The FPVC is parameterized so that thenumber of vector lanes and maximum vector length can beeasily modified. We have implemented the FPVC on a XilinxVirtex 5 connected via the Processor Local Bus (PLB) to theembedded PowerPC. Our results show more than five timesimproved performance over the PowerPC augmented with theXilinx Floating Point Unit on applications from linear algebra:QR and Cholesky decomposition.

    Keywords-floating point; vector processing; FPGA

    I. INTRODUCTION

    There is increased interest in using embedded processing

    on FPGAs including for applications that make use of float-

    ing point operations. The current design practice for bothXilinx and Altera FPGAs is to generate an auxiliary floating

    point processing unit. These FPUs rely on the embedded

    processor for fetching instructions, which inherently limits

    the parallelism in the implementation and hurts performance.

    We have implemented a floating point co-processor, the

    floating point vector/scalar co-processor (FPVC), that runs

    independent of the main embedded processor. The FPVC has

    its own local instruction memory (IRAM) and data memory

    (DRAM) under DMA control. The main processor initiates

    the DMA of instructions to IRAM and then starts the FPVC.

    The FPVC achieves performance by fetching and decoding

    instructions in parallel with other operations on the FPGA.

    In addition, scalar instructions are supported. This allowsall loop control to be handled locally without requiring

    intervention of the main processor.

    Vector processing has several advantages for an FPGA

    implementation. A much smaller program is required to

    implement a program, reducing the static instruction count.

    Fewer instructions need to be decoded dynamically, which

    simplifies instruction execution. Hazards only need to be

    checked at the start of an instruction. The FPVC design

    takes advantages of the reconfigurable nature of FPGAs by

    Figure 1. The Floating Point Vector Co-Processor

    including design time parameters including the maximum

    vector length (MVL) supported in hardware, number of

    vector lanes implemented as well as the size of the local

    instruction and data memories. Details of the FPVC and its

    implementation can be found in [1].

    I I . THE F LOATING P OINTV ECTOR C O-PROCESSOR

    The FPVC (Fig. 1) has the following features:

    Complete autonomy from the main processor

    Support for single precision floating point and 32-bit

    integer arithmetic operations

    Four stage RISC pipeline for integer arithmetic and

    memory access

    Variable length RISC pipeline for floating point arith-

    metic

    Unified vector/scalar general purpose register file Support for a modified Harvard style memory architec-

    ture with separate level 1 instruction and data RAM

    and unified level 2 memory

    The FPVC uses the Processor Local Bus(PLB) as the

    system bus interface. The FPVC has one slave (SLV) port

    for communicating with the main processor and one mas-

    ter (MST) port for main memory accesses. The memory

    architecture is divided into two levels: main memory and

    local memory. Both types of memory are implemented in on-

    IEEE International Symposium on Field-Programmable Custom Computing Machines

    978-0-7695-4301-7/11 $26.00 2011 IEEE

    DOI 10.1109/FCCM.2011.14

    33

  • 8/12/2019 low power data storage

    2/4

    Figure 2. The Vector Scalar Register File

    chip BlockRAM. Main memory is connected to the FPVC

    through the master port of the system bus while local

    memory sits between the bus and the processing core. All

    memory transfers are under program control; no caching

    is implemented. Instruction memory is loaded under DMA

    control before program execution begins. Data memory

    is loaded from main memory using DMA under FPVC

    program control. The local memories are part of system

    address space and can be accessed by any master on thesystem bus.

    A. Vector Scalar Instruction Set Architecture

    We have designed a new Instruction Set Architecture

    (ISA) inspired by the VIRAM vector ISA [2] as well as

    the Microblaze and Power ISAs. The Vector-Scalar ISA is

    a 32-bit instruction set. The instruction encoding allows

    for 32 vector-scalar registers with variable vector length.

    As shown in Fig. 2, the top short vector of each vector

    register can be used as a scalar register. This allows us

    to freely mix vector and scalar registers without requiring

    communication with the host processor. The vector reg-

    ister file supports a configurable lane width. The vector-scalar ISA supports a maximum vector length (MVL)

    ofC_NUM_OF_LANE * C_NUM_OF_VECTOR. These pa-

    rameters are design time parameters. Instructions are classi-

    fied into three major classes: Arithmetic instructions, mem-

    ory instructions and inter-lane communication instructions

    (expand and compress).

    Arithmetic Instructions Fig. 3 shows vector and scalar

    arithmetic operation. We implement several floating point

    instructions: add, multiply, divide and square root; as well as

    basic integer arithmetic, and compare and shift instructions

    which operate on the full data word (32 bits). The integer

    multiply instruction operates on the lower 16-bits of the

    operands and produces a 32 bit result. All scalar operationsare performed on the first element of the first short vector

    of each register and the result is replicated to all lanes and

    stored on the first short vector of the destination register.

    Vector instructions which require scalar data can reference

    the top of each register.

    Memory Instructions The memory instructions

    (load.xxx and store.xxx) support strided memory

    access (both unit and non-unit stride), permuted access,

    Figure 3. Vector and Scalar Arithmetic Operation

    Figure 4. Memory Access Patterns

    indexed access and rake access. Fig. 4 shows the rake

    access pattern, which can be described by a single vector

    instruction. It requires a vector base address, vector index

    address and immediate offset value (the distance between

    two neighbor elements). All neighbor elements within

    each rake are stored in a single short vector while each

    rake of elements is stored in a different short vector. The

    same instruction can be used for unit stride and non-unit

    stride access by setting the immediate offset value to an

    equal distance between two vector elements in memory.

    Permutation and look-up table access classes are realized

    by setting the immediate offset to zero and providing an

    index register. Scalar accesses are supported with the same

    memory instructions.

    Interlane Communication Instructions We implement

    vector compression and expansion instructions. Compress

    instructions select the subset of an input vector marked by a

    flag vector and pack these together into contiguous elementsat the start of a destination vector. Expand instructions

    perform the reverse operation to unpack a vector, placing

    source elements into a destination vector at locations marked

    by bits in a flag register. Previous vector processors [2],

    [3] implement a crossbar switch to perform inter-lane com-

    munication. The FPVC compress and expand instructions

    implement the same functionality with lower hardware cost

    but slower access to the vector elements.

    34

  • 8/12/2019 low power data storage

    3/4

    Figure 5. Vector Scalar Pipeline

    B. The Vector Scalar Pipeline

    The FPVC pipeline (Fig. 5) is based on the classic in-

    order issue, out-of-order completion RISC pipeline. The four

    stages are Instruction Fetch, Decode, Execute and Write

    Back. The pipeline is intentionally kept short so integer

    vector instructions can complete in a small number of cyclesand floating point instructions spend most of their time in

    the floating point unit, optimizing the execution latency. As

    both scalar and vector instructions are executed from the

    same instruction pipeline, both type of instructions are freely

    mixed in the program execution and stored in the same local

    instruction memory.

    Floating point operations are implemented using the

    VFloat library [4]. Each functional unit implements IEEE

    754 single precision floating point. Arithmetic operations

    all have different latencies: each unit is fully pipelined

    so that a new operation can be started every clock cycle.

    Normalization and rounding are implemented along with

    the multiplexers shown in the datapath. Due to the differentlatencies of different operations, instructions are issued in

    order but can complete out of order. Hence, a structural

    hazard may occur if more than one instruction completes

    in the same clock cycle. We have implemented an arbiter,

    which can commit one result each cycle, between the end of

    the execution stage and the write back stage of the pipeline

    to eliminate this hazard. When multiple results are available

    at the same time, one will be written to the register file, and

    the rest will be stalled.

    Design Time Parameters We take advantage of

    the flexibility of the reconfigurable fabric to provide

    design time reconfigurable parameters as well as runtime

    parameters. At design time, the implementer can chose thenumber of vector lanes(C_NUM_OF_LANE), number

    of short vectors supported(C_VECTOR_LENGTH),

    and number of bytes of local BRAM

    memory(C_INSTR_MEM_SIZE, C_DATA_MEM_SIZE)

    as well as the bitwidth of the floating point components.

    For the experiments described here, we always implement

    single precision. The maximum vector length supported is

    Figure 6. Experimental Setup

    the number of lanes times the vector length.

    III. EXPERIMENTS ANDR ESULTS

    The FPVC is implemented in VHDL and synthesizedusing Xilinx ISE 10.1 CAD tools targeting Virtex-5 FPGAs.

    We compare the FPVC against thePowerPC 440 with Xilinx

    FPU using linear algebra kernels as examples. All code runs

    on the Xilinx ML510 board. We connect the FPVC via

    a PLB to the PowerPC and to the main on-chip memory

    (BRAM, Fig. 6). We also connect the PowerPC to the

    Xilinx FPU via the Fabric Co-processor Bus (FCB). For the

    experiments either the PowerPC plus FPU or the PowerPC

    and FPVC are used. The performance metric is the number

    of clock cycles between the start and the end of a kernel.

    Clock cycles are counted using the PowerPCs internal timer

    and results are compared to the runtime of the PowerPC plus

    FPU.Local IRAM and DRAM can be con-

    figured for various sizes using parameters

    C_INSTR_MEM_SIZE, C_DATA_MEM_SIZE and

    C_MPLB_DWIDTH at design compile time. For all of the

    results presented, we have set the instruction and data

    memory sizes to 64KB each and the PLB to 32 bits. We

    vary the number of vector lanes and length of short vectors.

    For running on the PowerPC, the linear algebra kernels were

    written in C and compiled using gcc with -o2 optimization.

    The FPVC based kernels are written in machine code.

    Program and data are stored in the 64KB main memory

    shown in Fig. 6. The FPVC system bus interface is used

    to load instructions into the local IRAM of the FPVC. Wetest our FPVC for performance on floating point numerical

    linear systems solvers. Here we present results for QR and

    Cholesky decomposition. While our runtimes are not as

    fast as a custom datapath, they consume less area and are

    more flexible than previously published vector processors

    while running faster than the Xilinx PowerPC embedded

    processor with auxiliary FPU. The FPGA resources used

    for a range of the runtime parameter values is shown in

    35

  • 8/12/2019 low power data storage

    4/4

    Figure 7. Resources Used

    Figure 8. Results for QR Decomposition

    Table 7. The factor that has the largest influence on the

    number of resources used is the number of vector lanes.

    We implemented QR and Cholesky Decomposition on the

    FPVC and compared the results to a PowerPC connected

    via the APU interface to the Xilinx FPU; this setup has

    performance equal to one.

    Fig. 8 shows the results for QR on the FPVC. For the

    FPVC implementations, the short vector length was kept

    constant at 32; the number of vector lanes was varied. The

    FPVC outperforms the Xilinx FPU even with only one

    lane implemented. This is due to the fact that there aresignificantly fewer vector instructions to decode thus most

    of the time is spent implementing floating point operations.

    QR has plenty of parallelism, increasing the number of lanes

    improves the speedup. The best speedup is achieved with 8

    lanes on a 24 by 24 matrix; the speedup is over five times

    the performance of the Xilinx solution.

    Results comparing Cholesky on FPVC to PowerPC plus

    FPU, varying the number of lanes, is shown in Fig. 9. For

    the implementations shown, Cholesky runs more than three

    times faster on the FPVC. Cholesky does not exhibit as

    much parallelism as QR, hence increasing lanes does not

    give that much improvement. Increasing the short vector size

    to 32 did result in significant improvement. The maximumperformance improvement achieved is over 5x for 8 lanes.

    However the four lane solution, which uses significantly

    fewer resources, achieves nearly as good performance and

    is the best choice for Cholesky.

    IV. CONCLUSIONS AND FUTURE WORK

    The completely autonomous floating point vector/scalar

    co-processor exhibits speedup on important linear algebra

    Figure 9. Results for Cholesky Decomposition

    kernels when compared to the implementation used by most

    practioners: embedded processor FPU provided by Xilinx.

    The FPVC is easier to implement at the cost of a decrease

    in performance compared to a custom datapath. Hence theFPVC occupies a middle ground in the range of designs

    that make use of floating point. The FPVC is completely

    autonomous. Thus, the PowerPC can be doing independent

    work while the FPVC is computing floating point solutions.

    We have not yet exploited this concurrency.

    The FPVC is configurable at design time. The number

    of lanes, size of vectors and local memory sizes can be

    configured to fit the application. Results above show that for

    QR decomposition, the designer may choose 8 lanes while

    for Cholesky, 4 lanes at 32 element short vectors is more

    efficient. The bitwidth of the FPVC datapath can also easily

    be modified. We plan to implement double precision in the

    near future. We also plan to add instruction caching so thatlarger programs can be implemented on the FPVC, and to

    provide tools such as assemblers and compilers to make the

    FPVC easier to use.

    REFERENCES

    [1] J. Kathiara, The Unified Floating Point Vector Co-processor for Reconfigurable Hardware, Masters thesis,Northeastern University Dept. of ECE, Boston, MA,2011. [Online]. Available: http://www.coe.neu.edu/Research/rcl/publications.php#theses

    [2] C. Kozyrakis and D. Patterson, Overcoming the limitationsof conventional vector processors, in Intl Symposium onComputer Architecture, June 2003, pp. 399 409.

    [3] J. Yu, C. Eaglestonet al., Vector processing as a soft processoraccelerator, ACM Trans. Reconfigurable Technol. Syst., vol. 2,pp. 12:112:34, June 2009.

    [4] X. Wang and M. Leeser, VFloat: a variable precision fixed-and floating-point library for reconfigurable hardware, ACMTrans. Reconfigurable Technol. Syst., vol. 3, pp. 16:116:34,September 2010.

    36