design, development and performance evaluation of multiprocessor systems on fpga

Upload: gurumbhat

Post on 30-May-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    1/48

    DESIGN, DEVELOPMENT AND PERFORMANCE

    EVALUATION OF MULTIPROCESSOR SYSTEMS ON

    FPGA

    A dissertation submitted in partial fulfillment of the requirements

    for the degree of

    Master of Technology

    In

    Computer Application

    By

    Somen Barma

    (2005JCA2428)

    Under the guidance of

    Dr. Kolin Paul

    (Department of Computer Science and Engineering)

    Indian Institute of Technology Delhi

    May 2007

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    2/48

    2

    I

    CERTIFICATE

    This is to certify that the project entitled Design, development and performance

    evaluation of multiprocessor systems on FPGA submitted by Somen Barma in partial

    fulfillment of the requirement for the award of the degree of Master of Technology in

    Computer Applications to the Indian Institute of Technology, Delhi, is a record of bona-

    fide work carried by him under my supervision and guidance.

    Dr. Kolin Paul

    Department of Computer Science and Engineering

    Indian Institute of Technology, New Delhi

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    3/48

    3

    II

    ACKNOWLEDGEMENT

    I feel pleasure and privilege to express my deep sense of gratitude, indebt ness and

    thankfulness towards my guide, Dr. Kolin Paul, for his guidance, constant supervision

    and continuous inspiration and support throughout the course of work. His valuable

    suggestion and critical evaluation have greatly helped me in successful completion of the

    work.

    I am also thankful to Prof. M. Balakrishnan for showing keen interest in solving critical

    problems in this project.

    I am also thankful to all those who helped me directly or indirectly in completion of this

    work.

    New Delhi Somen Barma

    20th

    May, 2007 2005JCA2428

    IIT, Delhi.

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    4/48

    4

    III

    CONTENTS

    CERTIFICATE .

    ACKNOWLEDGEMENT

    CONTENTS .

    ABSTRACT

    CHAPTER 1 INTRODUCTION ..

    1.1 Back ground ..1.2 Microblaze ..

    1.3 Power PC ..

    1.4 Stand alone board support package .

    1.5 UClinux ..

    CHAPTER 2 METHODOLOGY AND WORK DONE

    2.1 Designing two multiprocessor system .

    2.2 Matrix addition on the system

    2.3 Two Microblaze system with DDRRAM ..

    2.4 Building of uClinux .

    2.5 Creating new application for Ethernet packet handling .

    2.6 Microblaze shared memory system ....

    2.7 Modified two Microblaze system

    2.8 Heterogeneous multiprocessor system with Power PC ..

    CHAPTER 3 EXPERIMENTS ON THE DEVELOPED SYSTEM

    3.1 Running of UClinux

    3.2 Speed up obtained with Decryption algorithm

    I

    II

    III

    IV

    7

    8

    9

    12

    13

    14

    16

    16

    18

    19

    22

    23

    24

    26

    31

    32

    32

    33

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    5/48

    5

    CHAPTER 4 PARALLELIZATION OF SMITH WATERMAN ALGORITHM .

    4.1 Local Alignment: Smith Waterman algorithm .

    4.2 Parallelism in Smith Waterman and multiple processors System

    4.3 Speed up obtained with Smith-Waterman algorithm ..

    CHAPTER 5 DISCUSSION

    CHAPTER 6 CONLUSION AND FUTURESCOPE

    REFERENCES .

    GLOSSARY ..

    35

    35

    36

    42

    44

    45

    46

    48

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    6/48

    6

    IV

    ABSTRACT

    In embedded system multiple processors can be used for performance enhancement.

    More over on FPGAs the cost and risk involved to develop such a system using soft-core

    processors is also much less. The target of the project was to successfully build a

    multiprocessor system so that the system can be used for many embedded system

    applications with an enhanced performance out put.

    We started with a single microprocessor system using Xilinxs soft-core processor

    Microblaze. Then developed a two processor system and sorted out certain resource

    conflict issues. Till then the two processors were supported by independent stand alone

    OS, with less capability and functionality. On this system we tried out the matrix addition

    application. The data transferred to the microprocessors was through am shared BRAM

    (Block RAM) sitting on the OPB.

    To get more functionalities in the OS level we chose for uClinux as our OS. The OS was

    build according to our system using the tool chain available. For the OS to reside we

    added the DDRAM to the system. Presently the system also has Ethernet card and can

    handle network packets.

    Finally we get a homogeneous and heterogeneous system. To evaluate the performance

    of the system we choose two applications Decryption and Smith-Waterman of local

    alignment.

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    7/48

    7

    CHAPTER 1 INTRODUCTION

    During the past decade, there has been a dramatic increase in the number of applications

    within the commercial, medical and military market requiring very high input/output

    bandwidth and real-time processing power. Powerful embedded systems, also offering

    flexible configurability and cost-effective features, are needed to support particular

    requirements of such applications.

    The predominant method to provide a solution to this is multiprocessor system. This is

    due to several reasons: the possibility of using the best processing element to make a

    particular functionality; the possibility of using off-the-shelf components; the possibility

    to allocate tasks with different timing characterization separately (periodic and sporadic

    tasks, tasks with hard and soft real-time constraints) and so to use the most appropriate

    local scheduling policy; the possibility of minimizing communication allocating

    cooperating tasks in the same subsystem; and the possibility of distributing processing

    elements closely to related sensors so that will be possible to manage, as possible,

    distributed data in a distributed manner.

    The system where all the microprocessors are same and treated similarly is called

    symmetric multiprocessor system (SMP). Here an attempt has been made to develop the

    SMP system on FPGA.

    The FPGA (Field Programmable Gate Arrays) is programmable logic. Field

    Programmable means that the FPGA's function is defined by a user's program rather than

    by the manufacturer of the device. A typical integrated circuit performs a particular

    function defined at the time of manufacture. In contrast, the FPGA's function is defined

    by a program written by someone other than the device manufacturer. Depending on the

    particular device, the program is either 'burned' in permanently or semi-permanently as

    part of a board assembly process, or is loaded from an external memory each time the

    device is powered up. This user programmability gives the user access to complex

    integrated designs without the high engineering costs associated with application specific

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    8/48

    8

    integrated circuits. The CLB is the basic logic unit in an FPGA. Exact numbers and

    features vary from device to device, but every CLB consists of a configurable switch

    matrix with 4 or 6 inputs, some selection circuitry (MUX, etc), and flip-flops.

    While the CLB provides the logic capability, flexible interconnect routing routes the

    signals between CLBs and to and from I/Os. Routing comes in several flavors, from that

    designed to interconnect between CLBs to fast horizontal and vertical long lines spanning

    the device to global low-skew routing for Clocking and other global signals. The design

    software makes the interconnect routing task hidden to the user unless specified

    otherwise, thus significantly reducing design complexity

    We have used here the FPGA board from Xilinx. The specifications of the board is

    XC2VP30 , Grade ff896 , Speed -7. Along with that they also ships the EDK software

    which can be used for functional specification , synthesis, place and routing and finally

    download and debug the system.

    1.1 Back ground:

    Today many embedded products place their solution on several chips making it bigger,

    more expensive and more power requiring. The SoC solution has existed a long time on

    Application Specific Integrated Circuit (ASIC) boards but is rather new on FPGA boards.

    FPGA boards has become bigger, faster and cheaper and is now able to handle a SoC

    solution. As the FPGA boards have become bigger and faster they are now able to handle

    a soft processor which is an Intellectual Property (IP) core implemented using logical

    primitives. A key benefit is configurability where it is possible to add only what is needed

    in the design. A trade of is performance, a hard processor is faster but less configurable

    and more expensive [1]. More and more companies are therefore looking into the

    possibility of using SoC on an FPGA board with a soft processor, which makes it easier

    to develop and evaluate the solution.

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    9/48

    9

    1.2 Microblaze:

    The soft-core processor used for this project is Microblaze [2]. The MicroBlaze

    embedded processor soft core is a reduced instruction set computer (RISC), 5 stage

    pipeline, optimized for implementation in Xilinx field programmable gate arrays

    (FPGAs). Figure 1.2.1 shows a functional block diagram of the MicroBlaze core.

    Many aspects of the MicroBlaze can be configured at compile time owing to the

    configurable nature of FPGAs. Cache structure, peripherals, and interfaces can be

    customized to the application. In addition, hardware support for certain operations, such

    as multiplication, division, and floating-point arithmetic, can be added or removed.

    Figure 1.2.1 microblaze core block diagram

    Microblaze does not have a memory management unit. It can run at the speed of 150

    MHz. It has the following features.

    The processors fixed feature set includes:

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    10/48

    10

    Thirty-two 32-bit general purpose registers

    32-bit instruction word with three operands and two addressing modes.

    32-bit address bus.

    Single issue pipeline.

    The list below consists of some additional features that can be added to the MicroBlaze

    [12].

    Hardware barrel shifter - A digital circuit that can shift data any number of bits in

    one operation. A vital component in floating point operations

    Hardware divider

    Instruction and data cache

    On-chip peripheral bus (OPB)

    Local memory bus (LMB)

    Fast Simplex Link (FSL)

    Xilinx CacheLink

    1.2.1 Registers:

    MicroBlaze provides two kinds of registers, general-purpose registers and specialpurpose registers [2].

    Volatile registers (caller-save) are temporary registers and do not retain their

    values across function calls. Volatile registers are registers R3-R12, R3 and R4

    are used for returning values to the caller function. R5-R12 are used to pass

    parameters.

    Non-volatile registers keep their values across function calls (callee-save).Non-

    volatile register are registers R19-R31.

    Dedicated registers are the other registers. Registers R14-R17 are used to store

    return addresses from interrupts, sub-routines, traps and exceptions. R0 is always

    value 0 and R1 is used to store the stack pointer. These registers should not be

    used for anything else.

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    11/48

    11

    1.2.2 Bus Interface:

    MicroBlaze has several bus interfaces to be used in different areas. It follows the Harvard

    architecture where separate paths are used for data and instruction accesses. An

    advantage using Harvard architecture is that it makes it possible to read both instructions

    and data from memory at the same time [2].

    On-chip Peripheral Bus

    The OPB is a fully synchronous bus that provides access to both on-chip and offchip

    peripherals. The bus is not intended to connect directly to the processor [3].

    Local Memory Bus

    The LMB is a fast local bus used to connect MicroBlaze to high-speed peripherals,mainly

    Block RAM (BRAM). LMB makes is possible to access BRAM in one clock cycle [12]. Fast Simplex Link Bus

    FSL is a one way point-to-point communication bus used between an output FIFO device

    and an input FIFO device. It has support for up to eight master and slave interfaces and

    data can be transfered in two clock cycles [12].

    Xilinx CacheLink

    The Xilinx CacheLink (XCL) interface is a high-speed bus for external memory

    communication and is only available when the caches are enabled. XCL can be combined

    with an OPB where one cache uses XCL and the other one uses an OPB bus. Memory

    located outside the cache-able area is accessed through OPB or LMB [12].

    Debug Interface

    The debug interface is used with the Microprocessor Debug Module (MDM) and is

    controlled through the JTAG port by the Xilinx Microprocessor Debugger (XMD) [12].

    An interesting comparison between the synthesizable processors MircoBlaze, LEON2

    and OpenRISC 2000 is presented in a master thesis from Chalmers university [3]. It

    compares the performance, configurability and usability. The MicroBlaze version used is

    2.10.a and it performs well in the benchmarks but it is discovered that it does not follow

    the floating point standard.

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    12/48

    12

    1.3 Power PC 405:

    The PowerPC 405 [17] is a 32-bit implementation of the PowerPC embedded-

    environment architecture. It has been derived from the PowerPC architecture. This one is

    particularly tailored to meet the requirement of embedded system development. The

    original one is a 64 bit processors with a 32-bit subset. But this one is a 32-bit processor.

    Some other features of this processor are as follows:

    1. Memory management optimized for embedded software environments.

    2. Cache-management instructions for optimizing performance and memory con-

    trol in complex applications, which are graphically and numerically intensive.

    3. A device-control-register address space for managing on-chip peripherals such

    as memory controllers.

    4. A dual-level interrupt structure and interrupt-control instructions.

    5. Multiple timer resources.

    1.3.1 PowerPC 405 Hardware Organization:

    1. Central Processing unit: The PowerPC 405 central-processing unit (CPU) implements

    a 5-stage instruction pipeline consisting of fetch, decode, execute, write-back, and load

    write-back stages. The fetch and decode logic sends a steady flow of instructions to the

    execute unit. All instructions are decoded before they are forwarded to the execute unit.

    Instructions are queued in the fetch queue if execution stalls. Up to two branches are

    processed simultaneously by the fetch and decode logic. If a branch cannot be resolved

    prior to execution, the fetch and decode logic predicts how that branch is resolved.

    2. Memory Management Unit: The PowerPC 405 supports 4 GB of flat (non-segmented)

    address space. The memory management unit (MMU) provides address translation,

    protection functions, and storage attribute control for this address space. The MMU

    supports demand-paged virtual memory using multiple page sizes of 1 KB, 4 KB, 16 KB,

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    13/48

    13

    64 KB, 256 KB, 1 MB, 4 MB and 16 MB. Multiple page sizes can improve memory

    efficiency and minimize the number of TLB misses.

    1.4 Stand alone board support package:

    Initially the microblazes tested with the stand alone [4] BSP. And in later cases too other

    than the Microblaze running on uClinux rest are running on stand alone BSP. It provides

    certain standard APIs. Which some applications written on c may use. Standalone Board

    Support Package. The Standalone BSP is designed for use when an application accesses

    board or processor features directly (without an intervening OS layer).

    Figure 1.3.1.1 Block Diagram of Power PC [17]

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    14/48

    14

    1.5 uClinux:

    UClinux (pronounced you-see-linux) is a port of regular Linux, intended for

    microprocessors that do not have a Memory Management Unit (MMU). It is a soft real

    time OS [10]. In short MMU translates logical addresses into physical addresses. All

    requests for data is sent to the MMU, it then decides if the data is in RAM or needs to be

    fetched from disk. It also decides if the process has the rights to access the memory it is

    trying to reach [25]. Without MMU there is no memory protection or virtual memory

    leaving a bigger responsibility to the programmer not to write over other processes

    memories. The most noticeable effect for the programmer is that vfork() is used instead

    of fork(). vfork() and fork() creates a child process that only differs from the parent

    process by its PID and PID number [9]. uClinux has been ported to many processorarchitectures, Motorolas Coldfire and Dragonball, Blackfin, ARM7TDMI and

    MicroBlaze are the ones most used. It exists as a derivative from linux kernels 2.0, 2.4

    and 2.6 but it is the 2.4derivative that has been ported to the most number of

    microprocessors [13].

    On VM Linux, whenever an application tries to write off the top of the stack, an

    exception is flagged and some more memory is mapped in at the top of the stack to allow

    the stack to grow. Under uClinux, no such luxury is available as the stack must be

    allocated at compile time. So sometimes due to the overflow of the stack may cause

    crashes. Also uClinux Instead of dynamic heap uses global memory pool that basically is

    the kernel's free memory pool.

    ISR (interrupt service routine) and Kernel task share common counting semaphore. There

    for if kernel task is holding the semaphore it may so happen that ISR has to wait for long

    to get executed. Secondly kernel tasks are non preemptive. Therefore even high priority

    user application may have to wait. This makes the response time of uClinux longer.

    Therefore it is not a complete hard real time OS.

    However a critical issue is that we cannot use a single uClinux for all the microblazes.

    This is because the system does not have provision for a single interrupt controller for all

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    15/48

    15

    of them. Secondly there is always a cache coherency problem. We do not have a

    hardware solution for that.

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    16/48

    16

    CHAPTER 2 METHODOLOGY AND WORK DONE

    2.1 Designing two multiprocessor system:

    Initially a system was built with a single microprocessor (Microblaze). The Microblaze is

    instantiated with its local BRAM support. As the Microblaze is of Harvard architecture

    the BRAM is connected to Microblaze both with an instruction BRAM controller and a

    data BRAM controller [5]. The Microblaze is sitting on the OPB (On Chip Peripheral

    BUS).

    Once we have got this system the next step was to add the other processor. The next

    processor was added to the system using the wizard. However even by changing the

    MHS file we could have got the same result. The new processor added has to be first

    provided with its own BRAM. Where the data and instruction local to the processor will

    reside. Eventually we would require two more controllers for that BRAM two. A block

    diagram of the resulted system is shown in the figure 2.1.1.

    Figure 2.1.1 Basic two Microblaze system

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    17/48

    17

    The BRAM sitting on the OPB acts as a shared memory for the two processors. Any

    address on the OPB is visible to both the microblazes. Microblaze uses memory mapped

    IO. So both of them can access any resource what so ever may be on the OPB bus.

    But this ability causes another problem , resource conflict. The bus arbiter in every cycle

    gives the bus to every Microblaze in a round robin fashion. So it may so happen when

    one of the Microblaze is transferring data , say to the uart, the other one gets hold of the

    bus and starts transmitting too. This can be solved by bus parking facility of the bus. So

    that until one Microblaze finishes the other does not get the chance.

    However with this scheme there is a flaw. Suppose there are multiple resources on the

    OPB bus. Both the processors are using different resources. In that case if we block the

    bus. Then either of the two will have to wait. This is removed with the introduction of the

    custom IP.

    The sync IP solves the problem. Whenever a process on a Microblaze wants to get the

    bus it comes and registers its ID and the process ID in a software addressable register for

    a particular resource i.e locks the resource. Later opn this microprocessor can release the

    lock. However with this approach there may be chances of dead lock. For the above

    system both the microblazes are having BSP support but no Operating System.

    After the hardware platform design is complete, one can generate an FPGA configuration

    bitstream. We use the XPS to build the bit stream and the net list. At this point, we have

    only a "hardware bitstream," and this is not ready for to be applied it to an FPGA until

    the software component is included for the embedded system.

    After the embedded software development is complete, we can choose one of the

    following ways to run it on the hardware:

    1. If the application executable resides in on-chip memory regions, it is possible to

    merg the Executable Linked Format (ELF) file into the hardware bitstream so that

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    18/48

    18

    it gets loaded into on-chip memory, ready to execute, every time the FPGA is

    configured.

    2. During prototyping, XPS can dynamically download the executable to the board

    via the JTAG cable connected to the FPGA. In this case, we select a bootloop to

    be merged into the bitstream to initialize on-chip memory so that the processor

    remains in a static state until software downloading can be completed.

    3. For production systems, one can store executables residing in off-chip memory

    regions in a non-volatile memory device, such as flash Programmable ROM

    (PROM) or along with the configuration bitstream in a System ACEdevice. In

    this case, one would configure a bootloader executable to be merged into the

    hardware bitstream to initialize on-chip memory. Then each time the FPGA is

    configured or reset, the bootloader copies the application executable to a suitable

    (volatile) memory device and starts it running.

    We have used the JTAG for down loading the elf file along with the bit stream to the

    board. The standard output for the system is the UART.

    2.2 Matrix addition on the system:

    The system has been tested with matrix addition. The system works like this. The first

    Microblaze has the data on which the matrix addition is to be performed. It writes the

    data to the shared BRAM.

    Till it completes the writing of the data the other Microblaze waits and keeps on checking

    whether a particular bit is set or not flagging that data has been written completely.

    Then the Microblaze 2 starts reading the data. Once completed it calculates the addition

    and writes the result back to a different location in the BRAM. The first one then picks

    the result up and displays the result on the UART, (the hyper terminal).

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    19/48

    19

    2.3 Two Microblaze system with DDRRAM:

    The above-described system has been changed by adding a DDRSDRAM. This Ram is

    required to accommodate the OS( uClinux). The Ram is also attached on the OPB bus.

    Now the caches for the Microblaze 1 has been instantiated.

    2.3.1 Cache

    The cache consists of both instruction and a data cache and are controlled using a bit each

    in the MSR [12]. They are 1-way associative (Direct mapped), each block (a collection of

    data containing the requested word) can only be placed at one place in the cache [6]. The

    memory can be divided into cache-able and non-cache-able segments, making it possible

    to tell exactly what to cache. The only address space that can not be cached is the LMB

    address space. The data-cache uses write-through, where the cache is mirrored on main

    memory by writing to memory on each cache writing [6]. The cache can be used with

    Figure 2.2.1 Schematic representation of Matrixaddition

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    20/48

    20

    either the OPB interface or the dedicated XCL interface or as combination. The

    differences between those interfaces are [12]:

    CacheLink uses a 4-word cache block (critical word first). It takes the requested

    word and the next three, which increases the hit rate; OPB uses a single word

    cache block.

    CacheLink uses a dedicated bus interface for memory accesses. This reduces

    traffic on the OPB bus.

    The CacheLink interface requires a specialized memory controller interface. The

    OPB interface uses standard OPB memory controllers.

    CacheLink allows posted write accesses on write-misses. OPB caches require the

    write access to be completed before execution is resumed (Only data cache).

    2.3.2 XMD debug module:

    To be able to debug the microblazes and the programs we have to connect the second

    Microblaze also too the XMD debug module [8] this is done by connecting the following

    ports of the XMD module to the micro blaze.

    PORT DBG_CAPTURE_0 = DBG_CAPTURE_sPORT DBG_CLK_0 = DBG_CLK_s

    PORT DBG_REG_EN_0 = DBG_REG_EN_s

    PORT DBG_TDI_0 = DBG_TDI_s

    PORT DBG_TDO_0 = DBG_TDO_s

    PORT DBG_UPDATE_0 = DBG_UPDATE_s

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    21/48

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    22/48

    22

    Supports unicast, multicast, and broadcast transmit and receive modes as well

    as promiscuous and 64 entry Contents Addressable Memory (CAM) based

    receive modes

    Auto source address field insertion or overwrite or pass through fortransmission

    The EMAC [10] Interface design is a soft intellectual property (IP) core designed for

    implementation in several Xilinx FPGAs. It supports the IEEE Std. 802.3 Media

    Independent Interface (MII) to industry standard Physical Layer (PHY) devices and

    communicates to a processor via an IBM On-Chip Peripheral Bus (OPB) interface. The

    design provides a 10 Megabits per second (Mbps) and 100 Mbps (also known as Fast

    Ethernet) EMAC Interface. This design includes many of the functions and the flexibility

    found in dedicated Ethernet controller devices currently on the market.

    2.4 Building of uClinux:

    In this step, the uClinux auto-configure mechanism is used to map/export the EDK

    hardware design to the uClinux kernel build mechanism. This is done by configuring and

    generating the BSP. The library generator uses the configuration information and the

    hardware design database to export an auto-config.in file, which contains all the

    information about the hardware design. This auto-config.in file is in the uClinux

    configuration file format. The uClinux sources, makefiles, and other scripts are built with

    conditional code to generate the correct software based on the hardware described in the

    auto-config.in file. This flow effectively means that retargeting uClinux for any hardware

    setup can be done quickly. To retarget uClinux, regenerate the BSP for the new hardware

    setup. By rebuilding uClinux with the newly generated auto-config.in file, an image

    targeting the new hardware setup is compiled by uClinux tools. The tools and relatedsources can be obtained from petalogix [15]. To generate the uClinux image we have to

    use a linux system. Once the auto-config is placed in the proper place in linux

    distribution, going to the tool chain we can use the following commands to generate the

    image.

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    23/48

    23

    1. We have to Go to the uClinux distribution directory. Run: cd

    home/devel/uclinux/src/uClinux-dist

    make clean

    make xconfig

    2. have to set the vendor option to Xilinx.

    3. choose the Os as uClinux auto

    4. select the other libraries .

    Once the kernel image ready we take that back to the host system. From the host

    system we download the image to the RAM , using the hardware debug module.

    We have used the following command.

    cd binaries

    dow -data tmicro.bin 0x30000000

    mwr 0x100 0

    rwr 5 0x100

    rwr pc 0x30000000

    con

    that means we are loading the kernel image tmicro.bin to the memory location

    0x30000000. and then setting th eprogram counter value to this address and asking

    the processor to start.

    2.5 Creating new application for Ethernet packet handling:

    Applications [14] can be developed separately or as a software application project in the

    XPS, they can also be created in the Software Development Kit (SDK) by importing the

    XPS design. All applications need to link to the Xilkernel library in order to get the

    kernel functionality [16]. The kernel and its applications can then be downloaded to the

    target within the EDK and SDK.

    The applications in Xilkernel can be run in two ways. One method is to build the

    applications into the kernel, having only one executable file. The other method is to have

    each application as a separate executable file, the same method that is used in a

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    24/48

    24

    conventional operating system where the kernel is a separate image file and the

    applications are separate files.

    Here we have first build the application by putting it into the source path and then

    updating the make file. However later on we only build the specific application. To down

    load the application we connect it to the microblaze system through the net. And from the

    host machine run a ftp session to the system. Where we have down loaded the application

    to the directory /var/tmp. Then changed the mode of the file. And executed. This

    approach helped us to avoid the tedious job of building the OS image all the time

    repeatedly.

    The application accepts the packet over the network using TCP/IP protocol. The payload

    is then copied to a particular location in DDR RAM. The other microblaze picks the data

    up from the location and performs the required operation in parallel.

    2.6 MicroBlaze shared memory Multiprocessing:

    The following are possible ways of creating a shared memory multi-processor system

    (taken from John Williams report). We evaluate the pros and cons and finally choose a

    custom deign for our implementation.

    2.6.1 Implicit Multiprocessing:

    Using this mechanism, parallelism is hidden by OS and hardware. A single copy of

    operating system runs and controls all the processors in the system. The following figure

    shows an implementation of N processor SMP (Symmetric Multi Processor).

    The MicroBlaze soft-core processor does not have hardware support for cache coherency.

    Software implementation of cache coherency has severe performance impact. MicroBlaze

    core also lacks distributed interrupt management.

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    25/48

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    26/48

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    27/48

    27

    the rest of the Microblazes on another OPB. This system also has the scope of further

    scaling up by attaching more buses and Microblazes on them.

    In order to achieve parallelism with minimum memory overhead and bus scaling issues,

    we have developed a system with a master-slave relationship. The master MicroBlaze in

    the system is the only MicroBlaze with uClinux OS. The remaining MicroBlaze

    processors in the system act as slaves and use standalone BSP. The MicroBlaze running

    uClinux is responsible for providing all benefits of advance OS to the complete system

    and for various synchronization activities. In the application explored in the project, the

    master processor is used for collecting data over the network and makes it available for

    other processors. The above figure shows the architecture of the system.

    There are some bridges (OPB2PLB and PLB2OPB) which allows the microblazes to

    access the DDRRAM and other peripherals on different buses which though not on their

    own OPB but mapped to their address space.

    2.7.1 Processor Local Bus (PLB):

    The Xilinx 64-bit Processor Local Bus (PLB) consists of a bus control unit, a watchdog

    timer, and separate address, write, and read data path units with a three-cycle only

    arbitration feature.

    Some works, which are specific to PLB arbitration logics, are:

    1. Single Read Transfer Bus Time-Out,

    2. Single Write Transfer Bus Time-Out,

    3. Line Read Transfer Bus Time-Out,

    4. Line Write Transfer Bus Time-Out,

    5. Burst Read TransferBus Time-Out,

    6. Burst Write Transfer Bus Time-Out,

    7. Pipelined Read Transfer,

    8. Pipelined Write Transfer

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    28/48

    28

    Arbitration Priority: The Xilinx PLB implements fixed priority when two or more

    masters have the same priority inputs. Priority order in this case is Master 0, Master 1,

    Master 2, and Master N.

    Clock and Power Management: The IBM PLB Arbiter Core supports clock and power

    management by gating clocks to all internal registers and providing asleep request signal

    to a central clock and power management unit in the system. This sleep request signal is

    asserted by the IBM PLB Arbiter to indicate when it is permissible to shut off clocks to

    the arbiter.

    2.7.2 OPB TO PLB (OPB2PLB) Bridge:

    The On-Chip Peripheral Bus (OPB) to Processor Local Bus (PLB) Bridge translates OPB

    transactions into PLB transactions. It functions as a slave on the OPB side and a master

    on the PLB side. Since the Microblazes are on the OPB bus to communicate with the

    DDRRAM the shared memory we have put that specific OPB2PLB bus. This bridge

    enables both the address translation as well as the bus protocol translation for the

    Microblazes.

    High Level Description: The following figure shows a schematic diagram of the bridge

    Figure 2.7.2.1 schematic diagram of the OPB2PLB Bridge

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    29/48

    29

    The OPB interface is designed with a pipelined architecture to improve timing and to

    support high clock frequencies. Input and Output signals to the OPB are designed to be

    driven through flip-flops for better timing. Pipelining introduces some additional latency

    in the design. Since some signals are delayed through registers. However, the use of

    pipelining balances transaction latency with higher clock fequencies.

    Address resolution and deadlock prevention:

    The bridge decoder takes care of that fact that same location is not addressed by any

    master of the PLB as well as by itself. To avoid a situation like this certain parameters are

    (user changeable) provided.

    The bridge provides four channels o address four contiguous memory locations on the

    PLB side by the master on the OPB side. Each channel has one C_RNG0_BASEADDR

    and one C_RNG0_HIGHADDR, which corresponds to the address range of that channel

    that will be accessed by the bridge on PLB. The PLB base addresses will be those, which

    will be obtained after negating the channel address for OPB.

    2.7.2 PLB TO OPB (PLB2OPB) Bridge:

    This Bridge acts as a master for the OPB bus and as a slave on the PLB bus. The user

    changeable parameters are similar to that of OPB2PLB Bridge. This bridge translates the

    transaction on the PLB bus to OPB bus.

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    30/48

    30

    2.7.3 System View:

    Figure 2.7.3.1 System View

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    31/48

    31

    2.8 Heterogeneous multiprocessor system with Power PC:

    The previously described homogeneous multiprocessor system has further been modified

    to a heterogeneous system by adding Power PC to it. The Power PC is now sitting on the

    PLB bus. We have tested the system with some simple program and it reported that all

    the hardware units are working.

    Figure 2.8.1 Heterogeneous multiprocessor system

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    32/48

    32

    CHAPTER 3 EXPERIMENTS ON THE DEVELOPED SYSTEM

    3.1 Running of UClinux:

    Once we have done with the building up of the hardware system. Now it is time to

    experiment with the system. We started with the goal of running the compiled kernel

    image as discussed in section 2.4 on the system developed (modified two microblaze

    system) discussed in 2.7. The microblaze running uClinux gives the flexibility of

    operating system where as the stand alone microblaze in the same system provide

    extreme computational power.

    We have got the following out put in the hyper terminal. It is the booting sequence of

    uClinux.

    Linux version 2.4.32-uc0 (jca052428@vindhyachal) (gcc version 3.4.1 ( Xilinx EDK

    8.1 Build EDK_I.17 090206 )) #1 Sun Nov 26 15:54:08 IST 2006

    On node 0 totalpages: 65536

    zone(0): 65536 pages.

    zone(1): 0 pages.

    zone(2): 0 pages.

    CPU: MICROBLAZE

    Kernel command line:

    Console: xmbserial on UARTLite

    Calibrating delay loop... 49.86 BogoMIPS

    Memory: 256MB = 256MB total

    POSIX conformance testing by UNIFIX

    xgpio #1 at 0x40020000 mapped to 0x40020000

    xgpio #2 at 0x40040000 mapped to 0x40040000

    Xilinx GPIO registered

    RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize

    eth0: using sgDMA mode.

    eth0: Xilinx EMAC #0 at 0x40C00000 mapped to 0x40C00000, irq=1

    eth0: id 2.0l; block id 11, type 1

    uclinux[mtd]: RAM probe address=0x3016209c size=0xd8000

    uclinux[mtd]: root filesystem index=0

    NET4: Linux TCP/IP 1.0 for NET4.0

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    33/48

    33

    IP Protocols: ICMP, UDP, TCP

    IP: routing cache hash table of 2048 buckets, 16Kbytes

    TCP: Hash tables configured (established 16384 bind 32768)

    NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.

    VFS: Mounted root (romfs filesystem) readonly.

    Freeing init memory: 52K

    flatfsd: Nonexistent or bad flatfs (-114), creating new one...

    /bin/flatfsd: mtd.c: 156: flat_dev_close: Assertion `flatinfo.fd != -1' failed.

    flatfsd: mtd.c: 156: flat_dev_close: Assertion `flatinfo.fd != -1' failed.

    Abort

    Setting hostname:

    Setting up interface lo:

    Starting DHCP client:

    Starting thttpd:eth0: Link carrier lost.

    uclinux-auto login:

    3.2 Speed up obtained with Decryption algorithm:

    An application using DES cryptography algorithm was used in our setup to measure the

    gains using the architecture as described above. The application accepting networkpackets as described in section 2.5 was executed on the MicroBlaze with uClinux. The

    standalone MicroBlaze ran DES algorithm for encrypting portions of data placed in RAM

    by master MicroBlaze. Following were the run-time achieved using varying number of

    MicroBlaze processors:

    Figure 3.2.1 Time taken by different number of processor

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    34/48

    34

    Figure 3.2.2 Speed up obtained with different number of processors

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    35/48

    35

    CHAPTER 4 PARALLELIZATION OF SMITH WATERMAN ALGORITHM

    4.1 Local Alignment: Smith Waterman algorithm:

    The most important fact of biological sequence analysis is that in bimolecular sequences

    (DNA, RNA or amino acid sequences), high sequence similarity usually implies

    significant functional or structural similarity.

    In many applications two strings may not be highly similar in their entirety but may

    contain regions that are highly similar. The task is to find and extract a pair of regions,

    one from each of the two given string that exhibit high similarity. This is called the local

    alignment or local similarity problem. Smith Waterman algorithm is an approach to that.

    The Smith-Waterman algorithm is a database search algorithm developed by T.F. Smith

    and M.S. Waterman, and is based on an earlier model known as Needleman and Winch,

    who proposed the algorithm initially. The Smith-Waterman algorithm uses dynamic

    programming to find the best local alignment between any two given sequences. Based

    on certain criterion, usually a scoring matrix, scores and weights are assigned to each

    character-to-character comparison. Positive for exact matches/substitutions, and usually

    negative for insertion/deletions. The exact scores are based on a scoring matrix. The

    scores are then added together and the highest scoring alignment is reported.

    The Smith-Waterman algorithm(the generic one) [18] is given below:

    1 Declare a nxn similarity matrix

    2 Initialize the top row (i=0) and left column (j=0) with 0

    3 for i = 1; i < length(sequence); i++ do

    4 for j = 1; j < length(sequence); j++ do

    5 F(i,j) = max (0, F(i-1,j-1)+ s(xi,yi), F(i-1,j) - d, F(i,j-1)-d)

    6 Save index of term that contributed to the calculated value in F(i,j)

    7 end for

    8 end for

    9 Find maximum value in nxn matrix

    10 Using saved indices in (6), traceback to first 0 encountered

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    36/48

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    37/48

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    38/48

    38

    4.2.1 The parallelization, A Locally Sequential, Globally Parallel, ( LSGP )

    approach:

    To calculate new values on a fresh diagonal whatever data dependencies there are from

    the previous diagonal. The following figure shows how two processors can be used for

    parallel computation.

    The dotted lines represent the cells, which will be computed in processor one and the

    solid lines represent the cells, which will be computed by the other processor, in essence

    LSGP approach. Now up to the first two diagonals both the processor will compute all

    the three cells. After that the division of work will arise. The first processor keeps the

    track of the number of cells to be computed by the other one and by him. Both of them

    compute the corner points first and then go for the calculation of the other points. The

    number 0ne processor (P1) calculates the right most cells and writes the result back to the

    shared memory. Similarly second processor (P2) does the same for its left corner point of

    the diagonal. However though all the time the data are not required to be transmitted but

    doing that brings symmetry to the process. So with an extra overhead the programming

    complexity can be reduced to a great deal. Along with the count of the cells P1 also lets

    P2 know about the right most point index. So that later one can start working from that

    end. The indexes are easy to calculate as because all the cells that need to be calculated

    are on the same diagonal. This approach can be extended for further number of

    Figure 4.2.1.1 an illustration of LSGP approach

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    39/48

    39

    processors. But with higher number of processors the pattern for incrementing or

    decrementing the number of cells calculated by each processor, for each diagonal has to

    follow a specific pattern. Here we have proposed a pattern, which we would like to

    explain with a three-processor system example.

    Incrementing pattern Number of cells Decrementing pattern Number of cells

    P1 P2 P3 P1 P2 P3

    4.2.2 Pseudo code for the processors

    We assume the function Smith_waterman(i,j) can calculate the value of each cell when

    provided with the index. And the function Trace_back() finally gives the required

    alignment. The variable number_of_cells_for_P1 and number_of_cells_for_P2 keep track

    of the cells each processor has to calculate for each diagonal. The two functions

    write_to_SM() and Read_to_SM() allows the processors to write and read data to and

    from the Shared Memory respectively. The processors have the data dependencies. Data

    of which cell is required and has to be brought in from the shared memory is found out by

    the processors after looking at the local table of each processor. If the required data

    already available do nothing, or fetch the data and store it first in its local memory, then

    go for the other computations.

    Processors 1:

    //the total table is of nXn size

    Calculate the first two diagonals locally.

    number_of_cells_for_P1=2;

    number_of_cells_for_P2=1;

    2 1 1

    2 2 1

    2 2 2

    3 2 2

    3 3 2

    4

    5

    6

    7

    8

    5 5 5

    5 5 4

    5 4 4

    4 4 4

    5 4 3

    15

    14

    13

    12

    11

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    40/48

    40

    For( i = 2 to n-1)

    If( i(mod)2 = 0)

    Then

    {

    number_of_cells_for_P1= number_of_cells_for_P1+1;

    }

    if( I(mod)3=0)

    Then

    {

    number_of_cells_for_P2= number_of_cells_for_P2+1;

    write_to_SM(number_of_cells_for_P2);

    }

    row_index_for_P2= i- number_of_cells_for_P1-1;

    column_index_for_P2= i+ number_of_cells_for_P1;

    write_to_SM(row_index_for_P2);

    write_to_SM(column_index_for_P2);

    While( iteration NOT EQUAL TO number_of_cells_for_P1)

    {

    result=Smith_Waterman(row_index,column_index);

    write_to_SM(result);

    row_index= row_index+1;

    column_index= column_index-1;

    iteration = iteration+1;

    }

    // upto this part the P1 calculates cells from the upper triangle of the table.

    For( k=0 to n-1)

    {

    //decrement of the number_of_cells_for_P2 and increment of

    nummber_of_cells_for_P1

    if( result required for calculation of the cell (n-1, (n-1)-number_of_cells_for_P1)

    not available)

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    41/48

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    42/48

    42

    }

    }// this part is for the upper half of the table.

    For( k=0 to n-1)

    {

    Read_from_SM(number_of_cells_for_P2);

    Read_from_SM (row_index_for_P2);

    Read_from_SM (column_index_for_P2);

    While( iteration NOT EQUAL TO number_of_cells_for_P1)

    {

    result=Smith_Waterman(row_index,column_index);

    write_to_SM(result);

    row_index= row_index+1;

    column_index= column_index-1;

    }

    }

    }// code completed for 2nd

    processor.

    4.3 Speed up obtained with Smith-Waterman algorithm:

    Speed Up Comparisions

    0

    10002000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    0 2 4 6 8 10 12 14

    String Length

    Numberofcycles

    P1

    P2

    P3

    Figure 4.3.1 Speed up in case of Smith-Waterman algorithm

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    43/48

    43

    The last graph has been obtained when we tried to implement the algorithm described in

    section 4.2 for Smith-Waterman algorithm.

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    44/48

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    45/48

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    46/48

    46

    REFERENCES

    [1] What is a soft processor?, Xilinx, 2006-01-.http://www.xilinx.com/ipcenter/processor

    central/microblaze/doc/mb faq.pdf.

    [2] MicroBlaze Processor Reference Guide Embedded Development Kit EDK 8.2i

    (UG081 v6.0) June 1, 2006

    [3] D. Mattson and M. Christensson. Evaluation of synthesizable cpu cores. Mastersthesis, 2004.

    [4] Standalone Board Support PackageEDK 8.2i, June 23, 2006.

    [5] LMB BRAM Interface Controller (v1.00b) DS452 February 22, 2006

    [6] on-Chip peripheral Bus V2.0 with OPB arbiter( v1.10c) DS401 December 2, 2005.

    [7] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative

    Approach, Third Edition. Morgan Kaufmann, May 2002

    [8] Micropr essor Debug Module (MDM) (v2.00a) DS450 February 22, 2006oc.

    [9] Linux Operating System and Linux Distributions. http://linux.about.com.

    [10]Linux for Real Time Requirements, SoftTech Solutions P Ltd. www.isofttech.com/

    downloads/Linux-RTOS.pdf.

    [11] D. McCullough. uclinux for linux programmers. In Linux Journal Volume 2004 ,

    Issue 123 (July 2004) Page: 7 Year of Publication: 2004 ISSN:1075-3583. ACM Press,

    July 2004.

    [12] MicroBlaze Processor Reference Guide, Xilinx. http://www.xilinx.com

    /ise/embedded/edk7 1docs/mb ref guide.pdf.

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    47/48

    47

    [13] Microblaze uClinux FAQ, John Williams, 2006-02-16 http://www.itee.uq.edu.au/-

    jwilliams/mblazeuclinux/Documentation/FAQ.html.

    [14] D. Stepner, N. Rajan, and D. Hui. Embedded application design using a real-time

    os. In annual ACM IEEE Design Automation Conference Proceedings of the 36th

    ACM/IEEE onference on Design automation, pages 151 156. ACM Press, August

    1999.

    [15] J. Williams. What does PetaLogix mean for the MicroBlaze uClinux community,

    PetaLogix, 2006-02-16. http://www.petalogix.org/ news events/petalogix announce

    .[16]OS and Libraries Document Collection, Xilinx. http: //www.xilinx.com

    /ise/embedded /edk71 docs/oslibs rm.pdf.

    [17] PowerPC 405 Processor Block Reference Guide ,UG018 (v2.0) August 20, 2004

    [18] A Parallel Implementation of the Smith-Waterman Algorithm for Massive

    Sequences Searching ,Hsien-Yu Liao, Meng-Lai Yin, Yi Cheng, Electrical and computer

    Engineering Department California State Polytechnic University, Pomona

    [email protected], [email protected].

    [19] Pruning algorithm to reduce the search space of the Smith-Waterman algorithm,

    Farhan Ahmed Department of Electrical and Computer Engineering Lafayette College,

    Easton, PA.

    [20] A Parallel Implementation of Smith-WatermanSequence Comparison Algorithm,

    Brian Hang Wai Yang, December 6, 2002

  • 8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

    48/48

    GLOSSARY

    FPGA Field Programmable Gate Arrays.

    SMP Symmetric Multiprocessor System

    CLB Configurable Logic Cells

    EDK Embedded Development Kit

    IP Intellectual Property

    SoC System on Chip

    RISC Reduced Instruction Set Computer

    OPB Onchip PeripherAL Bus

    LMB Local Memory Bus

    FSL Fast Simplex Link

    BRAM Block RAM

    XCL Xilinx Cache Link

    MDM Microprocessor Debug Module

    MMU Memory Management Unit

    TLB Translation Look Aside Buffer

    BSP Board Support Package

    ELF Executable Linked Format

    MHS Microprocessor Hardware Specification

    MSS Microprocessor Software Specification

    OPB2PLB OPB to PLB bridge