design, development and performance evaluation of multiprocessor systems on fpga

8/14/2019 DESIGN, DEVELOPMENT AND PERFORMANCE EVALUATION OF MULTIPROCESSOR SYSTEMS ON FPGA

1/48

DESIGN, DEVELOPMENT AND PERFORMANCE

EVALUATION OF MULTIPROCESSOR SYSTEMS ON

FPGA

A dissertation submitted in partial fulfillment of the requirements

for the degree of

Master of Technology

In

Computer Application

By

Somen Barma

(2005JCA2428)

Under the guidance of

Dr. Kolin Paul

(Department of Computer Science and Engineering)

Indian Institute of Technology Delhi

May 2007


2/48

2

I

CERTIFICATE

This is to certify that the project entitled Design, development and performance

evaluation of multiprocessor systems on FPGA submitted by Somen Barma in partial

fulfillment of the requirement for the award of the degree of Master of Technology in

Computer Applications to the Indian Institute of Technology, Delhi, is a record of bona-

fide work carried by him under my supervision and guidance.

Dr. Kolin Paul

Department of Computer Science and Engineering

Indian Institute of Technology, New Delhi


3/48

3

II

ACKNOWLEDGEMENT

I feel pleasure and privilege to express my deep sense of gratitude, indebt ness and

thankfulness towards my guide, Dr. Kolin Paul, for his guidance, constant supervision

and continuous inspiration and support throughout the course of work. His valuable

suggestion and critical evaluation have greatly helped me in successful completion of the

work.

I am also thankful to Prof. M. Balakrishnan for showing keen interest in solving critical

problems in this project.

I am also thankful to all those who helped me directly or indirectly in completion of this

work.

New Delhi Somen Barma

20th

May, 2007 2005JCA2428

IIT, Delhi.


4/48

4

III

CONTENTS

CERTIFICATE .

ACKNOWLEDGEMENT

CONTENTS .

ABSTRACT

CHAPTER 1 INTRODUCTION ..

1.1 Back ground ..1.2 Microblaze ..

1.3 Power PC ..

1.4 Stand alone board support package .

1.5 UClinux ..

CHAPTER 2 METHODOLOGY AND WORK DONE

2.1 Designing two multiprocessor system .

2.2 Matrix addition on the system

2.3 Two Microblaze system with DDRRAM ..

2.4 Building of uClinux .

2.5 Creating new application for Ethernet packet handling .

2.6 Microblaze shared memory system ....

2.7 Modified two Microblaze system

2.8 Heterogeneous multiprocessor system with Power PC ..

CHAPTER 3 EXPERIMENTS ON THE DEVELOPED SYSTEM

3.1 Running of UClinux

3.2 Speed up obtained with Decryption algorithm

I

II

III

IV

7

8

9

12

13

14

16

16

18

19

22

23

24

26

31

32

32

33


5/48

5

CHAPTER 4 PARALLELIZATION OF SMITH WATERMAN ALGORITHM .

4.1 Local Alignment: Smith Waterman algorithm .

4.2 Parallelism in Smith Waterman and multiple processors System

4.3 Speed up obtained with Smith-Waterman algorithm ..

CHAPTER 5 DISCUSSION

CHAPTER 6 CONLUSION AND FUTURESCOPE

REFERENCES .

GLOSSARY ..

35

35

36

42

44

45

46

48


6/48

6

IV

ABSTRACT

In embedded system multiple processors can be used for performance enhancement.

More over on FPGAs the cost and risk involved to develop such a system using soft-core

processors is also much less. The target of the project was to successfully build a

multiprocessor system so that the system can be used for many embedded system

applications with an enhanced performance out put.

We started with a single microprocessor system using Xilinxs soft-core processor

Microblaze. Then developed a two processor system and sorted out certain resource

conflict issues. Till then the two processors were supported by independent stand alone

OS, with less capability and functionality. On this system we tried out the matrix addition

application. The data transferred to the microprocessors was through am shared BRAM

(Block RAM) sitting on the OPB.

To get more functionalities in the OS level we chose for uClinux as our OS. The OS was

build according to our system using the tool chain available. For the OS to reside we

added the DDRAM to the system. Presently the system also has Ethernet card and can

handle network packets.

Finally we get a homogeneous and heterogeneous system. To evaluate the performance

of the system we choose two applications Decryption and Smith-Waterman of local

alignment.


7/48

7

CHAPTER 1 INTRODUCTION

During the past decade, there has been a dramatic increase in the number of applications

within the commercial, medical and military market requiring very high input/output

bandwidth and real-time processing power. Powerful embedded systems, also offering

flexible configurability and cost-effective features, are needed to support particular

requirements of such applications.

The predominant method to provide a solution to this is multiprocessor system. This is

due to several reasons: the possibility of using the best processing element to make a

particular functionality; the possibility of using off-the-shelf components; the possibility

to allocate tasks with different timing characterization separately (periodic and sporadic

tasks, tasks with hard and soft real-time constraints) and so to use the most appropriate

local scheduling policy; the possibility of minimizing communication allocating

cooperating tasks in the same subsystem; and the possibility of distributing processing

elements closely to related sensors so that will be possible to manage, as possible,

distributed data in a distributed manner.

The system where all the microprocessors are same and treated similarly is called

symmetric multiprocessor system (SMP). Here an attempt has been made to develop the

SMP system on FPGA.

The FPGA (Field Programmable Gate Arrays) is programmable logic. Field

Programmable means that the FPGA's function is defined by a user's program rather than

by the manufacturer of the device. A typical integrated circuit performs a particular

function defined at the time of manufacture. In contrast, the FPGA's function is defined

by a program written by someone other than the device manufacturer. Depending on the

particular device, the program is either 'burned' in permanently or semi-permanently as

part of a board assembly process, or is loaded from an external memory each time the

device is powered up. This user programmability gives the user access to complex

integrated designs without the high engineering costs associated with application specific


8/48

8

integrated circuits. The CLB is the basic logic unit in an FPGA. Exact numbers and

features vary from device to device, but every CLB consists of a configurable switch

matrix with 4 or 6 inputs, some selection circuitry (MUX, etc), and flip-flops.

While the CLB provides the logic capability, flexible interconnect routing routes the

signals between CLBs and to and from I/Os. Routing comes in several flavors, from that

designed to interconnect between CLBs to fast horizontal and vertical long lines spanning

the device to global low-skew routing for Clocking and other global signals. The design

software makes the interconnect routing task hidden to the user unless specified

otherwise, thus significantly reducing design complexity

We have used here the FPGA board from Xilinx. The specifications of the board is

XC2VP30 , Grade ff896 , Speed -7. Along with that they also ships the EDK software

which can be used for functional specification , synthesis, place and routing and finally

download and debug the system.

1.1 Back ground:

Today many embedded products place their solution on several chips making it bigger,

more expensive and more power requiring. The SoC solution has existed a long time on

Application Specific Integrated Circuit (ASIC) boards but is rather new on FPGA boards.

FPGA boards has become bigger, faster and cheaper and is now able to handle a SoC

solution. As the FPGA boards have become bigger and faster they are now able to handle

a soft processor which is an Intellectual Property (IP) core implemented using logical

primitives. A key benefit is configurability where it is possible to add only what is needed

in the design. A trade of is performance, a hard processor is faster but less configurable

and more expensive [1]. More and more companies are therefore looking into the

possibility of using SoC on an FPGA board with a soft processor, which makes it easier

to develop and evaluate the solution.


9/48

9

1.2 Microblaze:

The soft-core processor used for this project is Microblaze [2]. The MicroBlaze

embedded processor soft core is a reduced instruction set computer (RISC), 5 stage

pipeline, optimized for implementation in Xilinx field programmable gate arrays

(FPGAs). Figure 1.2.1 shows a functional block diagram of the MicroBlaze core.

Many aspects of the MicroBlaze can be configured at compile time owing to the

configurable nature of FPGAs. Cache structure, peripherals, and interfaces can be

customized to the application. In addition, hardware support for certain operations, such

as multiplication, division, and floating-point arithmetic, can be added or removed.

Figure 1.2.1 microblaze core block diagram

Microblaze does not have a memory management unit. It can run at the speed of 150

MHz. It has the following features.

The processors fixed feature set includes:


10/48

10

Thirty-two 32-bit general purpose registers

32-bit instruction word with three operands and two addressing modes.

32-bit address bus.

Single issue pipeline.

The list below consists of some additional features that can be added to the MicroBlaze

[12].

Hardware barrel shifter - A digital circuit that can shift data any number of bits in

one operation. A vital component in floating point operations

Hardware divider

Instruction and data cache

On-chip peripheral bus (OPB)

Local memory bus (LMB)

Fast Simplex Link (FSL)

Xilinx CacheLink

1.2.1 Registers:

MicroBlaze provides two kinds of registers, general-purpose registers and specialpurpose registers [2].

Volatile registers (caller-save) are temporary registers and do not retain their

values across function calls. Volatile registers are registers R3-R12, R3 and R4

are used for returning values to the caller function. R5-R12 are used to pass

parameters.

Non-volatile registers keep their values across function calls (callee-save).Non-

volatile register are registers R19-R31.

Dedicated registers are the other registers. Registers R14-R17 are used to store

return addresses from interrupts, sub-routines, traps and exceptions. R0 is always

value 0 and R1 is used to store the stack pointer. These registers should not be

used for anything else.


11/48

11

1.2.2 Bus Interface:

MicroBlaze has several bus interfaces to be used in different areas. It follows the Harvard

architecture where separate paths are used for data and instruction accesses. An

advantage using Harvard architecture is that it makes it possible to read both instructions

and data from memory at the same time [2].

On-chip Peripheral Bus

The OPB is a fully synchronous bus that provides access to both on-chip and offchip

peripherals. The bus is not intended to connect directly to the processor [3].

Local Memory Bus

The LMB is a fast local bus used to connect MicroBlaze to high-speed peripherals,mainly

Block RAM (BRAM). LMB makes is possible to access BRAM in one clock cycle [12]. Fast Simplex Link Bus

FSL is a one way point-to-point communication bus used between an output FIFO device

and an input FIFO device. It has support for up to eight master and slave interfaces and

data can be transfered in two clock cycles [12].

Xilinx CacheLink

The Xilinx CacheLink (XCL) interface is a high-speed bus for external memory

communication and is only available when the caches are enabled. XCL can be combined

with an OPB where one cache uses XCL and the other one uses an OPB bus. Memory

located outside the cache-able area is accessed through OPB or LMB [12].

Debug Interface

The debug interface is used with the Microprocessor Debug Module (MDM) and is

controlled through the JTAG port by the Xilinx Microprocessor Debugger (XMD) [12].

An interesting comparison between the synthesizable processors MircoBlaze, LEON2

and OpenRISC 2000 is presented in a master thesis from Chalmers university [3]. It

compares the performance, configurability and usability. The MicroBlaze version used is

2.10.a and it performs well in the benchmarks but it is discovered that it does not follow

the floating point standard.


12/48

12

1.3 Power PC 405:

The PowerPC 405 [17] is a 32-bit implementation of the PowerPC embedded-

environment architecture. It has been derived from the PowerPC architecture. This one is

particularly tailored to meet the requirement of embedded system development. The

original one is a 64 bit processors with a 32-bit subset. But this one is a 32-bit processor.

Some other features of this processor are as follows:

1. Memory management optimized for embedded software environments.

2. Cache-management instructions for optimizing performance and memory con-

trol in complex applications, which are graphically and numerically intensive.

3. A device-control-register address space for managing on-chip peripherals such

as memory controllers.

4. A dual-level interrupt structure and interrupt-control instructions.

5. Multiple timer resources.

1.3.1 PowerPC 405 Hardware Organization:

1. Central Processing unit: The PowerPC 405 central-processing unit (CPU) implements

a 5-stage instruction pipeline consisting of fetch, decode, execute, write-back, and load

write-back stages. The fetch and decode logic sends a steady flow of instructions to the

execute unit. All instructions are decoded before they are forwarded to the execute unit.

Instructions are queued in the fetch queue if execution stalls. Up to two branches are

processed simultaneously by the fetch and decode logic. If a branch cannot be resolved

prior to execution, the fetch and decode logic predicts how that branch is resolved.

2. Memory Management Unit: The PowerPC 405 supports 4 GB of flat (non-segmented)

address space. The memory management unit (MMU) provides address translation,

protection functions, and storage attribute control for this address space. The MMU

supports demand-paged virtual memory using multiple page sizes of 1 KB, 4 KB, 16 KB,


13/48

13

64 KB, 256 KB, 1 MB, 4 MB and 16 MB. Multiple page sizes can improve memory

efficiency and minimize the number of TLB misses.

1.4 Stand alone board support package:

Initially the microblazes tested with the stand alone [4] BSP. And in later cases too other

than the Microblaze running on uClinux rest are running on stand alone BSP. It provides

certain standard APIs. Which some applications written on c may use. Standalone Board

Support Package. The Standalone BSP is designed for use when an application accesses

board or processor features directly (without an intervening OS layer).

Figure 1.3.1.1 Block Diagram of Power PC [17]


14/48

14

1.5 uClinux:

UClinux (pronounced you-see-linux) is a port of regular Linux, intended for

microprocessors that do not have a Memory Management Unit (MMU). It is a soft real

time OS [10]. In short MMU translates logical addresses into physical addresses. All

requests for data is sent to the MMU, it then decides if the data is in RAM or needs to be

fetched from disk. It also decides if the process has the rights to access the memory it is

trying to reach [25]. Without MMU there is no memory protection or virtual memory

leaving a bigger responsibility to the programmer not to write over other processes

memories. The most noticeable effect for the programmer is that vfork() is used instead

of fork(). vfork() and fork() creates a child process that only differs from the parent

process by its PID and PID number [9]. uClinux has been ported to many processorarchitectures, Motorolas Coldfire and Dragonball, Blackfin, ARM7TDMI and

MicroBlaze are the ones most used. It exists as a derivative from linux kernels 2.0, 2.4

and 2.6 but it is the 2.4derivative that has been ported to the most number of

microprocessors [13].

On VM Linux, whenever an application tries to write off the top of the stack, an

exception is flagged and some more memory is mapped in at the top of the stack to allow

the stack to grow. Under uClinux, no such luxury is available as the stack must be

allocated at compile time. So sometimes due to the overflow of the stack may cause

crashes. Also uClinux Instead of dynamic heap uses global memory pool that basically is

the kernel's free memory pool.

ISR (interrupt service routine) and Kernel task share common counting semaphore. There

for if kernel task is holding the semaphore it may so happen that ISR has to wait for long

to get executed. Secondly kernel tasks are non preemptive. Therefore even high priority

user application may have to wait. This makes the response time of uClinux longer.

Therefore it is not a complete hard real time OS.

However a critical issue is that we cannot use a single uClinux for all the microblazes.

This is because the system does not have provision for a single interrupt controller for all


15/48

15

of them. Secondly there is always a cache coherency problem. We do not have a

hardware solution for that.


16/48

16

CHAPTER 2 METHODOLOGY AND WORK DONE

2.1 Designing two multiprocessor system:

Initially a system was built with a single microprocessor (Microblaze). The Microblaze is

instantiated with its local BRAM support. As the Microblaze is of Harvard architecture

the BRAM is connected to Microblaze both with an instruction BRAM controller and a

data BRAM controller [5]. The Microblaze is sitting on the OPB (On Chip Peripheral

BUS).

Once we have got this system the next step was to add the other processor. The next

processor was added to the system using the wizard. However even by changing the

MHS file we could have got the same result. The new processor added has to be first

provided with its own BRAM. Where the data and instruction local to the processor will

reside. Eventually we would require two more controllers for that BRAM two. A block

diagram of the resulted system is shown in the figure 2.1.1.

Figure 2.1.1 Basic two Microblaze system


17/48

17

The BRAM sitting on the OPB acts as a shared memory for the two processors. Any

address on the OPB is visible to both the microblazes. Microblaze uses memory mapped

IO. So both of them can access any resource what so ever may be on the OPB bus.

But this ability causes another problem , resource conflict. The bus arbiter in every cycle

gives the bus to every Microblaze in a round robin fashion. So it may so happen when

one of the Microblaze is transferring data , say to the uart, the other one gets hold of the

bus and starts transmitting too. This can be solved by bus parking facility of the bus. So

that until one Microblaze finishes the other does not get the chance.

However with this scheme there is a flaw. Suppose there are multiple resources on the

OPB bus. Both the processors are using different resources. In that case if we block the

bus. Then either of the two will have to wait. This is removed with the introduction of the

custom IP.

The sync IP solves the problem. Whenever a process on a Microblaze wants to get the

bus it comes and registers its ID and the process ID in a software addressable register for

a particular resource i.e locks the resource. Later opn this microprocessor can release the

lock. However with this approach there may be chances of dead lock. For the above

system both the microblazes are having BSP support but no Operating System.

After the hardware platform design is complete, one can generate an FPGA configuration

bitstream. We use the XPS to build the bit stream and the net list. At this point, we have

only a "hardware bitstream," and this is not ready for to be applied it to an FPGA until

the software component is included for the embedded system.

After the embedded software development is complete, we can choose one of the

following ways to run it on the hardware:

1. If the application executable resides in on-chip memory regions, it is possible to

merg the Executable Linked Format (ELF) file into the hardware bitstream so that


18/48

18

it gets loaded into on-chip memory, ready to execute, every time the FPGA is

configured.

2. During prototyping, XPS can dynamically download the executable to the board

via the JTAG cable connected to the FPGA. In this case, we select a bootloop to

be merged into the bitstream to initialize on-chip memory so that the processor

remains in a static state until software downloading can be completed.

3. For production systems, one can store executables residing in off-chip memory

regions in a non-volatile memory device, such as flash Programmable ROM

(PROM) or along with the configuration bitstream in a System ACEdevice. In

this case, one would configure a bootloader executable to be merged into the

hardware bitstream to initialize on-chip memory. Then each time the FPGA is

configured or reset, the bootloader copies the application executable to a suitable

(volatile) memory device and starts it running.

We have used the JTAG for down loading the elf file along with the bit stream to the

board. The standard output for the system is the UART.

2.2 Matrix addition on the system:

The system has been tested with matrix addition. The system works like this. The first

Microblaze has the data on which the matrix addition is to be performed. It writes the

data to the shared BRAM.

Till it completes the writing of the data the other Microblaze waits and keeps on checking

whether a particular bit is set or not flagging that data has been written completely.

Then the Microblaze 2 starts reading the data. Once completed it calculates the addition

and writes the result back to a different location in the BRAM. The first one then picks

the result up and displays the result on the UART, (the hyper terminal).


19/48

19

2.3 Two Microblaze system with DDRRAM:

The above-described system has been changed by adding a DDRSDRAM. This Ram is

required to accommodate the OS( uClinux). The Ram is also attached on the OPB bus.

Now the caches for the Microblaze 1 has been instantiated.

2.3.1 Cache

The cache consists of both instruction and a data cache and are controlled using a bit each

in the MSR [12]. They are 1-way associative (Direct mapped), each block (a collection of

data containing the requested word) can only be placed at one place in the cache [6]. The

memory can be divided into cache-able and non-cache-able segments, making it possible

to tell exactly what to cache. The only address space that can not be cached is the LMB

address space. The data-cache uses write-through, where the cache is mirrored on main

memory by writing to memory on each cache writing [6]. The cache can be used with

Figure 2.2.1 Schematic representation of Matrixaddition


20/48

20

either the OPB interface or the dedicated XCL interface or as combination. The

differences between those interfaces are [12]:

CacheLink uses a 4-word cache block (critical word first). It takes the requested

word and the next three, which increases the hit rate; OPB uses a single word

cache block.

CacheLink uses a dedicated bus interface for memory accesses. This reduces

traffic on the OPB bus.

The CacheLink interface requires a specialized memory controller interface. The

OPB interface uses standard OPB memory controllers.

CacheLink allows posted write accesses on write-misses. OPB caches require the

write access to be completed before execution is resumed (Only data cache).

2.3.2 XMD debug module:

To be able to debug the microblazes and the programs we have to connect the second

Microblaze also too the XMD debug module [8] this is done by connecting the following

ports of the XMD module to the micro blaze.

PORT DBG_CAPTURE_0 = DBG_CAPTURE_sPORT DBG_CLK_0 = DBG_CLK_s

PORT DBG_REG_EN_0 = DBG_REG_EN_s

PORT DBG_TDI_0 = DBG_TDI_s

PORT DBG_TDO_0 = DBG_TDO_s

PORT DBG_UPDATE_0 = DBG_UPDATE_s


21/48


22/48

22

Supports unicast, multicast, and broadcast transmit and receive modes as well

as promiscuous and 64 entry Contents Addressable Memory (CAM) based

receive modes

Auto source address field insertion or overwrite or pass through fortransmission

The EMAC [10] Interface design is a soft intellectual property (IP) core designed for

implementation in several Xilinx FPGAs. It supports the IEEE Std. 802.3 Media

Independent Interface (MII) to industry standard Physical Layer (PHY) devices and

communicates to a processor via an IBM On-Chip Peripheral Bus (OPB) interface. The

design provides a 10 Megabits per second (Mbps) and 100 Mbps (also known as Fast

Ethernet) EMAC Interface. This design includes many of the functions and the flexibility

found in dedicated Ethernet controller devices currently on the market.

2.4 Building of uClinux:

In this step, the uClinux auto-configure mechanism is used to map/export the EDK

hardware design to the uClinux kernel build mechanism. This is done by configuring and

generating the BSP. The library generator uses the configuration information and the

hardware design database to export an auto-config.in file, which contains all the

information about the hardware design. This auto-config.in file is in the uClinux

configuration file format. The uClinux sources, makefiles, and other scripts are built with

conditional code to generate the correct software based on the hardware described in the

auto-config.in file. This flow effectively means that retargeting uClinux for any hardware

setup can be done quickly. To retarget uClinux, regenerate the BSP for the new hardware

setup. By rebuilding uClinux with the newly generated auto-config.in file, an image

targeting the new hardware setup is compiled by uClinux tools. The tools and relatedsources can be obtained from petalogix [15]. To generate the uClinux image we have to

use a linux system. Once the auto-config is placed in the proper place in linux

distribution, going to the tool chain we can use the following commands to generate the

image.


23/48

23

1. We have to Go to the uClinux distribution directory. Run: cd

home/devel/uclinux/src/uClinux-dist

make clean

make xconfig

2. have to set the vendor option to Xilinx.

3. choose the Os as uClinux auto

4. select the other libraries .

Once the kernel image ready we take that back to the host system. From the host

system we download the image to the RAM , using the hardware debug module.

We have used the following command.

cd binaries

dow -data tmicro.bin 0x30000000

mwr 0x100 0

rwr 5 0x100

rwr pc 0x30000000

con

that means we are loading the kernel image tmicro.bin to the memory location

0x30000000. and then setting th eprogram counter value to this address and asking

the processor to start.

2.5 Creating new application for Ethernet packet handling:

Applications [14] can be developed separately or as a software application project in the

XPS, they can also be created in the Software Development Kit (SDK) by importing the

XPS design. All applications need to link to the Xilkernel library in order to get the

kernel functionality [16]. The kernel and its applications can then be downloaded to the

target within the EDK and SDK.

The applications in Xilkernel can be run in two ways. One method is to build the

applications into the kernel, having only one executable file. The other method is to have

each application as a separate executable file, the same method that is used in a


24/48

24

conventional operating system where the kernel is a separate image file and the

applications are separate files.

Here we have first build the application by putting it into the source path and then

updating the make file. However later on we only build the specific application. To down

load the application we connect it to the microblaze system through the net. And from the

host machine run a ftp session to the system. Where we have down loaded the application

to the directory /var/tmp. Then changed the mode of the file. And executed. This

approach helped us to avoid the tedious job of building the OS image all the time

repeatedly.

The application accepts the packet over the network using TCP/IP protocol. The payload

is then copied to a particular location in DDR RAM. The other microblaze picks the data

up from the location and performs the required operation in parallel.

2.6 MicroBlaze shared memory Multiprocessing:

The following are possible ways of creating a shared memory multi-processor system

(taken from John Williams report). We evaluate the pros and cons and finally choose a

custom deign for our implementation.

2.6.1 Implicit Multiprocessing:

Using this mechanism, parallelism is hidden by OS and hardware. A single copy of

operating system runs and controls all the processors in the system. The following figure

shows an implementation of N processor SMP (Symmetric Multi Processor).

The MicroBlaze soft-core processor does not have hardware support for cache coherency.

Software implementation of cache coherency has severe performance impact. MicroBlaze

core also lacks distributed interrupt management.


25/48


26/48


27/48

27

the rest of the Microblazes on another OPB. This system also has the scope of further

scaling up by attaching more buses and Microblazes on them.

In order to achieve parallelism with minimum memory overhead and bus scaling issues,

we have developed a system with a master-slave relationship. The master MicroBlaze in

the system is the only MicroBlaze with uClinux OS. The remaining MicroBlaze

processors in the system act as slaves and use standalone BSP. The MicroBlaze running

uClinux is responsible for providing all benefits of advance OS to the complete system

and for various synchronization activities. In the application explored in the project, the

master processor is used for collecting data over the network and makes it available for

other processors. The above figure shows the architecture of the system.

There are some bridges (OPB2PLB and PLB2OPB) which allows the microblazes to

access the DDRRAM and other peripherals on different buses which though not on their

own OPB but mapped to their address space.

2.7.1 Processor Local Bus (PLB):

The Xilinx 64-bit Processor Local Bus (PLB) consists of a bus control unit, a watchdog

timer, and separate address, write, and read data path units with a three-cycle only

arbitration feature.

Some works, which are specific to PLB arbitration logics, are:

1. Single Read Transfer Bus Time-Out,

2. Single Write Transfer Bus Time-Out,

3. Line Read Transfer Bus Time-Out,

4. Line Write Transfer Bus Time-Out,

5. Burst Read TransferBus Time-Out,

6. Burst Write Transfer Bus Time-Out,

7. Pipelined Read Transfer,

8. Pipelined Write Transfer


28/48

28

Arbitration Priority: The Xilinx PLB implements fixed priority when two or more

masters have the same priority inputs. Priority order in this case is Master 0, Master 1,

Master 2, and Master N.

Clock and Power Management: The IBM PLB Arbiter Core supports clock and power

management by gating clocks to all internal registers and providing asleep request signal

to a central clock and power management unit in the system. This sleep request signal is

asserted by the IBM PLB Arbiter to indicate when it is permissible to shut off clocks to

the arbiter.

2.7.2 OPB TO PLB (OPB2PLB) Bridge:

The On-Chip Peripheral Bus (OPB) to Processor Local Bus (PLB) Bridge translates OPB

transactions into PLB transactions. It functions as a slave on the OPB side and a master

on the PLB side. Since the Microblazes are on the OPB bus to communicate with the

DDRRAM the shared memory we have put that specific OPB2PLB bus. This bridge

enables both the address translation as well as the bus protocol translation for the

Microblazes.

High Level Description: The following figure shows a schematic diagram of the bridge

Figure 2.7.2.1 schematic diagram of the OPB2PLB Bridge


29/48

29

The OPB interface is designed with a pipelined architecture to improve timing and to

support high clock frequencies. Input and Output signals to the OPB are designed to be

driven through flip-flops for better timing. Pipelining introduces some additional latency

in the design. Since some signals are delayed through registers. However, the use of

pipelining balances transaction latency with higher clock fequencies.

Address resolution and deadlock prevention:

The bridge decoder takes care of that fact that same location is not addressed by any

master of the PLB as well as by itself. To avoid a situation like this certain parameters are

(user changeable) provided.

The bridge provides four channels o address four contiguous memory locations on the

PLB side by the master on the OPB side. Each channel has one C_RNG0_BASEADDR

and one C_RNG0_HIGHADDR, which corresponds to the address range of that channel

that will be accessed by the bridge on PLB. The PLB base addresses will be those, which

will be obtained after negating the channel address for OPB.

2.7.2 PLB TO OPB (PLB2OPB) Bridge:

This Bridge acts as a master for the OPB bus and as a slave on the PLB bus. The user

changeable parameters are similar to that of OPB2PLB Bridge. This bridge translates the

transaction on the PLB bus to OPB bus.


30/48

30

2.7.3 System View:

Figure 2.7.3.1 System View


31/48

31

2.8 Heterogeneous multiprocessor system with Power PC:

The previously described homogeneous multiprocessor system has further been modified

to a heterogeneous system by adding Power PC to it. The Power PC is now sitting on the

PLB bus. We have tested the system with some simple program and it reported that all

the hardware units are working.

Figure 2.8.1 Heterogeneous multiprocessor system


32/48

32

CHAPTER 3 EXPERIMENTS ON THE DEVELOPED SYSTEM

3.1 Running of UClinux:

Once we have done with the building up of the hardware system. Now it is time to

experiment with the system. We started with the goal of running the compiled kernel

image as discussed in section 2.4 on the system developed (modified two microblaze

system) discussed in 2.7. The microblaze running uClinux gives the flexibility of

operating system where as the stand alone microblaze in the same system provide

extreme computational power.

We have got the following out put in the hyper terminal. It is the booting sequence of

uClinux.

Linux version 2.4.32-uc0 (jca052428@vindhyachal) (gcc version 3.4.1 ( Xilinx EDK

8.1 Build EDK_I.17 090206 )) #1 Sun Nov 26 15:54:08 IST 2006

On node 0 totalpages: 65536

zone(0): 65536 pages.

zone(1): 0 pages.

zone(2): 0 pages.

CPU: MICROBLAZE

Kernel command line:

Console: xmbserial on UARTLite

Calibrating delay loop... 49.86 BogoMIPS

Memory: 256MB = 256MB total

POSIX conformance testing by UNIFIX

xgpio #1 at 0x40020000 mapped to 0x40020000

xgpio #2 at 0x40040000 mapped to 0x40040000

Xilinx GPIO registered

RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize

eth0: using sgDMA mode.

eth0: Xilinx EMAC #0 at 0x40C00000 mapped to 0x40C00000, irq=1

eth0: id 2.0l; block id 11, type 1

uclinux[mtd]: RAM probe address=0x3016209c size=0xd8000

uclinux[mtd]: root filesystem index=0

NET4: Linux TCP/IP 1.0 for NET4.0


33/48

33

IP Protocols: ICMP, UDP, TCP

IP: routing cache hash table of 2048 buckets, 16Kbytes

TCP: Hash tables configured (established 16384 bind 32768)

NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.

VFS: Mounted root (romfs filesystem) readonly.

Freeing init memory: 52K

flatfsd: Nonexistent or bad flatfs (-114), creating new one...

/bin/flatfsd: mtd.c: 156: flat_dev_close: Assertion `flatinfo.fd != -1' failed.

flatfsd: mtd.c: 156: flat_dev_close: Assertion `flatinfo.fd != -1' failed.

Abort

Setting hostname:

Setting up interface lo:

Starting DHCP client:

Starting thttpd:eth0: Link carrier lost.

uclinux-auto login:

3.2 Speed up obtained with Decryption algorithm:

An application using DES cryptography algorithm was used in our setup to measure the

gains using the architecture as described above. The application accepting networkpackets as described in section 2.5 was executed on the MicroBlaze with uClinux. The

standalone MicroBlaze ran DES algorithm for encrypting portions of data placed in RAM

by master MicroBlaze. Following were the run-time achieved using varying number of

MicroBlaze processors:

Figure 3.2.1 Time taken by different number of processor


34/48

34

Figure 3.2.2 Speed up obtained with different number of processors


35/48

35

CHAPTER 4 PARALLELIZATION OF SMITH WATERMAN ALGORITHM

4.1 Local Alignment: Smith Waterman algorithm:

The most important fact of biological sequence analysis is that in bimolecular sequences

(DNA, RNA or amino acid sequences), high sequence similarity usually implies

significant functional or structural similarity.

In many applications two strings may not be highly similar in their entirety but may

contain regions that are highly similar. The task is to find and extract a pair of regions,

one from each of the two given string that exhibit high similarity. This is called the local

alignment or local similarity problem. Smith Waterman algorithm is an approach to that.

The Smith-Waterman algorithm is a database search algorithm developed by T.F. Smith

and M.S. Waterman, and is based on an earlier model known as Needleman and Winch,

who proposed the algorithm initially. The Smith-Waterman algorithm uses dynamic

programming to find the best local alignment between any two given sequences. Based

on certain criterion, usually a scoring matrix, scores and weights are assigned to each

character-to-character comparison. Positive for exact matches/substitutions, and usually

negative for insertion/deletions. The exact scores are based on a scoring matrix. The

scores are then added together and the highest scoring alignment is reported.

The Smith-Waterman algorithm(the generic one) [18] is given below:

1 Declare a nxn similarity matrix

2 Initialize the top row (i=0) and left column (j=0) with 0

3 for i = 1; i < length(sequence); i++ do

4 for j = 1; j < length(sequence); j++ do

5 F(i,j) = max (0, F(i-1,j-1)+ s(xi,yi), F(i-1,j) - d, F(i,j-1)-d)

6 Save index of term that contributed to the calculated value in F(i,j)

7 end for

8 end for

9 Find maximum value in nxn matrix

10 Using saved indices in (6), traceback to first 0 encountered


36/48


37/48


38/48

38

4.2.1 The parallelization, A Locally Sequential, Globally Parallel, ( LSGP )

approach:

To calculate new values on a fresh diagonal whatever data dependencies there are from

the previous diagonal. The following figure shows how two processors can be used for

parallel computation.

The dotted lines represent the cells, which will be computed in processor one and the

solid lines represent the cells, which will be computed by the other processor, in essence

LSGP approach. Now up to the first two diagonals both the processor will compute all

the three cells. After that the division of work will arise. The first processor keeps the

track of the number of cells to be computed by the other one and by him. Both of them

compute the corner points first and then go for the calculation of the other points. The

number 0ne processor (P1) calculates the right most cells and writes the result back to the

shared memory. Similarly second processor (P2) does the same for its left corner point of

the diagonal. However though all the time the data are not required to be transmitted but

doing that brings symmetry to the process. So with an extra overhead the programming

complexity can be reduced to a great deal. Along with the count of the cells P1 also lets

P2 know about the right most point index. So that later one can start working from that

end. The indexes are easy to calculate as because all the cells that need to be calculated

are on the same diagonal. This approach can be extended for further number of

Figure 4.2.1.1 an illustration of LSGP approach


39/48

39

processors. But with higher number of processors the pattern for incrementing or

decrementing the number of cells calculated by each processor, for each diagonal has to

follow a specific pattern. Here we have proposed a pattern, which we would like to

explain with a three-processor system example.

Incrementing pattern Number of cells Decrementing pattern Number of cells

P1 P2 P3 P1 P2 P3

4.2.2 Pseudo code for the processors

We assume the function Smith_waterman(i,j) can calculate the value of each cell when

provided with the index. And the function Trace_back() finally gives the required

alignment. The variable number_of_cells_for_P1 and number_of_cells_for_P2 keep track

of the cells each processor has to calculate for each diagonal. The two functions

write_to_SM() and Read_to_SM() allows the processors to write and read data to and

from the Shared Memory respectively. The processors have the data dependencies. Data

of which cell is required and has to be brought in from the shared memory is found out by

the processors after looking at the local table of each processor. If the required data

already available do nothing, or fetch the data and store it first in its local memory, then

go for the other computations.

Processors 1:

//the total table is of nXn size

Calculate the first two diagonals locally.

number_of_cells_for_P1=2;

number_of_cells_for_P2=1;

2 1 1

2 2 1

2 2 2

3 2 2

3 3 2

4

5

6

7

8

5 5 5

5 5 4

5 4 4

4 4 4

5 4 3

15

14

13

12

11


40/48

40

For( i = 2 to n-1)

If( i(mod)2 = 0)

Then

{

number_of_cells_for_P1= number_of_cells_for_P1+1;

}

if( I(mod)3=0)

Then

{

number_of_cells_for_P2= number_of_cells_for_P2+1;

write_to_SM(number_of_cells_for_P2);

}

row_index_for_P2= i- number_of_cells_for_P1-1;

column_index_for_P2= i+ number_of_cells_for_P1;

write_to_SM(row_index_for_P2);

write_to_SM(column_index_for_P2);

While( iteration NOT EQUAL TO number_of_cells_for_P1)

{

result=Smith_Waterman(row_index,column_index);

write_to_SM(result);

row_index= row_index+1;

column_index= column_index-1;

iteration = iteration+1;

}

// upto this part the P1 calculates cells from the upper triangle of the table.

For( k=0 to n-1)

{

//decrement of the number_of_cells_for_P2 and increment of

nummber_of_cells_for_P1

if( result required for calculation of the cell (n-1, (n-1)-number_of_cells_for_P1)

not available)


41/48


42/48

42

}

}// this part is for the upper half of the table.

For( k=0 to n-1)

{

Read_from_SM(number_of_cells_for_P2);

Read_from_SM (row_index_for_P2);

Read_from_SM (column_index_for_P2);

While( iteration NOT EQUAL TO number_of_cells_for_P1)

{

result=Smith_Waterman(row_index,column_index);

write_to_SM(result);

row_index= row_index+1;

column_index= column_index-1;

}

}

}// code completed for 2nd

processor.

4.3 Speed up obtained with Smith-Waterman algorithm:

Speed Up Comparisions

0

10002000

3000

4000

5000

6000

7000

8000

9000

10000

0 2 4 6 8 10 12 14

String Length

Numberofcycles

P1

P2

P3

Figure 4.3.1 Speed up in case of Smith-Waterman algorithm


43/48

43

The last graph has been obtained when we tried to implement the algorithm described in

section 4.2 for Smith-Waterman algorithm.


44/48


45/48


46/48

46

REFERENCES

[1] What is a soft processor?, Xilinx, 2006-01-.http://www.xilinx.com/ipcenter/processor

central/microblaze/doc/mb faq.pdf.

[2] MicroBlaze Processor Reference Guide Embedded Development Kit EDK 8.2i

(UG081 v6.0) June 1, 2006

[3] D. Mattson and M. Christensson. Evaluation of synthesizable cpu cores. Mastersthesis, 2004.

[4] Standalone Board Support PackageEDK 8.2i, June 23, 2006.

[5] LMB BRAM Interface Controller (v1.00b) DS452 February 22, 2006

[6] on-Chip peripheral Bus V2.0 with OPB arbiter( v1.10c) DS401 December 2, 2005.

[7] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative

Approach, Third Edition. Morgan Kaufmann, May 2002

[8] Micropr essor Debug Module (MDM) (v2.00a) DS450 February 22, 2006oc.

[9] Linux Operating System and Linux Distributions. http://linux.about.com.

[10]Linux for Real Time Requirements, SoftTech Solutions P Ltd. www.isofttech.com/

downloads/Linux-RTOS.pdf.

[11] D. McCullough. uclinux for linux programmers. In Linux Journal Volume 2004 ,

Issue 123 (July 2004) Page: 7 Year of Publication: 2004 ISSN:1075-3583. ACM Press,

July 2004.

[12] MicroBlaze Processor Reference Guide, Xilinx. http://www.xilinx.com

/ise/embedded/edk7 1docs/mb ref guide.pdf.


47/48

47

[13] Microblaze uClinux FAQ, John Williams, 2006-02-16 http://www.itee.uq.edu.au/-

jwilliams/mblazeuclinux/Documentation/FAQ.html.

[14] D. Stepner, N. Rajan, and D. Hui. Embedded application design using a real-time

os. In annual ACM IEEE Design Automation Conference Proceedings of the 36th

ACM/IEEE onference on Design automation, pages 151 156. ACM Press, August

1999.

[15] J. Williams. What does PetaLogix mean for the MicroBlaze uClinux community,

PetaLogix, 2006-02-16. http://www.petalogix.org/ news events/petalogix announce

.[16]OS and Libraries Document Collection, Xilinx. http: //www.xilinx.com

/ise/embedded /edk71 docs/oslibs rm.pdf.

[17] PowerPC 405 Processor Block Reference Guide ,UG018 (v2.0) August 20, 2004

[18] A Parallel Implementation of the Smith-Waterman Algorithm for Massive

Sequences Searching ,Hsien-Yu Liao, Meng-Lai Yin, Yi Cheng, Electrical and computer

Engineering Department California State Polytechnic University, Pomona

[email protected], [email protected].

[19] Pruning algorithm to reduce the search space of the Smith-Waterman algorithm,

Farhan Ahmed Department of Electrical and Computer Engineering Lafayette College,

Easton, PA.

[20] A Parallel Implementation of Smith-WatermanSequence Comparison Algorithm,

Brian Hang Wai Yang, December 6, 2002


48/48

GLOSSARY

FPGA Field Programmable Gate Arrays.

SMP Symmetric Multiprocessor System

CLB Configurable Logic Cells

EDK Embedded Development Kit

IP Intellectual Property

SoC System on Chip

RISC Reduced Instruction Set Computer

OPB Onchip PeripherAL Bus

LMB Local Memory Bus

FSL Fast Simplex Link

BRAM Block RAM

XCL Xilinx Cache Link

MDM Microprocessor Debug Module

MMU Memory Management Unit

TLB Translation Look Aside Buffer

BSP Board Support Package

ELF Executable Linked Format

MHS Microprocessor Hardware Specification

MSS Microprocessor Software Specification

OPB2PLB OPB to PLB bridge

design, development and performance evaluation of multiprocessor systems on fpga

Documents