parallel computing using fpga - massey universitymjjohnso/notes/59735/seminars/07217579.doc · web...

159.735 – Studies in parallel and distributed systems

Parallel Computing Using FPGA

Submitted to: Dr. Martin Johnson

Submitted by: Sohaib Ahmed (07217579)

Parallel Computing Using FPGA

Introduction

Field programmable gate arrays (FPGAs) are emerging in many areas of high

performance computing, either as tailor made signal processor, embedded algorithm

implementation, systolic array, software accelerator or application specific

architecture. FPGAs are so flexible and reconfigurable that they are capable of

massively parallel operations, explicitly tailored to the problem at hand. There are lot

of paradigms to put FPGAs at work in a high performance computing environment

There are number of limitations which restrict FPGAs to reach the performance of

Application Specific Integrated Circuits (ASICs) but they provide the possibility of

changing the hardware design easily while outpacing software implementations on

general purpose processors.

FPGA

FPGA was first invented in the mid 1980s by one of the founder of Xilinx

(www.xilinx.com ), Ross Freeman. It is a semiconductor device which is comprised

of different number of logic elements, interconnects, IOBs (Input/Output blocks). All

these components are user-configurable. It can help in implementing complex digital

circuits. The IOBs form a ring around the outer-edge of the microchip while

rectangular array of logic blocks lies inside the IOB ring. Each IOB provides a facility

to access a selectable I/O to one of the pins found in exterior of the FPGA package as

mentioned in figure 1 (Buell et al, 2007).

Figure 1: FPGA internal structure based on the Xilinx architecture style (Buell et al., 2007)

http://www.xilinx.com/

A typical FPGA logic block consists of a four-input lookup table (LUT) and flip-flop.

Currently, FPGA consists of DSP blocks which have high-level functionality

embedded into the silicon, high-speed IOBs, embedded memories and processors. It

also contains Configurable logic blocks (CLBs) which comprised of multiple slices. A

slice is a small set of building blocks. Moreover, modern FPGA consist of tens of

thousands of CLBs and a programmable interconnected network in a rectangular grid

(Buell et al, 2007).

ASICs (Application specific integrated circuits) is typically performed a single

function through out the lifetime of a chip while in FPGA, it can be reprogrammed in

such a way that it can perform function in micro-seconds. Source code written in a

hardware description language (HDL) such as Verilog and VHDL provides the

functionality to perform tasks at run time. Synthesis process generates technology-

mapped netlist (figure 2). A map, place and route process then fits the netlist to the

actual FPGA in architecture. The process produces a bitstream which is used to

reconfigure FPGA. Timing, postsynthesis, functional simulations and verifications

methodologies can validate map, place and route results (Buell et al., 2007)

Figure 2: Typical design flow of FPGA (Buell et al, 2007)

FPGA supports the notion of reconfigurable computing and provides a facility of on-

chip parallelism which can be mapped directly from the dataflow characteristics of an

application’s parallel algorithm. A recent emergence in high performance can be

achieved by a hybrid approach to make a complex-system on a programmable chip.

Examples are Virtex II Pro, Virtex-4 and Xilinx Devices. The most recent success of

FPGAs in high-performance computing came under Tsubame cluster in Tokyo when

FPGAs increased performance by additional 25% (Wu-Feng & Manocha, 2007).

Reconfigurable Computing

High performance reconfigurable computing (HPRC) are parallel computing system

that allows multiple microprocessors and multiple FPGAs embedded into it. FPGAs

are inherently co-processors which are deployed to execute the small portion of the

application that takes most of the time – under 10-90 rule, the 10 percent of the code

that takes 90 percent of the execution time (Buell et al, 2007).

Reconfigurable computing is also known as configurable computing or custom

computing. It often has impressive performance and it is described in one of the

example mentioned. For a key size of 270 bits, a point multiplication can be computed

in 0.36 ms with a reconfigurable computing design implemented in an XC2V6000

FPGA at 66MHz while an optimized software implementation took 196.71 ms on a

dual-xeon computer at 2.6 GHz. It shows that reconfigurable computing design is

more that 540 times faster while its clock speeds is almost 40 times slower than the

Xeon processors (Todman et al, 2005).

Progress in hardware system and programming software

With the emergence of technologies, many hardware systems have begun to resemble

parallel computers. These systems are not designed for scalability because they

consisted of a single board of one or more microprocessors connected to one or more

FPGA devices. Recently, SRC-6 and SRC-7 have a parallel architecture in which used

cross bar switch that can be piled for further scalability. Traditionally, high-

performance computing vendors – specifically, Silicon Graphics Inc. (SGI), Cray and

Linux Networx have incorporated FPGAs in their parallel architectures (Buell et al.,

2007).

From software perspective, developers can create the hardware kernel by using

hardware description languages such as VHDL and Verilog. SRC Computers allow

other hardware description languages including Carte C, Carte Fortran, Impulse

Accelerated Technologies’ Impulse C, Mitrion C from Mitrionics, and Celoxica’s

Handel-C. Annapolis Micro Systems’ CoreFire, Starbridge Systems’ Viva, Xilinx

System Generator and DSPlogic’s reconfigurable computing toolbox are the high-

level graphical programming development tools (El-Ghazawi et al, 2008, Buell et al,

2007).

Reconfigurable logic and traditional processing

Reconfigurable logic consists of programmable computational matrix with a

programmable interconnected network which is used within that computational

matrix. There are the basic differences between reconfigurable logic and traditional

processing which are described in figure 3 (Bondalapati & Prasanna, 2002).

Spatial computation: the data is processed by spatially distributed the

computations rather than temporally sequencing.

Configurable data path: the functionality of the computational units and the

interconnection network can be adapted at run-time.

Distributed control: the computational units process data based on local

configuration rather than an instruction broadcast to all the functional units.

Distributed resources: the required resources for computation such as

computational units and memory.

Figure 3 Performance trends of FPGAs and microprocessors (Smith et al, 2007)

Advantages of FPGAs

There are number of advantages using FPGAs including speed, reduced energy

power consumption. As in reconfigurable computing, hardware circuit is optimised

with the application so that the power consumption will tend to be much lower than

that for a general-purpose processor. FPGAs have other advantages which comprised

of reduction in size, component count (and hence cost), improved time-to-market and

improved flexibility and extendibility. These advantages are especially important for

embedded applications (Todman et al, 2005).

Limitations of FPGAs

There are number of challenges in implementing reconfigurable computing. Todman

et al. (2005) described three such challenges.

1. Structure of reconfigurable fabric

2. Interfaces between the fabric, processor(s)

3. Memory must be efficient

There is another challenge regarding the development of computer-aided design and

compilation tools that map an application to a reconfigurable computing. The problem

is related to know about which part of the application is mapped to the fabric and

which should be mapped to the processor (Todman et al., 2005).

Few limitations of FPGAs in high performance computing should be addressed. These

issues include the need of programming tools that address the overall architecture,

profiling and debugging tools for parallel and reconfigurable performance.

Furthermore, application-portability issues should be required to explore (Buell et al.,

2007).

FPGAs power applications performance

FPGAs offer tremendous performance potential. They can support in number of

different parallel computation applications and implemented in single clock execution

time. If FPGAs are reprogrammable then they can provide on chip facility for a

number of applications. Due to the presence of on-chip memory facilitate co-

processor logic’s memory access bandwidth is not restricted to the number of I/O pins

present in the devices. Moreover, memory is also closely coupled to the algorithm

logic so therefore, no external high-speed memory cache is needed. And due to that

power-consuming cache access and coherency problems can be avoided. The use of

internal memory also means that no additional I/O pins are required to increase its

accessible memory size, simplifying design scaling (Altera, 2007).

With the use of defined structures and the availability of resources in today’s high

performance FPGAs ( i.e the Altera Stratix III family of FPGAs), they can serve as

hardware for wide range of applications. As shown in Table 1 (Altera, 2007) , some of

the practical examples of applications show that at least ten times improvement in

execution time of algorithms as compare to single processor.

Table 1 FPGA Algorithm Acceleration (Altera, 2007)

FPGA application design techniques

Herbordt (2007) described application design techniques to enable substantial FPGA

acceleration. These methods are as follows:

1. Use an algorithm optimal for FPGAs

2. Use a computing mode appropriate for FPGAs

3. Use appropriate FPGA structures

4. Living with Amdahl’s law

5. Hide latency of independent functions

6. Use rate techniques to remove bottlenecks

7. Take advantage of FPGA-specific hardware

8. Use appropriate arithmetic precision

9. Use appropriate arithmetic mode

10. Minimize use of high-cost arithmetic operations

11. Create families of applications, not point solutions

12. Scale application for maximal use of FPGA hardware

Need of highly parallel FPGAs

There is a common approach in increasing performance of single unit is to build a

cluster of low-cost off-the shelf machines. But there are few disadvantages like low

communication bandwidth and costly. Most clusters are loosely coupled based on

external peripherals and due to that it is much costly. Moreover, they are not able to

solve specific problem in parallel. Another approach is related to use parallel

computers and they share their resources which allow for very fast data throughput

but require system design constraints to remain scalable. COPACOBANA (Cost-

Optimized Parallel Code Breaker) utilizes up to 120 Xilinx Spartan-3 FPGAs

connected through a parallel backplane and interfaces the outside world through a

dedicated controller FPGA with an Ethernet interface and a Micro Blaze soft-

processor core running uClinux. It uses for cryptanalysis and further work has been

needed to use its architecture and framework for solving other scientific problems

(Guneysu et al., 2007).

COPACOBANA Architecture

It consists of three basic blocks including one controller module, up to 20 FPGA

modules and backbone which is used for providing interconnection between controller

and FPGA modules (figure 4).

Figure 4 Architecture of COPACOBANA (Guneysu et al., 2007)

The FPGAs are directly connected to a common 64-bit data bus on board of the

FPGA module which is interfaced to the backplane data bus via transceivers with 3-

state outputs. While disconnected from the bus, the FPGAs can communicate locally

via the internal 64-bit bus on the DIMM module. The DIMM format allows for a very

compact component layout, which is important to closely connect the modules by a

bus. Every FPGA module is assigned a unique hardware address, which is

accomplished by Generic Array Logic (GAL) attached to every DIMM socket. Hence,

all FPGA cores can have the same configuration and all FPGA modules can have the

same layout. They can easily be replaced in case of a defect. (Guneysu, 2007).

Conclusion

FPGAs offer a number of paradigms to speed up calculations in a hardware software

co-design environment. They are relatively cost-effective as compare to ASICs and

due to flexible in nature, hardware resources are utilized in an effective way.

However, much work remains to be done for achieving high-performance parallel

computing by using FPGAs.

References

Altera Cooperation White Paper (2007). Accerating high performance computing

with FPGAs. October 2007

Bondalapti,K., & Prasanna,V. (2002). Reconfigurable computing systems.

Proceedings of the IEEE, Vol. 90, No. 7, July, 2002.

Buell, D., El-Ghazawi, T., Gaj,K.,& Kindratenko,V. (2007). High-Performance

reconfigurable computing. IEEE Computer Society, March, 2007.

El-Ghazawi, T., El-Araby,E., Miaoqing Huang, Gaj,K., Kindratenko, V.,& Buell, D.

(2008).The promise of high-performance reconfigurable computing. IEEE

computer society, February, 2008 pp. 69 -76.

Guneysu,T., Paar,C., Pelzl,J., Pfieffer,G.,Schimmler,M., & Schlieffer,C. (2007).

Parallel computing with low cost FPGAs A framework for COPACOBANA.

Herbordt, M.C., VanCourt, T., Yongfeng, G., Shukhwani, B., Conti,A., Model,J. &

Disabello,D. (2007). Achieving high performance with FPGA-Based computing

Smith, M.C., Vetter,J.S., & Alam,S.R. (2005).Scientific computing beyond CPUs:

FPGA implementations of common scientific kernels. MAPLD/187.

Todman,T.J., Constantinides, G.A., Wilton,S.J.E, Luk,W. & Cheung, P.Y.K. (2005).

Reconfigurable computing: architectures and design methods. IEEE Proceedings

of Computer Digital Technologies, Vol. 152, No. 2, March, 2005.

Wu-Feng, Manocha,D. (2007). High performance computing using accelerators,

Parallel Computing, 33 (2007) pp. 645-647.

parallel computing using fpga - massey universitymjjohnso/notes/59735/seminars/07217579.doc · web...

Documents