6 month technical report

Upload: satyajeet-mukherjee

Post on 07-Apr-2018




0 download


  • 8/3/2019 6 Month Technical Report








    Research Group:


    Submission Date:

    FRIDAY 22 ND APRIL, 2005

  • 8/3/2019 6 Month Technical Report


  • 8/3/2019 6 Month Technical Report


    2 Literature Survey

    Covered is a literature survey of the use of FPGAs along with Graphics Hardware in the application of Video Processing.

    Firstly, I will look at current architectures using GPUs (Graphics Processing Units) and FPGAs for video processing. Fol-

    lowed by interconnect structures (including Network on Chip.) FPGA architectural features which make it adaptable to such

    applications will be shown, along with tools used for debugging & programming.

    The implementation of video processing applications is moving away from being predominantly software based to a more

    hardware based solution. This can be seen with the first video cards: all image processing was done, prior to it reaching

    the card. Today much more of the processing is performed on the card, such as lighting effects. Graphics hardware has

    progressed from being display hardware to allow user programmability leading to non-graphics applications.

    There have also been advances in research into interconnect structures, the bus is no-longer seen as the only solution to

    connect hardware cores together. Switch-boxes and networks are emerging, it is likely that we will see more topologies being

    considered in the future and some of the new ideas today becoming common-place.

    2.1 Current Architectures

    This section is split into the use of GPUs and FPGAs individually for Video applications and discusses the possibility of

    interlinking these modules.

    GPU Architectures:

    Emerging Field: Research into the use of GPUs for general-purpose computation began on the Ikonas machine[1] which

    was developed in 1978. This was used for the Genesis Planet Sequence in Star Trek: The Wrath of Khan and has also been

    used in TV commercials and the Superman III video game. This highlighted early on the potential for graphics hardware

    to do much more than output successive images. GPUs were firstly used for non-graphics applications (www.GPGPU.org)

    by Lengyel et al in 1990 for robot motion planning, this highlighted the start of the era of GPGPU (General Purpose GPUs.)

    Recent Developments: Trendall and Stewart in 2000[2] gave a summary of the possible calculations available on a GPU with

    a real-time calculation of refractive caustics. These capabilities have progressed further since, with more on-board memory


  • 8/3/2019 6 Month Technical Report


    and larger processing ability. More recently Moreland and Angel[3] implemented an FFT routine using GPUs with FFT,

    filtering and IFFT, performed on a 512x512 image, in under 1 second with a GeForce 5800. This has been made possible by

    graphics hardware manufacturers (Nvidia / ATI being the largest) allowing programmers more control over the GPU. This

    is facilitated through shader programs (released in 2001) written in languages such as Cg (C for Graphics) by Nvidia or

    DirectX by Microsoft, which have been enhanced since for greater user control (DirectX 9.0 now allows 128-bit precision

    i.e. 32-bit/RGBA pixel component[4]). There is still room for future progress as some hardwares functionality is still hidden

    from the user, there is no pointer support and debugging is difficult. Recognition for non-graphical applications of GPUs was

    given at the Siggraph / Eurographics Hardware Workshop in San Diego (2003) showing its emergence as a recognised field.

    Nvidia: The intentions of the manufactures is clear, in an article in July 2002[5], NVIDIAs CEO announced teaming with

    AMD to develop nForce. This will be capable of handling multimedia tasks and will bring theatre-style DVD playback to the

    home computer. Previously the CPU offloaded tasks, such as video processing, onto plug-in cards which were later shrunk

    and encapsulated within the CPU. This minimisation was beneficial to the likes of Intel and less so to GPU producers such

    as Nvidia. The implementation of multimedia applications on graphics cards (and their importance to the customer) means

    the screen is now seen as the computer: rather than the network it sits on. In development of Microsofts Xbox more money

    was given to Nvidia than Intel, this trend is likely to continue to show a power shift to Graphics card manufacturers.

    Performance: The rate of increase of processing performance of GPUs has been 2.8 times/year since 1993[4] (compared to

    2 times/1.5 years for CPUs according to Moores Law) a trend which is expected to continue till 2013. The GeForce 5900

    performs at 20 G/flops, which is equivalent to a 10 GHz Pentium Processor[4]. This shows the potential of graphics hardware

    to out-perform CPUs, with a new generation being unveiled every 6 months. It is expected that TFLOP performance will be

    seen from graphics hardware in 2005. For Example in Strzodka and Garbes implementation of Motion Estimation[6] they

    out-perform a P4 2GHz processor by 4 to 5 times (with a GeForce 5800 Ultra.)

    Increases in performance also benefit filtering operations: in the previously mentioned FFT implementation[3] the potential

    for frequency domain filtering was shown. The amount of computations required to do filtering are reduced by performing

    them in the frequency domain from an O(NM2) problem to an O(NM) + FFT + IFFT (about O(MN(log(M) + log(N))))

    one. Moreland and Angel implemented clever tricks with indexing (dynamic programming), frequency compression and


  • 8/3/2019 6 Month Technical Report


    splitting a 2D into two 1D problems to achieve this speed up. With the rapidly increasing power of graphics cards it can be

    expected that the computation time will be reduced from 1 second to allow real-time processing. A final factor which aids

    this is that 32-bit precision calculations are now possible on GPUs which are vital for such calculations.

    Cost: Another benefit of the GPU is cost, a top end graphics card (capable of near TFLOP performance) can be purchased

    for less than 300. Such a graphics card can perform equivalently to image generators costing 1000s in 1999[4]. This gives

    the opportunity for real-time video processing capabilities on a standard workstations.

    Parallelism: The architecture of graphics hardware is equivalent to that of stretch computers (designed for fast floating point

    arithmetic). They use stream processing: requiring a sequence of data in some order. This method exploits the dataflow in

    the organisation of the processing elements to reduce caching (CPUs are typically 60% cache[4].) Other features are the

    exploitation of spatial parallelism of images and that pixels are generally independent.

    Strzodka and Garbe in their paper on motion estimation / visualisation on graphics cards[6] show how a parallel computer

    application can be implemented in graphics hardware. They identify GPUs as not the best solution but to have a better

    price-performance ratio than other hardware solutions. In such applications the data-stream controls the flow rather than

    instructions, facilitating the above cache benefit. Moreland and Angel[3] go further in branding the GPU as no-longer a fixed

    pipeline architecture but a SIMD (Single Instruction-stream, Multiple Data-stream) parallel processor, which highlights the

    flexibility in the eyes of the programmer.

    How to program: There are 2 programming sources in graphics hardware programming[6]:

    Flowware: Assembly and Direction of dataflow

    Configware: Configuration of processing elements

    In comparison these 2 features are implemented together in an FPGA, however in graphics hardware these are explicitly

    different. Careful implementation of GPU code is necessary for platform (e.g. DX 9.0) and system (e.g. C++) independence.

    APIs also handle Flowware and Configware separately. This becomes important if considering programming FPGAs and

    GPUs simultaneously.


  • 8/3/2019 6 Month Technical Report


    FPGA Architectures:

    ASIC solutions for processing tasks are optimal in speed, power and size[7], however, are expensive and inflexible. DSPs

    allow for more flexibility and can be reused for many applications, however, are energy inefficient and can cause delays if

    not optimised per task. For these reasons it is often favourable to implement such applications in a reconfigurable unit.

    An Alternative: When deciding which hardware to use for a graphics sub-system there is a trade-off between operating speed

    and flexibility. To maximise its benefit, an FPGA implementation must: give more flexibility than custom graphics processors

    and be faster than a general purpose processor. The need for flexibility is justified as one may need a change an algorithm

    (e.g. compression standard) post manufacture. By utilising its re-programmability a small FPGA can appear as a large and

    efficient device.

    Example: Singh and Bellac in 1994[8] implemented three graphics applications on a FPGA, namely outline of circle, filled

    circle and a fast sphere algorithm. They found a RAM based FPGA was favourable due to the large storage requirements.

    The performance in drawing the circle was seen to be satisfactory out-performing a general purpose display processor

    (TMS34020) by a factor of 6 (achieving 16 million pixels/sec.) It was however worse in the application of fast sphere

    rendering at only 2627 spheres/sec vs. 14,300 from a custom hardware block. Improvements are however expected with new

    FPGAs, such as the Virtex 4, which have larger on-chip storage and more processing power. FPGAs today also have more

    built-in blocks to speed up operations such as multiplication (these will be considered later.)

    Sonic / Ultra-Sonic: Two more possibilities to accelerate video processing which highlight benefits of a re-configurable ar-

    chitecture. The hardware is systolic: 1 data item is clocked in and 1 out of the modules on every clock cycle, this maintains

    a high throughput rate although latency can vary. The challenges involved, highlighted in [9, 10] are: Correct hardware and

    software partitioning, spatial and temporal resolution, hardware integration with software, keeping memory accesses low

    and real-time throughput. Sonic approaches these challenges with PIPEs (Plug in Processing Elements) with 3 main com-

    ponents of an engine (for computations), a router (for routing, formatting and data access) and memory for storing video data.

    Typical applications are the removal of distortions introduced by watermarking an image[10], 2D filtering[11] and 2D

    convolution[12]. In the latter an implementation at 1/2 the clock rate of state-of-the-art technology was adequate suggesting


  • 8/3/2019 6 Month Technical Report


    a lower power solution. 2D filtering was split into two 1D filters and showed a 5.5 times speed up when using 1 PIPE, and

    greater speed up with more PIPEs.

    Bottlenecks: In contrast to memory the FPGAs bottleneck isnt bus speed but configuration time. Configurations can be

    stored in a memory bank and copied into a local cache as required. Singh and Bellac[8] propose partitioning the FPGA

    into zones, each with good periphery access to the network and a different size. The capability of partial reconfiguration is

    important here: if a new task is required only a section of the FPGA need be reconfigured, leaving other sections untouched

    for later reuse. A practical example of this is seen with the above Sonic Architecture: The router and engine are imple-

    mented on separate FPGA elements. If a different application required only a different memory access pattern (e.g. 2*1D

    implementation of a 2D filter[11]) only the router need be reconfigured, this separation also provides abstraction. Another

    architecture where the bus bottleneck problem is reduced is seen in [7] where a daughter board is incorporated to perform

    D/A conversion. The sharing of data and configuration control path reduces the bottleneck, however, data-loss occurs during

    configuration phase but this is seen as acceptable.

    Parallelism: On a Task-level parallelism is often ignored in designs, by proposing a design method focused on the system

    dataflow Sedcole et al[12] hope to overcome this. Taking Sonic as an example: spatial parallelism is done through distrib-

    uting parts of each frame across multiple hardware blocks (PIPEs in this case). Temporal parallelism can be exploited by

    distributing entire frames over these blocks. Further, these elements can also be grouped to perform bigger tasks, Singh

    and Bellec[8] similarly suggest the grouping of zones of a partitioned FPGA in a design. The following are general issues,

    Sedcole et al propose to be considered in such large scale implementations:

    Design Complexity

    Modularisation - allocation / management of resources

    Connectivity / Communication between modules

    Power Minimisation (ties in with low memory accesses)

    Hardware or Software: The benefits of a software implementation are seen with irregular, data-dependent or floating point

    calculations. A hardware solution is beneficial for regular, parallel computation at high speeds[7]. Tasks must be split opti-

    mally between these 2 methods. Advancements in hardware mean that some of the problems with floating point calculations

    and alike have been overcome. Hardware can now perform equally or even better than software. The software designer needs


  • 8/3/2019 6 Month Technical Report


    a good software model of the hardware and the hardware designer requires good abstraction[11]. Hardware acceleration is

    particularly suited to video processing applications due to the parallelism and relatively simple calculations.

    In the Sonic example: PIPEs act as plug-ins, analogous to software plug-ins, which provides an easy path for Sonic into

    software. This overcomes a previous problem with re-configurable hardware that there were no good software models.

    Co-operation: Another way to look at the use of FPGAs in a graphics system is to extend the instruction set of a host-

    processor as virtual hardware. This idea is approached by Vermeulen et al[13] where a processor is mixed with some

    hardware to extend its instruction set. In general this hardware could be another processor or an ASIC component, again

    there are issues with finding ways to get the components to communicate and work together.

    The requirements of a reconfigurable implementation are therefore to be flexible, powerful, low cost, run-time / partial

    reconfigurable and to fit in well with software. The current FGPA limitations highlighted by papers [9, 14] are: configuration

    speed, debugging, number of gates, partial reconfiguration (Altera previously had no support) and PCI Bus Bottleneck.

    These considerations would be important if considering an FPGAs implementation with other hardware and some / all of

    these requirements may also apply to this mixed system.

    2.2 Interconnects

    Interconnects currently used for graphics card to processor communications will be discussed, followed by a look at some

    System-on-Chip (SOC) and Network-on-Chip (NOC) architectures.

    GPU view: GPU components are implemented in conjunction with CPUs acting as graphics sub-systems and working as

    co-processors with the CPUs. To do this a high speed interface is required as GPUs can process large amounts of data in

    parallel, doubling in required bandwidth every 2 years[15]. The AGP standard has progressed through 1x to the current 8x

    model (peaking at 2.1 GBytes/sec,) however with new GPUs working at higher bit precisions (128bit/RGBA in the GeForce

    6800 series) greater throughput was required. AGP uses parallel point to point interconnections with timing relative to the

    source. As the transfer speed increased, the capacitance and inductance on connectors needed to be reduced, this became

    restrictive past 8x. A new transfer method was required: serial differential point to point offers a high speed interconnect at


  • 8/3/2019 6 Month Technical Report


  • 8/3/2019 6 Month Technical Report


    The Network: The advantages of a network are that it has high performance / bandwidth, modularity, can handle concurrent

    communications and has better electrical properties than a bus or switch. As the size of chips increases global synchrony

    becomes infeasible as it takes a signal several clock cycles to travel across a chip. The NOC overcomes this problem by being

    a GALS (Globally-Asynchronous Locally-Synchronous) architecture.

    Dally and Towles[18] propose a mesh structured NOC interconnect as a general purpose interconnect structure. The advan-

    tage of being general purpose is that the frequency of use would be greater justifying more effort in design, the disadvantage

    is that one could do better by optimising to an certain application (though this may not be financially viable.)

    In Dally and Towles example they divide a chip into an array of 16 tiles, numbered 0 through 3 in each axis. Interconnec-

    tions between tiles are made as a folded torus topology (i.e. in the order 0,2,3,1.) This attempts to minimise the number

    of tiles a packet must pass through to reach its destination. Each tile therefore has a N,S,E,W connection and each has

    an input and output path to put data into the network or take it out respectively. The data, address and control signals are

    grouped and sent as a single flit. Area is dominated by buffers (6.6% of tile area in their example.) The limitations are

    opposite to computer networks: less constraint on number of interconnections, but more on buffer space. The network could

    be run at 4GB/s (at least twice the speed of tiles) to increase efficiency, however this would increase space required for buffers.

    The disadvantage of the above example is that the tiles are not always going to be the same size and thus space would be

    wasted for smaller designs. Jantsch[19] proposes a solution which overcomes this, using a similar mesh structure. The main

    differences are that he no longer uses the torus topology but a standard connection to a tiles neighbours. He also provides a

    region wrapper, around a block considerably larger than others, which emulates the original network being present.

    Jantsch suggests 2 possibilities for the future: many NOC designs for many applications (expensive in design time) or 1 NOC

    design for many applications (inflexible.) The later would justify the design cost, however one would need to decide on the

    correct architecture (mix of CPU, DSP etc.), Language (to configure NOC), Operating System (for when running) and design

    method for a set of tasks.


  • 8/3/2019 6 Month Technical Report


    There are other suggested interconnect methods: Hemani et al[20] suggest a honeycomb structure where each component

    connects to 6 others. Benini and De Micheli[21] introduce the SPIN (Scalable, Programmable, Interconnect Network) with

    a tree structure of routers and the nodes being the leaves of the tree. Dobkin et al[22] propose a similar mesh structure to

    Jantsch however include bit-serial long-range links. They use a non-return to zero method for the bit-serial connection and

    believe it to be best for NOC. This shows a snapshot of the NOC ideas for which the are possibly as many topology ideas as

    for our standard computer networks today.

    2.3 FPGA Internal Structure

    FPGAs were first designed to be as programmable as possible, comprising configurable logic blocks and interconnects. As

    they have developed, manufacturers have introduced standard components into them, such as embedded memory blocks and

    in some of the latest Xilinx FPGAs Power PC Processors. There is potential for future work in this area in the development

    of new blocks, which could be placed into an FPGA, to improve functionality. In this section interesting modules will be

    considered, which could be used within FPGAs in the future.

    Multipliers: The motivation for use of embedded multipliers is that implementation of binary multiplication in FPGAs is

    often too large and slow. A possible solution is Programmable Array Modules (PAMs) - these are fixed in size however waste

    space if small bit length multiplication are required. Other solutions are trees or pre-processing methods, although these are

    difficult to generalise. A better solution is presented by Haynes and Cheung[23] to use reconfigurable multiplier blocks. They

    designed a Flexible Array Block (FAB) capable of multiplication of two 4 bit numbers, FABs combine to multiply numbers

    of lengths 4n and 4m. The 2 input numbers can be independently signed or unsigned. The speed of the FABs is comparable

    to that of non-configurable blocks at a cost of them being twice the size and having twice the number of interconnects. The

    later isnt a problem due to the many metal layers in an FPGA, they are also smaller than a pure FPGA implementations.

    A modification was proposed later by Haynes, Ferrari and Cheung[24] with a design base on the radix-4 overlapped multiple-

    bit scanning algorithm, which was more speed and area efficient. The MFAB (Modified FAB) multiplies 2 numbers of length

    8 together, or less with redundancy. The length must be greater than 7 to make a space saving on the FAB. The blocks are

    1/30th the size of the equivalent pure FPGA implementation and need only 40% usage to make them a worthwhile asset.


  • 8/3/2019 6 Month Technical Report


    Function Evaluation: A more specific block is one for functional evaluation such as that proposed by Sidahao, Constanti-

    nides and Cheung[25]. Previously a Lookup Table (LUT) approach was used however their architecture provides a lower

    area solution at the cost of execution speed.

    Memory: In Video applications storage of frames of data is important, therefore it is useful to be able to store this data

    in memory efficiently. Embedded Dual-Port RAMs, currently available in devices such as the Xilinx Virtex II Pro family,

    enable two accesses concurrently. It is likely this technology will progress further, perhaps to an Autonomous Memory Block

    (AMB), proposed by Melis, Chueng and Luk[26], which can create its own memory address.

    2.4 Debugging tools / coding

    The testing of a hardware module can be split into 2 areas: pre-load and post-load. A downside to FPGAs over ASICs is in

    pre-load: specifically back-annotated compared with initial testing. In ASIC design only wiring capacitance is missing from

    pre-synthesis tests, whereas in FPGA design the module placement is decided at synthesis, drastically effecting timing.

    Pre-load: The most widely known pre-load test environments are ModelSim (Xilinx) and Quartus (Altera.) COMPASS

    (Avant) is an automated design tool, creating a level of abstraction for the user. The benefits are highlighted by Singh and

    Bellac in 1994[8]: The user can enter a design as a state machine or dataflow and therefore implement at the system rather

    than lower (e.g. VHDL) level.

    Post-load: The issue of post-load testing is currently approached by using part of the FPGA space for a debugging environ-

    ment, invoked during on-board test. A previously popular test strategy was Bed of Nails where pins are connected directly

    to the chip and a logic analyser. Due to the large pin count on todays devices this is impractical, even if possible it would

    significantly alter the timing. Following this was Boundary Scanning by JTAG (Joint Test Action Group) however this only

    probed external signals. Better still is Xilinx Chipscope: an embedded black box which resides, inside the FPGA, as a probe

    unit. The downside is that is uses the slow JTAG interface to communicate readings.

    An example of an on-chip debugging environment, which uses the faster interface (PCI Bus), is the SONICmole[27] used

    with UltraSonic[14]. This takes up only 4% of a Virtex XVC1000 chip (512 slices.) Its function is to act as a logic analyser,


  • 8/3/2019 6 Month Technical Report


    viewing and driving signals, whilst being as small as possible and having a good software interface. This uses the PIPE mem-

    ory to store signal captures. It has been implemented at the UltraSonic maximum frequency of 66MHz[27] and is portable to

    other reconfigurable systems.

    Coding: Firstly coding for FPGAs: these can be programmed through well known languages such as VHDL and Verilog

    at the lower level. MATLAB (system generator), and more recently SystemC (see systemc.org) and HandleC at the higher

    level. The focus of this sub-section will be on programming the GPU, as FPGA coding is widely understood.

    Cg Language: Cg[28] was developed by Nvidia for developers to program GPUs in a C-like manner. The features of C

    beneficial for an equivalent GPU programming tool are: performance, portability, generality and user control over machine

    level operations. The main difference to C is the stream processing model for parallelism in GPUs.

    Cg supports high level programming, however is linkable with assembly code for optimised units - giving the programmer

    more control. Cg supports user defined compound types (e.g. arrays and structures) which are useful for non-graphics ap-

    plications. It also allows vectors of floating point numbers up to size 4 (e.g. RGBA), along with matrices up to size 4x4

    (for operations on the vectors.) A downside is Cg doesnt support pointers or recursive calls (as there is no stack structure).

    Pointers may be implemented at a later date.

    Nvidia separates programming of the 2 GPU processors (vertex and fragment)to avoid branching and loop problems and so

    they are accessed independently. The downside is optimisations across this boundary arent possible, a solution is to use

    a meta-programming system to merge this boundary. Nvidia introduce the concept of profiling for handling differences in

    generations of GPUs. Each GPU era has a profile of what it is capable of implementing. There is also a profile level for all

    GPUs necessary for portable code.

    In development of the PlayStation 2 Sony supported a full C implementation with the on-chip GPUs combined with off chip

    resources. This shows the trend towards more user programmability. When developing Cg Nvidia worked closely with other

    companies (such as Microsoft) who were developing similar tools. An aim of Cg was to support non-shading uses of GPU,

    this is of particular interest. (Fernando and Kilgard[29] provide a tutorial on using Cg to program graphics hardware.) For


  • 8/3/2019 6 Month Technical Report


    the non-programmable parts of a GPU CgFX[28] handles the configuration settings and parameters.

    2.5 Literature Survey Conclusions

    In summary some current architectural uses for GPUs and FPGAs have been considered including an FFT routine on the

    GPU and some graphics routines on a FPGA. The Sonic architecture was looked at particularly how it is used as a hardware

    accelerator for graphics applications. This was followed by interconnect structures looking at buses, switches and networks:

    specifically their advantages and disadvantages. The internal structure of an FPGA was then considered investigating embed-

    ded components that could be useful in video applications such as multipliers, memory and function solvers. Finally tools

    used in pre and post device function load and in device programming were analysed, specifically the Cg language.

    3 Research Questions

    The interconnect between cores in a design is a common bottleneck. It is important to have a good model of the interconnect,

    to either eliminate or reduce this delay. There have been many architectures proposed / developed for module interconnects

    (groupable as bus, switch and network) discussed in the Literature Survey. This leads to the first research question: investigate

    suitable interconnect architectures for mixed core hardware blocks and find adequate ways to model interconnect behaviour.

    A model is important to decide the best interconnect for a task without the need for full implementation.

    The potential of Graphics Hardware has long been exploited in the gaming industry, focusing on its high pixel throughput

    and fast processing. It has been shown to be particularly efficient where there is no inter-dependence between pixels. Pro-

    gramming this hardware was historically difficult: One could use assembly level language in which it takes a long time to

    prototype. The alternative is an API, such as OpenGL, which limits a programmers choice to a set of functions. In 2003

    Nvidia produced a language called Cg, allowing high level programming without losing the control of assembly level coding.

    Following this non-graphical applications were explored, for example Moreland and Angels FFT Algorithm[3].

    The adaptability of graphics hardware, to non-standard tasks, leads to the second research question: to further investigate

    graphics hardware used in a mixed core architecture. This takes advantage of the price-performance ratio of graphics hard-

    ware, whilst maintaining current benefits of using FPGA / Processor cores. FPGA cores allow for high levels of parallelism


  • 8/3/2019 6 Month Technical Report


    and flexibility as many designs can be implemented on the same hardware. Processors can be optimised for certain types of

    instructions and run many permutations of them without costly reprogramming associated with FPGAs.

    When one wishes to resize an image there are two possibilities for determining new values for pixels: filtering or interpo-

    lation. Filtering could be a FIR (Finite Impulse Response) Low-Pass filter with complexity variation in the number of taps.

    Interpolation could be a Bi-linear, Bi-Cubic or a spline method, each of varying complexity. The final research questions

    is: investigate the perceived quality versus computational complexity of the 2 methods. Theory suggests FIR filtering, of a

    long enough tap length, should produce a smoother result: this may not however be perceptively the best or could be too

    computationally complex.

    4 Interconnect Model

    My first task was to implement a high level model of the ARM AMBA Bus. This would model its performance for varying

    numbers of master & slave and be cycle accurate. SystemC, a relatively new hardware modelling library, was used for this.

    The motivation came from a paper by Vermeulen and Catthoor[13] where an ARM7 processor was used, in addition to cus-

    tom hardware, to allow for up to 10% post manufacture functional modification.

    A multiply function, for a communicating processor and memory, was modelled: Two values, to be multiplied, are loaded in

    consecutive cycles, multiplied then returned to memory using an interconnect. This consists of data plus control signals as a

    simple bus model. This demonstrates how to display and debug the results of a hardware model. SystemC is used to create

    a VCD (Value Change Dump) file, which can be displayed in a waveform viewer such a ModelSims. The results are seen

    Figure 1. Waveform for multiplier implementation


  • 8/3/2019 6 Month Technical Report


  • 8/3/2019 6 Month Technical Report


    Figure 3. Test output showing reset and bus request / grant procedure

    A number of meetings were held with Ray Cheung from Computing (currently modelling processors) to discuss the possible

    interoperability, between an AMBA bus model and processor model. A fully flexible bus and processor model was suggested,

    which could be later extended to include other hardware blocks such as FPGAs.

    Following this, my attention was turned to the design of such a bus model. A physical interpretation of how the AMBA AHB

    bus blocks fit together can be seen in figure 2. Missing from figure 2, are global clock and reset signals, which are routed to

    each block. HWDATA and HRDATA apply to write and read data respectively and H prefix denotes AHB bus as apposed to

    ASB. The control signals are requests from masters and split (resume transfer) signals from slaves. Complexity in coding the

    multiplexer blocks lay in making them general. Constants were used, in place of actual numbers, for data and address signal

    widths throughout. The master multiplexer used a delayed master select signal from the Arbiter to pipeline address and data

    buses. One master can use the data bus, whilst another controls the address bus.

    For the decoder an assumption was made about how a slave is chosen. The number of address bits, used to decipher which

    slave to use, is calculated as: log2(numberslaves) rounded up. The bits are taken as the MSBs of the address. The literal

    value of the binary number indicates which slave to use, i.e. 01 would be slave 1.


  • 8/3/2019 6 Month Technical Report


    A test procedure was produced, this loads stimulus from a text file, with the result viewed as a waveform, as with the mul-

    tiplexer example. The file consists of lines of either, variable and value pairs or tick followed by a number of cycles to

    run for. Initially, simple tests were carried out, to check for correct reset behaviour and that the 2 multiplexers worked (with

    a setup of 1 master and 2 slaves.) An example of a test output is shown in figure 3.

    In the example as HSEL signals change at the bottom of the waveform, the 2 read data signals are multiplexed. When reset,

    all outputs are set to zero, irrespective of inputs, which is what would be expected. When the master requests the bus, the

    arbiter waits till HREADY goes high, before granting access, through HGRANT. In the case of more than 1 master, the

    HMASTER signal changes immediately (with HBUSREQ) to the correct master, allowing for multiplexing and so the slaves

    know which master is communicating.

    The model was further tested with 2 masters and 2 slaves, a common configuration. The correct, cycle accurate, results were

    seen. Within this, the sending of packets consisting of 1 and multiple data items was experimented with along with split

    transfers and error responses from slaves. The waveforms for these become complicated and large very quickly, however are

    of a similar form to figure 3.

    5 Primary Colour Correction

    Primary Colour Correction is a non-graphical application, as with the FFT on a GPU algorithm discussed above, I will now

    discuss my optimised version of this. The algorithm performs three main transformations per pixel: Input Correction, His-

    togram Equalisation and Colour Balancing (see Figure 4.)

    Input Correction and Colour Balancing require the RGB signal to be converted to HSL (Hue, Saturation and Luminance)

    space. In my optimisations, I converted half way to a chroma representation ycbcr and implemented the algorithm at this

    level, which showed considerable speed up.

    Other key optimisations were to perform calculations in vector space and to remove, where possible, conditional statements

    which are inefficient on GPUs. The lessons learnt can be summarised below:


  • 8/3/2019 6 Month Technical Report






    2D Texture

































    Fix torange[0,1]

    Fix torange[0,1]

    Fix torange[0,1]








    Figure 4. Primary Colour Correction Block Diagram

    Perform calculations in vectors & matrices

    Use in-built functions to replace complex maths & conditional statements

    Pre-compute uniform inputs, where possible, avoiding repetition for each pixel

    Consider what is happening at assembly code - decipher code if necessary

    Dont convert between colour spaces if not explicitly required

    Table 1 shows the performance results for the initial and optimised designs using various generations of GPUs. It is seen

    that there is a large variation in the throughput rates of the devices, although there is only 2-3 years between them. For more

    information on the optimisation of the primary colour correction algorithm see[30].

    Architecture Throughput (Final) MP/s Throughput (Initial) MP/s

    6800 Ultra 116.36 44.14

    6800 GT 101.82 38.62

    6600 72.73 27.59

    5700 Ultra 12.67 2.12

    5200 Ultra 7.08 1.24

    Table 1: Performance Comparison on GeForce architectures for the Optimised (Final) and Initial Designs


  • 8/3/2019 6 Month Technical Report


    For efficient optimisation of an algorithm it is important to understand the performance penalty of each section. A detailed

    breakdown of the above primary colour correction algorithm, in terms of delay, was carried out. Some performance bot-

    tlenecks in the implementation were compare and clamping operations. The Colour Balancing function, which includes

    many of each of these, was seen to be the slowest of the three main blocks. The conversion between colour spaces was seen

    to have a large delay penalty due mainly to the conversion from RGB to XYL space. In Histogram Equalisation pow was

    seen also to add greatly to the delay and accounts for almost 50% of the delay (0.00089s/MP).

    The register usage, although minimal, was seen to be larger in calculations than compare operations. This is due to the large

    number of min-terms in the calculations and there being fewer intermediate storages required in compares. In this case the

    register usages was not a limiting factor to the implementation, however it may be for other algorithms. The breakdown of

    delay for each block can be seen below, for more detail see [31].

    Block Cycles Rregs Hregs Instructions Throughput (MP/s) Delay (s/MP)

    Input Correction 16 3 1 35 350.00 0.00286

    Histogram Correct 12 2 1 25 466.67 0.00214

    Colour Balancing 23 3 1 56 243.47 0.00411

    Table 2: Effect on Performance of Each Block of the Primary Colour Correction Algorithm

    6 Plan of Work Leading to Transfer

    The next step in the modelling of interconnects is to consider a general bus structure, this can also consist of: multiple masters

    and slaves, varying methods of arbitration, clock speeds, shared / individual read-write lines etcetera. This requires a more

    abstract implementation, which is allowed for in the SystemC library. A model of cross-bar switches and a network on chip

    structure are other possibilities for the future work on interconnect modelling.

    The next stage on the question of graphics hardware is to implement the primary colour correction algorithm on a Pentium

    Processor and on a FPGA. An optimised implementation in MATLAB completed the computation, on a Pentium 4, with a

    512x512 image in 2.3 seconds. This equates to 0.1MP/s which is much slower than the graphics card. An implementation in

    C / C++ is expected to perform better, but to still be 1-2 orders of magnitude worse. The FPGA implementation is expected to


  • 8/3/2019 6 Month Technical Report


    out-perform both, if a large enough device is used. When limited to a device of equivalent cost to a graphics card the FPGA

    is expected to perform worse than the graphics card but better than the CPU.

    A comparison of the visual differences of filtering and interpolation will be performed, along with the computation time re-

    quired by each. The algorithms will be tried on the graphics hardware and any limitations of the interconnect, either on or off

    board, noted. Implementations may also be prototyped on an FPGA device and Pentium 4 processor for further comparison

    of computational capabilities. The literature survey will also be updated to include documents relating to interpolation and

    filtering algorithms, particularly in hardware.

    An updated Gantt chart for my work intensions, up to transfer, can be found in Appendix 1 at the rear of this document. This

    relates to my above aims.

    7 Conclusion

    A literature survey of related work to my chosen research area has been presented, highlighting possibilities for work in

    the areas of interconnects and utilisation of graphics hardware in a mixed core system. My three main research questions:

    investigating interconnects and their modelling; the use of graphics hardware for video processing and comparison of FIR

    filtering and interpolation were then explained. The work covered to-date on Interconnect Modelling and Primary Colour

    Correction implementation on a graphics card was summarised, followed by a plan of my future work including a Gantt chart.


  • 8/3/2019 6 Month Technical Report



    [1] J.N. England, A system for interactive modelling of physical curved surface objects, SIGGRAPH 78 1978, pp.336-340

    [2] Chris Trendall and A. James Stewart, General calculations using graphics hardware, with applications to interactive caustics, 2000

    [3] Kenneth Moreland and Edward Angel, The FFT on a GPU, in The Eurographics Association, 2003, pp. 112-136

    [4] Micheal Macedonia, The GPU Enters Computings Mainstream, Entertainment Computing, pp. 106-108, 2003

    [5] Jeffery M. OBrian, Nvidia, www.wired.com Issue 10.07, 2002

    [6] Robert Strzodka and Christoph Garbe, Real-Time Motion Estimation and Visualisation on Graphics Cards, University of Duisburg,


    [7] Wayne Luk, P. Andreou, A. Derbyshire, F. Dupont-De-Dinechin, J. Rice, N. Shirazi, D. and Siganos, A Reconfigurable Engine for

    Real-Time Video Processing, Lecture Notes in Computer Science, 1998

    [8] Satnam Singh and Pierre Bellec, Virtual Hardware for Graphics Applications Using FPGAs, FCCM 1994

    [9] Simon Haynes, John Stone, Peter Cheung and Wayne Luk, Video Image Processing with the Sonic Architecture, Computer, pp.

    50-57, 2000

    [10] Wim Melis, Peter Cheung and Wayne Luk, Image Registration of Real-Time Broadcast Video Using the UltraSONIC Reconfigurable

    Computer, FPL, pp. 1148-1151, 2002

    [11] Simon Haynes, John Stone, Peter Cheung and Wayne Luk, SONIC - A Plug in Architecture for Video Processing, FPGA, pp.21-30,


    [12] Pete Sedcole, Peter Cheung, G.A. Constantinides and Wayne Luk, A Reconfigurable Platform for Real-Time Embedded Video

    Image Processing, FPGA, 2003

    [13] Fredrick Vermeulen and Francky Catthoor, Power-Efficient Flexible Processor Architecture for Embedded Applications, IEEE

    Transactions on VLSI Systems Vol 11, pp. 376-385, 2003

    [14] Simon Haynes, Sonic - A reconfigurable image processing architecture, Poster - IEEE Symposium on FPGAs for Custom Com-

    puting Machines, 1999

    [15] Intel Developers Network for PCI Esxpress Architecture, Why PCI Express Architectures for Graphics,www.express-lane.org,



  • 8/3/2019 6 Month Technical Report


    [16] AMBA SPECIFICATION (Rev 2.0), ARM, 1999

    [17] Jiang Xu, wayne Wold, Joerg Henkel, Srimat Chakradhar and Tiehan Lv, A case study in Networks-on-Chip Design for Embedded

    Video, Automation and Test European Conference, 2004

    [18] William J. Dally and Brian Towles, Route Packets, Not Wires: On-Chip Interconnection Networks, DAC, 2001

    [19] Axel Jantsch, Networks on Chip, 2002

    [20] Ahmed Hemani, Axel Jantsch, Shashi Kumar, Adam Postula, Johnny Oberg, Mikael Millberg and Dan Lindvist, Network on Chip:

    Architecture for billion transistor era., Proceedings of the IEEE NorChip Conference, 2000

    [21] Luca Benini and Giovanni De Micheli, Networks on Chips: A New SOC Paradigm, Computer, pp. 70-78, 2002

    [22] Rostislav Dobkin, Isral Cidon, Ran Ginosar, Avinaom Kolodny and Arkadiy Morgenshtein, Fast Asynchronous Bit-Serial Intercon-

    nects for Network-on-Chip, 2004

    [23] Simon Haynes and Peter Cheung, A Reconfigurable Multiplier Array for Video Image Processing Tasks, Suitable for Embedded In

    An FPGA Structure, IEEE Symposium on Field-Programmable Custom Computing, 1998

    [24] Simon Haynes, Antonio Ferrari and Peter Cheung, Flexible Reconfigurable Multiplier Blocks Suitable for Enhancing the Architec-

    ture of FPGAs, Proceedings of Custom Integrated Circuit Conference, 1999

    [25] Nalin Sidahao, George Constantinides and Peter Cheung, Architectures for Function Evaluation on FPGAs, IEEE Symposium on

    Circuits and Systems, pp. 804-807, 2003

    [26] Wim Melis, Peter Cheung and Wayne Luk, Autonomous Memory Block for Reconfigurable Computing, ISCAS, pp. 581-584, 2004

    [27] T. Wiangtong, C.T. Ewe and P.Y.K. Cheung, SONICmole: A Debugging Environment for the UltraSONIC Reconfigurable Com-

    puter, ISCAS, pp.808-811, 2003

    [28] William R. Mark, R. Stephen Glanville, Kurt Akeley and Mark J. Kilgard, Cg: A system for programming graphics hardware in a

    C-like language, ACM Transactions on Graphics, pp. 896-907, 2003

    [29] R. Fernando and M.J. Kilgard, The Cg Tutorial: The definative guide to programming real-time graphics, Addison Wesley, 2003

    [30] Ben Cope, Efficient Implementation of Primary Colour Correction on Graphics Hardware, avaliable from author, 2005

    [31] Ben Cope, Breakdown of performance for Primary Colour Correction, avaliable from author, 2005
