coprocessor approach to accelerating multimedia … · a coarse grain reconfigurable coarse-grained...

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI , JARI NURMI ]

Processor Design

Lecture Objectivesj

Background Need for Accelerator Accelerators and different type of parallelizm Processor Architecture and different approached to pp

acceleration Requirements of applications for hardware coprocessor Numeric coprocessorsp Various type of Reconfigurable Accelerators Milk coprocessor Butter Accelerator Butter Accelerator

How to improve the performance of a microprocessor system?microprocessor system?

Choose a faster version of your microprocessor Choose a faster version of your microprocessor Add additional computational units that are

perform special functions?perform special functions? Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Coprocessor (Floating Point Processor) Additional Microprocessor Hardware Accelerator

Hardware Accelerator

If the overall performance of a uni-processor system is too slow, additionalhardware can be used to speed up the system. This hardware is calledhardware accelerator!

The hardware accelerator is acomponent that works togethercomponent that works togetherwith the processor and executeskey functions much faster thanthe processor

A co-processor is connected toAn Accelerator is NOT a COPROCESSOR

the CPU and executes specialinstructions. Instructions aredispatched by the CPU.An accelerator appears as aAn accelerator appears as adevice on the bus

Accelerators and different types of ll liparallelism

One of the key properties that can be exploited is One of the key properties that can be exploited is the parallelism Instruction level parallelism Loop level parallelism, Task level parallelism Program level parallelism, Data parallelism

Processor architectures and different h t l tiapproaches to acceleration

DSP processors RISC i RISC microprocessors CISC microprocessors

fact that applications and protocols change fast, so having a programmable core in the system is recommendable to guarantee general validity and flexibility to the

platform.

One possible way of accelerating a programmable core exploiting instruction and/or data parallelism of applicationsp g / p pp by providing the processor with VLIW or SIMD extensions;

another way consists in adding special functional units MAC circuits, barrel shifter, other special components designed to speed up the execution of DSP algorithms)

in the datapath of the programmable core The design and verification issues related to coprocessors can be faced independently from the

ones related to the main processor: this way it is possible to parallelize the design activities, saving h ithen time.

Requirements of applications for h d hardware coprocessors Different application domains call for different kinds of accelerators:

For example, applications require floating-point computation

robotics,

a tomation automation,

Dolby digital audio,

3D graphics making

thus the insertion of FPU very useful and sometimes even necessary

very effective way of solving this problem which is widely accepted very effective way of solving this problem which is widely accepted nowadays is to make those architectures run-time reconfigurable.

means that the hardware is done so that the datapath of the architecture can be changed by modifying the value of special bits, named configuration bits or configware.

Numeric coprocessors: floating-point units

Commonly required: floating-point arithmetic : leads Commonly required: floating point arithmetic : leads to higher complexity

P.S.The area of the FPUs is usually quite large; this point usually di d d i t i l d th i t th i tdiscouraged designers to include them into their systems

There are different existing typologies of FPU, ranging from proprietary to open-source ones, g g p p y p , supporting the IEEE-754 standard or not, able of single-precision or double precision computation, for usage with CISC or RISC machines for usage with CISC or RISC machines

Numeric coprocessors: floating-point units [cont.]

RISC cores, one of the most important examples is given by FPUs for ARM, called VFP-9, VFP-10 and VFP-11, Pipelined, with some software configurable functions, powerful, vector FPUs,

supporting also double precision to enhance accuracy in calculation

MEIKO is an FPU developed at SUN open source RISC core developed at Gaisler Research

Used with Leon processor

The FPU from Jidan Al-Eryani is a complete coprocessor, which features a hardware logic to handle denormal operands, even though it does not support parallel execution of the instructions.pp p

Various types of reconfigurable l taccelerators

Butter Co-Processor [overview]

NxM array of reconfigurable processing elements (cells) Each cell features

integer and floating-point arithmetic operations, shift and LUT-based operations shift and LUT based operations

Flexible interconnect schemes between the cells, providing nearest-neighbor and global communication

Nearest neighbor interconnections are anyway sufficient to implement Nearest-neighbor interconnections are anyway sufficient to implement the simplest DSP algorithms,

the global ones are more useful for matrix-multiplications and 3D graphics algorithms

Dedicated input and output in addition to the system bus (or network!) interface which is mainly used for configuration purposes

Butter Accelerator

a coarse grain reconfigurable Coarse-Grained Parallelism

Maximizing the performance in the elaboration of multimedia,

• Infrequent data communication, after larger amounts of computation

multimedia, signal processing, 3D applications.

A parametric VHDL model

IMPACT: However, The mapping of VHDL on standard-cells technologies implies

more area on chipp

lower clock frequencies

Butter Accelerator [cont.]

execution of applications execution of applications detecting the parts specialized hardware

Butter is a coprocessor attached to the system busConfiguration bits are stored in a dedicated memory inside Butter, and can be written the core or via DMA transfers.

Direct memory access (DMA) is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading p y y gand/or writing independently of the central processing unit.

Butter Processing Element: Cellsg

Butter is organized as a matrix of processing elements called cells two inputs ports to read 32-bit wide operands; 6-bit wide input port (Configuration bits) control the internal registers

reset enable input are used to of the cell.

two 32-bit output ports for each cell64 bit lt f 32 bit lti li ti 64-bit result of a 32-bit multiplication, or

a generic 32-bit result coming from another functional unit

Input registers inside the cells are used to sample the operands Introduces the pipeline Introduces the pipeline Can be disabled to avoid useless dynamic power consumption

special input register is used to keep constant values inside the cell, so that they can be used during the elaboration with no need to re-route them.

Butter Processing Element: Cells [cont.]g

Inside each cell there are three functional units a multiplier, an adder an adder, a barrel shifter

small memory (4 cells 32-bit wide) used as lookuptable (LUT) A i l f ti l it (fl ti i t lti li ti ) A special functional unit (floating-point multiplications)

3D graphics benefit from fast, low precision floating-point operations

results produced by the adder and the multiplier, rounding them to be stored in the floating-point format a dedicated block inside the cells: (with three portions)

calculates the amount of leading zeros for each of the operands, the sign of the result, k th i t l b i t th fi l f t packs the internal number into the final format.

Internal Architecture of a Cell of Butter AcceleratorAccelerator

The first row of cells read their operands from global vertical interconnections; The results of the elaboration are put as output accessible from the underlying rows The results of the elaboration are put as output accessible from the underlying rows. The final result can be read externally of Butter either from

its last row at the bottom of the device, or from the rightmost column: rightmost column:

results can be accessed as soon as they are produced, with no need to wait that they go through all the rows.

Different kinds of interconnection inside Butter

Interconnections in Butter

The interleaved interconnection is useful (for example) to propagate the 64-bit result of multiplications splitting their processing over two adjacent rows. They are useful in easing and enhancing the mapping of some algorithms, and in

reducing the amount of cells used.

Thanks to the interleaved connections it is possible to implement the FIR algorithm using only three rows of the array: the first row executes the multiplications the first row executes the multiplications,

the second row the additions of the least significant bits of the products,

the third row the addition of the most significant bits.

Global Interconnections: connecting the output of each cell to every input Global Interconnections: connecting the output of each cell to every input of the cells laying on the row below algorithms like matrix–matrix multiplications and matrix–vector multiplications ?

Butter Co-Processor Requirementsq

Butter was synthesized on FPGA : operating frequency 57 MHz

90 nm Standard-cells technology: Operating Frequency: 280MHz

Thanks to its wide datapath, high parallelism and pipelined nature Butter can run algorithms using a very limited amount of clock cycles; can run algorithms using a very limited amount of clock cycles;

for example, an FIR filter takes 16 cycles, `

a matrix vector multiplication takes 4 cycles and a matrix–vector multiplication takes 4 cycles, and a 2D IDCT 54 cycles.

Milk Coprocessor Design And Verification of a VHDL Model of aFloating-Point Unit for a RISC MicroprocessorFloating Point Unit for a RISC Microprocessor

Solutions to Improve Performancep

pipelined architecture, to deliver up to one result per clock cycleclock cycle

parallel elaboration of instructions High Parallelism different functional units commit their

elaboration simultaneously, a multi-port register file allows th t it b k f th i lt

Parallel elaboration of instructions is made so that some fast instructions can be run while a heavier one is still in progress; the compiler can then provide a significant improvement in the execution f l ith b ki d the concurrent write back of their results.

fast internal bus switching hardware support for denormal operands handling Scalability & Adaptability

of algorithms by making a good scheduling of the instructions, reducing this way unused clock cycles and increasing global computation efficiency.

Scalability & Adaptability functional units can be inserted or removed from the

architecture in an immediate way

Modularity to the Functional Unit

any non-zero number which is smaller than the smallest normal number is ‘denormal'.

Hardware logic for “register locking” and to stall the core

The GCC compiler’s support.

Milk co-processor external interfacep

Coffee RISC core supports upt f t

Pins Interfacingspecify the daia exchange direction (input or output)

It h 4 bit dd d f i t l

to four coprocessors twosignals (c-index [ 1. . 0 ])are used to select whichcoprocesser is currently beingaddressed

g1. wr_cop2. rd_cop3. c_index[1, 0]4 inde [3 0] It has 4-bit address used for internal

registers addressing: bit r_index [3] logical high: a special register is being indexed (r-index [0] then selects among status register or control register)bi i d [3] l l l

4. r_index[3,0]5. cop_exc6. data(31,0)

signal cop-exc indicates internalexceptions: they are• concurrent writes on theCoprocessor register file, by theinternal functional units and the bit r_index [3] logical low: one among

the eight general purpose registers is being indexed

internal functional units and theprocessor core• arithmetical exceptions: overflow,underhow, inexact result, invalidoperand, division by zero• illegal instruction code (the currenti i i d b hinstrution is not supported by thecoprocessor).

Milk Coprocessor Internal Architecturep

MILK CO-Processor Requirementsq

It requires105 K gates The operating frequency 400 MHz on a 90 nm standard

cells technology 20K Logic Elements running at 67 MHz on an Altera Stratix 20K Logic Elements running at 67 MHz on an Altera Stratix

FPGA. It is capable of completing instructions in a very small

number of clock cycles: number of clock cycles: 3 for multiplications, 5 for additions, 8 for square root, 8 for square root, 11 for divisions 2 for conversions and 1 for all the other ones

QUESTIONS? QUESTIONS?

coprocessor approach to accelerating multimedia … · a coarse grain reconfigurable coarse-grained...

Documents