coprocessor approach to accelerating multimedia … · a coarse grain reconfigurable coarse-grained...
TRANSCRIPT
COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI , JARI NURMI ]
Processor Design
Lecture Objectivesj
Background Need for Accelerator Accelerators and different type of parallelizm Processor Architecture and different approached to pp
acceleration Requirements of applications for hardware coprocessor Numeric coprocessorsp Various type of Reconfigurable Accelerators Milk coprocessor Butter Accelerator Butter Accelerator
How to improve the performance of a microprocessor system?microprocessor system?
Choose a faster version of your microprocessor Choose a faster version of your microprocessor Add additional computational units that are
perform special functions?perform special functions? Standard Component (Graphics Processor) Coprocessor (Floating-Point Processor) Coprocessor (Floating Point Processor) Additional Microprocessor Hardware Accelerator
Hardware Accelerator
If the overall performance of a uni-processor system is too slow, additionalhardware can be used to speed up the system. This hardware is calledhardware accelerator!
The hardware accelerator is acomponent that works togethercomponent that works togetherwith the processor and executeskey functions much faster thanthe processor
A co-processor is connected toAn Accelerator is NOT a COPROCESSOR
the CPU and executes specialinstructions. Instructions aredispatched by the CPU.An accelerator appears as aAn accelerator appears as adevice on the bus
Accelerators and different types of ll liparallelism
One of the key properties that can be exploited is One of the key properties that can be exploited is the parallelism Instruction level parallelism Loop level parallelism, Task level parallelism Program level parallelism, Data parallelism
Processor architectures and different h t l tiapproaches to acceleration
DSP processors RISC i RISC microprocessors CISC microprocessors
fact that applications and protocols change fast, so having a programmable core in the system is recommendable to guarantee general validity and flexibility to the
platform.
One possible way of accelerating a programmable core exploiting instruction and/or data parallelism of applicationsp g / p pp by providing the processor with VLIW or SIMD extensions;
another way consists in adding special functional units MAC circuits, barrel shifter, other special components designed to speed up the execution of DSP algorithms)
in the datapath of the programmable core The design and verification issues related to coprocessors can be faced independently from the
ones related to the main processor: this way it is possible to parallelize the design activities, saving h ithen time.
Requirements of applications for h d hardware coprocessors Different application domains call for different kinds of accelerators:
For example, applications require floating-point computation
robotics,
a tomation automation,
Dolby digital audio,
3D graphics making
thus the insertion of FPU very useful and sometimes even necessary
very effective way of solving this problem which is widely accepted very effective way of solving this problem which is widely accepted nowadays is to make those architectures run-time reconfigurable.
means that the hardware is done so that the datapath of the architecture can be changed by modifying the value of special bits, named configuration bits or configware.
Numeric coprocessors: floating-point units
Commonly required: floating-point arithmetic : leads Commonly required: floating point arithmetic : leads to higher complexity
P.S.The area of the FPUs is usually quite large; this point usually di d d i t i l d th i t th i tdiscouraged designers to include them into their systems
There are different existing typologies of FPU, ranging from proprietary to open-source ones, g g p p y p , supporting the IEEE-754 standard or not, able of single-precision or double precision computation, for usage with CISC or RISC machines for usage with CISC or RISC machines
Numeric coprocessors: floating-point units [cont.]
RISC cores, one of the most important examples is given by FPUs for ARM, called VFP-9, VFP-10 and VFP-11, Pipelined, with some software configurable functions, powerful, vector FPUs,
supporting also double precision to enhance accuracy in calculation
MEIKO is an FPU developed at SUN open source RISC core developed at Gaisler Research
Used with Leon processor
The FPU from Jidan Al-Eryani is a complete coprocessor, which features a hardware logic to handle denormal operands, even though it does not support parallel execution of the instructions.pp p
Various types of reconfigurable l taccelerators
Butter Co-Processor [overview]
NxM array of reconfigurable processing elements (cells) Each cell features
integer and floating-point arithmetic operations, shift and LUT-based operations shift and LUT based operations
Flexible interconnect schemes between the cells, providing nearest-neighbor and global communication
Nearest neighbor interconnections are anyway sufficient to implement Nearest-neighbor interconnections are anyway sufficient to implement the simplest DSP algorithms,
the global ones are more useful for matrix-multiplications and 3D graphics algorithms
Dedicated input and output in addition to the system bus (or network!) interface which is mainly used for configuration purposes
Butter Accelerator
a coarse grain reconfigurable Coarse-Grained Parallelism
Maximizing the performance in the elaboration of multimedia,
• Infrequent data communication, after larger amounts of computation
multimedia, signal processing, 3D applications.
A parametric VHDL model
IMPACT: However, The mapping of VHDL on standard-cells technologies implies
more area on chipp
lower clock frequencies
Butter Accelerator [cont.]
execution of applications execution of applications detecting the parts specialized hardware
Butter is a coprocessor attached to the system busConfiguration bits are stored in a dedicated memory inside Butter, and can be written the core or via DMA transfers.
Direct memory access (DMA) is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading p y y gand/or writing independently of the central processing unit.
Butter Processing Element: Cellsg
Butter is organized as a matrix of processing elements called cells two inputs ports to read 32-bit wide operands; 6-bit wide input port (Configuration bits) control the internal registers
reset enable input are used to of the cell.
two 32-bit output ports for each cell64 bit lt f 32 bit lti li ti 64-bit result of a 32-bit multiplication, or
a generic 32-bit result coming from another functional unit
Input registers inside the cells are used to sample the operands Introduces the pipeline Introduces the pipeline Can be disabled to avoid useless dynamic power consumption
special input register is used to keep constant values inside the cell, so that they can be used during the elaboration with no need to re-route them.
Butter Processing Element: Cells [cont.]g
Inside each cell there are three functional units a multiplier, an adder an adder, a barrel shifter
small memory (4 cells 32-bit wide) used as lookuptable (LUT) A i l f ti l it (fl ti i t lti li ti ) A special functional unit (floating-point multiplications)
3D graphics benefit from fast, low precision floating-point operations
results produced by the adder and the multiplier, rounding them to be stored in the floating-point format a dedicated block inside the cells: (with three portions)
calculates the amount of leading zeros for each of the operands, the sign of the result, k th i t l b i t th fi l f t packs the internal number into the final format.
Internal Architecture of a Cell of Butter AcceleratorAccelerator
The first row of cells read their operands from global vertical interconnections; The results of the elaboration are put as output accessible from the underlying rows The results of the elaboration are put as output accessible from the underlying rows. The final result can be read externally of Butter either from
its last row at the bottom of the device, or from the rightmost column: rightmost column:
results can be accessed as soon as they are produced, with no need to wait that they go through all the rows.
Different kinds of interconnection inside Butter
Interconnections in Butter
The interleaved interconnection is useful (for example) to propagate the 64-bit result of multiplications splitting their processing over two adjacent rows. They are useful in easing and enhancing the mapping of some algorithms, and in
reducing the amount of cells used.
Thanks to the interleaved connections it is possible to implement the FIR algorithm using only three rows of the array: the first row executes the multiplications the first row executes the multiplications,
the second row the additions of the least significant bits of the products,
the third row the addition of the most significant bits.
Global Interconnections: connecting the output of each cell to every input Global Interconnections: connecting the output of each cell to every input of the cells laying on the row below algorithms like matrix–matrix multiplications and matrix–vector multiplications ?
Butter Co-Processor Requirementsq
Butter was synthesized on FPGA : operating frequency 57 MHz
90 nm Standard-cells technology: Operating Frequency: 280MHz
Thanks to its wide datapath, high parallelism and pipelined nature Butter can run algorithms using a very limited amount of clock cycles; can run algorithms using a very limited amount of clock cycles;
for example, an FIR filter takes 16 cycles, `
a matrix vector multiplication takes 4 cycles and a matrix–vector multiplication takes 4 cycles, and a 2D IDCT 54 cycles.
Milk Coprocessor Design And Verification of a VHDL Model of aFloating-Point Unit for a RISC MicroprocessorFloating Point Unit for a RISC Microprocessor
Solutions to Improve Performancep
pipelined architecture, to deliver up to one result per clock cycleclock cycle
parallel elaboration of instructions High Parallelism different functional units commit their
elaboration simultaneously, a multi-port register file allows th t it b k f th i lt
Parallel elaboration of instructions is made so that some fast instructions can be run while a heavier one is still in progress; the compiler can then provide a significant improvement in the execution f l ith b ki d the concurrent write back of their results.
fast internal bus switching hardware support for denormal operands handling Scalability & Adaptability
of algorithms by making a good scheduling of the instructions, reducing this way unused clock cycles and increasing global computation efficiency.
Scalability & Adaptability functional units can be inserted or removed from the
architecture in an immediate way
Modularity to the Functional Unit
any non-zero number which is smaller than the smallest normal number is ‘denormal'.
Hardware logic for “register locking” and to stall the core
The GCC compiler’s support.
Milk co-processor external interfacep
Coffee RISC core supports upt f t
Pins Interfacingspecify the daia exchange direction (input or output)
It h 4 bit dd d f i t l
to four coprocessors twosignals (c-index [ 1. . 0 ])are used to select whichcoprocesser is currently beingaddressed
g1. wr_cop2. rd_cop3. c_index[1, 0]4 inde [3 0] It has 4-bit address used for internal
registers addressing: bit r_index [3] logical high: a special register is being indexed (r-index [0] then selects among status register or control register)bi i d [3] l l l
4. r_index[3,0]5. cop_exc6. data(31,0)
signal cop-exc indicates internalexceptions: they are• concurrent writes on theCoprocessor register file, by theinternal functional units and the bit r_index [3] logical low: one among
the eight general purpose registers is being indexed
internal functional units and theprocessor core• arithmetical exceptions: overflow,underhow, inexact result, invalidoperand, division by zero• illegal instruction code (the currenti i i d b hinstrution is not supported by thecoprocessor).
Milk Coprocessor Internal Architecturep
MILK CO-Processor Requirementsq
It requires105 K gates The operating frequency 400 MHz on a 90 nm standard
cells technology 20K Logic Elements running at 67 MHz on an Altera Stratix 20K Logic Elements running at 67 MHz on an Altera Stratix
FPGA. It is capable of completing instructions in a very small
number of clock cycles: number of clock cycles: 3 for multiplications, 5 for additions, 8 for square root, 8 for square root, 11 for divisions 2 for conversions and 1 for all the other ones
QUESTIONS? QUESTIONS?