arces university of bologna reconfigurable architectures andrea lodi
Post on 20-Dec-2015
220 views
TRANSCRIPT
ARCES University of Bologna
Reconfigurable Architectures
Andrea Lodi
ARCES University of Bologna
SoC trends
• Increasing mask cost (~ 3M$)• Increasing design complexity• Increasing design time (~ 3M$)
• Rapidly changing communication standards• Low-power design in wireless environment• Increasing algorithmic complexity
requirements
ARCES University of Bologna
Product life cycle
time
sales
Growth Maturity Decrease
LOSS
time-
to-m
arke
t met
time-
to-m
arke
t fai
led
ARCES University of Bologna
Trends in wireless systems
• Increased on-chip Transistor density
• Increased design complexity
Millions of transistors/Chip
1997199920012003200520070
400
200
300
100
2009
Technology (nm)
• Increased Algorithmic complexity
• Low battery capacity growth
1997199920012003200520072009
Algorithm complexityMoore’s law
Battery capacity
• Demand for reusability and flexibility
• Demand for high performance and energy efficiency
ARCES University of Bologna
Digital architecture design space
ARCES University of Bologna
Parallelism in computation
• Thread level parallelism
• Instruction level parallelism (ILP)
• Pipeline (loop level)
• Fine-grain parallelism (bit/byte-level)
ARCES University of Bologna
Instruction level parallelism
+ + +
**
-
+
e
3
a b c d
+ + +
* *3
-
+
ASIC Implementation
ARCES University of Bologna
Spatial vs. Temporal Computing
Ax2 + Bx + c (Ax + B)x + C
Spatial (ASIC) Temporal (Processor)
ARCES University of Bologna
Superscalar/VLIW processors
• FU limitations• Register file size limitation• Crossbar inefficiency
ARCES University of Bologna
Byte-level parallelism in processors
• MMX technology: 57 new instructions • Byte and half word parallel computation• SIMD execution model
ARCES University of Bologna
Bit-level parallelismReverse (int v) {
int x, r;
for (c=0; x<WIDTH; x++) {r |= v&1;v = v >> 1;R = r << 1;
}return r;
}
v
r
popcount (int v) {int r=0;
while (v) {if (v&1) r++;v = v >> 1;
}return r;
}
+ + + ++ + + +
+ + +
v
r
ARCES University of Bologna
Pipeline parallelism
for (j=0; j<MAX; j++)b[j] = popcount[a[j]];
+ + + +
+ + + +
+ + +
v
r= register
ARCES University of Bologna
FPGAFPGA (Field-Programmable Gate Array) composed of 2 elements:• Array of clbs (configurable logic blocks) composed of :
– 1 or few small size LUTs (4:1 or 3:1)– Control logic: mux controlled by configuration bits– Dedicated computational logic (carry chain …)
• Configurable routing network connecting clbs composed of:– Different length wires– Connection blocks connecting clbs to the routing network– Switch blocks connecting routing wires
LUTs, configuration bits to program clbs and the routing network represent the FPGA configuration, which determines the function implemented
ARCES University of Bologna
Configurable logic block
ARCES University of Bologna
Xilinx Clb
• Xilinx clb 4000 series:– 11 input 4 output bits
– 3 LUTs
– Carry logic
– 2 output registers
ARCES University of Bologna
Configurable routing network
ARCES University of Bologna
Example
ARCES University of Bologna
Density Comparison
ARCES University of Bologna
FPGA vs. Processor
FPGA(computing in space)• Parallel execution• Configurable in 102-103 cycles• Fine-grained data• Application specific operators• Large area (switches, SRAM)• Entire applications don’t fit• Slow synthesis, P&R tools
Processor(computing in time)• Sequential execution• Programmable every cycle• Fixed-size operands• Basic operators (ALU)• Compact• Handles complex control flow• Fast compilers
ARCES University of Bologna
Reconfigurable processors
But:
• 90% execution time spent in computational kernels:– FPGAs 10-100x speed-up over processors– FPGAs 10-100x denser than processors (bit-ops/2s)
• Reconfigurable processor: Risc + FPGA
ARCES University of Bologna
Reconfigurable processor architecture
• Hybrid architectures:– RISC processor– FPGA
ARCES University of Bologna
Computational models
• RC Array: IO Processor/Interface logic
• Attached processor– Piperench, T-Recs
• ISA Extension– Function unit:
• PRISC, OneChip, Chimaera
– Coprocessor• Garp, NAPA, Molen
ARCES University of Bologna
IO Processor/Interface Logic
• Logic used in place of – ASIC environment
customization– external FPGA/PLD
devices
• Looks like IO peripheral to processor
• Example– protocol handling– stream computation
• compression, encrypt– peripherals– sensors, actuators
• Case for:– Always have some system
adaptation to do
– Modern chips have capacity to hold processor + glue logic
– reduce part count
– Glue logic vary
– many protocols, services
– only need few at a time
ARCES University of Bologna
Example: Interface/Peripherals
• Triscend E5
ARCES University of Bologna
Instruction Set Extension
• Instruction Bandwidth– Processor can only describe a small number of basic
computations in a cycle• I bits 2I operations
– This is a small fraction of the operations one could do even in terms of www Ops
• w22(2w) operations
– Processor could have to issue w2(2 (2w) -I) operations just to describe some computations
– An a priori selected base set of functions could be very bad for some applications
ARCES University of Bologna
Instruction Set Extension
• Idea:– provide a way to augment the processor’s
instruction set– with operations needed by a particular
application
ARCES University of Bologna
Architectural Models for I.S.A extension
Cpu surrounded by a collection of
Application-specific Custom Computing Devices
PLEIADES PLEIADES
High performance Overdesigned for most applications Difficult to program
Zhang et al, 2000
XTENSA XTENSA
Risc CPU featuring application-specific function units optionally inserted in the
processor pipeline
Good performance Easy to program Configured at mask-level
Tensilica inc, 2002
ARCES University of Bologna
Dynamic ISA Extension models
Standard processor coupled with embedded programmable logic where application specific functions are dynamically
re-mapped depending on the performed algorithm
1: Coprocessor model 2: Function unit model
ARCES University of Bologna
Coprocessor model: Garp Explicit instructions moving Explicit instructions moving data to and from the arraydata to and from the array High communication overheadHigh communication overhead (long latency array operations)(long latency array operations) Processor stalled each time the Processor stalled each time the array is activearray is active
Array performs at TASK level Array performs at TASK level (Very coarse grain)(Very coarse grain)
10-20x on stream, feed-forward 10-20x on stream, feed-forward operationsoperations 2-3x when data-dependencies 2-3x when data-dependencies limit pipelininglimit pipelining
Callahan, Hauser, Wawrzynek, 2000
ARCES University of Bologna
Function unit model: Prisc
Razdan, Smith 1994
Array fit in the risc pipelineArray fit in the risc pipeline
No communication overheadNo communication overhead Some degree of parallelism Some degree of parallelism between between function unitsfunction units
Gate array performs Gate array performs combinatorial combinatorial instructions ONLY (very fine instructions ONLY (very fine grain)grain)
Low speedup figures (2x/3x)Low speedup figures (2x/3x)
ARCES University of Bologna
Function Unit Model: pros
• No communication overhead:– Strict synergy between FPGA and other function
units– FPGA can be used frequently even for small
functions– Small reconfigurable array area
• Flow control handled by the core• Memory access handled by the core• Easy instruction set extension• Configuration streams compiled from C
ARCES University of Bologna
EXTENDIBLE INSTRUCTION SET RISC ARCHITECTURE
32-bit load/store Risc architecture (5 stages pipeline)
•Concurrent fetch and execution of two 32-bit instructions per cycle•Fully bypassed, to minimize pipeline stalls (Average of 10/20% for most computational cores)•DSP-oriented reconfigurable functional unit (PiCoGA)•Fully configurable at execution time•Elaboration and configuration controlled by asm instructions inserted in C source code•PiCoGA used as a programmable Data-path with independent pipeline structure
•Multiply/Mac Unit•Branch/Decrement Unit•Alu featuring “MMX” byte-wide concurrent operations
Embedded reconfigurable device for dynamic ISA extension
VLIW Elaboration
Set of specialized functional units
ARCES University of Bologna
XiRisc Architecture
ARCES University of Bologna
Dynamic Instruction Set Extension
ARCES University of Bologna
…..pgaload …..…..…..pgaop $3,$4,$5…...…...Add $8, $3
Dynamic Instruction Set Extension
Register FileRegister File
Con
fig
ura
tion
Mem
ory
Con
fig
ura
tion
Mem
ory
ARCES University of Bologna
PiCoGA Architecture
PiCoGA(Pipelined Configurable Gate Array): Embedded datapathfor dynamic i.s.a. extension •Dynamically reconfigurable•Structured in rows activated in data- flow fashion by the PiCoGA control unit• Can hold a state• pGA-op latency depends on the specific mapped function• Functionality is determined from DFG extracted from C code
PiC
oG
APiC
oG
A C
on
trol
Con
trol U
nit
Un
itPiC
oG
APiC
oG
A C
on
trol
Con
trol U
nit
Un
it
Processor InterfaceProcessor Interface
PicoRow (Synchronous Element)
ARCES University of Bologna
Pico-cell Description4x32-bit input data from Reg File2x32-bit output data to Reg File
PiC
oG
A C
on
trol U
nit
PiC
oG
A C
on
trol U
nit
PiC
oG
A C
on
trol U
nit
PiC
oG
A C
on
trol U
nit
INPUTLOGICINPUTLOGIC
LUT16x2LUT16x2
OUTPUTLOGIC,
REGISTERS
OUTPUTLOGIC,
REGISTERS
CARRYCHAINCARRYCHAIN
LUT16x2LUT16x2
EN
PiCoGA control unit signals
Configuration bus
Loop-back
12 global lines to/from R
eg File
INPUT CONNECT
BLOCK
INPUT CONNECT
BLOCK
SWITCHBLOCK
INP
UT
CO
NN
EC
TB
LO
CK
INP
UT
CO
NN
EC
TB
LO
CK
OU
TP
UT
CO
NN
EC
TB
LO
CK
OU
TP
UT
CO
NN
EC
TB
LO
CK
RLC
…
…
… …
…
…
…
ARCES University of BolognaPiC
oG
APiC
oG
A C
on
trol
Con
trol U
nit
Un
it
Computing on PiCoGA
Mapping
Pga_op2
Mapping
Pga_op1
Data Flow Graph
Data out
Data in
ARCES University of Bologna
Multi-context Array
Func. 1Func. 1
Func. 2Func. 2
Func. 3Func. 3
Func. 4Func. 4
Func. nFunc. n
Configuration Configuration CacheCache
PiCoGAPiCoGA
Four configuration planes are Four configuration planes are available, one of them executingavailable, one of them executing
Plane switch takes just 1 clock Plane switch takes just 1 clock cyclecycle
While a plane is executing another While a plane is executing another may be reconfigured may be reconfigured → No → No reconfiguration time overheadreconfiguration time overhead
ARCES University of Bologna
Architecture FlexibilityParallelism to exploit ?
Bit-level operations ?
MAC intensive ?
Memory intensive ?
Yes
Yes
Yes
Yes
No
No
No
(Ex: Turbo Decod., Motion Est.)
(Ex: DES, Reed-Solomon)
(Ex: FFT, Scalar product)
(Ex: DCT, Motion Est.)
Speed-up from
pGA (5x – 100x)
Speed-up from DSP instructions and VLIW
(1.5x – 2x)
Improvements for a large number of Data & Signal Processing algorithms
ARCES University of Bologna
Programming XiRisc: Restrictions
• Fixed-point algorithms• Variable size specification at the bit level
Not supported yet:• Dynamic memory allocation• Math library• Operating System
ARCES University of Bologna
File.cFile.c
C COMPILERC COMPILER
PROFILERPROFILER
XiRisc Compilation Flow
PiCoGAConfigurator
PiCoGAConfigurator
ConfigurationBit stream
ConfigurationBit stream
PiCoGAopPiCoGAop
ConfigurationLibrary
ConfigurationLibrary
Software Simulation
ARCES University of Bologna
Example: Motion Estimation
Sum of Absolute Difference
(SAD)-
High instruction-level
and inter-iteration parallelism
ARCES University of Bologna
Data Flow Graphpixel-pixel
absolute difference
Abs (p1[i] – p2[i])•p1[i], p2[i] pixel
Absolute DifferenceSum tree
…..
ARCES University of Bologna
SAD
Writebackto
Register File
AD1 AD2 AD3 AD4
From Register File
Sum of Absolute Difference
SAD8
SAD8
ARCES University of Bologna
Emulation Functionwith
Latency and Issue Delay
Emulation Functionwith
Latency and Issue Delay
Place & Route
ConfigurationBits
ConfigurationBits
Place & RoutePlace & Route
MappingMapping
DFG-based descriptionDFG-based description
High-LevelC Compiler
High-LevelC Compiler
GriffyCompiler
ARCES University of Bologna
Performance evaluation
• Emulation function• Latency and Issue-Delay back-annotation• Profiling
ARCES University of Bologna
Motion Estimation: Results
Motion estimation:• 16 SAD operations in parallel• PiCoGA occupation: ~100%• Speed-up: 7x (with respect to standard XiRisc)
MPEG preliminary result:• H.261 standard QCIF (176x144): 10 frame/sec
ARCES University of Bologna
Reed-Solomon Encoder: Results
Encoder RS(15,9): 4-bit symbols• PiCoGA occupation: ~25%• Speed-up: 37x• Throughput: 70.6 Mb/sec
Encoder RS(255,239) widely used: 8-bit symbols• PiCoGA occupation: ~60%• Speed-up: 135x• Throughput: 187.1 Mb/sec
ARCES University of Bologna
Speed-up and Power Consumption
AlgorithmAlgorithmEnergy consumption Energy consumption
reductionreduction
(vs. std. XiRisc)(vs. std. XiRisc)
Speed-up Speed-up
(vs. std. XiRisc)(vs. std. XiRisc)
DES encryptionDES encryption 89%89% 13.5x13.5x
Turbo decoderTurbo decoder 75%75% 11.7x11.7x
Motion predictionMotion prediction 46%46% 4.5x4.5x
Median filterMedian filter 60%60% 7.7x7.7x
CRCCRC 49%49% 4.3x4.3x