a reconfigurable processor architecture and software development environment for embedded systems...
Post on 29-Dec-2015
234 Views
Preview:
TRANSCRIPT
A Reconfigurable Processor Architecture A Reconfigurable Processor Architecture and Software Development Environment and Software Development Environment
for Embedded Systemsfor Embedded Systems
A Reconfigurable Processor Architecture A Reconfigurable Processor Architecture and Software Development Environment and Software Development Environment
for Embedded Systemsfor Embedded Systems
Andrea CappelliF. Campi, R.Guerrieri, A.Lodi, M.Toma, A. La Rosa,
L. Lavagno, C. Passerone, R.Canegallo
Nice, FranceApril 22, 2003
OutlineOutline
Motivations XiRisc: a VLIW Processor PiCoGA: A Pipelined Configurable Gate
Array Software Development Environment Results & Measurements Conclusions
MotivationsMotivations
Increased on-chip Transistor density
Increased Integration costs
Strong limitations in power supply
Severepower consumption
constraints
Millions of transistors/Chip
1997199920012003200520070
400
200
300
100
2009
Technology (nm)
Increased Algorithmic complexity
Quest for performance and
flexibility
1997199920012003200520072009
Algorithm complexityMoore’s law
Battery capacity
Embedded systems Algorithms analysisEmbedded systems Algorithms analysis 90% of computational complexity is concentrated
in small kernels covering small parts of overall code
Many algorithms show a relevant instruction-level parallelism Performance improved by multiple parallel data paths
Operand granularity is typically different from 32-bit Traditional ALU is power-inefficient
Significant improvements can be obtained extending embedded processors with application-specific function units
Reconfigurable computingto achieve maximum flexibility
Existing ArchitecturesExisting Architectures
Standard processor coupled with embedded programmable logic where application specific functions are dynamically
remapped depending on the performed algorithm
1: Coprocessor model 2: Function unit model
32-bit load/store Risc architecture (5 stages pipeline)
Concurrent fetch and execution of two 32-bit instructions per cycle
VLIW Elaboration:
Set of specialized function units implementing DSP-specific operations
EXTENDED INSTRUCTION SET RISC ARCHITECTURE
Function unit approach: Reconfigurable device fits in a classical RISC pipeline:
Low communication overhead Exploits very high resource parallelism
ArchitectureArchitecture
Duplicated instruction decode logic (2 simmetrical data- channels)
Duplicated commonly used function Units (Alu and Shifter)
All others function units are shared (DSP operations, Memory handler)
A tightly coupled pipelined configurable Gate Array
Dynamic Instruction Set ExtensionDynamic Instruction Set Extension
configuration specificationregion
specificationpGA-load
Specific operation to transfer data from a configuration cache to the PiCoGA:
32-bit and 64-bit operation to launch the execution inside the PiCoGA(Data exchange through register file):
operation
specification
32-bit
pGA-opSource 1 Source 2 Dest 1 Dest 2
64-bit
pGA-opSource 1 Source 2
operation
specificationDest 1 Dest 2Source 3 Source 4
PiCoGA: a Pipelined ConfigurablePiCoGA: a Pipelined ConfigurableGate ArrayGate Array
Two-dimensional array of LUT-based Reconfigurable Logic Cells Each row implements a possible stage of a customized pipeline, independent and concurrent with the processor Up to 4x32-bit input data and up to 2x32-bit output data from/to register File
Embedded function unit for dynamic extension of the Instruction Set
PiCoGA
DFG-based elaborationDFG-based elaboration Row elaboration is activated by an embedded control unit Execution enable signal for of each pipeline stage
PiCoGA operation latency is dependent on the operation performed
ConfigurationCachePiCoGA
PiCoGA ConfigurationPiCoGA Configuration
Goal: to reduce cache misses due to PiCoGA configuration
Multi-context programming (4 cache layers/planes inside the array) Dedicated Configuration Cache with high bandwith bus to the PiCoGA (192 bits) Partial Run-Time Reconfiguration (A region is configured while another one is
active) Configuration is completely concurrent with processor elaboration
Layer4
Layer3
Layer2
Layer1
PiCoGA mapping
The Software Development EnvironmentThe Software Development Environment
InititialC code
Profiling
Computationkernel
extraction
100010100001100101001010110110010010100101110101101001011101101001010110111111111101
Executablecode
Latencyinformation
AssemblerLevel
Scheduler
pGA-op
Software SimulationSoftware SimulationGoals: check the correctness of the algorithm and evaluate performances
In the source code pGA-op is described using a pragma directive:
#pragma pGA shift_add 0x12 5 c a bc = ( a << 2 ) + b
#pragma end
/**************************************//* Shift_add mapped on PiCoGA *//**************************************/
#if defined(PiCoGA)...
asm(“pGA-op 0x12 ...”)...
/*************************************//* Emulation function _shift_add *//************************************/
#elsevoid _shift_add(){
...c = ( a << 2 ) + b
...}
#endif
Sofware SimulationSofware Simulation
Two special instructions are defined to support emulation:...topga ...jal _shft_addfmpga ......
topga saves current state and passes arguments to emulation function. Function clock cycle count is halted
fmpga copies emulation function result(s) and restores registers; cycle count is incremented with the latency value of the pGA-op
Evaluation of overall performances by counting elaboration cycles
Results and Measurements Results and Measurements
0,076
0,27
0,150,22
0
0,2
0,4
0,6
0,8
1
DES CRC MedianFilter
MotionPrediction
Only VLIW
VLIW + PiCoGA
Normalized Energy Histogram
Speed-ups for several signal processing cores:
75% of energy consumption for a VLIW
architecture is due to accesses to instruction
and data memory
Strong reduction of accesses to instruction
memory DES CRC
Median
Filter
Motion
Estimation
Motion
PredictionTurbo Codes
13.5x 4.3x 7.7x 12.4x 4.5x 12x
ConclusionsConclusions
XiRisc: VLIW Risc architecture enhanced by run-time reconfigurable function unit
PiCoGA: pipelined, runtime configurable, row-oriented array of LUT-based cells
Specific software development toolchain Speedups range from 4.3x to 13.5x Up to 93% energy consumption reduction
top related