eng6530 reconfigurable computing systems application specific application specific instruction...
TRANSCRIPT
ENG6530 Reconfigurable
Computing Systems
Application Specific Application Specific
Instruction ProcessorsInstruction Processors
““ASIPS” orASIPS” or
““Reconfigurable Processors”Reconfigurable Processors”
ENG6530 RCS 2
TopicsTopics
ASIPs: DefinitionASIPs: Definition MotivationMotivation How to customize ASIPsHow to customize ASIPs Tools for ASIPsTools for ASIPs ApproachesApproaches ConclusionsConclusions
ENG6530 RCS 3
References
1.1. ““Engineering the Complex SOC: Fast, Flexible Engineering the Complex SOC: Fast, Flexible Design with Configurable Processors”, by Chris Design with Configurable Processors”, by Chris Rowen, 2004,Rowen, 2004,
2. “Xtensa Architecture and Performance”, Tensilica Inc, Sep 2002.
3. “Configurable Processors: What, Why, How?”, Tensilica Inc, June 2007
ENG6530 RCS 4
Microprocessors and ASICs
For the ultimate in flexibilityflexibility, programmers map the application onto a general-purpose microprocessor.
For the ultimate in performanceperformance, logic designers map the application into a custom circuit.
App
licat
ion
Microprocessor
ASIC
Programmers
Logic designers
FPGA
ENG6530 RCS 5
Classic Options for Systems-on-Chip
Design Gap!
ENG6530 RCS 6
General Purpose Processors
ENG6530 RCS 7
A Case for Customization
General Purpose Processors: Flexible, but tends to customize the application to
the architecture! ASICS:
High performance, but Expensive, and tends to customize the architecture to the application!
We need to find a technology that can:We need to find a technology that can: customize the architecture to the applicationcustomize the architecture to the application and at the same time flexible and cheap!and at the same time flexible and cheap!
ENG6530 RCS 8
Processor Specialization:Get the Best of Both Options
Gains!
ENG6530 RCS 9
Motivations: reduce size
Pentium 4 die can fit about 50 ARM9 processors at 0.13um, and 80 at 0.10um
At 0.13um and 250MHz clock, ARM9 dissipates 0.1W50 ARM9s = 5W
12mm
12mm
ARM9 at 0.13um=3mm2
Pentium4 at 0.13um= 144mm2
Cost, Power, and Size are important for embedded applications! Processing vs. Dedicated hardware (ASIC)? System-On-a-Chip concept
ENG6530 RCS 10
Programmable Processors
Past Microprocessor Microcontroller DSP Graphics
Processor
Now / Future Network Processor Sensor Processor Crypto Processor Game Processor Wearable Processor Mobile Processor
ENG6530 RCS 11
A Case for Customization General purpose processors handles many
applications fairly well, but…Each application has different requirementsThe instruction set is fixed!Data path width may not suit your application!Cache size/configuration may not be optimalRegister file is either too small or …Functional units might be missing or … Internal busses are slow or too narrow …
ENG6530 RCS 12
Processor Customizations
Specialized Specialized instructionsinstructions
Optimization, searching, classification, …Optimization, searching, classification, …
Specialized Specialized functional unitsfunctional units
MAC Units, Special Comparators, Sorting UnitsMAC Units, Special Comparators, Sorting Units
Parameterized Parameterized busses and datapathsbusses and datapaths
8-bit, 16 bits, synch/async busses8-bit, 16 bits, synch/async busses
Parameterized Parameterized register filesregister files
Parameterized Parameterized cachescaches
Cache size, replacement strategy, …Cache size, replacement strategy, …
P
RegFile
D/I - Caches
FU1 FU2 FU3
ENG6530 RCS 13
Application-specific instruction processors An ASIP is a stored-memory CPU whose architecture architecture
is tailoredis tailored for a particular set of applications. The instruction-sets tailoredinstruction-sets tailored to specific applications or
application domains Customized functional units within data pathwithin data path for high
performance Programmability allows changesallows changes to implementation, Can be used in several differentused in several different products.
Application-specific architecture provides smaller silicon areaarea, higher speedspeed, lower power consumptionpower consumption.
ENG6530 RCS 14
RecallRecall: Different levels of coupling: Different levels of coupling
FU
Workstation
Coprocessor
CPU Memory Caches
I/O Interfac
e
Standalone Processing Unit
Attached Processing Unit
Tightly CoupledTightly Coupled
Loosely CoupledLoosely Coupled
ENG6530 RCS 15
FPGA
ASIC P
Design costDesign costTime-to-marketFlexibilityDeterminismPowerPowerPerformancePerformance
Design costDesign costTime-to-marketTime-to-marketFlexibilityFlexibilityDeterminismPowerPerformance
Design costTime-to-marketFlexibilityDeterminismPowerPerformance
Application Specific Instruction Processors
ENG6530 RCS 16
FPGA
ASIC P
Design costDesign costTime-to-marketTime-to-marketFlexibilityFlexibilityDeterminismDeterminismPowerPowerPerformancePerformanceASCP
Application-Specific Customizable Embedded Processor– Helps preserve the benefits of generality Helps preserve the benefits of generality – Alleviates the drawbacks of general-purpose processorsAlleviates the drawbacks of general-purpose processors
Embedded Applications Requirements
ENG6530 RCS 17
Performance vs. FlexibilityF
lexi
bil
ity
Performance
ASIC
GPP
DSP
RCS
ASIPs!!
ENG6530 RCS 18
ASIPs: Advantages Tailor for specific applications by:
Customize the instruction set Add Customized execution units that efficiently
perform task specific algorithms. Add special registers sized to the natural data
types of the tasks to be performed. Instructions will often execute in one or two
clock cycles which will keep clock rates low and thus energy consumption low as well.
You can further customize the processor as your application evolves with time.
ENG6530 RCS 19
ASIP Design MethodologyA
pplic
atio
n
Design-time configurable
microprocessor
Profile the application
Create custom hardware and instructions to
accelerate critical application sections
Most of the application runs as
execution of general-purpose
instructions
ENG6530 RCS 20
ASIP based approach R
econ
fig
ura
ble
In
str
ucti
on
Set
Pro
cessors
C Parsing
Optimizations
Inst. Identification
Inst. Selection
Config. Scheduling
Code Generation
C Code
Assembly Code
HardwareGeneration
Configuration bits
HardwareEstimator
Compiler Structure
ENG6530 RCS 21
Instruction Set Extension
Idea:Provide a way to augmentaugment the processor’s
instruction set with? Operations needed by a particular application
22
Determinates of CPU PerformanceDeterminates of CPU Performance
CPU time = Instruction_count x CPI x clock_cycle
Instruction_count
CPI clock_cycle
Algorithm
Programming language
Compiler
ISA
Processor organization
TechnologyX
XX
XX
X X
X
X
X
X
X
ENG6530 RCS
ENG6530 RCS 23
Instruction Specialization The instruction set determines the functions
directly implemented in hardware and the operations which can be performed in parallel.
How to improve the instruction set?How to improve the instruction set? Operations which can frequently be scheduled
concurrently should be coded in the same instruction
Operations which can often be chained should be coded in the same way
Multiply-accumulation Vector operations
ENG6530 RCS 24
Computationally demanding parts of applications run on special hardwarespecial hardware
New instructions New instructions use the special hardware
Instruction Set Customization
CUSTOM
XOR
MPY LD
XOR
SHR
XOR
MOV
MPYLD
SHR
AND
25
Automatically Collapsing Clusters of Instructions into New Ones
If the ad-hoc functional unit completes the
job faster GAIN
One ad-hoc complex operation instead of a long
sequence of standard ones
ENG6530 RCS 26
Function Unit and Data Path Specialization
To reduce power consumption and increase performance Word length adaptationWord length adaptation Implementation of application specific HW functionsspecific HW functions
String manipulation String matching Pixel operation Multiplication-accumulation
Special consideration: clock frequency It may be better to use a slower clock in embedded
systems.
ENG6530 RCS 27
Customized Function Units Goal: support important
computation subgraphs Add specialized units within
the data path of the processor Exploits subgraph parallelism Allows natural data
propagation
FU FU FU …
FU FU FU …
IN 1
…
IN 2
…
Fetch
Issue
…ALU
ALU
CCA
… WB
ENG6530 RCS 28
Interconnect Specialization
Specialization can be done in respect to: Interconnect of functional modules
Reduced bus instead of standard system bus to save cost or power consumption
Dedicated connection between registers (accumulator) and memories to increased parallelism
Protocol usedProtocol used for the communication between components.
Synchronous Asynchronous Semi synchronous
ENG6530 RCS 29
Optimizing Power in ASIPs
29
Configurable processors have a deep influence on low power design in two ways: Compared to hardwired logic, software based design
allows for more sophisticated algorithms and control of operating modes.
In many applications, the software can be much smarter than custom RTL about when to run and how fast
ASIPs pack the same work into far few cycles than GPPs allowing the SOC to run at a lower clock frequency (How?)
ENG6530 RCS 30
Optimizing Power in ASIPs
30
E = alpha C V2n E Energy use due to active switching in
CMOS logic C is the total capacitance of all the switched
nodes in the circuit V is the voltage alpha is the average fraction of circuit nodes
switching between one and zero each cycle n is the number of cycles required to execute
the function.
ENG6530 RCS 31
Optimizing Power (insight)
31
The impact of a good processor configuration is to sharply reduce ‘n’ , while increasing ‘C’ only slightly relative to a baseline processor.
ASIPs can be quite smart about activating execution units only when necessary. The processor generator can determine the
combinations of logic blocks that must be active at each stage of the pipeline and create logic for fine-granularity clock gatingclock gating thereby reducing ‘alpha’
ENG6530 RCS 32
ToolsTools??
ENG6530 RCS 33
Tensilica
Tensilica has two main product lines of 32-bit 32-bit
processor coresprocessor cores for SOC design (IP):1. Diamond Standard processors (non modifiable)
2. Xtensa processors (can be modified)
Tensilica also has several CAD tool flowsCAD tool flows to extend the instructions sets
TIE Language
XPRESS Compiler
ENG6530 RCS 34
1. Tensilica Diamond Processor Are a set of off-the-shelf synthesizable cores (fixed and
not configurable) directly available from Tensilica and foundry partners that range from area-efficient, low-power controllerscontrollers an audioaudio processor, a high-performance DSPDSP, and a videovideo processor
Diamond Standard processors come with a comprehensive software tool set: Compilers Assemblers Debuggers, ….
ENG6530 RCS 35
2. Tensilica Xtensa Processor Tensilica’s Xtensa processors are synthesizable
processors that are configurable and extensible.!
ENG6530 RCS 36
Xtensa Processors Architecture The Xtensa Instruction Set Architecture (ISA) is a 32-bit
RISC architecture featuring a compact instruction set optimized for embedded designs.
RISC?
• A small number of memory addressing modes• Large uniform register files for computation operations• Fixed-size instruction words Optimized Pipelined Architecture Simple and fixed instruction-field encoding Memory access via loads and stores of registers
ENG6530 RCS 37
Xtensa Processors Architecture The architecture has:
a 32-bit ALU; 16, 32 or 64 general-purpose physical registers; six special purpose registers; Cache:Cache:
up to 32 KB and up to 32 KB and 1,2,3,4 way set associative cache?1,2,3,4 way set associative cache? Replacement Policy?Replacement Policy? Write back vs. Write through?Write back vs. Write through?
ENG6530 RCS 38
Xtensa Processors Architecture The architecture has:
a 32-bit ALU; 16, 32 or 64 general-purpose physical registers; six special purpose registers; 5 or 7 stage pipelines:5 or 7 stage pipelines:
5-stage: Power Usage: 47 uW/MHZ @ 350 MHz 5-stage: Power Usage: 47 uW/MHZ @ 350 MHz 7-stage: Power Usage: 57 uW/MHz @ 400 MHz7-stage: Power Usage: 57 uW/MHz @ 400 MHz
ENG6530 RCS 39
Tensilica Xtensa Architecture
ENG6530 RCS 40
Xtensa Processor Generator The designer can select from a broad selection of predefined
standard RISC microprocessor options and can add instructions and register extensions to the tailored processor.
Or the designer can use Tensilica's XPRES Compiler to automatically tailor the processor to optimize existing C/C++ code. The Xtensa Processor Generator then creates the complete processor
solution set – pre-verified processor hardware description in source RTL (Verilog or
VHDL), plus supporting hardware implementation methodology scripts.
This complete package includes software development tools including commercial RTOS support, and comprehensive system modeling and
modeling co-verification support.
ENG6530 RCS 41
XPRES Compiler
ENG6530 RCS 42
XPRES CompilerXPRES Compiler
ENG6530 RCS 43
XPRES Compiler
ENG6530 RCS 44
Tensilica Instruction Extension (TIE) TIE is a Verilog-like language used to
describe desired custom instructions.
You can express the desired functionality in the Tensilica Instruction Extension (TIE) language.
TIE helps you get orders of magnitude performance increases out of your processor design.
1. Fusion,
2. SIMD (Single Instruction Multiple Data),
3. FLIX (Flexible Length Instruction Encoding)
ENG6530 RCS 45
TIE Extensions
ENG6530 RCS 46
(I) Fusion
ENG6530 RCS 47
Affect of TIE Instructions
ENG6530 RCS 48
TIE Flow
ENG6530 RCS 49
Fusion Example
ENG6530 RCS 50
Exploiting Parallelism
ENG6530 RCS 51
Creating SIMD TIE Execution Units
ENG6530 RCS 52
FLIX Acceleration
ENG6530 RCS 53
Creating FLIX (VLIW) Acceleration An Xtensa processor can become a multi-issue VLIW processor.
The Xtensa C/C++ compiler’s is capable to aggressively extract instruction-level parallelism from the code. The compiler can schedule multiple operations in a VLIW instructions.
By allowing two or three instructions to execute simultaneously, FLIX allows the processor to act as a 2- or 3- issue VLIW CPU, accelerating general purpose code by 40-60 %.
ENG6530 RCS 54
FLIX
ENG6530 RCS 55
Estimation (energy)
ENG6530 RCS 56
Example: MPEG Acceleration One of the most difficult parts of encoding MPEG-4 video
streams is motion estimation which searches adjacent video frames for similar pixel blocks as part of the MPEG-4 decompression algorithm.
The search algorithm’s inner loop contains a SAD (sum of absolute differences) algorithm consisting of Subtraction Absolute value operation Addition of the resulting value with previously computed values
For a QCIF (quarter common image format) image frame, a 15-Hz frame rate and an exhaustive search motion estimation scheme, SAD operations require slightly more than 641 million operations/sec.
ENG6530 RCS 57
MPEG Acceleration Combining all three SAD component operations (subtraction, absolute
value, addition) into one operation that executes in one clock cycle and executing 16 single-pixel SAD operations in one SIMD SAD SIMD SAD instruction during the same clock cycle reduces the cycle count from 641 million reduces the cycle count from 641 million instructions/sec to 14 million instructions/sec – a 98% reductioninstructions/sec to 14 million instructions/sec – a 98% reduction
ENG6530 RCS 58
MPEG Acceleration The full MPEG-4 decoder adds approximately 100,000 gates to the base
processor and implements a 2-way (coder and decoder) QCIF video coded that operates at 15 frames/sec.
When instructions are added to accelerate all of these MPEG-4 decoding tasks, creating an MPEG-4 SIMD engine within the tailored processor, the results can be quite surprising.
The resulting SIMD engine drops the number of cycles required to decode the MPEG-4 video clips from billions to millions and the required processor operating frequency by roughly 30x to around 10MHz (power dissipation!!)
ENG6530 RCS 59
How Xtensa Compares
Reconfigurable Instruction Reconfigurable Instruction Set ProcessorsSet Processors
61
Two roads to customizationTwo roads to customization
Augment GPPs with programmable logicCouple standard processor (ARM, MIPS) with
an FPGA fabricFixed processor instruction setFPGA implements custom instructions
Implement them in FPGAsCustomize instructions at compile time or at
run time
Reconfigurable Instruction Set ProcessorsReconfigurable Instruction Set Processors
Duplicated instruction decode logic (2 simmetrical data- channels)
Duplicated commonly used function Units (Alu and Shifter)
All others function units are shared (DSP operations, Memory handler)
A tightly coupled pipelined configurable Gate Array
Dynamic Instruction Set Extension(1)
for (i=0; i<16;i++) { temp = abs (v1[i]-v2[i]); out = out + temp; }
A-B B-A
MUX
Accumulator
for (i=0; i<16;i++) {
pgaop (out, v1[i], v2[i]);
}
PiCoGAR
egis
ter
File
ALUs & Multiplier
Memory Unit
A-B
B-A
MU
XA
ccu
mu
lato
r
Original code Optimized XiRisc code
ENG6530 RCS 64
Summary Configurable and extensible (tailorable) processor cores are a
combination of hardware and software IP that give system developers the ability to tailor processors for better performance tailor processors for better performance in specific applicationsin specific applications
The main difference between GPPs and ASIPs is specializationspecialization. It is important to note that specialization must not compromise flexibility!
Advantages:Advantages: Faster, more power efficient, less silicon areaFaster, more power efficient, less silicon area No other company will have your version of that task-No other company will have your version of that task-
specific processor.specific processor. No one will have the matching compiler and software tool No one will have the matching compiler and software tool
chain.chain.
ENG6530 RCS 65
Conclusion ASIPs is somehow related to hardware/software co-designrelated to hardware/software co-design methodology
since a GP is involved along with hardware accelerators in the form of specialized functional units.
Tensilica provides all the necessary tools to automatically createautomatically create Application Specific Instruction Set Processors in minimum time.
The designer can rely either on the TIE language to manually extendTIE language to manually extend the instruction set of the newly
created processor. Another option would be to rely on the Tensilica XPRESS compilerTensilica XPRESS compiler to
automatically createautomatically create the processor and all the necessary software development tools such as compilers, debuggers …
The designer can extend the capabilities of the processor by changing the cache, ports, queues, register files, functional units, ….
It is worth pursuing using the Tensilica tools to perform some type of perform some type of design explorationdesign exploration for your application before you attempt to custom build hardware accelerators.
ENG6530 RCS 66