http://ces.univ-karlsruhe.de/RISPPBauer, Shafique, Henkel Invited Talk @ SPP-RR Colloquium, 9/25/2009
RISPP: Rotating Instruction SetProcessing Platform
Lars Bauer, Muhammad Shafique
and Jörg Henkel
Chair for Embedded Systems (CES)
University of Karlsruhe (TH)
2
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Development of Embedded Systems
Typical: Static analysis of
hot spots
Building tightly optimizedsystem
Nowadays: Increasing complexity
More functionality
Problem: Statically chosen design
point has to match all requirements
Typically inefficient for individual components (e.g. tasks or hot spots)
3
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Flexibility, 1/time-to-market, …
Eff
icie
ncy
: $/M
ips,
mW
/MH
z, M
ips/
are
a, …
ASIC:- Non-programmable,- highly specialized
ASIC:- Non-programmable,- highly specialized
General purposeprocessor
General purposeprocessor
ASIP
(extensibleprocessor)
ASIP
(extensibleprocessor)
- Instruction set extension- parameterization
- inclusion/exclusion offunctional blocks
“Hardware solution”
“Softwaresolution”
Possible Solution:Extensible Processors
Reconfigurable Compu-
ting: Processor with
reconfigurable ISA,
i.e. reconfigurable
Special Instructions
4
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Realizing Reconfigurable SIs
Legend:Special Instruction
Container (SIC):Reconfigu-rable area:
Core Pipeline (scaled down):
Partition the reconfi-gurable fabric into so-called SI Containers
An SI may be loaded into any free container
Problems: Fragmentation (internal
and external)
Relatively long reconfi-guration time
Co
re P
ipe
line
Corresponds to Chimaera, OneChip, Molen, Proteus, etc.
5
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Analysis of Special Instruction Execution
0
5
10
15
20
25
30
35
0 200 400 600 800 1000 1200 1400 1600 1800 2000
No cISA exec.
With cISA exec.
With cISA exec. & smaller SIs
With cISA exec. & upgrades
#A
ccum
ula
ted S
I E
xecutions (
in t
housands)
Execution Time [K cycles]
core Instruction
Set Architectures
(i.e. the ISA that is
statically available
in the pipeline)
Our RISPP
approach:
modular Special
Instructions sup-
porting upgrades
6
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Atom: elementary data path (smaller granularity)
Molecule: combination of Atoms (bigger granularity)
Special Instr.: Application specific assembly instruction
SI A SI B SI C
A1 A2 A3 AcISA
12
2
Atom 2Atom 1
B1 B2 BcISA C1 CcISA
Atom 3
1 2
C2
SPECIAL IN-STRUCTIONS(SIs)
MOLECULES
ATOMS2
111
12
Atom 4 Atom 6Atom 5
1 2 122
112
(the numbers denote: #Atom-instances requi-red for this Molecule)
1
(an SI can be implementedby any of its Molecules)
Fundamental Processor Extension:Atom / Molecule Model
7
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Atom: elementary data path (smaller granularity)
Molecule: combination of Atoms (bigger granularity)
Special Instr.: Application specific assembly instruction
SI A SI B SI C
A1 A2 A3 AcISA
12
2
Atom 2Atom 1
B1 B2 BcISA C1 CcISA
Atom 3
1 2
C2
SPECIAL IN-STRUCTIONS(SIs)
MOLECULES
ATOMS2
111
12
Atom 4 Atom 6Atom 5
1 2 122
112
(the numbers denote: #Atom-instances requi-red for this Molecule)
1
(an SI can be implementedby any of its Molecules)
Example Atom
X00
X30
X10
X20Y20
Y00
Y10
Y30
>> 1−
>> 1
>> 1
−>> 1++
++
<< 1
<< 1
−
−
DCT HT
Fundamental Processor Extension:Atom / Molecule Model
8
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Atom: elementary data path (smaller granularity)
Molecule: combination of Atoms (bigger granularity)
Special Instr.: Application specific assembly instruction
SI A SI B SI C
A1 A2 A3 AcISA
12
2
Atom 2Atom 1
B1 B2 BcISA C1 CcISA
Atom 3
1 2
C2
SPECIAL IN-STRUCTIONS(SIs)
MOLECULES
ATOMS2
111
12
Atom 4 Atom 6Atom 5
1 2 122
112
(the numbers denote: #Atom-instances requi-red for this Molecule)
1
(an SI can be implementedby any of its Molecules)
Fundamental Processor Extension:Atom / Molecule Model
Example Special Instruction
INPUT: OUTPUT:DCT=0
QSubSAV (Sum of
Absolute Values)
+
+
+
Repack Transform
HT=0 DCT=0 HT=1
9
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Atom: elementary data path (smaller granularity)
Molecule: combination of Atoms (bigger granularity)
Special Instr.: Application specific assembly instruction
SI A SI B SI C
A1 A2 A3 AcISA
12
2
Atom 2Atom 1
B1 B2 BcISA C1 CcISA
Atom 3
1 2
C2
SPECIAL IN-STRUCTIONS(SIs)
MOLECULES
ATOMS2
111
12
Atom 4 Atom 6Atom 5
1 2 122
112
(the numbers denote: #Atom-instances requi-red for this Molecule)
1
(an SI can be implementedby any of its Molecules)
Fundamental Processor Extension:Atom / Molecule Model
Example Molecule
+
+
+
Repack (2 instances) Transform (2 instances)
1716151413121110
SAV (2 instances)
10
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Supporting Modular SIs
Co
re P
ipe
line
There is no predetermined maximum of supported SIs
Multiple SIs may share common data paths (i.e. reuse them)
SIs can be upgraded (due to multiple available Molecules)
Significantly reduced fragmentation problem
Decision how many Atom Containers shall be spend for which SI can adapt at run time Demands a run-time system
Co
re P
ipe
line
11
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Dynamic System Behavior
Extensible Processor: selecting points in design space at design time
Reconfigurable Processors: typically fix at compile time when and how to deploy reconfigurable hardware
For instance depending on input data (e.g. different computational paths in a video encoder)
How to handle situations that are
unknown at design- & compile time?
12
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Example: Execution Flowof an H.264 Video Encoder
Iterates on Macro Blocks (MBs), i.e. 16x16 pixels
2 different MB-types different computational paths
Intra-frame prediction: I-MB
Inter-frame prediction: P-MB
If M
B_
Ty
pe
= P
_M
B MC
L
oo
p O
ve
r M
B
Encoding
Engine
Lo
op
Ov
er
MB
ME
: S
A(T
)D
RD
·M
B-T
yp
e D
ec
isio
n (
I o
r P
)
·M
od
e D
ec
isio
n (
for
I o
r P
)
Lo
op
Ov
er
MB
IPRED
DCT /
Q
DCT /
HT / Q
IDCT /
IQ
IDCT /
IHT / IQ
CAVLCth
en
els
e
MB Encoding Loop
In-L
oo
p D
e-
Blo
ck
ing
Filte
r
13
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Problem: Input-DependentDynamic Application Behavior
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700
Frame Number
I-M
Bs
pe
r fr
am
e [
%]
Distribution of
I-MBs [%] in a CIF
(352x288: 396 MBs)
Video Scence
The RISPP Run-time system (Rotation Manager) can adapt the SI
performance (choosing different Molecules) depending on the requirements
14
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Run-time System:Simplified Overview and Connections
Decode Scheduling
Prediction
Selection
Replacing
Core Pipeline
Status / Controll
Execution Control
Instruction
Reconfigure
Special Instructions& Forecasts
Run-time System
Instruction Memory including Special
Instructions and
Forecasts
Monitoring
Reconfigurable HW
15
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Error Back-
propagation
Online Monitoring & Prediction
Exemplary Control Flow Graph (Nodes are Base Blocks)
Time for
reconfiguration
Exemplary inner loop,
executing SATD
FC1: Forecasting the
future usage of SATD
FC2: Forecasting that SATD is
no longer required in this loop,
potentially forecasting other SIs
Potentially other
inner loops, etc.
Exemplary
outer loop
Monitor the amount of SIexecutions between FC1 and FC2
Calculate the Error between Prediction and Monitoring
Back-propagate the weight error (based on the Temporal Difference Scheme)
16
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Measured Forecast Adaptationfrom Hardware Prototype
0
20
40
60
80
100
120
140
160
180
200
150 200 250 300 350 400 450 500
Frame Number
Fo
reca
st
Valu
e (
exp
ecte
d a
mo
un
t o
f I-
MB
s)
Actually Executed I-MBs
Predicted I-MBs for α = 0.6
Predicted I-MBs for α = 0.3
Predicted I-MBs for α = 0.1
17
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Run-time System:Simplified Overview and Connections
Decode Scheduling
Prediction
Selection
Replacing
Core Pipeline
Status / Controll
Execution Control
Instruction
Reconfigure
Special Instructions& Forecasts
Run-time System
Instruction Memory including Special
Instructions and
Forecasts
Monitoring
Reconfigurable HW
18
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Input to the Selection:Requested SIs and theirimplementing Molecules
Selection: Choose a subsetS of Molecules
Constraint: Chose exactly oneMolecule to implement a SI
Constraint: Stay within the capacityof the reconfigurable hardware
Formalized Molecule Selection
Complexity:
● Our Selection has similarities to the Knapsack
problem
● However, due to Atom sharing it is not identical
● Polynomial reduction from Knapsack to
Selection is given in the DAC’08 paper
Selection is NP-hard
19
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Greedy vs. Optimal Selection
For many parameter pairs, greedy finds same solution
In some (not relevant) cases, greedy is even faster
optimal solving the Selection does not necessarily lead to the fastest execution (more problems need to be solved and the performance still depends on the actual SI execution frequency)
Greedy: Optimal:
20
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Run-time System:Simplified Overview and Connections
Decode Scheduling
Prediction
Selection
Replacing
Core Pipeline
Status / Controll
Execution Control
Instruction
Reconfigure
Special Instructions& Forecasts
Run-time System
Instruction Memory including Special
Instructions and
Forecasts
Monitoring
Reconfigurable HW
21
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Determining Atom Reconf. Sequence
6
5
4
3
2
1
fastest available
Molecule# loaded
Atoms
3m
1m 2m
#
In
sta
n-
ce
s o
f A
tom
A1
# Instances
of Atom A2
1 2 3
1
2
3
1m
2m
3m2m 2m
3m
Upgrade
candi-
dates
Selected
Molecule
Problem: Reconfiguration is slow
Constraint: At most one reconfiguration at a time
22
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Comparing Different SchedulingMethods for 2 Selected SIs
Scheduling Methods:
# Instances of Atom A1
# Instances of Atom A2
1 2 3
1
2
3
4
4
Upgrade Candi-
dates for SI2
5
5“First Select First Re-
configure” (FSFR)
“Avoid Software
First” (ASF)
“Smallest Job First”
(SJF)
Selected Mole-
cule for SI2
Selected Mole-
cule for SI1
1 2m m
1m
2m
23
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Comparing our ProposedScheduling Schemes
200
300
400
5005 6 7 8 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Amount of Reconfigurable Hardware [#AtomContainers]
Ex
ec
uti
on
Tim
e [
Mil
lio
n C
yc
les
]
Avoid Software First (ASF)
First Select First Reconfigure (FSFR)
Smallest Job First (SJF)
Highest Efficiency First (HEF)
Encoding 140 frames (352x288 resolution) with H.264
24
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Detailed Analysis of HEF scheduler
DCT Execution MC Execution SATD Execution SAD Execution
DCT Latency MC Latency SATD Latency SAD Latency
Lin
es:
SI
Late
ncy [
Cycle
s]
(Lo
g S
cale
)
Execution Time [100K Cycles]
110
100
1,0
00
10,0
00
Bars
: # o
f S
I E
xecu
tio
ns p
er
100K
Cycle
s
01,0
00
2,0
00
3,0
00
4,0
00
Continuation of Latency lines for SAD and SATD are omitted for clarity
0 2 4 6 8 10 12 14 16 18 20 22 24
25
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Run-time System:Simplified Overview and Connections
Decode Scheduling
Prediction
Selection
Replacing
Core Pipeline
Status / Controll
Execution Control
Instruction
Reconfigure
Special Instructions& Forecasts
Run-time System
Instruction Memory including Special
Instructions and
Forecasts
Monitoring
Reconfigurable HW
26
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Infrastructure for Modular SIs
Atom Container(reconfigurable)scaled down for clarity
Bus Connector(non-reconfigurable)
Input
Output
. . .. . .
27
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
FPGA-based Prototype
Xilinx Virtex-4 LX 160 on Silica/Avnet Board
Audio/Video Module, CF-Card, Touch-Screen LCD
SDRAM, DDR-DRAM, SRAM, Reconfiguration EEPROM
28
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
PCB for Reconfiguration EEPROMand Peripherals
29
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Leon 2 core Instruction Set Architecture
Static Atoms (i.e. non-reconfigura-
ble) for typical operations like data repacking
etc.
10 dynamically reconfigurable
Atoms
Periphery IP-Core for Video-In and
Video-Out. Additi-onally providing
video buffers and memory-mapped interface to access
the buffers
Periphery IP-Core for I2C (touch-
screen LCD)
Reconfiguration IP-Core: external EEPROM FIFO ICAP (Internal
Configuration Access Port)
Atom Framework
Rotation Manager: currently imple-
mented as a hard-ware block for the Forecasts / Predic-tion and a Micro-
Blaze for Selection, Scheduling and Replacement
FPGA Floorplan/PlanAhead
30
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
FPGA Floorplan/PlanAhead (cont’d)
31
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
RISPP Simulator:Components, Connections, and GUI
SystemC-based simulator
Input for pipeline is obtained from Instruction Set Simulator (ArchC)
SI information is semi-automatically derived at compile time
getFastestAvailableImpl()
Special Instruction
getRequiredDPs()
SI Implementation
isAvailableOnFPGA()
Data Path
manageSIexec()
SI Execution UnitCore PipelineApplication
Binary
Prefetching Unit
Online Monitoring
input
input
inpu
t
Instruction
Set Arch.
Branch
tracepushNextDataPath()
DP loading queue
FPGA
SIC FPGA
DPC FPGA
Special Instruction Container
Data Path Container
2..Ü
1..Ü
0..Ü
0,11
1
11
1
1 1
1
11 1 0..Ü
1
11..Ü
1
0,1
Defines the SIs (including instruc-
tion format), implementations and data paths
XML-filehas many▼
requiresmultiple
▼
currentlycontains ►
is availableon FPGA
◄
contains ►
1 0..Ü
knows ►
triggers ►
◄ stalls
◄ observes
asks ►
fills ►
triggers▼
reconfigures ►
1..Ü0..Ü
Pipeline & run-time system SI management
FPGA management0..Ü
...
...
......
...
UML Legend: association: aggregation: composition: generalization:
1
32
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
RISPP Simulator:Components, Connections, and GUI
33
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
If M
B_
Ty
pe
= P
_M
B MC
L
oo
p O
ve
r M
B
Encoding
Engine
Lo
op
Ov
er
MB
ME
: S
A(T
)D
RD
·M
B-T
yp
e D
ec
isio
n (
I o
r P
)
·M
od
e D
ec
isio
n (
for
I o
r P
)
Lo
op
Ov
er
MB
IPRED
DCT /
Q
DCT /
HT / Q
IDCT /
IQ
IDCT /
IHT / IQ
CAVLC
the
ne
lse
MB Encoding Loop
In-L
oo
p D
e-
Blo
ck
ing
Filte
r
Overall System Evaluation
34
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Overall System Evaluation
Special
InstructionImplemented Atoms
Motion Estimation
(ME)
SAD SAD_16
SATD QSub, HT_4, Repack, SAV
(Inverse) Transform
(I)DCT DCT_4, Repack, (QSub)
(I)HT_2x2 HT_2
(I)HT_4x4 HT_4, Repack
Motion Compen-
sation (MC)MC_Hz_4 PointFilter, BytePack, Clip3
Intra Prediction
(IPred)
IPred_HDC PackLBytes, CollapseAdd
IPred_VDC CollapseAdd
Loop Filter (LF) LF_BS4 Cond, LF_4
35
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Overall System Evaluation
Special
InstructionImplemented Atoms
Motion Estimation
(ME)
SAD SAD_16
SATD QSub, HT_4, Repack, SAV
(Inverse) Transform
(I)DCT DCT_4, Repack, (QSub)
(I)HT_2x2 HT_2
(I)HT_4x4 HT_4, Repack
Motion Compen-
sation (MC)MC_Hz_4 PointFilter, BytePack, Clip3
Intra Prediction
(IPred)
IPred_HDC PackLBytes, CollapseAdd
IPred_VDC CollapseAdd
Loop Filter (LF) LF_BS4 Cond, LF_4
Compared to Leon 2 GPP: 26.6x faster
Conservative comparison to reconfigurable
processor with Monolithic SI: still 1.24x faster
Depending on the size/ granularity of the SIs
it can be > 7x (e.g. for Proteus; 2.38x in
comparison to Molen)
Our approach additionally provides:
Adaptivity for changing control flow
(due to input data)
Adaptivity for multi tasking scenarios
(tasks sharing the reconfigurable hardware)
36
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
Summary & Conclusion
Hierarchical Special Instruction composition with
different area-performance trade-offs modular SIs
Solved the problem “Parallelism vs. Reconfiguration
Overhead”. We can provide both by upgrading the SIs
Achieving noticeably better performance:
Comparison to GPP (Leon-2): 26.6x (using 8 Atom Containers)
Comparison to state-of-the-art ASIPs: up to 3.1x
Comparison to state-of-the-art reconfigurable processor: up to
7.19x (2.38x in comparison to Molen)
Providing very high adaptivity that is demanded for
changing control flow or multi-tasking environments
There is a large potential for improving the way current
Extensible Processors work
37
Bauer, Shafique, Henkel http://ces.univ-karlsruhe.de/RISPPInvited Talk @ SPP-RR Colloquium, 9/25/2009
RISPP Publication Excerpt[ICCAD’09] Run-time Energy Minimization Scheme using a dynamically
power-gated instruction set
[CODES’09] Replacement Policy for run-time reconfigurable accelerators
[DATE’09] Cross-Architectural Design-Space Exploration Tool
[JSPS’09] Describing and optimizing the H.264 video encoder appl.
[FPL’08] Hardware infrastructure that allows to reconfigure Atoms and to implement different Molecules
[TVLSI’08] General overview and comparison with state-of-the-art ASIP
[DAC’08] Determining the Molecule Selection and comparing our approach with Proteus
[DATE’08] Determining the Atom reconfiguration sequence for the selected Molecules and comparing our approach with Molen
[SASO’07] Online monitoring and fine-tuning the predicted Special Instruction execution frequencies
[DAC’07] Presentation of RISPP concept and compile-time preparations (when to start prefetching)