A Framework for Run-Time Reconfigurable Computing
DANIEL LLAMOCCA
Electrical and Computer Engineering Department,
Oakland University
October, 24th, 2014
OutlineMotivation
General Approach
Implementation Details
Application Examples:o Pixel Processor
o 2D FIR Filter
Research Plans
Teaching Plans
Conclusions
Motivation
Dynamic Reconfigurable Computing Management will enable us to deliver:
A dynamically self-adaptive system (by dynamic allocation of computational resources and dynamic frequency control) that satisfies time-varying Multi-variable requirements (or constraints).
Optimal hardware realizations: We want to investigate optimal solutions that can meet time-varying Multi-variable requirements .For example, if the variables were Energy, Performance, and Accuracy, then the system should minimize energy consumption, and at the same time maximize performance and precision, whilesatisfying the given multi-variable requirements.
Digital systems can be characterized by a series of properties:
Energy, Performance, Precision, Resource Usage, Bandwidth, etc.
The controlling of these variables at run-time is defined as Dynamic Reconfigurable Computing Management.
Digital
signal/image/video
DIGITALSYSTEM
Digital
signal/image/videoor features
Control
Energy
PerformancePrecision
Resources
...
... Parameters
IN OUT
Motivation
Examples:
Task 1: A video processing system is asked to deliver real time performance at 30 frames per second (fps) on limited battery life that will also need to operate for at least 10 hours. This is a multi-objective optimization problem. If solutions are found, pick the system realization that delivers the highest accuracy.
Task 2: Now, suppose that we are asked to deliver performance at 100 frames per second (fps) at some minimum level of accuracy (60dB). In this case, we select the hardware realization with the lowest energy requirements while meeting the performance and accuracy constraints.
The system can then carry out independent tasks in time:
t1 t2 t3 time...
DIGITALSYSTEM
Set 1 of
Parameters
Performance 30 fps
Energy available for at least 10 hours
Performance 100 fps
Accuracy 60 dB
DIGITALSYSTEM
Set 2 of
Parameters
TASK 1 TASK 2
Multi-variable Space:
Energy-Performance-Accuracy
Time-Varyingconstraints
MotivationDynamic Reconfigurable Computing Management can rely on: Dynamic Partial Reconfiguration and Dynamic Frequency Control on FPGAs.
Dynamic Partial Reconfiguration (DPR)
DPR technology enables the adaptation of hardware resources by modifying or switching off portions of the FPGA while the rest remains intact, continuing its operation.
A Partial Reconfiguration Region (PRR) is a region whose hardware configuration can be modified at run-time.
Xilinx® devices: the PRR is dynamically reconfigured via the internal c0nfiguration access port (ICAP).
Dynamic Frequency Control
Digital Clock Managers (DCMs) inside FPGAs provide a wide range of clock management features.
The Dynamic Reconfiguration Port (DRP) of the DCM enables dynamic control of the frequency and phase. We can dynamically adjust the frequency without reloading a new bitstream to the FPGA.
clkfx
clkin
rst
clkfb clk0
daddr[6:0]
di[15:0]
dwe
dclk
den
do[15..0]
drdy
DRP
DCM_ADV
DC
R s
lave
inte
rface
processor
DCR MASTER
BUFG
BUFG
reference clock
DC
R B
US
DCR SLAVE
dcr_a_bus
dcr_sl_dbus
dcr_read
dcr_write
sl_dcrack
sl_dcr_dbus
clkfx
Dynamic FrequencyControl on Virtex-4 FPGAs
VariableClock
FPGA
Module 1
Module 2
Module n
t1
t2
tn
PRR
static region
dynamicregion
General approach (1/7)For a given system, Dynamic Reconfigurable Computing
Management is carried in the following manner:
1) Definition of Objective Functions
2) Development of efficient cores
3) Parameterization of Hardware Cores
4) Multi-objective Pareto Optimization in the Multi-Variable Space
5) Dynamic management based on real-time multi-variable constraints
General approach (2/7)1) Definition of objective functions:
A wide range of quantities (e.g., energy, performance, precision, hardware usage) can considered as the objective functions of system parameters. These properties may have a slightly different definition depending on the application:
Energy can be measured as the total energy spent during the system operation, or the energy spent during an operation (e.g., energy per video frame). In some instances, measuring Power is more useful.
Performance can be measured by: Megasamples per second, frames per second, Megabytes per second, etc.
Precision can be measured by: numerical representation, or accuracy with respect to an idealized result (e.g., PSNR).
There can be objective functions that pertain to an specific application. For example, in image compression, the bitrate metric evaluates the compression efficiency of a hardware architecture (e.g. JPEG processor). Or bandwidth for communication networks.
General approach (3/7)2) Development of efficient cores: The hardware architectures should
use techniques that:i) Minimize the amount of computational resources (e.g. LUT-based
approaches, Distributed Arithmetic),ii) Exploit parallelism and pipelining so as to obtain high performance
architectures.iii)Make intensive use of DPR and/or take advantage of DPR.
The cores should be described using Hardware Description Language (HDL) at the Register Transfer Level (RTL). The best effort must be made so that these cores remain portable across devices and vendors.
+++
++
+
X(0) X(1) X(2) X(3) X(4) X(5) X(6)
Yo
PIPELINED ADDER TREE
x[M
-1]
s[0]s[M-2]s[M-1]
X_in B
x[N-1] x[N-2] x[0]x[1]
X_in
x[0]
x[N-1] x[M+1]
x[M]x[M-1]x[1]
x[N-2]B
only whenN is evenB+1
E E E E
E
E E E
E E E
E
B B B B
x[M]...
...
...
...for anti-
symmetry
B+1 B+1
E
+
E E E E E E...
E_COF
COFNH ...
E_COF
COFNH
NH+B NH+B NH+B NH+B
Y
Addertree +
NH+B+1 NH+B+1 NH+B+1
Y
Addertree
NON-SYMMETRIC FILTER SYMMETRIC/ANTI-SYMMETRIC FILTER
General approach (4/7)3) Parameterization of hardware cores:
Fine control of the objective functions (e.g., energy, performance, accuracy) is greatly helped by realistic parameterization of the hardware cores (e.g., I/O bit-width, number of parallel cores).
Parameterized HDL code allows us to quickly create a set of hardware realizations by varying the parameters. Each realization comes with different values for the objective functions, which we ultimately control by varying the hardware parameters.
Example: Parameterization of the ‘Pixel processor’ architecture:
NC (number of cores), NI (number of input bits per pixel),
NO (number of output bits per pixel), F (function to be implemented),
LUT values (text file with LUT values)
PIXEL
PROCESSOR
NC LUT values(from text file)
NI NO F
NINC NONC
General approach (5/7)4) Multi-objective Optimization in the Multi-Variable Space:
The Energy-Performance-Accuracy space is shown along with the Pareto- optimalpoints.In some cases, we may want to explore a space of just 2 variables, e.g., the Energy-Accuracy space.
Multi-variable space: Represented by a set of hardware realizations along with their objective function values. We create it by varying the system parameters.Optimality: A hardware realization is defined to be optimal in the multi-objective (Pareto) sense if it is not possible to improve on all criteria without deteriorating in at least one of them. Objective 1
Obje
ctive
2
SET OF HARDWARE
REALIZATIONS
Parameters Objective FunctionsSet 1 Values 1Set 2 Values 2Set 3 Values 3... ....
Multi-variable
Space
Example 3-variable space: Energy-Performance-Accuracy Space: An optimal hardware realization is defined as one that minimizes energy, while maximizing performance and accuracy.
En
erg
y
-Accuracy
Pareto front
-Accuracy -Performance
En
erg
y
Pareto front
General approach (6/7)5) Dynamic management based on real-time multi-variable constraints:
Once the Pareto front has been extracted, we can cast optimization problems based on multi-variable constraints. We will use the Energy-Performance-Accuracy Space to explain this idea.Example: We are given constraints on the 3 variables. The feasible set is then represented by the golden points. We prioritize energy consumption, so we select the realization from the feasible set that also minimizes energy consumption. This can be cast as the following optimization problem:
𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒𝑅𝑖
𝐸𝑛𝑒𝑟𝑔𝑦(𝑅𝑖) 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜:
𝐸𝑛𝑒𝑟𝑔𝑦(𝑅𝑖) ≤ 2𝑚𝐽𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑅𝑖) ≥ 50𝑑𝐵
𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑅𝑖 ≥ 30𝑓𝑝𝑠
-Performance-Accuracy
constr
ain
ts
Energ
y
Energ
y
-Accuracy
constraints
Circled point: Realization from the feasible set that minimizes energy consumption.
If we ignore one variable (say, performance), we have a 2D optimization problem.
General approach (7/7)5) Dynamic management based on real-time multi-variable
constraints: Now, we are given time-varying simultaneous constraints: different set of constraints are applied at different moments in time.The system receives stimuli in the form of multi-variable constraints and reconfigures itself via DPR and/or Dynamic Frequency Control to meet the multi-variable constraints. The figure shows examples with 3 and 2 simultaneous constraints.
EnergyAccuracy (psnr)
Performance (fps)
min50dB
200
30 mJmax
300
----
max
t1 t2 t3
En
erg
y
-Accuracy
PO 1
Pareto front
PO 3
PO m
PO 4
PO 2
'm' Pareto points
-Performance
-Accuracy
Pareto front
En
erg
y PO 1
PO 3
PO m
PO 4
PO 2
'm' Pareto points
EnergyAccuracy (psnr)
min--
30 mJmax
--max
t1 t2 t3
30mJ
Implementation Details (1/2)Embedded FPGA system that supports Dynamic Partial Reconfiguration and Dynamic Frequency Control:Pareto-optimal point: Represented by <bitstream, frequency of operation>- Hardware realization that becomes active in the FPGA via Dynamic Partial Reconfiguration (DPR) and/or Dynamic Frequency Control. If the system receives a multi-variable constraint: It looks for a solution in the Pareto-optimal set: <bitstream*, freq*> It reconfigures the FPGA dynamic region(s) and /or frequency of operation, so as to meet the multi-variable constraints.Example (one Dynamic Region): The PRM (Partial Reconfigurable Module) is a hardware core that performs an specific task and that can be modified at run-time.
PPC/
mBlaze/ARM
PLB/AXI
System ACE
memoryEthernet
MAC
PLB/AXI Peripheral
PRM
PRR
ICA
Pport
'n' bitstreamsin memory
DPRDMAcore
M S
ICAPcore
frequency& PR ctrl
iFIFOin
terf
ace
CFcard
oFIFO
clkfx
PR_done
Module
1
Module
2
Module
n
...n m n ≤ m
Energ
y
-Accuracy
PO 1
Pareto front
PO 3
PO m
PO 4
PO 2
'm' Pareto pointsModule 1
'n' uniquebitstreams
freq.
f1
f2
f2
f3
f1
-Performance
Module 2
Module 1
Module 1
Module n
...
...
Mu
lti-
vari
ab
leco
ns
train
ts
*PRR: Partial Reconfigurable Region
Implementation Details (2/2) A Pareto point is represented by ‘k’ bistreams and the frequency of operation if: The system has ‘k’ dynamic regions. The requested operation requires the
dynamic regions to have a unique combination of bitstreams. The system has one dynamic region, but the requested operation requires
reconfiguring the dynamic region ‘k’ times.Generalization: A Pareto point is represented by:
<bitstream1, bitstream2, …, bitstreamk, frequency of operation>
n ≤ m, p ≤ m
f1
f2
f1
f3
fp
...
'n' unique sets of partial bitstreams
prm 11 prm 21 prm k1...
prm 1n prm 2n prm kn...
prm 12 prm 22 prm k2...
prm 13 prm 23 prm k3
prm 12 prm 22 prm k2
...
partial bitstreams
Hardware core
prm 1 prm 2 prm k
Static Region
...
...
...
Design Parameters
clockfrequency
...
freq.Objective 1
Pareto front
Obje
ctive
2
PO 1
PO 3
PO m
PO 4
PO 2
'm' Pareto points
'k' Dynamic Regions
In some cases, adifferent Pareto pointcan exist where only
the frequenqychanges
n ≤ m, p ≤ m
f1
f2
f1
f3
fp
...
'n' unique sets of partial bitstreams
prm 11 prm 21 prm k1...
prm 1n prm 2n prm kn...
prm 12 prm 22 prm k2...
prm 13 prm 23 prm k3
prm 12 prm 22 prm k2
...
partial bitstreams
Hardware core
prm
Static Region
...
...
Design Parameters
clockfrequency
...
freq.Objective 1
Pareto front
Obje
ctive
2PO 1
PO 3
PO m
PO 4
PO 2
'm' Pareto points
One Dynamic Region
In some cases, adifferent Pareto pointcan exist where only
the frequenqychanges
Digital signal, image, and video processing applications
The following systems are discussed:
Pixel Processor and Dynamic PPA/EPA Management
2D Separable FIR Filter Dynamic EPA Management.
Pixel Processor (1/4) LUT-based architecture.
Single-pixel operations (e.g., gamma correction, Huffman encoding, histogram equalization, contrast stretching) can be dynamically swapped. Parameter F modifies the function.
In addition to dynamically modifying the input-output function, we allow for the modification of:
Input pixel bitwidth (NI)
Output pixel bitwidth (NO),
Number of parallel processing elements (NC)
t1t2t3
Input Frame
Inv. Gamma correction
gamma = 2
Contrast Stretching
alpha = 2, m = 0.5
Gamma correction
gamma = 0.5
Module 3 Module 2 Module 1
FPGA
PRR
PIXEL
PROCESSOR
NC LUT values
(from text f ile)
NI NO F
NINC NONC
* VHDL IP core available at:
dllamocca.org
Pixel Processor (2/4) Embedded System: One Dynamic Region. Pareto point represented by: <bitstream, frequency>. Pixel processor interface: PLB (Processor Local Bus) slave burst interface. The
figure shows a PRR with NC=4, NI=NO=8. The system dynamically reconfigures: NC, NI, NO, FUNCTION, under the
following constraints: NINC32, and NONC 32 (because of the 32-bit PLB) Five ‘clkfx’ frequencies allowed: 100.00, 66.66, 50.0, 40.00, and 33.33 MHz. FIFOs required to properly isolate different clock regions (PLB clk= 100 MHz and
‘clkfx’).
ord
en
ird
en
LUT8-to-8
LUT8-to-8
LUT8-to-8
32 32
upix_ip
LUT8-to-8
Bus2IP
_D
ata
SlaveReg 0
oFIFO
FWFT
DOrden
IP2B
us_D
ata
32 IP
2B
us_W
rack
IP2B
us_R
dack
Bus2IP
_W
rCE
(0)
Bus2IP
_R
dC
E(1
)
FSM
Bus2IP
_C
lkDIwren
PLB Bus
PLB Interface
512x32
Bus2IP
_B
E
rdwr
IP2B
us_A
dd
rAck
PLB46 Slave Burst IP
Bus2IP
_R
dR
eq
Bus2IP
_W
rReq
full
em
pty
ow
ren
ofull oempty
iFIFO
FWFT
DOrden
DIwren
full
em
pty
iwre
nifull iempty
FSM
clkfx
iwren
SlaveReg 1
iwre
n
512x32
PowerPC
PLB
System ACE
memoryEthernet
MACICAPcore
ICAPport
CFcard
freq. ctrlDCR Bus clkfx
EP
Aco
nstr
ain
ts
'n' bitstreamsin memory
Mo
dule
n
Mo
dule
2
Mo
dule
1
DMAcore
M S
Hardware core
PixelProcessor
Inte
rface
PRRPRR
Implemented on a ML405 Development BoardDevice: XC4VFX20-11FF672
Pixel Processor (3/4)
Multi-objective optimization of the Power-Performance-Accuracy space:
Function: log(1 + I) + Full Histogram stretching. Image: Lena (VGA: 640x480).
8-bit input image (NI=8 fixed). Pareto points are clustered as a function of NO (# of output bits). A similar trend occurs with NC (# of cores) (not shown)
Left side shows how power and performance depend on frequency.
0
2000
400055 60 65 70 75 80 85
0
20
40
60
80
100
fps
EPP Results -Function 7
psnr(dB)
Pow
er(
mW
)
02000
400060 80100
0
50
100
fps
EPP Results -Function 7
psnr(dB)
Pow
er(
mW
)
NO = 8
NO = 9
NO = 10
NO = 11
NO = 12
0
2000
400055 60 65 70 75 80 85
0
20
40
60
80
100
fps
PPA Results -Function 7
psnr(dB)
Pow
er(
mW
)
100
66.66
50
40
33.33
NC
=4
NC
=6
NC=8
NC=10
NO=12
NC
=2
NO=8
HA
HP
LPw
NO Accuracy, Power
NCPower, fps
Pixel Processor (4/4) Dynamic Energy-Performance-Accuracy management: We show two
examples on 2D (ignoring performance) with time-varying constraints:
Video sequence:
foreman (CIF: 352x288)
Function:
Gamma correction:
I, = 1/0.45
Video sequence:
missamerica(QCIF: 176x144)
Function:
Nonlinear contrast
Stretching: 1/(1 + m/I),
m = 0.5, =2
* Published in IEEE Transactions on Circuits
and Systems for Video Technology
0 50 100 150 200 250 30045
50
55
60
65
70
75
80
85
Frame #
PS
NR
(dB
)
frame #30
avg=59.07std=0.05
diff. w/ lena=0.25%
avg=70.48std=0.06
diff. w/ lena=0.21%
frame #120
avg=82.58std=0.21
diff. w/lena=0.47%
frame #200
avg=48.53std=0.06
diff. w/lena=3.23%
frame #270
304050607080902
3
4
5
6
7
8
9
10
psnr(dB)
Energ
y p
er
fram
e(u
J)
EPA Results -Function 6
Energy per frameAccuracy (psnr)
7 uJ55dB
min70 dB
--max
min50dB
NI=8, NO=10,NC=6
NI=NO=8, NC=8
NI=8, NO=12, NC=10
NI=7, NO=7, NC=10
0 50 100 15045
50
55
60
65
70
75
80
85
Frame #
PS
NR
(dB
)
avg=59.43std=0.045
diff. w/lena=0.78%
frame #15
avg=70.89std=0.035
diff. w/ lena=0.42%
frame #55
avg=82.53std=0.030
diff. w/ lena=1.67%
frame #100
avg=49.73std=0.032
diff. w/ lena=1.39%
frame #130
304050607080900.5
1
1.5
2
2.5
psnr(dB)
Energ
y p
er
fram
e(u
J)
EPA Results -Function 5
Energy per frameAccuracy (psnr)
2 uJ55dB
min70 dB
--max
min50dB
NI=7, NO=8, NC=10
NI=NO=8, NC=8
NI=8, NO=12, NC=10
NI=8, NO=10,NC=6
2D Separable FIR Filter (1/7)
EX_in E E E
y
c[0] c[1] c[2 c[3
EX_in E E E
y
shifts and adds
LUT array
Multiply and add Filter Distributed Arithmethic Filter
TWO TYPES OF IMPLEMENTATIONS
FIR_DAX_in B
E
NO
N NH
[NO
NQ
]
OP
SY
MM
ET
RY
COEFFICIENTS
Y
L
sclr
BNumber of taps
Input bit-w idth
Coeff icient bit-w idth
Output f ixed point format
LUT input size
Output truncation scheme Efficient implementation of a 1D FIR Filter via DPR: Dynamic Partial Reconfiguration turns the fixed-coefficient DA filter into a variable-coefficient DA filter, at the expense of partial reconfiguration time overhead.
Parameterization of the VHDL-coded FIR filter core:
Distributed Arithmetic (DA) approach is more efficient since it is a LUT-based approach that turns the multiplications into shifts and adds. But it requires the coefficients to be constant.
1D FIR Filter implementation
* VHDL IP core available at:
dllamocca.org
2D Separable FIR Filter (2/7)
2D Separable Filter Implementation:
Separable FIR filters allow for efficient implementations by means of two 1D FIR Filters.
The reconfiguration rate is constant (twice per frame).
Cyclic Dynamic reconfiguration of two 1-D filters (usually full-filter reconfiguration):
- Implement row filter
- Replace by column filter
- Implement column filter
- Replace by row filter
…
DPR
tr1 tc1FPGA
ROW COL
COL
tr2 tc2
ROW COL
1 Frame is streamed
Processed row -by-row frame is streamed
2
3
4
Replacing row filter by column filter via DPR
Replacing column filter by row filter via DPR
ROW
DPRROWCOL
Processing
frame 1
Processing
frame 2
5 Go to Step 1 to process a new frame
ROW PRR
row filter
col f ilter
COL
* A comparison of this 2D FIR Filter and a GPU implementation for different number of coefficients was published 2011 IEEE Field Programmable Logic Conference (FPL’2011)
2D Separable FIR Filter (3/7) Embedded System: PLB Interface: The interface is inside the PRR (Partial Reconfigurable
Region), so that we can dynamically modify the I/O bitwidth. Each 2D filter realization is represented by 2 bitstreams (1 dynamic region)
Rowfilter
Colfilter
Row i
Col iDPR
DPR
DYNAMIC MANAGERGenerates Pareto Optimal point POi and
loads it into the 2D FIR filter
Row i Col i
PO i, 1 ≤ i ≤ n
B: User constraints
Consider new constraints based on image type, user input, or output
EPA constraintgenerated?
Look for PO point that satifies theEPA constraints
PO point exists?
Load <row i,col i> bitstreams
into the 2D FIR filter via DPR
no
no
yes
yes
ParametersN,NH,OB,NXr,NXc
Energy
Performance
Accuracy
Pointer tobitstream row i
Pointer tobitstream col i
Pareto Optimal PointPOi in memory:
2D FIR Filter
A: im
age type
C: outp
ut fr
am
es
mBlaze
PLB
System ACE
memoryEthernet
MAC
Real Filter peripheral
1D FIR IP core
PRR
ICA
Pport
EP
Aco
ns
train
ts
'2n' bitstreamsin memory
Filter 1 Filter 2 Filter n
Row
1
Col 1
Row
2
Col 2
Row
n
Col n
DPRDMAcore
M S
ICAPcore
frequency& PR ctrl
iFIFO
inte
rface
CFcard
oFIFO
clkfx
PR_done
...
Energ
y p
er
fram
e
-Accuracy
PO 1
Pareto front
PO 3
PO m
PO 4
PO 2
'm' Pareto points
-Performancen ≤ m
Row 1 Col 1
Row 2 Col 2
Row 1 Col 1
Row 3 Col 3
Row n Col n
bitstream pair'n' unique pairs fr
eq.
f1
f2
f3
f2
f4
...
...
Implemented on a ML605 Development Board
325
330
33550 60 70 80 90 100 110
0.2
0.25
0.3
0.35
fps
psnr(dB)
Results according to NH (coefficient bitw idth)
Energ
y p
er
fram
e(m
J)
325
330
335
50 60 70 80 90 100 110
0.2
0.25
0.3
0.35
fps
Frame size:352x288
psnr(dB)
Energ
y p
er
fram
e(m
J)
325
330
335
50
100
150
0.2
0.25
0.3
0.35
fps
Frame size:352x288
psnr(dB)
Epf(
mJ)
N = 24
N = 20
N = 16
N = 12
N = 8 325330
335
50
100
150
0
0.2
0.4
fps
Results according to NH (coefficient bitw idth)
psnr(dB)
Energ
y p
er
fram
e(m
J)
NH = 16
NH = 12
NH = 10
LE, HP1, LA
HA, HE, LP
OB=8
U1
HP2
HP3
HP4
HP5U2,U3U4
U5 U6
HA, HE, LP
LE, HP1, LA
2D Separable FIR Filter (4/7)Multi-objective optimization of the Energy-Performance-Accuracy space:
2D Filter Parameters: N (# of coefficients),NH (# of bits per coefficients), OB (# of bits per output pixel),
Results: Low-pass Gaussian Filter, x=y=1.5 (symmetric coefficients).Image: Lena (CIF: 352x288).HA: highest-accuracy. HP: highest-performance,LE: lowest energy -pi
0
pi
-pi
0
pi0
0.5
1
1.5
x
Gaussian f ilter - x=
y=1.5
y
Magnitu
de
2D Separable FIR Filter (5/7)Multi-objective optimization of the EPA space
320
325
330
335
20
40
60
80
100
0.2
0.25
0.3
0.35
0.4
fps
Results according to NH (coefficient bitw idth)
psnr(dB)
Energ
y p
er
fram
e(m
J)
320
325
330
335
20
40
60
80
100
0.2
0.25
0.3
0.35
0.4
fps
Frame size:352x288
psnr(dB)
Energ
y p
er
fram
e(m
J)
320
330
340
0
50
100
0.1
0.2
0.3
0.4
0.5
fps
Frame size:352x288
psnr(dB)
Energ
y p
er
fram
e(m
J)
N = 32
N = 24
N = 20
N = 16
N = 12
N = 8
OB=8
LE, HP, LA
HA, HE
(a) (b)
325330
335
50
100
150
0
0.2
0.4
fps
Results according to NH (coefficient bitw idth)
psnr(dB)
Energ
y p
er
fram
e(m
J)
NH = 16
NH = 12
NH = 10
HP
HP
LP
LP
332fps, 23.6dB, 0.155mJ
332fps, 23.6dB, 0.155mJ HA, HE
LE, HP, LA
324
326
328
330
332
334
2030
4050
6070
0.2
0.25
0.3
0.35
fps
Frame size:352x288
psnr(dB)
Energ
y p
er
fram
e(m
J)
324
326
328
330
332
334
2030
4050
6070
0.2
0.25
0.3
0.35
fps
Results according to NH (coefficient bitw idth)
psnr(dB)
Energ
y p
er
fram
e(m
J)
325
330
335
50
100
150
0.2
0.25
0.3
0.35
fps
Frame size:352x288
psnr(dB)
Epf(
mJ)
N = 24
N = 20
N = 16
N = 12
N = 8
OB=8332fps, 29.6dB, 0.156mJLE, HP, LA
326fps, 67dB, 0.323mJHA, HE, LP
325330
335
50
100
150
0
0.2
0.4
fps
Results according to NH (coefficient bitw idth)
psnr(dB)
Energ
y p
er
fram
e(m
J)
NH = 16
NH = 12
NH = 10
HP
HP
LP
LP
HA, HE, LP
LE, HP, LA
Low-pass Gaussian Filter: x=4,y=2
Diff. of Gaussians (DoG) filter: 1=2,2=4
-pi
0
pi
-pi
0
pi0
0.5
1
x
Gaussian f ilter - x=4,
y=2
y
Magnitu
de
-pi
0
pi
-pi
0
pi
0
0.5
x
DoG - 1=2,
2=4
y
Magnitu
de
2D Separable FIR Filter (6/7)Dynamic EPA Management (1st example): Applied on the Pareto front of the DoG filter .Video: foreman’(CIF: 352x288)
Dynamic management basedon user constraints:Time-Varying constraints are Provided at run-time by the user
Dynamic management basedon output frames:Metric: Percentage change (T) ofthe filter output (betweenconsecutive frames). Large scenesare associated with large values of T
Right side: State diagram.Low accuracy: no significantvariation in T.Medium accuracy: some variation in T.High accuracy: large variation in T.
Left side: Detection of Scene changes, especiallyFrames 185 to 191.
* to appear in ACM Transactions onReconfigurable Technologyand Systems
0 50 100 150 200 250 3000
0.01
0.02
0.03
0.04
0.05
0.06
Diff
ere
nce o
f sum
of absolu
te v
alu
es
Frame #
0 50 100 150 200 250 30040
50
60
70
PS
NR
(dB
)
0 50 100 150 200 250 30020
30
40
50
60
70
80
90
Frame #
PS
NR
(dB
)
2030405060708090
0.2
0.25
0.3
0.35
0.4
psnr(dB)
Energ
y p
er
fram
e(m
J)
N=20, NH=10, OB=8
N=32, NH=16, OB=16
N=24, NH=16, OB=16
N=8, NH=12, OB=8
N=24, NH=10OB=16
Energy per frameAccuracy (psnr)
0.3 mJ45dB
0.3mJmax
min--
min65dB
--max
avg=49.65
std=0.0977
avg=61.14
std=0.1425
avg=23.56std=0.18
avg=66.18std=0.5568
avg=88.45std=0.0475
frame # 30
frame # 90
frame # 150
frame # 215
frame # 270
frame # 31
frame # 191
frame # 260
frame # 245
Threshold
fram
e #
185
fra
me
# 1
91
30 framesbelow T=0.01
10 framesabove T=0.01
15 framesbelow T=0.01
MEDIUMACCURACY
62.33 dB
0.281 mJ
LOWACCURACY
49.58 dB
0.214 mJ
HIGH ACCURACY
65.82 dB
0.346 mJ
START15 frames
above T=0.01
If the conditions do not hold,
stay in the current state
2D Separable Complex FIR Filter (7/7)Here, the input image is complex and the filter coefficients are complex. These types of filters are very useful in AM-FM decompositions for image analysis applications.
Xr B
E
N NH
[NO
NQ
]
OP
SYMMETRYCOEFFICIENTS
(from text file)
L
sclr
B
imag
v
COEFFICIENTS (real)SYMMETRY (real)
E sclr v
1D FIR core
1D FIR core
1D FIR core
1D FIR core
YrNO
NO Yi
-
+
Xi BCOEFFICIENTS (imag)SYMMETRY (imag)
E sclr v
COEFFICIENTS (imag)SYMMETRY (imag)
E sclr v
COEFFICIENTS (real)SYMMETRY (real)
E sclr v
realreal imag
205
210
215
4060
80100
0.2
0.25
0.3
0.35
fps
Results according to "N" (coefficient bitw idth)
psnr(dB)
Energ
y p
er
fram
e(m
J)
205
210
215
0
50
100
0
0.2
0.4
fps
Results according to "N" (coeff icient bitw idth)
psnr(dB)
Energ
y p
er
fram
e(m
J)
N=31
N=29
N=23
N=19
N=15
N=11
N=9
N=7
205
210
215
0
50
100
0
0.2
0.4
fps
Results according to "N" (coeff icient bitw idth)
psnr(dB)
Energ
y p
er
fram
e(m
J)
N=31
N=29
N=23
N=19
N=15
N=11
N=9
N=7
OB=8209.6fps, 98dB, 0.289mJ
N=31,NH=16, OB=16 HA, HE
LE, HP
213.3fps, 47dB, 0.173mJN=9, NH=10, OB=8
Frequency: 100 MHz
205
210
21540
6080
100
0.2
0.25
0.3
0.35
fps
psnr(dB)
Energ
y p
er
fram
e(m
J)
Energy per frameAccuracy (psnr)
min50dB
0.25mJ60dB
0.25mJmax
--max
N=31, NH=16, OB=1698dB, 0.289mJ
N=29,NH=12,OB=1690.68dB, 0.2463mJ
N=15, NH=10,OB=1662.11dB, 0.2093mJ
N=7, NH=10, OB=1651.47dB, 0.1821mJ
0 50 100 150 200 250 30050
60
70
80
90
100
Frame #
PS
NR
(dB
)
avg=58.78
std=0.3348
avg=66.56
std=0.4168
avg=93.47
std=0.0631 avg=98.22std=0.0578
frame # 190
frame # 280frame # 120
frame # 30
Results: Gabor complex filter, video sequence: news (CIF: 352x288),.* Published in 2012 IEEE International Conference on Field Programmable Logic and Applications
2D Gabor at 45
Research Areas (1/5)RECONFIGURABLE COMPUTING:
Run-Time Reconfigurable Architectures under Time-Varying Constraints
Automatic generation of time-varying constraints. Self-aware Computing and Self-adaptive techniques. Design Space Exploration for large multi-objective spaces.
Advanced Topics on Computer Architecture Fully-pipelined architectures for: Signal, Image, and Video
Processing: Discrete Cosine Transforms, 1D/2D Filterbanks. Non-standard numerical representations: Trade-offs between double
floating point precision and fixed-point precision. Specialized architectures for CORDIC (trigonometric, linear, and
hyperbolic functions), square root, fast division, multiplication. Use of non-standard numerical representations.
LUT-based design
Research Areas (2/5)Example: Radon Transform:
• 7x7 Radon Transform core. Presented in SSIAI’2014 and ICIP’2014 conferences
RO(6,4)
RI(6,5)
RI(5,5)
RI(4,5)
RI(3,5)
RI(2,5)
RI(1,5)
RI(0,3)
RO(0,0)RO(0,0)
RO(0,0)
E
RI(0,0) RI(0,1) RI(0,2) RI(0,4) RI(0,5) RI(0,6)
X(0) X(1) X(2) X(3) X(4) X(5) X(6)
E E E E E E
RO(0,1)RO(1,1)
RO(0,2)RO(2,2)
RO(0,3)RO(3,3)
RO(0,4)RO(4,4)
RO(0,5)RO(5,5)
RO(0,6)RO(6,6)
RO(1,1)RO(0,1)
E
RI(1,0) RI(1,1) RI(1,2) RI(1,3) RI(1,4) RI(1,6)
E E E E E E
RO(1,2)RO(1,2)
RO(1,3)RO(2,3)
RO(1,4)RO(3,4)
RO(1,5)RO(4,5)
RO(1,6)RO(5,6)
RO(1,0)RO(6,0)
RO(2,2)RO(0,2)
E
RI(2,0) RI(2,1) RI(2,2) RI(2,3) RI(2,4) RI(2,6)
E E E E E E
RO(2,3)RO(1,3)
RO(2,4)RO(2,4)
RO(2,5)RO(3,5)
RO(2,6)RO(4,6)
RO(2,0)RO(5,0)
RO(2,1)RO(6,1)
RO(3,3)RO(0,3)
E
RI(3,0) RI(3,1) RI(3,2) RI(3,3) RI(3,4) RI(3,6)
E E E E E E
RO(3,4)RO(1,4)
RO(3,5)RO(2,5)
RO(3,6)RO(3,6)
RO(3,0)RO(4,0)
RO(3,1)RO(5,1)
RO(3,2)RO(6,2)
RO(4,4)RO(0,4)
E
RI(4,0) RI(4,1) RI(4,2) RI(4,3) RI(4,4) RI(4,6)
E E E E E E
RO(4,5)RO(1,5)
RO(4,6)RO(2,6)
RO(4,0)RO(3,0)
RO(4,1)RO(4,1)
RO(4,2)RO(5,2)
RO(4,3)RO(6,3)
RO(5,5)RO(0,5)
E
RI(5,0) RI(5,1) RI(5,2) RI(5,3) RI(5,4) RI(5,6)
E E E E E E
RO(5,6)RO(1,6)
RO(5,0)RO(2,0)
RO(5,1)RO(3,1)
RO(5,2)RO(4,2)
RO(5,3)RO(5,3)
RO(5,4)RO(6,4)
RO(6,6)RO(0,6)
E
RI(6,0) RI(6,1) RI(6,2) RI(6,3) RI(6,4) RI(6,6)
E E E E E E
RO(6,0)RO(1,0)
RO(6,1)RO(2,1)
RO(6,2)RO(3,2)
RO(6,3)RO(4,3)
RO(6,4)RO(5,4)
RO(6,5)RO(6,5)
+++
++
+
+
Addertree
Z(6
)
+
Addertree
Z(5
)
+
Addertree
Z(4
)
+
Addertree
Z(3
)
+
Addertree
Z(2
)
+
Addertree
Z(0
)
+
Addertree
Z(1
)
RO(0,1) RO(0,2) RO(0,3) RO(0,4) RO(0,5) RO(0,6)
RO(1,0) RO(1,1) RO(1,2) RO(1,3) RO(1,4) RO(1,5) RO(1,6)
RO(2,0) RO(2,1) RO(2,2) RO(2,3) RO(2,4) RO(2,5) RO(2,6)
RO(3,0) RO(3,1) RO(3,2) RO(3,3) RO(3,4) RO(3,5) RO(3,6)
RO(4,0) RO(4,1) RO(4,2) RO(4,3) RO(4,4) RO(4,5) RO(4,6)
RO(5,0) RO(5,1) RO(5,2) RO(5,3) RO(5,4) RO(5,5) RO(5,6)
RO(6,0) RO(6,1) RO(6,2) RO(6,3) RO(6,5) RO(6,6)
s
E
Z(0)
Z(1)
Z(N-1)
v
RADON
CORE
FSMEn
E s
2
...
...
...
...
(b)
(a)
(c)
B N
BOX(0)
X(1)
X(N-1)
BO = B + log2N
BO
BO
B
B
B
RO(0,i) RO(1,i) RO(2,i) RO(3,i) RO(4,i) RO(5,i) RO(6,i)
Z(i)
Research Areas (3/5) Example: HEVC (High Efficiency Video Coding Standard) HEVC Intra Prediction Core: Presented in the SSIAI’2014 conference. The HEVC Intra prediction core computes the following modes: Angular (33),
planar (1), and DC (1). Current work: Implementation of HEVC Transform, Scaling, and
Quantization. Future work: Self-reconfigurable hardware implementation of HEVC Encoder
Sullivan, G., et al, "Overview of the High Efficiency Video Coding (HEVC) Standard", IEEE TCSVT, V.22, No. 12, Dec. 2012
Research Areas (4/5)FPGA-based Embedded Systems Applications: automotive, networking, bioengineering, software-
defined radio, communications, video compression, video filtering. Software-Hardware Co-Design. Dealing with different processors: ARM, MicroBlaze (Xilinx), Nios
(Altera), OpenRISC (Open-source). Development of embedded interfaces for a variety of buses:
Advanced eXtensible Interface (AXI, Xilinx), Avalon switch fabric (ALTERA), Wishbone (open source), etc.
External communication interfaces (SpaceWire, CAN, Ethernet). Current work: Open-Source embedded system that supports DPR
via Wishbone.
Run-Time Reconfiguration on FPGAs Objective: Develop high-speed dynamic reconfiguration controllers: Support for standard buses: AXI, Wishbone. Hardware and software techniques the provide dramatic increases
in reconfiguration speed. Dynamic Frequency Control on FGPAs
Research Areas (5/5)Specialized techniques on FPGAs Low Power techniques Advanced coding mechanism for efficient parameterization using
VHDL and Verilog HDL. Test-bench generation Crossing clock domains
GPU PROGRAMMING:High performance implementation of Digital Signal, Image, and
Video Processing Algorithms.Integration with multi-threaded implementations on CPUsComparisons with FPGA implementations.
EMBEDDED SYSTEMSMicrocontroller-based SystemEfficient architectures for embedded interfacesEmbedded design using System-on-Chip that incorporates analog,
programmable logic, memory, and microcontroller (e.g. Cypress devices)
Teaching PlansECE378: Digital Logic and Microprocessor Design
(Winter 2015) Digital System Design Microprocessor Design in VHDL Digital Synthesis with VHDL Parameterized VHDL coding
ECE495/595: Reconfigurable Computing (Fall 2015) Hardware/Software co-design on FPGAs Self-Reconfigurable systems: Partial Reconfiguration Advanced topics in Computer Arithmetic Applications in: Digital Signal, Image, and Video Processing Communication interfaces (SpaceWire, CAN,
Ethernet)
Reconfigurable Systems
Static Dynamic
Embedded Systems
Applications: DSP,
automotive,
communicationsInte
rfaci
ng
Digital Logic Design
Conclusions A framework was presented for Dynamic Management of
Optimization of Run-Time Reconfigurable Architectures. Theexamples presented (Pixel Processor, 2D FIR Filter) weresuccessfully tested on several standard video sequences.
The results suggest that the general framework can beapplied to a variety of digital systems. This framework willlead to exciting new methods. As an example, consider theautomatic generation of time-varying constraints. Forexample: detection of a scene triggers a requirement forincreased accuracy, a scene remaining still triggers arequirement for a decrease in energy consumption.
The presented work opens up new excitinginterdisciplinary research opportunities (e.g.:automotive applications, design space exploration,automatic constraints generation, pervasive healthcareapplications, low power applications).