séminaire cosi ’01
DESCRIPTION
Séminaire COSI ’01. Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye. Content. Context and motivations Silicon compilation tools Target architectures Power consumption Related work Partitioning Modeling Power Experimental results Conclusion. - PowerPoint PPT PresentationTRANSCRIPT
Séminaire COSI-Roscoff’01 1
Séminaire COSI ’01
Power Driven Processor Array Partitionning for FPGA SoC
S.Derrien, S. Rajopadhye
Séminaire COSI-Roscoff’01 2
Content Context and motivations
Silicon compilation tools Target architectures Power consumption Related work
Partitioning Modeling Power Experimental results Conclusion
Séminaire COSI-Roscoff’01 3
Silicon compilation tools Parallel processor array architectures
Regular and scalable (well suited to FPGAs) Specialized high-performance data-path
Restricted class of loops SUREs (uniform dependencies) Static polyhedral loop domain
Compute intensive nested loops Image processing (motion estimation, stereo vision) Signal processing (QR factorization, DLMS)
Séminaire COSI-Roscoff’01 4
Power consumption General model and motivations
P=Pstat+Vdd.Cd.Df (gate level model) Estimate at RTL level (entropy based models)
Mainly dictated by : On chip area cost and activity Off-chip I/O volume
System level power model ? Estimate from specs and target arch.
Séminaire COSI-Roscoff’01 5
Target architecture
FPGA
CPU
SystemMemory
Ext world
Embedded CPU Power PC NIOS
Soc bus Amba, Coreconnect Plug ’n play IP cores
Shared Memory Low latency High bandwidth
Séminaire COSI-Roscoff’01 6
Related Work Compiler transformations to reduce mem
accesses [Kandemir] Loop fusion Loop tiling Loop reordering
Design space exploration for custom memory systems [Imec]
Systematic exploration Multi-level memory hierachy The approach is brute force
Séminaire COSI-Roscoff’01 7
Content Context and motivations Target architectures Partitioning
Clustering (LSGP) Tiling (LPGS) Co-partitionning
modeling Power Experimental results Conclusion
Séminaire COSI-Roscoff’01 8
Partition PE array into Tiles Tiles are executed sequentially Intermediate results stored in off-chip memory requires unidirectionnal communications :
Tile shape is rectangular Bound // to PE space base vectors Perfect « Tiling » of processor space
Tiling (LPGS)
Séminaire COSI-Roscoff’01 9
Tiling (LPGS)
1
1
000000
H
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Mux
PEPE
PE PE PE
PE
Mux
DeMux
DeMux
FIFO
FIFO
=2
=3
Matrix diagonal det||=Npe
domain height
Séminaire COSI-Roscoff’01 10
Regroups PEs into Clusters operations executed sequentially I/O accesses reduced
Cluster shape is rectangular Bound // to PE space basis vectors Perfect « Tiling » of processor space
Scheduling is axes-major Several possible schedulings Seq. of clustering along each axis Simplifies control logic
Clustering (LSGP)
Séminaire COSI-Roscoff’01 11
Clustering (LSGP)
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
y=3
y=2
PE
PE
PE
PE
Matrix diagonal det||=Npe
size yx…xx
xp ..
PE index vector Iteration index
vector
Original space-time mapping
1
1
000000
H
Séminaire COSI-Roscoff’01 12
Clustering (LSGP)
+*
A
B C
+*
A
B C
+*
A
B C1 2 61
1 1
1 3
1
PE original x=2 x=2, y=3
Resource usage estimate :
Séminaire COSI-Roscoff’01 13
Hybrid-partitioning Step1 : array is Tiled
Tune the I/O volume Step2 : Tile is clusteredArray
Tune the resource usage Trade-Off
Off-chip I/O Volume Local memory sizes
Séminaire COSI-Roscoff’01 14
Content Context and motivations Target architectures Partitioning modeling Power
IO power model Core power model Putting it all together
Experimental results Conclusion
Séminaire COSI-Roscoff’01 15
Dynamic IO Energy model IO Energy depends on
IO volume (Ram clock speed) Operation (Rd,Wr) Port Toggle rate
Eio=Krd.Vrd+ Kwr.Vwr
Determine IO volume For all loop variables Given tiling parameters
Number write I/O operations
Technological constant
Séminaire COSI-Roscoff’01 16
Tile IO volume is called « foot print » Estimate for this foot print [Arg95] Spread vector of dependencies
IO Volume estimate (1/2)
: substituting ith row with spread vector
n
iaiAV
1
det
Séminaire COSI-Roscoff’01 17
v
k
n
i
n
jjjijjik alVio
0 1 1,, )1(
Total Tile IO volume:
Example :dA=[1 0 0] aA=[1 0 0] lA=2 VA= 2.H.1
dB=[0 1 0] aB=[1 0 0] lB=2 VB= 2.H.
dC=[0 0 1] aC=[1 0 0] lC=4 VC=
IO Volume estimate (1/2)
kth variable byte widthNumber of variables
Tile size parameterSpread vector
dependenciesTile output data
dependenciesTile input data
BA
C
j i
k
Séminaire COSI-Roscoff’01 18
FPGA power dissipation model Pcore=Pstat+Kc.Dlc.nlc.f
Not suited to our target FPGA architecture. Distinction between LCs (mem and logic)
Pcore=Pstat+Kc.Dlc.nlc.f+ Km.Dm.nm.f
Core power model (1/4)
Technology constant
Average toggle rate
Nbs of logic cells
Design operating freq.
Séminaire COSI-Roscoff’01 19
Core power model (2/4) Control logic is not modeled
too complex to estimate no significant contribution to power
Core power depends on Number of PEs : depends on and Area usage for each PE : depends on Average toggle rate for PE datapath and local
memory (application constant)
Séminaire COSI-Roscoff’01 20
Core power model (3/4) Memory ressource usage
LCs used as distributed memory (16x1bits) Datapath is design constant (library based)
Area cost for a PE array
Clustering parameter along processor space j
Register width along processor space k
Datapath functional cost
Number of PEs
fpd AnA
detdet
pn
16A 1
0m
p
kjjp
kkp An
Séminaire COSI-Roscoff’01 21
Core power model (4/4) Energy cost for the whole loop nest
we have Ec=Pc.ncycle.Tcycle
we will consider ncycle=Vcalc/np
Total core energy cost
Energy is not dependant on np !!
Total loop computation volumeAverage toggle rate
16E 1
0core
p
kjjp
kkpmcalcmfpfcalcf AnDVKAnDVK
Séminaire COSI-Roscoff’01 22
Content Context and motivations Target architectures Partitioning Modeling Power Experimental results
Model validation Extrapolations
Conclusion
Séminaire COSI-Roscoff’01 23
IO power model results
510
1520
25
510
1520
250
50
100
150
x
y
Pow
er (m
w)
510
1520
25
510
1520
250
50
100
150
x
y
Pow
er (m
w)
510
1520
25
510
1520
250
50
100
150
x
y
Powe
r (m
w)
510
1520
25
510
1520
250
20
40
60
80
100
x
y
Rel
ativ
e er
ror(%
)
Observed IO power dissipation Predicted IO power dissipation
Relative errorAbsolute error
Séminaire COSI-Roscoff’01 24
Core power model results
510
1520
25
510
1520
250
100
200
300
400
x
y
Powe
r (m
W)
510
1520
25
510
1520
250
100
200
300
400
x
y
Powe
r (m
W)
510
1520
25
510
1520
250
100
200
300
400
x
y
Powe
r (m
w)
510
1520
25
510
1520
25
0
50
100
x
y
Relative error (%)
Predicted core powerObserved core power
Absolute error(mw)
Séminaire COSI-Roscoff’01 25
System power model
510
1520
25
510
1520
250
50
100
150
x
y
Loop
exe
cutio
n en
ergy
cost(
J)
510
1520
25
510
1520
250
50
100
150
x
y
Loop
exe
cutio
n en
ergy
cost(
J)
510
1520
25
510
1520
250
50
100
150
200
x
y
Ener
gy (J
)
510
1520
25
510
1520
250
20
40
60
80
100
x
y
Rela
tive
erro
r(%)
Predicted total energy dissipation Observed total energy dissipation
Energy dissipation absolute error Energy dissipation relative error
Séminaire COSI-Roscoff’01 26
Content Context and motivations Target architectures Partitioning modeling Power Experimental results Conclusion
Solving the optimisation problem (Lagrange Multipliers) Custom cache for embedded CPUs Extension to SAREs (affine dependances)
Séminaire COSI-Roscoff’01 27
Conclusion Models matches experiments
Cheap measurement setup Many components contribute to current
dissipation (LEDs, PCI, etc…) Observations
Trade-off evolves with technology More sensitive for Asics ?
Séminaire COSI-Roscoff’01 28
Future Work(1/2) Formulation of the optimization pb
Minimize Energy/iteration Contraints on Performance and Area
Analitycal solution ? Lagrange multipliers No closed form for n>3 BUT fast numerical methods
Séminaire COSI-Roscoff’01 29
Future Work(2/2) Model for embedded CPUs
Trade-off cache-size and memory acceses. Determine optimal cache size and associated
tiling parameters. Extension to SARE ?
Affine dependencies. More general loops.