séminaire cosi ’01

29
Séminaire COSI-Roscoff’01 1 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye

Upload: mada

Post on 19-Mar-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Séminaire COSI ’01. Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye. Content. Context and motivations Silicon compilation tools Target architectures Power consumption Related work Partitioning Modeling Power Experimental results Conclusion. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 1

Séminaire COSI ’01

Power Driven Processor Array Partitionning for FPGA SoC

S.Derrien, S. Rajopadhye

Page 2: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 2

Content Context and motivations

Silicon compilation tools Target architectures Power consumption Related work

Partitioning Modeling Power Experimental results Conclusion

Page 3: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 3

Silicon compilation tools Parallel processor array architectures

Regular and scalable (well suited to FPGAs) Specialized high-performance data-path

Restricted class of loops SUREs (uniform dependencies) Static polyhedral loop domain

Compute intensive nested loops Image processing (motion estimation, stereo vision) Signal processing (QR factorization, DLMS)

Page 4: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 4

Power consumption General model and motivations

P=Pstat+Vdd.Cd.Df (gate level model) Estimate at RTL level (entropy based models)

Mainly dictated by : On chip area cost and activity Off-chip I/O volume

System level power model ? Estimate from specs and target arch.

Page 5: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 5

Target architecture

FPGA

CPU

SystemMemory

Ext world

Embedded CPU Power PC NIOS

Soc bus Amba, Coreconnect Plug ’n play IP cores

Shared Memory Low latency High bandwidth

Page 6: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 6

Related Work Compiler transformations to reduce mem

accesses [Kandemir] Loop fusion Loop tiling Loop reordering

Design space exploration for custom memory systems [Imec]

Systematic exploration Multi-level memory hierachy The approach is brute force

Page 7: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 7

Content Context and motivations Target architectures Partitioning

Clustering (LSGP) Tiling (LPGS) Co-partitionning

modeling Power Experimental results Conclusion

Page 8: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 8

Partition PE array into Tiles Tiles are executed sequentially Intermediate results stored in off-chip memory requires unidirectionnal communications :

Tile shape is rectangular Bound // to PE space base vectors Perfect « Tiling » of processor space

Tiling (LPGS)

Page 9: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 9

Tiling (LPGS)

1

1

000000

H

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

Mux

PEPE

PE PE PE

PE

Mux

DeMux

DeMux

FIFO

FIFO

=2

=3

Matrix diagonal det||=Npe

domain height

Page 10: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 10

Regroups PEs into Clusters operations executed sequentially I/O accesses reduced

Cluster shape is rectangular Bound // to PE space basis vectors Perfect « Tiling » of processor space

Scheduling is axes-major Several possible schedulings Seq. of clustering along each axis Simplifies control logic

Clustering (LSGP)

Page 11: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 11

Clustering (LSGP)

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

y=3

y=2

PE

PE

PE

PE

Matrix diagonal det||=Npe

size yx…xx

xp ..

PE index vector Iteration index

vector

Original space-time mapping

1

1

000000

H

Page 12: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 12

Clustering (LSGP)

+*

A

B C

+*

A

B C

+*

A

B C1 2 61

1 1

1 3

1

PE original x=2 x=2, y=3

Resource usage estimate :

Page 13: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 13

Hybrid-partitioning Step1 : array is Tiled

Tune the I/O volume Step2 : Tile is clusteredArray

Tune the resource usage Trade-Off

Off-chip I/O Volume Local memory sizes

Page 14: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 14

Content Context and motivations Target architectures Partitioning modeling Power

IO power model Core power model Putting it all together

Experimental results Conclusion

Page 15: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 15

Dynamic IO Energy model IO Energy depends on

IO volume (Ram clock speed) Operation (Rd,Wr) Port Toggle rate

Eio=Krd.Vrd+ Kwr.Vwr

Determine IO volume For all loop variables Given tiling parameters

Number write I/O operations

Technological constant

Page 16: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 16

Tile IO volume is called « foot print » Estimate  for this foot print [Arg95] Spread vector of dependencies

IO Volume estimate (1/2)

: substituting ith row with spread vector

n

iaiAV

1

det

Page 17: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 17

v

k

n

i

n

jjjijjik alVio

0 1 1,, )1(

Total Tile IO volume:

Example :dA=[1 0 0] aA=[1 0 0] lA=2 VA= 2.H.1

dB=[0 1 0] aB=[1 0 0] lB=2 VB= 2.H.

dC=[0 0 1] aC=[1 0 0] lC=4 VC=

IO Volume estimate (1/2)

kth variable byte widthNumber of variables

Tile size parameterSpread vector

dependenciesTile output data

dependenciesTile input data

BA

C

j i

k

Page 18: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 18

FPGA power dissipation model Pcore=Pstat+Kc.Dlc.nlc.f

Not suited to our target FPGA architecture. Distinction between LCs (mem and logic)

Pcore=Pstat+Kc.Dlc.nlc.f+ Km.Dm.nm.f

Core power model (1/4)

Technology constant

Average toggle rate

Nbs of logic cells

Design operating freq.

Page 19: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 19

Core power model (2/4) Control logic is not modeled

too complex to estimate no significant contribution to power

Core power depends on Number of PEs : depends on and Area usage for each PE : depends on Average toggle rate for PE datapath and local

memory (application constant)

Page 20: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 20

Core power model (3/4) Memory ressource usage

LCs used as distributed memory (16x1bits) Datapath is design constant (library based)

Area cost for a PE array

Clustering parameter along processor space j

Register width along processor space k

Datapath functional cost

Number of PEs

fpd AnA

detdet

pn

16A 1

0m

p

kjjp

kkp An

Page 21: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 21

Core power model (4/4) Energy cost for the whole loop nest

we have Ec=Pc.ncycle.Tcycle

we will consider ncycle=Vcalc/np

Total core energy cost

Energy is not dependant on np !!

Total loop computation volumeAverage toggle rate

16E 1

0core

p

kjjp

kkpmcalcmfpfcalcf AnDVKAnDVK

Page 22: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 22

Content Context and motivations Target architectures Partitioning Modeling Power Experimental results

Model validation Extrapolations

Conclusion

Page 23: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 23

IO power model results

510

1520

25

510

1520

250

50

100

150

x

y

Pow

er (m

w)

510

1520

25

510

1520

250

50

100

150

x

y

Pow

er (m

w)

510

1520

25

510

1520

250

50

100

150

x

y

Powe

r (m

w)

510

1520

25

510

1520

250

20

40

60

80

100

x

y

Rel

ativ

e er

ror(%

)

Observed IO power dissipation Predicted IO power dissipation

Relative errorAbsolute error

Page 24: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 24

Core power model results

510

1520

25

510

1520

250

100

200

300

400

x

y

Powe

r (m

W)

510

1520

25

510

1520

250

100

200

300

400

x

y

Powe

r (m

W)

510

1520

25

510

1520

250

100

200

300

400

x

y

Powe

r (m

w)

510

1520

25

510

1520

25

0

50

100

x

y

Relative error (%)

Predicted core powerObserved core power

Absolute error(mw)

Page 25: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 25

System power model

510

1520

25

510

1520

250

50

100

150

x

y

Loop

exe

cutio

n en

ergy

cost(

J)

510

1520

25

510

1520

250

50

100

150

x

y

Loop

exe

cutio

n en

ergy

cost(

J)

510

1520

25

510

1520

250

50

100

150

200

x

y

Ener

gy (J

)

510

1520

25

510

1520

250

20

40

60

80

100

x

y

Rela

tive

erro

r(%)

Predicted total energy dissipation Observed total energy dissipation

Energy dissipation absolute error Energy dissipation relative error

Page 26: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 26

Content Context and motivations Target architectures Partitioning modeling Power Experimental results Conclusion

Solving the optimisation problem (Lagrange Multipliers) Custom cache for embedded CPUs Extension to SAREs (affine dependances)

Page 27: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 27

Conclusion Models matches experiments

Cheap measurement setup Many components contribute to current

dissipation (LEDs, PCI, etc…) Observations

Trade-off evolves with technology More sensitive for Asics ?

Page 28: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 28

Future Work(1/2) Formulation of the optimization pb

Minimize Energy/iteration Contraints on Performance and Area

Analitycal solution ? Lagrange multipliers No closed form for n>3 BUT fast numerical methods

Page 29: Séminaire COSI ’01

Séminaire COSI-Roscoff’01 29

Future Work(2/2) Model for embedded CPUs

Trade-off cache-size and memory acceses. Determine optimal cache size and associated

tiling parameters. Extension to SARE ?

Affine dependencies. More general loops.