uic panella thesis

46
Design Methodologies for Dynamic Reconfigurable Multi-FPGA Systems BY Alessandro Panella [email protected] Thesis defense – May 5, 2008 THESIS COMMITTEE: John Lillis, Marco D. Santambrogio, Ajay Kshemkalyani

Upload: usrdresd

Post on 06-May-2015

3.385 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: UIC Panella Thesis

Design Methodologies for Dynamic Reconfigurable

Multi-FPGA Systems

BY

Alessandro Panella

[email protected]

Thesis defense – May 5, 2008

THESIS COMMITTEE:

John Lillis, Marco D. Santambrogio, Ajay Kshemkalyani

Page 2: UIC Panella Thesis

2

About this thesis (1/2)About this thesis (1/2)

PROBLEM STATEMENT:Extend the range of application of dynamic reconfigurability techniques from the single FPGA case to multi-FPGA systems

NOVELTYMethodology for the design of multi-FPGA systems

Dynamic reconfigurabilitySeen as a solution for implementing area over-requiring applicationsOnly used “when needed”

Regularity-driven partitioning for run-time reuse

Page 3: UIC Panella Thesis

About this thesis (2/2)About this thesis (2/2)

Major contribution:Development of a multi-FPGA system design flow which exploits dynamic reconfigurability for blocks’ reuse.

Useful contributions:Creation of an intermediate representation for structural and hierarchical circuits.Creation of a framework for the extraction of the design from VHDL.Design and implementation of static global layout algorithms.Exploit hierarchy information for regular patterns extraction.

3

Page 4: UIC Panella Thesis

4

OutlineOutline

Context definitionFPGAMulti-FPGA Systems (MFS)Dynamic reconfigurability

Related worksMFS design flowsDynamic reconfigurable MFS’s

Proposed methodologyDesign extractionGlobal layoutReuse and Dynamic reconfigurability

Experimental resultsConclusion and future works

Page 5: UIC Panella Thesis

5

Field Programmable Gate Field Programmable Gate ArrayArray

Re-programmable semi-custom hardwareLow Non Recurrent Engineering (NRE) costsGood performancesHigh flexibility

Composed of Configurable Logic Blocks (CLB)Xilinx Virtex CLB:

2 slices, each containing two 4-input Look-Up Tables (LUT)

Page 6: UIC Panella Thesis

6

Multi-FPGA Systems (MFS)Multi-FPGA Systems (MFS)

Ensembles of more FPGAs (2 - 1000’s)

Motivations:Massively parallel computingNeed to implement large applicationsGeneral trend in VLSI towards multi-core computers

Applications:SupercomputingLogic emulationNeural networks, …

Terminology:Architecture: physical cluster of FPGAs Application: programmed functionalitySystem: architecture + application

Page 7: UIC Panella Thesis

7

MFS topologies (1/2)MFS topologies (1/2)

Connections:Hardwired vs. ProgrammableDedicated vs. Shared (bus, point to point)

Complete graph (Clique) Direct connection between any two chips Planarity; Pin requirements

Mesh: 4(8)-neighbor pattern Expandability No fixed length path Communication logic

in intermediate chips

PRO CON

Page 8: UIC Panella Thesis

8

MFS topologies (2/2)MFS topologies (2/2)

Crossbar: logic bearing chips and routing chipsTotal (one routing chip) Partial (several routing chips)

Equal communication delays Low scalability

Hybrid: combine benefits of the two approaches

Example: Complete Graph Partial Crossbar (HCGP)(from Khalid, M.: Routing Architecture and Layout Synthesis for Multi-FPGA Systems, Ph.D. Thesis, University of Toronto, 1999)

Page 9: UIC Panella Thesis

9

ReconfigurabilityReconfigurability

Reconfiguration: altering the location or functionality of a system element (H. Estrin, 1960)FPGA: suitable physical ground

Partial vs. Total

(Partial) Dynamic vs. Static:Only some parts of the system take part in each reconfigurationThe execution of the system does not cease

Motivations and applicationsProvide a larger virtual areaReact to sudden and frequent changes in applications needsFault tolerance

Page 10: UIC Panella Thesis

10

Dynamically Reconfigurable Dynamically Reconfigurable MFS’sMFS’s

Rationale: expand the capabilities of static MFS’sGoing beyond MFS physical limitationsProvide a high level of flexibility

E.g. in logic emulation: dynamic fault fixing

Partial vs. Total reconfiguration in MFS

Two main scenarios (not exclusive)Reconfiguration of logic chipsReconfiguration of routing chips

The interconnections are dynamically mutableComponents can be reused

Page 11: UIC Panella Thesis

11

Design hierarchyDesign hierarchy

Application composed of:Blocks

Can have sub-blocks

NetsBlock-to-blockBlock-to-interface

Advantages:Handle the complexity of designReuse of modules

IP-Cores libraries

Block-to-block net

Block-to-interface

net

Page 12: UIC Panella Thesis

12

What’s nextWhat’s next

Context definitionFPGAMulti-FPGA Systems (MFS)Dynamic reconfigurability

Related worksMFS design flowsDynamic reconfigurable MFS’s

Proposed methodologyDesign extractionGlobal layoutReuse and Dynamic reconfigurability

Experimental resultsConclusion and future works

Page 13: UIC Panella Thesis

13

Related works - MFS design Related works - MFS design flowflow

All MFS design flows have a similar structureDifferent algorithms used in each phase

Examples: Hauck (a) and Kahlid (b)

Global layout tasks: partitioning, placement and routing

a) Hauck, S.: Multi-FPGA Systems, Ph.D. Thesis, University of Washington, 1995

b) Kahlid, M.: Routing Architecture and Layout Synthesis for Multi-FPGA Systems, Ph.D. Thesis, University of Toronto, 1990

Page 14: UIC Panella Thesis

14

Complete MFS design flows (a)Complete MFS design flows (a)

Integrated solution to partitioning, placement and routing– Recursive bi-partitioning

• Multilevel approach– Clustering and refinement phases

– Partition orderings for placement• Identify the bottlenecks in the architecture• Assign the two initial partitions to the least

connected parts of the architecture, and so on recursively

– The connections are routed as the bisections are computed

PROS: the architecture is considered CONS: no flexibility on routing given partitioning

and placement

Page 15: UIC Panella Thesis

15

Complete MFS design flows Complete MFS design flows (b)(b)

Partitioning: recursive bisection using Fiduccia-Mattheyses heuristic

Placement: dependent on the topology– Mesh: force-directed– Crossbar: trivial task, the FPGAs have the same

distance Routing: two approaches

– General (obtain a graph from the architecture)– Specific (fitted on the particular MFS topology)

PROS: uses existent effective and robust algorithms

CONS: stress on routing and topology evaluation

Page 16: UIC Panella Thesis

16

Partial MFS design flowsPartial MFS design flows Address only some phases of the design

– Usually partitioning and placement

Iterative approaches– Genetic algorithm [Hidalgo et al., DSD ‘02]– Simulated annealing

[Roy at al., ICCAD ’93; Vicente et al., FPL ‘99]

Hierarchical approaches– Exploit the design hierarchy in partitioning– Behrens et al., ICCAD ‘96

• Hierarchy exploration heuristicFang et al., TODAES ‘00

Hierarchy extraction from Verilog spec.Set-covering procedure

Page 17: UIC Panella Thesis

17

Dynamic Reconfigurable MFSDynamic Reconfigurable MFS

Extraction of a directed task graph from VHDL Task graph divided into time segments

– Using a non-linear programming model Each segment is spatially partitioned

[Ouaiss et al., An Integrated Partitioning and Synthesis System for Dynamically Reconfigurable Multi-FPGA architectures, 1998]

Dynamic?

Page 18: UIC Panella Thesis

18

What’s nextWhat’s next

Context definition– FPGA– Multi-FPGA Systems (MFS)– Dynamic reconfigurability

Related works– MFS design flows– Dynamic reconfigurable MFS’s

Proposed methodology– Design extraction– Global layout– Reuse and Dynamic reconfigurability

Experimental results Conclusion and future works

Page 19: UIC Panella Thesis

19

Proposed methodologyProposed methodology

Multi-FPGA design flow Three main phases

1. Design extraction2. Static Global Physical Layout

• Partitioning• Placement• Routing

3. Reuse through Dynamic Reconfigurability

Reuse introduces extra delays– Reconf. times, sequential

execution…– Only adopted when needed– In such case, the introduced delay

has to be minimized

Page 20: UIC Panella Thesis

Input: VHDL description Output: Intermediate representation

– Ad hoc created data structure

Two sub-phases:– VHDL preprocessing– VHDL structural parsing

20

Design ExtractionDesign Extraction

Page 21: UIC Panella Thesis

21

Intermediate representationIntermediate representation

C++ data structure Contains both structural and hierarchical

information Graphs implemented using the Boost Graph

Library Container class provides an API

Page 22: UIC Panella Thesis

22

VHDL ParsingVHDL Parsing

VHDL preprocessing: obtain a pure structural VHDL description

– Features of each component are retrieved using vendors synthesis tools (i.e. Xilinx XST, Synplify PRO)

Create the intermediate representation from the pure VHDL description

Page 23: UIC Panella Thesis

23

ExampleExample

Hierarchy

Flattened view

DES encryption core(part of the 3DES core circuit)

Page 24: UIC Panella Thesis

24

Static Global LayoutStatic Global Layout

This phase addresses Partitioning and Placement

Two implemented approaches:– Integrated P&P

– Sequential P&P

Page 25: UIC Panella Thesis

25

Simulated annealing algorithm– Iterative randomized approach

• Suitable to cope with high dimesionality problems• Partitioning + Placement is such a problem

– Aim: minimize a cost function f– The algorithm starts with a “high” temperature T– At each iteration

• M random moves are performed• The move if accepted (Metropolis criterium)

– Always if the cost decreases or remains equal– With probability if the cost increase

• T is decreased by a cooling factor α– Stop after S consecutive non-accepted moves

Integrated P&PIntegrated P&P

e−Δc /T

Page 26: UIC Panella Thesis

26

Annealing implementationAnnealing implementation

Solution: array [ci], node i is placed in FPGA ci Cost: Weighted Estimated Wire Length (WEWL)

Random move: single-node or swap, with equal probability

Constraints:– Area constraint– I/O Pin constraint– Handled with penalties

Page 27: UIC Panella Thesis

27

Sequential P&PSequential P&P

Partitioning: bottom-up clustering 1-to-1 Placement: annealing

– Simplified version of the integrated P&P algorithm

CLUSTERING: Initialization: each node is considered as a

cluster At each iteration

– Choose two nodes on the basis of a metric– Collapse them

Stop when– Only one cluster is left– No clusters can be formed due to

• Area constraint• I/O Pin constraint

Page 28: UIC Panella Thesis

28

Clustering metricsClustering metrics

1. Connection:

2. Communication Ratio:

– Internal comm.

– External comm.

2. Communication density:

Page 29: UIC Panella Thesis

29

Blocks reuseBlocks reuse Problem: application does not fit onto the

architecture – Reuse similar parts of the circuit in order to save

space Def: dynamically-interconnected structure

Architectural scenarios– Bus– Crossbar

Page 30: UIC Panella Thesis

30

Isomorphic clustersIsomorphic clusters Which parts of the structure consider for reuse? Def. Isomorphic Clusters

– Substructures which contain the same blocks having the same connections

– Example

Two subproblems– Finding isomorphic clusters– Select the ones to reuse (and how many times)

Page 31: UIC Panella Thesis

31

Isomorphic clusters extraction (1/2)Isomorphic clusters extraction (1/2)

Regularity driven clustering

Def. type of a node: component which the node is instance of

If two nodes selected for collapsing have the same parent– Look for nodes with the same type of the parent in

the hierarchy– Execute the same collapsing operation– Assign the same type to the newly created clusters

Clustering itself benefits from this enhancement– Problem of standard clustering: lack of global

metric– Regularity provides global information

Page 32: UIC Panella Thesis

32

Isomorphic clusters extraction Isomorphic clusters extraction (2/2)(2/2)

The key feature is the assignment of a “type” to clusters

Example:

Page 33: UIC Panella Thesis

33

Blocks reuse choicesBlocks reuse choices

Choose which blocks to reuse Difficulty: high complexity due to hierarchical

clusters– Some clusters contains others

Solution– ILP model fast even for a high number of nodes– Run the ILP model on each “cut” of the dendrogram

– Each cut is a flatten structural view of the application

Page 34: UIC Panella Thesis

34

ILP model for blocks reuseILP model for blocks reuse

xi: number of times cluster type ti is reused (= no. of needed reconfigurations)

Page 35: UIC Panella Thesis

35

What’s nextWhat’s next

Context definition– FPGA– Multi-FPGA Systems (MFS)– Dynamic reconfigurability

Related works– MFS design flows– Dynamic reconfigurable MFS’s

Proposed methodology– Design extraction– Global layout– Reuse and Dynamic reconfigurability

Experimental results Conclusion and future works

Page 36: UIC Panella Thesis

ExperimentsExperiments

Test circuit description (slide 37)

Integrated vs. Sequential partitioning & placement– Methodologically, both approaches are valid– They are compared from a numerical point of view

• Partitioning evaluation (slide 38)• Placement evaluation (slide 39)

Sequential P&P vs. Metis (slide 40)– Provide a comparison with an external approach

Blocks reuse evaluation (slide 41)– Execution time– Example of application

36

Page 37: UIC Panella Thesis

37

Results: test circuitsResults: test circuits

Triple-DES encryption+decryption core (3DES) Finite Impulse Response filter (FIR) Noekeon cipher (NOEK) Composed module FIR+3DES

Page 38: UIC Panella Thesis

Integrated vs. Sequential P&P (1/2)Integrated vs. Sequential P&P (1/2)

Partitioning evaluation

38

NOTE: by setting the distance between any two FPGAs equal to 1, the integrated annealing approach is actually a partitioning algorithm

Page 39: UIC Panella Thesis

Placement evaluation (on mesh architectures) Integrated P&P

Sequential P&P

v

39

Integrated vs. Sequential P&P (2/2)Integrated vs. Sequential P&P (2/2)

Page 40: UIC Panella Thesis

Clustering Vs. MetisClustering Vs. Metis

40

Page 41: UIC Panella Thesis

41

Results: ILP model solvingResults: ILP model solving

Timing results

ILP result - example: • 3DES-FIR circuit

• Conn metric

• 4 FPGAs of 600 slices needed

Only 3 are available

• Adopt reuse

• Dendrogram cuts 2-7 provides the lowest estimated rec. time

Page 42: UIC Panella Thesis

42

What’s nextWhat’s next

Context definition– FPGA– Multi-FPGA Systems (MFS)– Dynamic reconfigurability

Related works– MFS design flows– Dynamic reconfigurable MFS’s

Proposed methodology– Design extraction– Global layout– Reuse and Dynamic reconfigurability

Experimental results Conclusion and future works

Page 43: UIC Panella Thesis

43

Conclusion: contributionsConclusion: contributions

Major contribution:– Development of a multi-FPGA systems design flow which

exploits dynamic reconfigurability for blocks reuse while minimizing the estimated execution time.

Useful contributions:– Creation of a intermediate representation for structural

and hierarchical circuits.– Creation of a framework for the extraction of the design

from VHDL.– Design and implementation of static global layout

algorithms.– Exploit hierarchy information for regular patterns

extraction.

The proposed approaches have been validated through experimental evaluations

Page 44: UIC Panella Thesis

44

Conclusion: future worksConclusion: future works

Improvements– Go beyond the inherent greediness of clustering– More powerful closeness metrics– More accurate time estimation function for blocks

reuse

Additions– Development of a robust and effective routing

algorithm for both static and dynamic implementations

– Partitioning and placement for dynamically-interconnected structures

– Binding and scheduling of application blocks on the instantiated clusters

Page 45: UIC Panella Thesis

45

The end.The end.

Questions?

Page 46: UIC Panella Thesis

46

That’s all folks!That’s all folks!

Thank you.

How ‘bout a funny joke?