marvin tom university of british columbia department of electrical and computer engineering
DESCRIPTION
Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays. Marvin Tom University of British Columbia Department of Electrical and Computer Engineering Vancouver, BC, Canada. Contributions. Two new FPGA benchmark circuit “suites” - PowerPoint PPT PresentationTRANSCRIPT
Channel Width Reduction Techniques for System-on-Chip Circuits in
Field-Programmable Gate Arrays
Marvin Tom
University of British ColumbiaDepartment of Electrical and Computer Engineering
Vancouver, BC, Canada
2
Contributions• Two new FPGA benchmark circuit “suites”
– Meta Circuit: mimic “System-on-Chip” design by randomly “stitching” real designs– Stdev: synthetic clones of Meta Circuit, used to vary interconnect demand
• Two new FPGA CAD flows
– DHPack: Design Hierarchy Packing• Identify congested IP blocks depopulate reduced interconnect demand• Conference paper: “Logic Block Clustering…”, published at DAC 2005
– Un/DoPack: UnPack and DoPack• Find “local” interconnect congestion depopulate reduced interconnect demand• Conference paper, submitted to DAC 2006
• Discoveries…– “Non-uniform” depopulation limits area inflation– “BLE limiting” gives better interconnect controllability than “Input limiting”– “Interconnect variation” important for area inflation and FPGA architecture design– “Routing closure” achieved by re-clustering and incremental place & route
• UNROUTABLE circuits made ROUTABLE buy an FPGA with MORE LOGIC!!!
3
Mesh-Based FPGA Architecture• 9 logic blocks• 4 wires per channel• 3*4=12 total horizontal tracks
L L L
L L L
L L L
L L L
L L L
L L L
L L L
L
L
L
L
• Larger FPGAs have more “aggregate” interconnect
• 16 logic blocks• 4 wires per channel• 4*4=16 total horizontal tracks
4
alu4
apex2
apex4
bigkey
clma
des
diffeq
dsip
elliptic
ex1010
ex5p
frisc
misex3
pdc
s298s38417
s38584seq
spla
tseng
pdc
ex1010
frisc splaapex4 elliptic
10
20
30
40
50
60
70
80
90
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700
CLB Count
Ro
ute
d C
ha
nn
el W
idth
Logic Utilization vs. Channel Width• Trade-off logic utilization for channel width
– User can always buy more logic…. (not more wires)
FPGA 1 FPGA 2
L L L L
L L L L
L L L L
L L L L
L L L L
L L L L
L L L L
L L L L
L
L
L
L
L L L L L
Trade-off:
CLB count
for
Channel width
But….. can we achieve lower Total Area? ( = SIZE * CLB Count)( No! but we can break even! )
5
Logic Element: BLE and CLB
• Basic Logic Element (BLE)– ‘k’-input LUT + FF
• Configurable Logic Block (CLB) – ‘N’ BLEs, ‘N’ outputs– ‘I’ shared inputs
‘I’ Inputs ‘N’ Outputs
BLE #1
BLE #2
BLE #3
BLE #4
BLE #5
CLB
L L L L
L L L L
L L L L
L L L L
Note: I < k*N
6
CLB Depopulation
• General Approach– Use existing clustering tools– Do not fill CLB while
clustering
1. Input-Limited• Eg. Maximum 67% input
utilization per CLB• Might use all BLEs
2. BLE-Limited• Eg. Maximum 60% BLE
utilization per CLB• Might use all Inputs
BLE #1
BLE #2
BLE #3
BLE #4
BLE #5
CLB
‘I’ Inputs ‘N’ Outputs
7
Reducing Channel Width Results(max cluster size 16, max num inputs 51)
• Input-Limited• No channel width control
30
40
50
60
70
80
90
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Cluster Size (BLE-Limit)
Routed Channel
Width
6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54Number of Inputs (Input-Limit)
Input-limited clmaBLE-Limited clma
• BLE-Limited• (almost) monotonically increasing good channel width control
8
Meta Benchmark Circuit Creation
• Mimic process of creating large designs– “IP Blocks” <==> MCNC Circuits– SoC <==> Randomly integrate/stitch together “IP Blocks”– IP Blocks have varied interconnect needs
• Considered 3 stitching schemes…
– Independent• IP Blocks are not connected to each other
– Pipeline• Outputs of one IP block connected to inputs of next IP block
– Clique• Outputs of each IP block are uniformly distributed to inputs of all other IP
blocks
9
DHPack: Meta Circuit P&R
• Use VPR FPGA tools from University of Toronto
• Observation 1– VPR placer successfully
groups IP blocks from random initial placement
• Observation 2– VPR router confirms channel
width of MetaCircuit is dominated by a few IP blocks{ pdc, clma, ex1010 }
10
0.8
1
1.2
1.4
1.6
1.8
2
40 50 60 70 80 90 100
1
Channel Width Constraint
No
rmal
ized
Are
a
DHPack: Meta Circuit P&R Results
40
50
60
70
80
90
100
40 50 60 70 80 90 100
• Clique MetaCircuit– P&R channel width results closely match “constraints”
• Shrink Channel Width by ~20% (from 95 to 75), NO AREA INCREASE by ~50% (from 95 to 50), 1.7x area increase
Channel Width Constraint
Ch
ann
el W
idth
Constraint Routed
11
Meta Circuits vs. Stdev Circuits
• Meta Circuit Drawbacks– Design hierarchy boundaries not well-defined– Coarse-grained IP block boundary– Stitching unrealistic
• Flip Flop placed at every output• Connections only have FO1
• Stdev Circuits (created using GNL)– Synthetic clone of Meta circuits– Hierarchical specify Rent parameter of each partition
• Root # I/Os, # IP blocks• Second Level 20 IP blocks, # LEs, Rent parameter
12
Stdev Circuits: Rent Parameters• 7 benchmark circuits• 240/120 primary inputs/outputs, approx 52,000 CLBs• Rent parameter: Average 0.62, vary Stdev 0.0 to 0.12
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
bigke
y
s385
84.1
ellipt
icdif
feq
s298 alu
4
mise
x3 pdc
ex5p
ex10
10
MCNC Circuit IP Blocks
Ren
t P
aram
eter
Stdev000Stdev002
Stdev004Stdev006
Stdev008 / meta cloneStdev010
Stdev012
13
Un/DoPack Flow
• Iterative non-uniform cluster depopulation tool
• Step 1: Traditional SIS/VPR• Step 2: UnPack:
– Congestion Calculator
• Step 3: DoPack:– Incremental Re-Cluster
• Step 4,5: Fast Place/Route
Circuit DescriptionArchitecture Description
Channel Width ConstraintArray Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
14
Un/DoPack Flow: SIS/VPRCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 1: Traditional SIS/VPR
Circuit DescriptionArchitecture Description
Channel Width ConstraintArray Size Constraint
15
Un/DoPack Flow: SIS/VPRCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 1: Traditional SIS/VPR
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Synthesize andTechnology Map(SIS/Flowmap)
16
Un/DoPack Flow: SIS/VPRCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 1: Traditional SIS/VPR
Channel WidthConstraint Met?
Success!
Yes
No
17
Un/DoPack Flow: UnPackCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 2: UnPack– Generate Congestion Map– CLB Label = Largest CW occ
in 4 adjacent channels
18
Un/DoPack Flow: UnPackCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 2: UnPack:Depop Center = Largest CLB label
M X M Array
19
Un/DoPack Flow: UnPackCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 2: UnPack:Depop Radius = M/4
Depop Amt: 1 new row/col in array
M X M Array
20
Un/DoPack Flow: DoPackCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 3: DoPack:– Incremental Re-Cluster
IncrementalCluster
(DoPack)
No
21
Un/DoPack Flow: Fast P&RCircuit Description
Architecture DescriptionChannel Width Constraint
Array Size Constraint
Cluster(iRAC Replica)
Placement(VPR)
Routing(VPR)
Channel WidthConstraint Met?
Success!
CongestionCalculator(UnPack)
Fast Placement(Incremental or
VPR)
Fast Routing(VPR)
Channel WidthConstraint Met?
Yes Yes
No No
Array Size LimitsReached?
Failure
Yes
No
Synthesize andTechnology Map(SIS/Flowmap)
IncrementalCluster
(DoPack)
• Step 4,5: Fast Place/Route
• Fast Placement– UBC Incremental Placer
(under development)– VPR “–fast” option
• Router– Use full routed solution
• Slow but reliable
22
Before 120/79/27
After 100/79/20Peak / Avg / Stddev
Peak / Avg / Stddev
23
Normalized Area of GNL Benchmarks
0.901.001.101.201.301.401.501.601.701.801.902.00
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05
% of Maximum Channel Width
No
rmal
ized
Are
a
stdev000
stdev002
stdev004
stdev006
stdev008/clone
stdev010
stdev012
24
Absolute Area of GNL Benchmarks
0.901.001.101.201.301.401.501.601.701.801.902.00
60 70 80 90 100 110 120 130 140 150 160 170
Absolute Channel Width
No
rmal
ized
Are
a
stdev000
stdev002
stdev004
stdev006
stdev008/clone
stdev010
stdev012
25
Interconnect Variation: Impact on FPGA Architecture Design
70
80
90
100
110
120
130
140
Min
imu
m R
ou
ted
Ch
an
ne
l W
idth
Baseline
10% Area Increase
20% Area Increase
25% Area Increase
High VariationHigh VariationCircuits RequireCircuits Require
Wide Channel WidthWide Channel Width
26
Contributions• Two new FPGA benchmark circuit “suites”
– Meta Circuit: mimic “System-on-Chip” design by randomly “stitching” real designs– Stdev: synthetic clones of Meta Circuit, used to vary interconnect demand
• Two new FPGA CAD flows
– DHPack: Design Hierarchy Packing• Identify congested IP blocks depopulate reduced interconnect demand• Conference paper: “Logic Block Clustering…”, published at DAC 2005
– Un/DoPack: UnPack and DoPack• Find “local” interconnect congestion depopulate reduced interconnect demand• Conference paper, submitted to DAC 2006
• Discoveries…– “Non-uniform” depopulation limits area inflation– “BLE limiting” gives better interconnect controllability than “Input limiting”– “Interconnect variation” important for area inflation and FPGA architecture design– “Routing closure” achieved by re-clustering and incremental place & route
• UNROUTABLE circuits made ROUTABLE buy an FPGA with MORE LOGIC!!!
End of Talk