compact area and performance modelling for coarse-grained … · 2019-07-28 · component-wise...
TRANSCRIPT
Compact Area and Performance Modelling forCoarse-Grained Reconfigurable Architectures
by
Kuang-Ping Niu
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
c© Copyright 2019 by Kuang-Ping Niu
Abstract
Compact Area and Performance Modelling for Coarse-Grained Reconfigurable
Architectures
Kuang-Ping Niu
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2019
We consider area and performance modelling for coarse-grained reconfigurable architec-
tures (CGRAs) and extend the open-source CGRA-ME (CGRA modelling and explo-
ration) framework to rapidly estimate these metrics. Area is modelled by synthesizing
commonly occurring CGRA primitives in isolation, and then aggregating the primitives’
component-wise areas. Performance is modelled by integrating a static-timing analysis
(STA) framework into CGRA-ME. The delays in the STA timing graph are based on
component-wise delays, as well as estimated interconnect delay. Experimental results us-
ing the estimation engine demonstrate reasonably accurate estimation for both area and
performance for different CGRA architectures, as well as different variations of the same
architecture. By offering fast and accurate estimation in an early phase of CGRA archi-
tecture exploration, the estimation engine allows the user to bypass the lengthy process
of a full VLSI implementation, and rapidly explore the area/performance architecture
space.
ii
Acknowledgements
My great thanks goes to my supervisor, Professor Jason Anderson. The advice and
guidance I have received from him were beyond the research we have done together.
Along with the exciting research project, he has inspired many interesting ideas to tackle
the many challenges I have faced. His mentorship and encouragement has helped me
reach important achievements in life. I am very grateful to have him as my supervisor.
I would like to thank my parents, my aunt, and my sister. For more than a decade
that I have been away from home, they have always been unconditionally supportive
from the other side of the phone. Big thanks to my family, for always encouraging me
when I was weak, and for sharing the happiness from my accomplishments.
Special credit goes to Cathy, for injecting non-technology-related elements into my
day to day life, for believing in me, and for constantly pushing me to challenge myself to
become more capable.
Lastly, many thanks to the bright minds in our research group: Xander, Brett, Jin
Hee, Julie, Joy, Ian, Matthew, Nick, and Austin. I greatly appreciate the many friendly
advice, critical feedback, brain-picking conversations, camping, coffee, frisbee, and many
more, throughout these years. Thank you all for making the journey that much more
rewarding and lively.
iii
Contents
List of Tables vi
List of Figures viii
List of Acronyms xi
List of Acronyms xiv
1 Introduction 1
1.1 Introduction to Coarse-Grained Reconfigurable Architectures (CGRAs) . 3
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 8
2.1 Existing CGRA Architectures . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 ADRES Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 HyCUBE Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Existing CGRA CAD Tools . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 CGRA-ME Framework Overview . . . . . . . . . . . . . . . . . . . . . . 18
2.6 ASIC Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Prior Work on Hardware Performance and Area Estimation . . . . . . . . 21
iv
3 CGRA-ME – Estimation Engine 24
3.1 Architecture Modelling in CGRA-ME and the Primitive Modules . . . . 24
3.2 Characterization of Primitive Modules . . . . . . . . . . . . . . . . . . . 27
3.3 Area Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Performance Modelling and STA . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Interconnect Delay Estimation . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Experimental Studies 39
4.1 Target Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Target Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 ADRES-O Full VLSI Implementation Versus CGRA-ME Estimation . . . 43
4.3.1 CGRA-ME Estimation Results . . . . . . . . . . . . . . . . . . . 44
4.4 ADRES Architecture with Added Diagonal Connectivity . . . . . . . . . 53
4.4.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 ADRES Architecture versus HyCUBE Architecture . . . . . . . . . . . . 57
4.5.1 Architectural Differences: ADRES-O versus HyCUBE . . . . . . . 57
4.5.2 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Conclusion and Future Work 62
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Appendix A Area Modelling 64
Appendix B Performance Modelling 67
Bibliography 69
v
List of Tables
3.1 Database of area and critical-path delay of the primitive modules as mapped
into the NanGate FreePDK45 45nm standard-cell library. . . . . . . . . . 30
4.1 ADRES with orthogonal interconnect (ADRES-O): Total core area of area-
optimized and delay-optimized designs, as well as estimation by CGRA-ME. 44
4.2 ADRES-O: Critical path delay of benchmarks for area-optimized and delay-
optimized targets from PrimeTime STA. . . . . . . . . . . . . . . . . . . 45
4.3 The critical path delay report of the mults1 benchmark, mapped onto
ADRES-O in the delay-optimized scenario, generated by PrimeTime and
CGRA Modelling and Exploration (CGRA-ME) Tatum (without account-
ing for interconnect delay). . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 The critical path delay report of the mults1 benchmark, mapped onto
ADRES-O in delay-optimized scenario, generated by PrimeTime and CGRA-
ME Tatum now with interconnect delays modelled based on Modulo Rout-
ing Resource Graph (MRRG) node fanouts. . . . . . . . . . . . . . . . . 50
4.5 The critical path delay report of the mults1 benchmark, mapped onto
ADRES-O in the delay-optimized scenario, generated by PrimeTime and
CGRA-ME Tatum now with interconnect delays modelled based on MRRG
node fanouts, and overridden for the multiplexer driving the Functional
Units (FUs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
vi
4.6 Correctness of reported critical paths: 1) 5 – Tatum reported the same
path as PrimeTime, 2) � – path from PrimeTime is not reported by
Tatum, 3) 1 – path from PrimeTime is reported as one of the top-four
critical paths from Tatum. . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 ADRES with diagonal interconnect (ADRES-D): Total core area of area-
optimized and delay-optimized designs, as well as estimation by CGRA-ME. 55
4.8 Correctness of reported critical paths on ADRES-D architecture: 1) 5 –
Tatum reported the same path as PrimeTime, 2) � – path from Prime-
Time is not reported by Tatum, 3) 1 – path from PrimeTime is reported
as one of the top-four critical paths from Tatum. . . . . . . . . . . . . . . 56
4.9 HyCUBE: Total area of area-optimized and delay-optimized variants, as
well as estimation by CGRA-ME. . . . . . . . . . . . . . . . . . . . . . . 58
4.10 Correctness of reported critical paths on HyCUBE architecture: 1) 5 –
Tatum reported the same path as PrimeTime, 2) � – path from Prime-
Time is not reported by Tatum, 3) 1 – path from PrimeTime is reported
as one of the top-four critical paths from Tatum. . . . . . . . . . . . . . . 60
vii
List of Figures
1.1 CPU trends from 1970s to 2010s: transistor count versus clock frequency. 2
1.2 Logic element comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 The Reconfigurable Pipelined Datapath (RaPiD) accelerator [excerpted
from [15,16]]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 The MorphoSys architecture [excerpted from [55]]. . . . . . . . . . . . . . 10
2.3 The PipeRench architecture [excerpted from [53]]. . . . . . . . . . . . . . 11
2.4 Mapping a (a) 5-virtual-stage application onto a (b) 3-physical-stage/stripe
PipeRench system [excerpted from [19]]. . . . . . . . . . . . . . . . . . . 12
2.5 The Dynamically Reconfigurable Processor (DRP) architecture [excerpted
from [58]]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 The ADRES CGRA system [excerpted from [43]]. . . . . . . . . . . . . . 15
2.7 The HyCUBE architecture. The Coarse-Grain Reconfigurable Array (CGRA)
consists of 2D array of FUs connected by multi-hop-capable crossbar switch
interconnect [excerpted from [30]]. . . . . . . . . . . . . . . . . . . . . . . 17
2.8 CGRA-ME framework overview . . . . . . . . . . . . . . . . . . . . . . . 19
2.9 Typical steps involved in Very-Large-Scale Integration (VLSI) design. . . 20
3.1 Illustrations of the two data structures in CGRA-ME representing a Pro-
cessing Element (PE) with 3 contexts. . . . . . . . . . . . . . . . . . . . 26
3.2 Characterization steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
viii
3.3 An example CGRA architecture modelled as a tree of module objects in
CGRA-ME. The modules filled in blue are the primitive modules, while the
ones filled in red are non-primitive/composite modules. During area esti-
mation, the modules with red solid outlines would require either a database
lookup (primitive) or summation of all submodule areas (composite). In a
highly regular architecture, there can be multiple instances of each unique
composite module, hence in this example, only a few composite modules
require summation and have red solid outlines. . . . . . . . . . . . . . . . 32
3.4 MRRG of a 3-context PE, with the mapped resources highlighted in red. 34
3.5 Three categories of subgraph to compose a full timing graph in Tatum,
converted from their corresponding MRRG counterparts. . . . . . . . . . 35
3.6 Timing graph representing the mapped MRRG from Figure 3.4. . . . . . 36
3.7 Mapped MRRG and timing graph of the “sum” benchmark put side-by-
side, showing difference in graph complexity at a larger scale. . . . . . . . 36
3.8 Standard-cell layout for fanout delay scaling analysis on op_and with 16
fanout registers, with the upper, middle, and lower rats nests representing
fanin registers, the primitive module, and fanout registers, respectively. . 37
3.9 Averaged fanout-delay of all primitive modules. . . . . . . . . . . . . . . 38
4.1 High-level view of the ADRES-like architectures and HyCUBE architec-
ture used in the experimental studies. . . . . . . . . . . . . . . . . . . . . 40
4.2 Illustration of the Dataflow Graphs (DFGs) of the 8 benchmarks used in
the experimental studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Standard-cell PnR architecture layout of ADRES – area-optimized (left)
vs. delay-optimized (right) side-by-side on the same scale. . . . . . . . . 44
4.4 Critical path delay comparison – CGRA-ME estimations without inter-
connect delays vs PrimeTime with interconnect delays. . . . . . . . . . . 46
ix
4.5 After mapping benchmark conv2, we produced the partial MRRG rep-
resenting the used portion of the hardware. Without interconnect delay
taken into account, Synopsys PrimeTime and CGRA-ME report different
critical paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Critical path delay comparison – CGRA-ME estimates with MRRG fanout-
inferred interconnect delay versus PrimeTime. . . . . . . . . . . . . . . . 49
4.7 Critical path delay comparison – CGRA-ME estimations with selected
fanout count overridden vs PrimeTime. . . . . . . . . . . . . . . . . . . . 51
4.8 PEs from ADRES-O and ADRES-D put side-by-side for comparison. . . 54
4.9 PEs from ADRES-O and HyCUBE. . . . . . . . . . . . . . . . . . . . . . 58
A.1 ADRES-O Architecture: Area breakdown in µm2 . . . . . . . . . . . . . 64
A.2 ADRES-D Architecture: Area breakdown in µm2 . . . . . . . . . . . . . 65
A.3 HyCUBE Architecture: Area breakdown in µm2 . . . . . . . . . . . . . . 66
B.1 ADRES-D Architecture: Critical path delay comparison – CGRA-ME es-
timations with selected fanout count overridden versus PrimeTime. . . . 67
B.2 HyCUBE Architecture: Critical path delay comparison – CGRA-ME es-
timations with selected fanout count overridden versus PrimeTime. . . . 68
x
List of Acronyms
ADL Architecture Description Language. 16
ADRES Architecture for Dynamically Reconfigurable Embedded System. 14–16, 23,
25, 36, 57, 59
ADRES-D ADRES with diagonal interconnect. vii, x, 39, 53–57, 60, 65, 67
ADRES-O ADRES with orthogonal interconnect. vi, x, 39, 43–46, 48, 50, 52–60, 63,
64
ALU Arithmetic Logic Unit. 3, 9, 11, 12, 14, 15, 17, 29
API Application Programming Interface. 18
ASIC Application-Specific Integrated Circuit. 1, 2, 5–7, 28, 33, 62
CAD Computer-Aided Design. 5–8, 21, 53, 62
CBS Crossbar Switch. 41, 57, 58
CG-SMAC Coarse-Grained SMAC. 18
CGRA Coarse-Grain Reconfigurable Array. viii, ix, 2, 4–9, 12–17, 19, 22–25, 28, 29,
31–33, 36, 57, 60, 62, 63
CGRA-ME CGRA Modelling and Exploration. vi, viii–x, 4–8, 16–18, 24–29, 31–33,
37, 40, 43, 45–48, 50–52, 54, 55, 59–63
xi
CPU Central Processing Unit. 4, 22
CSTC Central STC. 14
DFG Dataflow Graph. ix, 18, 33, 41–43
DMA Direct Memory Access. 10
DMU Data Management Unit. 14
DRESC Dynamically Reconfigurable Embedded System Compiler. 15, 16
DRF Data Register File. 39
DRP Dynamically Reconfigurable Processor. viii, 13, 14
DSE Design Space Exploration. 16, 18, 22, 23, 39, 53
DSP Digital Signal Processor. 1
FF flip-flop. 21
FPGA Field-Programmable Gate Array. 1, 2, 4, 5, 17, 21, 33, 36, 41, 63
FSM Finite State Machine. 14
FU Functional Unit. vi, viii, 5, 11, 12, 15, 17, 40, 41, 49–52, 58, 63
GPGPU General-Purpose Graphic Processing Unit. 22
GPU Graphic Processing Unit. 1, 4, 22
HDL Hardware Description Language. 4, 20, 27, 29, 31
I/O Input/Output. 5, 13, 25, 31
II Initiation Interval. 17
xii
ILP Integer Linear Programming. 19
IMS Iterative Modulo Scheduling. 18
IP Intellectual Property. 29
IR Intermediate Representation. 16
LE Logic Element. 3
LUT Lookup Table. 3, 9, 12, 21
MP Memory Interface Port. 39–41, 44, 59
MRRG Modulo Routing Resource Graph. vi, ix, x, 16, 17, 19, 25–27, 33–36, 47, 48,
50, 52, 54
PE Processing Element. viii–x, 3, 11–17, 26, 27, 31, 33, 34, 39, 41, 53, 54, 57, 58, 63
PLD Programmable Logic Device. 1, 4
PnR Place-and-Route. 20, 21, 29, 33
RAA Reconfigurable Array Architecture. 23
RAM Random-Access Memory. 21
RaPiD Reconfigurable Pipelined Datapath. viii, 8–10
RC Reconfigurable Cell. 10, 11, 15
RF Register File. 39, 41, 57
RTL Register-Transfer Level. 4, 8, 25
SDC Synopsys Design Constraint. 33
xiii
SMAC Simultaneous Mapping and Clustering. 18
SPR Schedule, Place, and Route. 18
SRP Samsung Reconfigurable Processor. 22, 23
STA Static Timing Analysis. 6, 20, 21, 28, 33, 34, 37, 38, 41, 48, 55, 59, 61
STC State Transition Controller. 13, 14
VLIW Very-Long Instruction Word. 16, 57
VLSI Very-Large-Scale Integration. viii, 5, 8, 20, 21
VTR Verilog-to-Routing. 5, 6, 21
xiv
Chapter 1
Introduction
Computations, broadly speaking, can be implemented in software or in hardware. In
software, computations are typically specified in a high-level language and then executed
on a standard processor or an application-specific processor. Examples of the latter
include Graphic Processing Units (GPUs) or Digital Signal Processors (DSPs). When
computations are realized in hardware, frequently used options are custom Application-
Specific Integrated Circuits (ASICs), or Programmable Logic Devices (PLDs), such as
Field-Programmable Gate Arrays (FPGAs).
Until recently, Moore’s Law [44] and Dennard scaling [14] allowed improved logic
density, power and performance with each process generation. Unfortunately, perfor-
mance scaling plateaued in the mid-2000s. Figure 1.1 shows processor transistor counts
and clock frequency from 1970 until today. We observe, in the red points, that pro-
cessor clock frequencies have stalled in the single-digit GHz range. We can no longer
rely on process scaling to deliver higher computational throughput in standard proces-
sors. As such, other means must be used to meet insatiable consumer demand for higher
throughput and lower power. GPUs and DSPs offer such higher throughput for certain
applications: massively parallel floating point, and signal processing. On the other hand,
customized hardware, implemented as ASICs or using PLDs, can be precisely tailored to
1
Chapter 1. Introduction 2
Figure 1.1: CPU trends from 1970s to 2010s: transistor count versus clock frequency.
application needs, potentially providing orders-of-magnitude improvements in speed and
energy efficiency (e.g. [50]).
Custom ASICs deliver the highest density, speed and power efficiency for any given
application. However, their cost is prohibitively expensive for all but the highest-volume
applications, or those applications with aggressive speed and power constraints. FPGAs
can provide significant speed and energy advantages over processors (e.g. [38]). However,
owing to the overhead of FPGA programmability, they are 3-4× slower, 12× more power
hungry, and 3-35× less area efficient than custom ASICs for implementing a given ap-
plication [34]. Moreover, using FPGAs has traditionally required knowledge of hardware
design, and the compile times for large FPGA designs can run into hours or even days.
Coarse-Grain Reconfigurable Arrays (CGRAs) are an alternative style of PLD that offers
programmability, performance and power benefits over FPGAs in certain cases. Perfor-
mance and area modelling of CGRAs is the central topic of this thesis.
This chapter provides a general overview of what a CGRA is, and describes the
motivation for, and contributions of this thesis.
Chapter 1. Introduction 3
1.1 Introduction to Coarse-Grained Reconfigurable
Architectures (CGRAs)
A CGRA comprises a 2D grid of coarse-grained Arithmetic Logic Unit (ALU)-like Pro-
cessing Elements (PEs), interconnected by bus-based interconnect. This stands in con-
strast to FPGAs, which have a mix of coarse-grained and fine-grained logic elements,
and where individual logic signals are routed independently. Figure 1.2 compares a fine-
grained FPGA Logic Element (LE) (left) with a CGRA PE (right). As illustrated, the
FPGA LE contains a Lookup Table (LUT), which is a hardware implementation of a truth
table. The CGRA PE receives wide inputs and typically performs ALU-like operations on
such inputs, such as multiply, divide, etc. Because of their coarse-grained nature, CGRAs
have less area dedicated to programmability overhead, making them “less flexible” than
FPGAs. Despite this apparent weakness, CGRAs can excel in speed/power/area over
FPGAs in specific applications, where the CGRA processing element capabilities, and
the CGRA interconnect fabric align closely with application computational and commu-
nication needs.
(a) FPGA logic element. (b) CGRA processing element.
Figure 1.2: Logic element comparison.
As CGRAs are programmable at a higher level of abstraction than FPGAs, CAD tools
targeting CGRAs are responsibile for fewer decisions. Putting it succinctly, CGRAs are
Chapter 1. Introduction 4
configurable at the bus level, not the bit level. CGRA CAD tools contain many of the
same steps as FPGA CAD tools, including technology mapping, placement, and routing,
however, the number of decisions in each of these steps is reduced considerably. This leads
to another advantage of CGRAs, which is CAD tool runtime. While mapping a large
application to an FPGA may take hours or days in the worst-case [21,29,52], CGRA CAD
tool run-times are expected to be closer to software compile times. Moreover, CGRA
applications are typically specified in high-level software languages, rather than in a
Register-Transfer Level (RTL) Hardware Description Language (HDL), as traditionally
used for FPGAs. Thus, CGRAs hold the promise of addressing the two key usability
challenges of FPGAs: 1) software programmability, and 2) fast compile times.
1.2 Motivation
While CGRAs offer certain advantages, their relatively late appearance (1990s) com-
pared to FPGAs (1980s), the reliable Moore’s Law scaling which reduced the incentive
to adopt CGRA, and the lack of a killer CGRA application, are all reasons that CGRAs
have not taken PLD market share. Many CGRA architectures have appeared in literature
(e.g. [8,59]), and a few commercial CGRAs have been developed (e.g. [7,31,60]). However,
they are less studied than alternative computing platforms, such as Central Processing
Units (CPUs), GPUs, and FPGAs. For such platforms, architecture modelling and eval-
uation frameworks exist, allowing hypothetical architectures to be targeted, tested, and
compared with specific baseline architectures, or other hypothetical architectures. For
CGRAs, the only publicly accessible framework is CGRA Modelling and Exploration
(CGRA-ME) [11, 12], which is under active development at University of Toronto. This
thesis contributes new modelling functionality to CGRA-ME.
Architecture modelling frameworks for CPUs, GPUs, and FPGAs allow hypothetical
architectures to be modelled and evaluated at an abstract level. The frameworks offer an
Chapter 1. Introduction 5
early preview of the cost, performance, and power of hypothetical architectures before
actual fabrication. For example, in Verilog-to-Routing (VTR) [39], area is estimated by
counting the number of minimum-width transistors required for the modelled FPGA.
The advantage of this high-level approach is to facilitate the rapid exploration of the
architectural space. Once good points in the architecture space have been identified, a
more detailed implementation of the desirable architectures can be performed to refine
early estimates. If, on the other hand, a full standard-cell or custom Very-Large-Scale
Integration (VLSI) implementation were performed for each architectural candidate, the
breadth of exploration would be severely hindered by lengthy ASIC Computer-Aided
Design (CAD) tool run-times, likely hours or days to produce each datapoint.
1.3 Contributions
While a previous work [11] demonstrated that Verilog HDL, automatically produced by
CGRA-ME, could be pushed through commercial ASIC CAD tools (targeting standard
cells) to assess a CGRA’s area and performance, CGRA-ME offered no capability for
high-level area, performance, and power modelling. The work described in this the-
sis overcomes this limitation for the performance and area metrics. Our approach to
area and performance modelling is based on the notion that CGRAs are composed of
commonly occurring primitives, including multiplexers of various sizes, Functional Units
(FUs) with specific arithmetic capability, registers, register files, Input/Outputs (I/Os)
ports, and so on. As such, we use standard-cell ASIC tools to create models of area and
delay for each such primitive, thereby producing a characterization library. Several such
libraries are constructed, representing, for example, area-optimized or delay-optimized
implementations of the primitives by the ASIC tools. The two optimization targets are
selected because die size and performance are common design objectives in varied appli-
cations. Overall CGRA area can then be estimated by aggregating primitive component
Chapter 1. Introduction 6
areas.
For performance, the individual primitive timing information is insufficient to gauge
the speed performance of an application as implemented on a modelled CGRA. We
therefore incorporated full Static Timing Analysis (STA) into CGRA-ME, leveraging an
open-source STA framework – Tatum, which is also used in VTR [46]. The primitive
delays are annotated onto the timing graph of the framework, allowing a critical path
delay report to be produced (in addition to a variety of other reports). In an experimental
study, we compare the area and performance of the primitive-based model to that of a full
ASIC implementation of the modelled CGRA, and demonstrate that the rapid estimation
model produces reasonably accurate estimates. The primary contribution of this work is
the capability to perform fast and accurate area/performance estimation for any given
architecture modelled within CGRA-ME. The main contributions of this thesis are:
• We present an extension to the CGRA-ME framework which accurately estimates
chip area for a standard-cell implementation based on a database of area charac-
teristics of components commonly used in CGRAs.
• We present an extension to the CGRA-ME framework which accurately estimates
per-benchmark critical path delay on a proposed CGRA based on both component-
wise timing characteristics, and interconnect delay inferred from the fanout count
of each submodule.
• We model and analyze three architectures in the CGRA-ME framework, and com-
pare the area/performance estimates with the area/performance of the same archi-
tectures implemented in standard cells using a full ASIC CAD flow. These experi-
ments confirm the viability of using the proposed estimators within CGRA-ME to
perform architecture studies.
Chapter 1. Introduction 7
1.4 Thesis Outline
The thesis is structured as follows:
• Chapter 2 - Background: This chapter provides a brief survey of existing CGRA
architectures, CAD tools, and studies. The chapter overviews the CGRA-ME
framework in detail.
• Chapter 3 - CGRA-ME Estimation Engine: This chapter details the method
to generically model area and performance for CGRAs.
• Chapter 4 - Experimental Results: This chapter presents results for experi-
ments where the estimation models are applied to gauge area/performance of hypo-
thetical CGRA architectures. Comparisons with full ASIC implementations of the
given architectures are made to validate the area/performance estimation engine.
• Chapter 5 - Conclusion and Future Work: This chapter summarizes the thesis
and discusses potential future work.
Chapter 2
Background
This chapter summarizes recent CGRA literature, including various architectures, soft-
ware frameworks, as well as reviews material relevant to following chapters. We introduce
a wide set of CGRAs that have appeared in the literature, and present a more detailed
examination of two architectures, ADRES and HyCUBE, that we use as test vehicles in
this thesis. Then, we provide a detailed introduction to the CGRA-ME framework and its
core components, including kernel extraction, software architecture modelling, and RTL
generation. We overview conventional procedures to realize a full standard-cell VLSI
design. Lastly, we briefly describe various existing techniques for early rapid hardware
area/performance estimation, without involvement of the lengthy VLSI CAD flow.
2.1 Existing CGRA Architectures
An excellent overview of previously published CGRA architectures appears in [13]. Here,
to provide the reader with a sense of the range of CGRA architectures proposed, we
briefly review some of the most highly-cited architectures, highlighting their differences,
as well as observing important commonalities. The Reconfigurable Pipelined Datapath
(RaPiD) [15] was proposed in 1996, aiming to accelerate computation via pipelining.
The CGRA comprises both a datapath and control, as depicted in Figures 2.1a and 2.1b,
8
Chapter 2. Background 9
(a) Data path portion (b) Control path portion
(c) Bus connector component between bus segments
Figure 2.1: The RaPiD accelerator [excerpted from [15,16]].
respectively. The datapath circuit, also the main CGRA portion of the design, is dynami-
cally configured every cycle by the control circuit, which consists of statically programmed
LUTs, depicted in Figure 2.1b. Figure 2.1a represents the basic cell of the 1-D array,
scalable in the horizontal direction. Each basic cell consists of one multiplier, two ALUs,
three memory units, and six registers. These components are interconnected by lanes
of word-width (16 bit) bus segments of varying lengths, separated by bus connectors.
Depicted in Figure 2.1c, the bus connector is implemented using three multiplexers, two
tristate buffers, and one register. A bus connector allows directional signaling such as
left-to-right, right-to-left, and cutoff (connected to a driver on both ends). It also al-
lows latency control, with the output multiplexer selecting input from either the source
bus segment directly, or the registered data with one cycle latency. The RaPiD-I [15]
implementation replicates 16 basic cells. Effectively, the RaPiD architecture is a LUT-
controlled-CGRA. While RaPiD offers good performance, it requires careful memory
partitioning in order to maximize parallelism. Data memory bandwidth is the main
limiting factor to the performance scalability [15].
Chapter 2. Background 10
(a) MorphoSys M1 chip architecture overview (b) MorphoSys ReconfigurableCell (RC) array interconnect ar-chitecture
(c) Structure of a MorphoSys RC
Figure 2.2: The MorphoSys architecture [excerpted from [55]].
Subsequent to RaPiD, the MorphoSys architecture was introduced, specifically, the
M1 chip [56]. Figure 2.2a illustrates the architecture, consisting of main memory modules
external to the chip, a Direct Memory Access (DMA) controller, instruction/data cache,
a TinyRISC processor, a frame buffer, an RC array, and context memory for the RCs.
Equipped with a framebuffer fetching mechanism, MorphoSys specializes in multimedia
applications. The TinyRISC is a modified RISC processor, and it acts as the master to
the rest of the chip, sending control signals to the DMA controller, frame buffer, and
the RC array. The DMA controller is instructed by the TinyRISC processor to retrieve
frame and context data for the frame buffer and the RC array. Another external memory
module contains program instructions and data for the processor.
Figure 2.2b depicts the interconnect architecture for the 8×8 RC array. Within each
4×4 quadrant, there exists higher interconnectivity among the member RCs, with each
Chapter 2. Background 11
member RC accepting inputs from the nearest four neighbors, other RCs from the same
row and column, and crossbar output(s) if adjacent to another 4×4 quadrant. Each 4×4
quadrant is interconnected to its adjacent quadrants by the RCs on the corresponding
side, and these RCs are fully connected. Figure 2.2c depicts the implementation of the
RC, featuring two input multiplexers, an ALU with multiplier, shifter, output register,
feedback register file, as well as the context register which configures the components.
MorphoSys demonstrated superior performance on video compression, automatic target
recognition, and data encryption/decryption applications [55].
(a) PipeRench architecture overview (b) Structure of a PE in a PipeRench stripe
(c) Structure of a FU in a PipeRench PE
Figure 2.3: The PipeRench architecture [excerpted from [53]].
The PipeRench architecture [20], from Carnegie Mellon University, was introduced
around the same time as MorphoSys, and was later commercialized by Rapport [2].
Chapter 2. Background 12
Figure 2.3a depicts the general organization of the PipeRench architecture, featuring
multiple stripes, with each stripe being a 1D array of PEs. Each stripe is capable of
realizing a pipeline stage.
Figure 2.4: Mapping a (a) 5-virtual-stage application onto a (b) 3-physical-stage/stripePipeRench system [excerpted from [19]].
When mapping an application to PipeRench, the application is first represented in
virtual stages, depicted in Figure 2.4a. Depending on the actual available physical re-
sources, the virtual stages can then be mapped into set of physical stages, depicted in
Figure 2.4b, where each physical stage corresponds to a stripe. As shown in Figure 2.3b,
each PE consists of an FU, shifters for each input to the FU, and a register file. However,
as depicted in Figure 2.3c, an FU of PipeRench is unlike other CGRAs, which usually
contain an ALU. The FU in PipeRench consists of 8 identically configured 3-LUTs, a
carry-chain, and a zero detector. Each 3-LUT is driven by: 1) one-bit from input bus A,
2) one-bit from input bus B, and last signal of all 3-LUTs is driven by 3) an X input from
another FU. The uniform 3-LUT array along with the carry-chain, effectively allows the
FU to perform 8-bit addition, subtraction, or arbitrary bit-wise manipulation. The X
input can be either carry-out from an adjacent PE or zero, making it possible to combine
more than one PE, forming wider arithmetic operations. The throughput of PipeRench
Chapter 2. Background 13
is dependent on the number of physical resources. Suitable applications should have high
data locality, including streaming applications.
(a) DRP-1 prototype architecture overview (b) Dynamically Reconfigurable Processor(DRP) tile, containing a State TransitionController (STC) and 64 PEs
(c) PE of a DRP tile
Figure 2.5: The DRP architecture [excerpted from [58]].
A commercial CGRA example is the DRP from NEC Corp. (now Renesas) [1, 45].
The implementation is described in [58]. Figure 2.5a gives a bird’s-eye view of the
architecture. Described in [23], the chip is equipped with the DRP core, interfacing
via I/O ports, eight multipliers around the tiles, an external memory controller, and
a PCI controller. DRP-1 operates in a hierarchical fashion: 1) instructions for the 64
Chapter 2. Background 14
PEs within each tile are provisioned by a local Finite State Machine (FSM), called the
STC, located at center of Figure 2.5b; 2) context for the 8 tiles within the DRP core
are provisioned by the Central STC (CSTC), located at the center of Figure 2.5a. The
structure of a tile is shown in Figure 2.5b, where the PEs and STC are surrounded by
memory units, with single-ported memory units at the top and bottom side, labeled as
HMEM, as well as dual-ported memory units with their controllers at the left and right
side, labeled as VMEM and Vmemctrl. As depicted in Figure 2.5c, each PE consists
of a Data Management Unit (DMU) capable of shifting and masking, an ALU without
multiplier, an instruction table, output register, and a register file. The details on inter-
PE and inter-tile interconnect architecture are not fully disclosed, but from Figure 2.5c,
we can see that a PE appears to possess flexible interconnect, with input and output bus
selectors from perhaps many PEs within the same tile.
The DRP architecture possesses many advantages over previously proposed CGRAs.
Much like MorphoSys and PipeRench, the DRP is scalable with a tile as the base unit.
Much like RaPiD, the DRP comes with control-flow capability, while equipped with
more complex organization and better scalability. A distinctive feature of the DRP is
the highly distributed memory units. While these impose data partitioning requirements
on the application, they are likely useful in many applications, particularly streaming.
However, we observed that all architectures consist a similar set of basic modules.
2.2 ADRES Architecture
The Architecture for Dynamically Reconfigurable Embedded System (ADRES) CGRA
architecture [43] was proposed to overcome a perceived performance bottleneck of other
CGRA processor/accelerator systems arising from performance disparities between the
reconfigurable array and the connected processor. In ADRES, the processor is closely
coupled with the reconfigurable fabric. Figure 2.6a shows the ADRES system at a high
Chapter 2. Background 15
level, where instructions and data for the ADRES core are supplied by an external mem-
ory module. Figure 2.6b shows the ADRES core, where the FU and RC are merely labels
distinguishing PEs from the two operating modes.
(a) ADRES system overview:consisting of the ADRES core, in-struction and data cache, and externalmemory.
(b) ADRES core:CGRA body of the ADRES system, partitionedinto the VLIW view and reconfigurable matrixview.
(c) RC: an array element in an ADREScore.
(d) Dynamically Reconfigurable Embed-ded System Compiler (DRESC):compiler flow diagram.
Figure 2.6: The ADRES CGRA system [excerpted from [43]].
Dashed lines in the figure show the VLIW view and the reconfigurable matrix view
(CGRA). The top row of PEs, and a multi-ported register file are used in both views.
The ADRES PE is shown in 2.6c. It consists of a local register file (RF), and an ALU
(labelled as FU in the figure). Inputs to the FU are two operands, and a predicate.
Additionally, the ADRES architecture is a “template” instead of a fixed architecture,
Chapter 2. Background 16
allowing customization at various levels with an XML-based Architecture Description
Language (ADL).
In the VLIW mode, ADRES behaves like a VLIW processor, with instructions fetched,
decoded, dispatched, and so on. This mode is suitable for control-intensive program
segments. The CGRA mode is suitable for highly parallelized dataflow computation.
The two modes share data with one another through the shared register file. Effectively,
both the control and data path reside on the PE grid, efficiently reusing computational
resources in the core, and reducing the cost of data movement.
The DRESC compiler framework [42] is used to map applications onto ADRES. The
workflow is illustrated in Figure 2.6d, employing a C compiler frontend, producing the
Intermediate Representation (IR) of the program. The program is then partitioned into
a control-path and data-path. The control-path portion IR is compiled into Very-Long
Instruction Word (VLIW) instructions, to be executed in VLIW-mode. The ADL is
interpreted into an in-memory architecture model, called the Modulo Routing Resource
Graph (MRRG), which at a high level, is a graph-based representation of the CGRA
device. By MRRG scheduling, placement and routing, the data-path portion of the
IR is mapped and compiled into CGRA configurations, to be executed during CGRA-
mode. The work revolving around DRESC and ADRES has inspired many later works
on CGRAs, including the CGRA-ME project. In Chapter 4, Sections 4.3 and 4.4, we will
use ADRES as a target architecture to showcase Design Space Exploration (DSE) based
on CGRA-ME.
2.3 HyCUBE Architecture
Recently, the HyCUBE architecture [30] was proposed, featuring richer interconnect than
previous CGRAs. The architecture, HyCUBE, is named to be synonymous with the
concept of high-dimentional cube – a hypercube. The authors of HyCUBE argue that
Chapter 2. Background 17
Figure 2.7: The HyCUBE architecture. The CGRA consists of 2D array of FUs connectedby multi-hop-capable crossbar switch interconnect [excerpted from [30]].
the nearest-neighbor interconnect topology of existing architectures, and the requirement
to use FUs as route-throughs introduces performance loss and mapping difficulty, leading
to higher Initiation Intervals (IIs) – the rate at which new inputs can be injected into
the fabric. The authors argue that a flexible multi-hop interconnect would be a better
design choice.
The HyCUBE architecture is a 2D array of PEs, as shown in Figure 2.7. Each PE re-
ceives inputs from neighbouring PEs (North, South, East, West). The input registers to
the crossbar switch can be bypassed. The PE also contains a predicated ALU, a bypass-
able register on the ALU output, and most importantly, a crossbar switch, which drives
all neighbour PEs. The crossbar switches collectively realize an interconection network
allowing arbitrary source and destination pairs to route with arbitrary cycle latency. This
style of interconnect highly resembles the island-style FPGA routing network, except that
it is routable by a bus.
The compiler framework of HyCUBE also models the architecture with an MRRG,
and uses heuristics to find a viable mapping with different cost functions, and by incre-
menting the allowed II. The HyCUBE authors claim that the crossbar switches contribute
a quarter of both total power consumption and chip area, but provide a much shorter
compilation runtime and higher throughput/power. In Chapter 4.5, we will use CGRA-
Chapter 2. Background 18
ME to model and evaluate the area and performance of HyCUBE.
2.4 Existing CGRA CAD Tools
There are various architecture-specific CGRA DSE frameworks, but there are very few
generic CGRA modelling and exploration frameworks. In [9], a commercial framework
was extended to support high-level modelling of CGRAs. The commercial architecture
description language (ADL), LISA, was extended to support a CGRA coprocessor de-
scription, while the application mapping is handled by the Coarse-Grained SMAC (CG-
SMAC) algorithm inspired by Simultaneous Mapping and Clustering (SMAC) for FP-
GAs [37]. Schedule, Place, and Route (SPR) [18] is another generic CGRA mapping tool,
where the mapping problems: scheduling, placement, and routing are implemented using
Iterative Modulo Scheduling (IMS) [51], Simulated Annealing [33], and QuickRoute [36]
plus PathFinder [41], respectively.
2.5 CGRA-ME Framework Overview
CGRA Modelling and Exploration (CGRA-ME), as the name suggests, is a tool which
offers architecture exploration for CGRAs, as well as permitting research on CGRA CAD
algorithms. With CGRA-ME, both the architecture specification, and an application
benchmark, are input to the toolflow. CGRA-ME permits the scientific evaluation and
comparison of hypothetical CGRA architectures.
The CAD flow depicted in Figure 2.8 progresses from top to bottom. The main
user inputs are the architecture description and an application benchmark. Through
the LLVM kernel-extraction pass, key computations of the application benchmarks are
extracted and represented as Dataflow Graphs (DFGs). The architecture interpreter
accepts an architecture specification (either using XML or the C++ Application Program-
ming Interface (API)) as input and builds an in-memory device model of the architecture.
Chapter 2. Background 19
Figure 2.8: CGRA-ME framework overview
The in-memory model is a MRRG [42], which consists of nodes and edges. Nodes repre-
sent the CGRA’s functional units, multiplexers, register files, I/Os, multi-bit buses, and
so on, which may be annotated with additional data, such as cycle latency. Edges repre-
sent electrical connectivity between nodes. As a whole, a MRRG models how a CGRA
functionally behaves. The in-memory architecture model can also be used to generate a
Verilog implementation of the architecture.
The mapping step will schedule, place, and route the DFG onto the MRRG. That
is, in mapping, each computation in the DFG must be associated with a functional unit
node in the MRRG, and each connection between nodes in the DFG (data dependency)
must be mapped to a series of routing nodes within the MRRG, thereby connecting the
relevant functional unit nodes, accordingly. The mapper offers two choices of algorithm
– a simulated-annealing-based approach [12], and an Integer Linear Programming (ILP)-
based approach [10]. If the mapping is feasible, this implies the modelled CGRA can be
configured to realize the computations of the DFG. The mapping result and architecture
model can then be used to generate a bitstream to configure the CGRA, and verify
functionality through RTL simulation.
Figure 2.8 also highlights the contribution of this thesis: performance and area esti-
Chapter 2. Background 20
mation following the mapping step. There are two inputs to the estimation engine: 1) the
application benchmark as mapped onto the modelled CGRA; 2) profiles of the area and
delay of commonly used CGRA primitives. The estimation engine and characterization
data are elaborated upon in the next chapter.
2.6 ASIC Design Flow
We briefly review the steps of the standard-cell ASIC design flow, as this is relied upon in
subsequent chapters. The flow is shown in Figure 2.9. It consists of the following steps:
Figure 2.9: Typical steps involved in VLSI design.
1) Synthesis/Technology Mapping: An HDL file describing circuitry is input to the
synthesis tool. The logic functions in the circuit are optimized and its functionality
is mapped into standard cells drawn from a library. The synthesis tool accepts
area, timing and power objectives from the user and attempts to produce an im-
plementation that meets the user’s constraints.
2) Post-Synthesis Verification: The output circuit from step 1) undergoes verification,
involving early STA to verify timing validity, and simulation to verify functionality
and power/performance. If verification results meet all requirements, the netlist
files will be input to step 3). Otherwise the user goes back to step 1), iteratively
adjusting the implementation or synthesis objectives accordingly.
3) Place-and-Route (PnR)/Chip Layout: The standard-cell netlist is placed and routed
on the chip, again subject to user provided constraints, such as floorplan, poros-
ity, and timing constraints. Following routing, interconnect parasitic capacitance,
Chapter 2. Background 21
resistance, and inductance can be extracted. These permit accurate post-layout
analysis of timing and power.
4) STA: With parasitics information extracted from step 3), a precise STA is per-
formed. Depending on the results, the user either proceeds to step 5), or she/he
would revise the floorplan/constraints in step 3) and re-perform PnR. In worst case,
she/he would go back to step 1), revise synthesis objectives and re-synthesize the
design.
5) Post-Layout Verification: The final verification, which extensively verifies the design
layout in all aspects, such as timing, power, and functionality. After passing all
verification tests, the design would be ready for fabrication.
Although dependent on the size of the targeted design, the standard-cell design process is
expensive in many respects, such as engineering design effort, CAD runtime in each step,
cost of proprietary library/tool license and fabrication, etc. As the design process moves
toward the final verification stage, designers gradually gain more confidence on how well
the design will perform post-fabrication. However, the lengthy design process encourages
fast and accurate estimation of performance, at an early stage. In Chapters 3 and 4,
we will employ the above VLSI design methodologies on various levels, from primitive
modules, to top level architecture.
2.7 Prior Work on Hardware Performance and Area
Estimation
Regarding area/performance estimation for FPGAs, CPUs, and GPUs, there are pop-
ular frameworks publicly available. VTR comes with a built-in performance and area
estimation engine for FPGAs. STA in VTR is carried out by Tatum, which models tim-
ing of FPGA primitive blocks, such as LUTs, flip-flops (FFs), Random-Access Memories
Chapter 2. Background 22
(RAMs), etc., and estimates critical path delay. In GEM5 [5], a CPU modelling and
simulation framework, when a user specifies operating frequency, cache sizes, and the
memory speed of a targeted CPU system, the framework demonstrates accurate estima-
tion in [6]. However, in [6], the authors model an already produced architecture and
system. Hence, when designing a new architecture from scratch, GEM5 offers limited ca-
pability to preview system performance. McPAT [35], a multicore CPU power, area, and
timing modelling framework, was shown to be a viable extension to GEM5 [17], for core
area estimation. GPGPU-Sim [3], a General-Purpose Graphic Processing Unit (GPGPU)
simulation framework, models an architecture at a functional level, omitting many mi-
croarchitecture details. This makes GPGPU-Sim limited in ways similar to GEM5, and
requires dedicated third-party tools for timing, area, and power estimation. In [47], the
authors pointed out some pitfalls and limitations of these existing CPU/GPU simulation
and modelling frameworks.
There are previous works for modelling area and performance of arbitrary systems,
too. A previous work proposed CompEst and ChipEst [48], which are accurate and very
fine-grained performance and area modelling technique for high-level design. Each ba-
sic building block is realized in various implementations/topologies with standard-cell
technology, and performance results of all variants are later used in a top-level estimator
to select implementations for all instances of basic building blocks used in a high-level
design, while satisfying all specified design requirements. The tool then reports an es-
timate of total area. In [40], the authors proposed a technique to accurately estimate
circuit interconnect delay, based on gate attributes such as pins, area, fanin and fanout
numbers. [49] presents wireload-aware synthesis and floorplaning, emphasizing the im-
portance of modelling interconnect. Another work [24] proposes an accurate early wire
characterization technique, employed in an alternative design flow.
For CGRAs, there are previous works for architecture DSE. Suh et al. [57] de-
scribed architectural DSE for the Samsung Reconfigurable Processor (SRP) (a variation
Chapter 2. Background 23
of ADRES [31]), which improved the performance, and chip area of the SRP. Another
work [32] demonstrated DSE on a Reconfigurable Array Architecture (RAA), improving
performance, chip area, and power. Both DSE studies provide performance, area, and
power results.
However, there are fewer efforts providing architectural area and performance mod-
elling for CGRAs. In [4], the authors provided a detailed study on how different config-
urations of CGRA processing elements can affect performance measured in cycle-count
latency, but the paper does not elaborate on critical path delay since they did not specify
the implementation technology. In [61], we see performance and area estimation for a
CGRA implemented as an FPGA overlay. Although the above studies include area and
performance results, they are either tied to one specific CGRA architecture or implemen-
tation technology.
The CGRA-ME framework offers a generic mapper, targeting a user-specified archi-
tecture, and this thesis extends the framework to estimate performance and area by
accepting a user-specified area/performance profile based on a user-selected implemen-
tation technology.
Chapter 3
CGRA-ME – Estimation Engine
In this chapter, we discuss how the estimation engine in CGRA-ME is implemented. We
hypothesize that area and performance of a modelled CGRA architecture can be gauged
if area and performance of all sub-components can be characterized. For performance
estimation, we will describe why and how interconnect is taken into account. We will
also clarify several factors affecting whether the results from the estimation engine can
reflect an actual implementation. The estimation engine described in this chapter will
be used in the experiments detailed in Chapter 4.
3.1 Architecture Modelling in CGRA-ME and the
Primitive Modules
In Chapter 2, we described previous CGRA architectures. While these CGRA architec-
tures vary, they primarily consist of a similar set of primitive submodules. In CGRA-ME,
we provide a set of primitives for architecture designers to create the software architecture
model. The primitives are:
1) op_add: an adder unit
2) op_sub: a subtractor unit
24
Chapter 3. CGRA-ME – Estimation Engine 25
3) op_mul: a multiplier unit
4) op_shl: a left shift unit
5) op_ashr: an arithmetic right shift unit
6) op_lshr: a logical right shift unit
7) op_and: a bitwise AND unit
8) op_or: a bitwise OR unit
9) op_xor: an bitwise XOR unit
10) mux_*to1: multiplexers of various sizes
11) register: a set of edge-triggered Flip-Flops
12) registerFile: array of registers readable and writable from multiple ports
13) tristate: a tristate buffer, mainly used in external I/O ports
14) mem_unit: a memory unit, offering data load and store capabilities
With the above primitive modules, CGRA-ME can model many generic CGRA architec-
tures, such as ADRES and HyCUBE.
Modelling an architecture in CGRA-ME is very similar to the process of RTL de-
sign. CGRA-ME constructs a tree data structure to represent a hierarchy of modules,
with bottom-level nodes comprised of only primitive modules. Each node encapsulates
the connections among its submodules/child-nodes with a list. The module tree data
structure is used to construct the corresponding MRRG of architecture, which is a graph
representing the architecture functionality (nodes) and connectivity (nodes and edges).
Each primitive module comes with a corresponding MRRG representation. When con-
structing a full architecture MRRG, the tool flattens the module tree data structure, and
connects all sub-MRRGs accordingly.
Chapter 3. CGRA-ME – Estimation Engine 26
(a) Module object representing the PE.
(b) MRRG representing the routing and functionality of the PE
Figure 3.1: Illustrations of the two data structures in CGRA-ME representing a PE with3 contexts.
Chapter 3. CGRA-ME – Estimation Engine 27
Figure 3.1 depicts how an example PE with three contexts and three supported oper-
ations is represented in CGRA-ME. The module tree data structure closely resembles the
physical hierarchy of the architecture, while the MRRG closely resembles the operational
structure of the architecture. When creating the MRRG, a non-primitive module will
contain all lower-level MRRGs, and contain lists of ports and connections to connect all
sub-MRRGs. We hypothesize that when area and performance characteristics of primi-
tive modules are known, we will be capable of performing realistic area and performance
estimations by leveraging the existing data structures.
3.2 Characterization of Primitive Modules
Figure 3.2: Characterization steps.
Figure 3.2 shows the steps to characterize primitive area and performance. We first
retrieve HDL files of the target architecture, which will include all primitives modules
used in it. The primitive modules are synthesized in the technology of interest. In this
thesis, we target the 45nm NanGate FreePDK45 Generic Open Cell Library [54]. After
synthesis, we obtain area and performance characteristics of the primitives. The char-
acterization results are stored in a database. The database entries then serve as inputs
Chapter 3. CGRA-ME – Estimation Engine 28
to the area/performance estimation engine. The approach is analogous to that used for
commercial FPGAs, wherein delay characterization data for key sub-circuits are stored
in a database, whose entries are then recalled during STA. In addition to the charac-
terization of primitive modules, we also modelled interconnect delay, elaborated upon
in Section 3.5. To obtain primitive module characterization, as well as an interconnect
delay model, we use the following CAD tools:
1) Technology mapping/synthesis: Synopsys Design Compiler
2) Place and route: Cadence Innovus
3) Timing analysis: Synopsys PrimeTime
In an ASIC standard-cell flow, constraints can be applied during technology map-
ping to optimize for area, delay, or a combination of the two. Individual cells (e.g. 2-
input NAND) are available in multiple drive strengths, and delay-driven mappings will
tend to use larger cells with higher drive strengths. CGRAs may be used in a high-
performance context, or a low-power embedded context, as and such, we opted to gen-
erate two databases of area/performance values for the primitives: an area-optimized
database wherein the ASIC tools were executed with a minimum-area objective, and a
delay-optimized database wherein the ASIC tools were executed with a minimum-delay
objective. During synthesis, a minimum-area objective will guide the tool to select smaller
cells with weaker drive strengths (slower signal transitions), whereas a minimum-delay
objective will guide the tool to select larger cells with stronger drive strengths (faster sig-
nal transitions). The two databases permit a human architect user of CGRA-ME to select
between the databases according to the intended CGRA usage. However, should the user
have custom constraints for the primitive modules, the tool is capable of incorporating a
user-generated database, too.
Layout quality and cell variety of a standard-cell library are the primary factors
dictating how well the synthesis can align the design with the synthesis constraints;
Chapter 3. CGRA-ME – Estimation Engine 29
however, there are other factors that require attention from designers. A synthesis tool
can often realize a design leveraging only standard-cells from a library, but there are
many occasions when synthesis tool must infer implementation of part(s) of a design. In a
Verilog design, arithmetic operations are often expressed with generic syntax: “assign c
= a + b;”, which requires the synthesis tool to infer implementation of the “+” operator
from a library of implementations. This is why proprietary synthesis tools are sometimes
equipped with an Intellectual Property (IP) core library. For instance, Synopsys supports
the Design Compiler with their DesignWare IP core library, which contains a wide range
of arithmetic operations.
Verilog implementations generated by CGRA-ME, employ the generic syntax, allow-
ing users to select their preferred IP core library. When synthesizing ALU operation units
such as op_add and op_mul, we leverage the DesignWare library [28]. However, when an
HDL file is not sufficiently specific, the synthesis tool may select an IP core unintended
by the designer. For instance, we observed that when synthesizing op_mul without spec-
ifying an IP core, Design Compiler ended up selecting an unsigned and signed multiplier
for the area-optimized and delay-optimized target, respectively, which are functionally
different. For our experiments in Chapter 4, both area-optimized and delay-optimized
implementations must have the same functionality in order to draw fair comparisons. We
use the unsigned multiplier when synthesizing op_mul for both targets.
Table 3.1 shows a portion of the two databases used in Chapter 4. The left-most
column lists the primitive. The next two columns give the area, in square microns for
each primitive in each of the two databases. These numbers are the total silicon real
estate required for both standard-cell transistors and the additional space required for
PnR. They are derived by dividing total cell area by layout utilization factor/density
(standard-cell area per core area). A higher utilization factor implies heavier congestion,
and will lead to long PnR runtime. Since the CGRAs discussed in Chapter 4 are relatively
small, we set the utilization factor to be 0.8, while a typical value is around 0.7 [22,
Chapter 3. CGRA-ME – Estimation Engine 30
Area [µm2] Critical Path Delay [ns]Target Area Optimized Delay Optimized Area Optimized Delay Optimized
op add 32b 168.00 536.00 2.78 0.37op sub 32b 190.00 539.00 2.80 0.40
op multiply 32b 2860.00 3008.00 1.12 1.10op and 32b 43.00 43.00 0.03 0.03op or 32b 43.00 74.00 0.05 0.04op xor 32b 64.00 64.00 0.06 0.06op shl 32b 456.00 491.00 0.53 0.43
op ashr 32b 456.00 470.00 0.53 0.45op lshr 32b 456.00 470.00 0.53 0.45
mux 2to1 32b 74.00 88.00 0.06 0.07mux 4to1 32b 147.00 166.00 0.06 0.06mux 5to1 32b 179.00 283.00 0.06 0.11mux 6to1 32b 215.00 245.00 0.09 0.07mux 7to1 32b 252.00 340.00 0.07 0.07mux 8to1 32b 286.00 325.00 0.07 0.07
RF 1in 2out 32b 1123.00 1231.00 0.07 0.08RF 4in 8out 32b 9307.00 11 568.00 0.17 0.20
register 32b 214.00 222.00 0.01 0.01tristate 32b 86.00 102.00 0.41 0.22const 32b 214.00 222.00 0.01 0.01
Table 3.1: Database of area and critical-path delay of the primitive modules as mappedinto the NanGate FreePDK45 45nm standard-cell library.
27]. The right-most two columns give the delay, in ns for each of the primitives. The
delay values recorded for register, RF_1in_2out, and RF_4in_8out are the delay (wire,
combinational, or both) from input pin to the D-input of the registers. While not listed
in Table 3.1, clock-to-Q delays, setup and hold times of the registers are also entries in
the databases.
From Table 3.1, we observe that in few cases, the delay with delay-optimized target
is actually slightly higher than those with area-optimized target, such as mux_2to1,
mux_5to1, RF_1in_2out, and RF_4in_8out. While these delay differences are under
0.05ns, there are a few factors that can lead to these results. When a targeted design is
trivial, and only a small set of cells is available to the synthesis tool, there is a higher
chance that the synthesis results may not align well with the synthesis objectives. Most
of the primitive modules are trivial in design, and while there is open-source merit in
Chapter 3. CGRA-ME – Estimation Engine 31
using the NanGate FreePDK45 library, it has reduced cell variety compared to other
proprietary standard-cell libraries. While it is possible to cherry-pick these cases and
modify their constraints or augment the HDL design towards the synthesis objective, we
decided to keep synthesis constraints and HDL files the same for consistency. Broadly
speaking, we observe that the delay-optimized primitives are larger and faster than the
area-optimized primitives.
Note that the estimation engine is independent of the specific standard-cell target
technology, because the ASIC design flow is the same regardless of technology node or
standard-cell library. This means that users can redefine entries in the database based on
the standard-cell library, and IP libraries available to them, and the results of our CGRA
estimation engine will properly reflect the target technology and IP. The databases are
human-readable INI files and easy to read/modify.
3.3 Area Modelling
As mentioned previously, the CGRA-ME framework has an architecture interpreter,
which constructs a tree data structure modelling the targeted CGRA. The architec-
ture model is represented hierarchically, with a top-level module representing the entire
CGRA, and second-level modules representing CGRA tiles in the two-dimensional array,
and so on. Within CGRA-ME, an architect may specify modules with arbitrary levels of
hierarchy. However, at the bottom of the module hierarchy lie primitive modules, which
align with those discussed in the previous section, for which area and delay data are
contained in the database. Figure 3.3 depicts the data structure modelling an example
architecture with a 3×3 PE-grid, 3 I/O ports, and 3 memory ports.
Area estimation therefore performs depth-first traversal, aggregating area at every
module visited, with the top-level module as root. The area of each primitive module is
drawn from the database and aggregated upwards. The grid structure of CGRAs implies
Chapter 3. CGRA-ME – Estimation Engine 32
Figure 3.3: An example CGRA architecture modelled as a tree of module objects inCGRA-ME. The modules filled in blue are the primitive modules, while the ones filledin red are non-primitive/composite modules. During area estimation, the modules withred solid outlines would require either a database lookup (primitive) or summation ofall submodule areas (composite). In a highly regular architecture, there can be multipleinstances of each unique composite module, hence in this example, only a few compositemodules require summation and have red solid outlines.
that many modules are repeatedly instantiated with the same parameters, and for these
cases, we do not require recomputation for every instance. In the traversal process, for
every unique non-primitive/composite module, after computing its area, it is given a
new entry in the area characterization database for reuse. In Figure 3.3, the module
instances with a red border are the only ones requiring an entry in the characterization
database or accumulation from submodules. At the end of the traversal, an estimate of
the total CGRA area is available and reported to the user. Likewise, a report also shows
the estimated area at each lower level of the hierarchy, giving the architect visibility
into the breakdown of area for the modelled CGRA. In Chapter 4, we compare this
straightforward primitive-aggregating estimation approach to the actual area after a full
Chapter 3. CGRA-ME – Estimation Engine 33
ASIC PnR of a CGRA.
3.4 Performance Modelling and STA
As with an FPGA, the critical path of an application benchmark implemented on a
CGRA depends on the mapping, placement and routing of the application within the
CGRA, as well as the circuit delays within the CGRA device (from the characterization
database discussed in Section 3.2).
We integrated an open-source STA engine into CGRA-ME. The STA engine, called
Tatum, is also used within the VTR project [46]. Tatum performs timing analysis using a
timing graph, wherein nodes represent pins on electrical components, and edges represent
connections between pins. Delays are then annotated onto the edges of the timing graph.
Tatum has an easy-to-use C++ API that allows one to create the timing graph and perform
the delay annotation. Following this, Tatum performs timing analysis on the graph, and
can generate a critical-path delay report. That is, Tatum includes the functionality for
the familiar STA tasks of forward delay propagation to find the worst-case timing paths
in a design, and backward propagation of slacks [25] to find the timing slack on each
edge of the graph. Tatum can also be extended to accept a Synopsys Design Constraint
(SDC) file as input, to allow user-control over the timing analysis (e.g. setting of false
paths, selection of specific timing to analyze).
To integrate Tatum into CGRA-ME, we use the mapping results of the application
benchmark’s DFG onto the architecture model MRRG. In essence, we walk the used
part of the CGRA for the application benchmark, creating a partial MRRG, which will
serve as an input to create a timing graph in Tatum. Using the same example PE from
Figure 3.1, Figure 3.4 depicts an MRRG of a three-context PE, where the nodes and
edges highlighted in red are the used part of the CGRA. In order to create a timing
graph from the partial MRRG, all primitive modules involved in this MRRG will have
Chapter 3. CGRA-ME – Estimation Engine 34
Figure 3.4: MRRG of a 3-context PE, with the mapped resources highlighted in red.
their corresponding timing graph representation created and connected. In Figure 3.5,
we see the three entities represented in Tatum’s timing graph:
1) Combinational entity: primitive modules with only combinational delay, such as
adder, multiplier, multiplexer, etc.
2) Register entity: primitive modules with clock inputs, such as registers and register
files, which requires annotation for setup time, hold time, clock-to-Q delay, and
clock skew (reported by per-primitive post-synthesis STA).
3) Interconnect delay entity: interconnect wires among primitive modules.
For example, in Figure 3.5a, the used input and output pins on a CGRA multiplexer
become nodes in Tatum’s timing graph, connected by an edge. The delays on the edges of
timing graph are then annotated based on the characterization database discussed above.
Regarding how interconnect delay is inferred, details will be discussed in the following
Chapter 3. CGRA-ME – Estimation Engine 35
(a) An edge pointing from input pin (ipin) to output pin (opin), representing an abstractcombinational entity, and the edge is annotated with the combinational delay.
(b) Sub-structure with input pin (ipin), clock pin (cpin), source pin (src), sink pin(snk), and output pin (opin), representing a register entity. The edge from cpin to snk isannotated with hold and setup delay values, and the edge from src to opin is annotatedwith clock-to-Q delay value.
(c) An edge pointing from opin to ipin, representing an abstract interconnect entity, andthe edge is annotated with the interconnect delay.
Figure 3.5: Three categories of subgraph to compose a full timing graph in Tatum,converted from their corresponding MRRG counterparts.
Chapter 3. CGRA-ME – Estimation Engine 36
section. Figure 3.6 depicts the resulting timing graph generated from the MRRG shown
in Figure 3.4.
Figure 3.6: Timing graph representing the mapped MRRG from Figure 3.4.
(a) Mapped MRRG (b) Timing graph
Figure 3.7: Mapped MRRG and timing graph of the “sum” benchmark put side-by-side,showing difference in graph complexity at a larger scale.
Another example is Figure 3.7, which illustrates the mapped MRRG of the sum
benchmark on the ADRES architecture, and the corresponding timing graph in Tatum.
Observe that the timing graph is generally larger than the mapped MRRG, because nodes
in the timing graph represent pins, whereas nodes in the MRRG are more coarse-grained.
Note that Tatum does not enforce the granularity of the modelled entity, meaning
that it is entirely up to the user to decide how detailed the timing behaviour is modelled.
CGRAs in general, can be modelled with coarse granularity, since they are configurable
by bus. Conversely, while FPGAs are configurable to a single bit, a much higher granu-
Chapter 3. CGRA-ME – Estimation Engine 37
larity is required, which greatly enlarges and complicates the timing graph. The timing
graph created by CGRA-ME takes advantage of the coarse granularity, making primitive
modules the most detailed unit to model. As a result, STA in CGRA-ME is very fast
and computationally inexpensive.
3.5 Interconnect Delay Estimation
The characterization database contains delays for each type of primitive. However, inter-
connect delays are a growing contributor to total delay in deep-submicron VLSI technol-
ogy. Such delays are not accounted for in the database, and in Chapter 4, we show that
ignoring interconnect delay leads to poor accuracy. As our aim is to provide a high-level
performance estimation capability, we must estimate interconnect delay prior to place-
ment and routing (i.e. without detailed knowledge of wirelength, capacitance, resistance).
Therefore, to improve the delay-modelling capabilities of CGRA-ME, we constructed a
simple fanout-based interconnect delay estimation model. Fanout is widely used as a
proxy for interconnect delay estimation [26,40].
Figure 3.8: Standard-cell layout for fanout delay scaling analysis on op_and with 16fanout registers, with the upper, middle, and lower rats nests representing fanin registers,the primitive module, and fanout registers, respectively.
Chapter 3. CGRA-ME – Estimation Engine 38
Specifically, for each type of primitive, we attached the outputs of the primitive to
various numbers of fanout registers. We then performed a full synthesis, placement, and
routing into standard-cells, followed by STA with Synopsys PrimeTime. We extracted
the delay from the outputs of the primitive to the fanout registers. This allowed us to
create a model associating primitive fanout with delay. Figure 3.8 shows the primitive
module op_and at the center of the chip, while the fanout registers are at the bottom.
Figure 3.9: Averaged fanout-delay of all primitive modules.
Figure 3.9 plots the relationship between fanout register count and average fanout
delay of all primitives, showing a line of best fit. As shown, the interconnect delay is
roughly linear with the fanout. Using the MRRG fanout and the linear fanout-delay
model, we can optionally annotate estimated interconnect delays onto Tatum’s timing
graph. In Chapter 4, we will use 8 benchmarks and 3 architectures to verify the accuracy
of modelling performance incorporating the fanout-based delay model.
Chapter 4
Experimental Studies
In this chapter, we experimentally assess the area and performance estimation models
described in the previous chapter. Section 4.1 overviews three architecture variants used
in the following sections. Section 4.2 overviews eight benchmarks used in this study.
Section 4.3 presents a comparison between the full VLSI CAD area/performance versus
the estimates. Section 4.4 presents estimation results for two variants of the same archi-
tecture, validating the ability to use the estimates for architecture-specific DSE studies.
Section 4.5 presents results for applying the estimators to two entirely different architec-
tures, thereby assessing the architecture-to-architecture comparison capability.
4.1 Target Architectures
Figures 4.1a and 4.1b show two variants of ADRES used in this study, referred to as
ADRES with orthogonal interconnect (ADRES-O) and ADRES with diagonal intercon-
nect (ADRES-D), respectively. From the top to bottom, the architecture is equipped
with a row of I/O ports, a wider Data Register File (DRF) shared by the first row of
PEs, a 4×4 grid of PEs, a smaller Register File (RF) coupled with each PE (excluding
the first row), and Memory Interface Ports (MPs) each connecting to a row of PEs. On
a side note, the MPs were not used in the original proposal of the ADRES architecture;
39
Chapter 4. Experimental Studies 40
(a) ADRES-O – Orthogonal interconnect. (b) ADRES-D – Diagonal interconnect.
(c) High-level view of the HyCUBE-like architecture used in the experimental studiesSection 4.5.
Figure 4.1: High-level view of the ADRES-like architectures and HyCUBE architectureused in the experimental studies.
however, in CGRA-ME, load and store operations are serviced by memory units, and
mapped to MPs, so in order to map benchmarks with memory operations, we decided
to add MPs to the architecture1. Each PE consumes and provides data to the nearest
orthogonal or diagonal neighbor PEs, highlighted in black and green, respectively. PEs
on edges are also connected by toroidal buses, highlighted in red and blue, representing
vertical and horizontal toroidal connections, respectively. Each PE can also perform a
bypass through a multiplexer, allowing data routing at the cost of an FU.
1The area/performance of the MPs is not modelled.
Chapter 4. Experimental Studies 41
Figure 4.1c depicts the HyCUBE architecture with 4×4 PEs. The composition of the
PEs were detailed in Figure 2.7. Recall that the Crossbar Switch (CBS) in HyCUBE
allows data to be routed combinationally throughout the array (without requiring that
FUs be used as route-throughs). With the CBS, the architecture achieves multi-hop,
and multi-cast capability, at the cost of silicon area and delay overhead due to the extra
multiplexers in bypassable registers and the CBS. The leftmost column of the architecture
is also connected to MPs, for data load and store. In contrast with ADRES, HyCUBE
does not contain any RFs.
4.2 Target Benchmarks
When modelling architecture performance, it is not meaningful to perform STA on an
entire architecture, since the longest bypass/multi-hop chain will become the critical path.
Instead, much like timing analysis for FPGAs, we will map a set of benchmarks on all
target architectures. Timing analysis will be constrained and performed on mapped/used
resources only. The mapping results will be used to generate constraints for PrimeTime
STA to create baseline results, and will also be used to construct the Tatum timing graph
to produce estimation results. The 8 benchmarks selected are listed below:
1) conv2 : Computes and stores the dot product of 2 elements of array “a” with 2
constants, and stores into array “b”; 16 DFG nodes
2) conv2 : Computes and stores the dot product of 3 elements of array “a” with 3
constants, and stores into array “b”; 24 DFG nodes
3) mac : Computes the sum-of-product of 2 arrays; 11 DFG nodes
4) mults1 : Computes and accumulates the dot product of 4 elements of array “a”
with 4 constants; 31 DFG nodes
Chapter 4. Experimental Studies 42
(a) conv2 benchmark. (b) conv3 benchmark.
(c) mac benchmark. (d) mults1 benchmark.
(e) nomem1 benchmark. (f) simple benchmark.
(g) simple2 benchmark. (h) sum benchmark.
Figure 4.2: Illustration of the DFGs of the 8 benchmarks used in the experimental studies.
Chapter 4. Experimental Studies 43
5) nomem1 : Computes an arithmetic series without using any memory units; 6 DFG
nodes
6) simple : Computes and stores element-wise additions of array “a” and “b”; 12
DFG nodes
7) simple2: Computes and stores element-wise multiplications of array “a” and “b”;
12 DFG nodes
8) sum : Computes sum of all elements of array “a”; 7 DFG nodes
The benchmarks employ different computation resources and data-routing patterns.
Figure 4.2 depicts the DFGs of the 8 benchmarks.
There is an ongoing effort on improving the mapping runtime of CGRA-ME, but
currently, mapping the above mini benchmarks, averaging around 15 DFG nodes, is time
consuming (around an hour for ADRES, and three hours for HyCUBE), since the mapper
currently attempts to simultaneously satisfy constraints for scheduling, placement, and
routing. Mapping real-world applications, which can easily contain more than 100 DFG
nodes, is a direction that is currently being actively worked on within the group.
4.3 ADRES-O Full VLSI Implementation Versus CGRA-
ME Estimation
To assess the area and performance estimation within CGRA-ME, the full architecture
Verilog RTL of ADRES-O generated by CGRA-ME was realized in standard cells using
the same set of technology/tools used to profile the primitive modules (c.f. Chapter 3).
Figure 4.3 shows standard-cell layouts of ADRES-O for both the area-optimized and
delay-optimized targets, side by side, in the same scale.
In these layouts, only the immediate submodules under the top module are floor-
planned. Cells within these submodules are free to be placed anywhere within the floor-
Chapter 4. Experimental Studies 44
Figure 4.3: Standard-cell PnR architecture layout of ADRES – area-optimized (left) vs.delay-optimized (right) side-by-side on the same scale.
planned rectangle. This implies that different instances of the same top-level submodules
will have different internal placements and routings, and therefore, we expect slightly dif-
ferent delays.
Area-optimized Delay-optimizedBaseline [µm2] 125945.3 151203.0
CGRA-ME [µm2] 116415.0 140854.0Error -7.57% -6.84%
Table 4.1: ADRES-O: Total core area of area-optimized and delay-optimized designs, aswell as estimation by CGRA-ME.
Table 4.1 shows the total chip area of both targets in the row labelled Baseline.
Table 4.2 summarizes the critical path delays of the benchmarks. Note that the MPs are
not included in the performance/area results, as these are typically proprietary IPs and
our focus is on the CGRA aspects of area/performance.
4.3.1 CGRA-ME Estimation Results
We first examine area estimation, where the estimates are shown in the CGRA-ME row
of Table 4.1. Comparing the two rows of the table, we see close alignment between the
Chapter 4. Experimental Studies 45
Area-optimized [ns] Delay-optimized [ns]conv2 4.37 2.77conv3 4.46 3.20mac 4.51 2.92
mults1 4.51 2.77nomem1 4.33 2.72simple 3.24 2.71simple2 3.24 2.71
sum 4.13 2.96
Table 4.2: ADRES-O: Critical path delay of benchmarks for area-optimized and delay-optimized targets from PrimeTime STA.
estimates and actual area values for the entire CGRA. The estimates are about 6% lower
than the actual layout values. The results confirm that aggregating primitive module
areas provides a good estimate of overall CGRA area. The 6% gap between our tool and
the actual areas is anticipated, since our tool does not account for the area contributed
by configuration cells. We do not take configuration cells into account in our estimates
because their detailed implementation may vary depending on the CGRA (e.g. SRAM
cells or flip-flops). However, the number of configuration bits can be estimated to further
improve the result. For the baseline results, configuration cells are implemented as flip-
flops, connected in a scan chain.
Turning now to the performance results, Figures 4.4a and 4.4b show the critical
path delays of the 8 benchmarks, under area-optimized and delay-optimized objectives,
respectively. The blue bars show the actual post-routing critical path delays; the red
bars represent the estimates provided by our model. For the results from CGRA-ME, we
did not take interconnect delay into account – interconnect delays are zeroed. Observe
that the estimated critical path delays are almost always optimistic (i.e. the estimated
delays are smaller than the actual delays). The average error in the critical-path delay
estimation is 1.25ns, or 43%. Moreover, apart from this error, the critical path reported
by the estimator differed from the actual critical path in many cases. Out of the 16
(8 from each of area-optimized and delay-optimized implementations), the critical path
Chapter 4. Experimental Studies 46
(a) Area-optimized target.
(b) Delay-optimized target.
Figure 4.4: Critical path delay comparison – CGRA-ME estimations without interconnectdelays vs PrimeTime with interconnect delays.
reported by the estimator was the same as the actual critical path in 7 of the benchmarks.
That is, in 9 of the cases, the wrong critical path was reported by the estimator. As a
concrete example, we depict the critical path reported by PrimeTime and CGRA-ME in
Figure 4.5.
When the mults1 benchmark was mapped onto the ADRES-O architecture in the
delay-optimized scenario, we observed that the critical path reported by the estimator
and PrimeTime are the same. However, the critical path delays reported are 2.78ns and
1.52ns, from PrimeTime and CGRA-ME, respectively. In Table 4.3, we show detailed
Chapter 4. Experimental Studies 47
(a) Critical path reported by PrimeTime is highlighted in green.
(b) Critical path reported by CGRA-ME is highlighted in red.
Figure 4.5: After mapping benchmark conv2, we produced the partial MRRG represent-ing the used portion of the hardware. Without interconnect delay taken into account,Synopsys PrimeTime and CGRA-ME report different critical paths.
Chapter 4. Experimental Studies 48
PrimeTime CGRA-MENode Incr [ns] Path [ns] Incr [ns] Path [ns]
rf c1 r2.Q 0.100 0.100 0.100 0.100pe c1 r2.mux out.in 0.100 0.200 0.000 0.100pe c1 r2.mux out.out 0.050 0.240 0.070 0.170pe c0 r2.mux bypass.in 0.120 0.360 0.000 0.170pe c0 r2.mux bypass.out 0.070 0.430 0.110 0.280pe c0 r2.mux out.in 0.040 0.470 0.000 0.280pe c0 r2.mux out.out 0.040 0.510 0.070 0.350pe c3 r2.mux b.in 0.120 0.630 0.000 0.350pe c3 r2.mux b.out 0.130 0.760 0.070 0.420pe c3 r2.func.op mult.in 0.820 1.580 0.000 0.420pe c3 r2.func.op mult.out 1.130 2.710 1.100 1.520rf c3 r2.D 0.060 2.780 0.000 1.520
Table 4.3: The critical path delay report of the mults1 benchmark, mapped onto ADRES-O in the delay-optimized scenario, generated by PrimeTime and CGRA-ME Tatum (with-out accounting for interconnect delay).
critical path delays for PrimeTime and CGRA-ME STA. The “Node” column represents
the pins in the design. The “Incr” column represents the incremental delay from one node
above. The “Path” column represents the cumulative critical path delay from the starting
node. In this report, any incremental delay associated with a .out node, represents total
delay (combinational cell + metal wire) in a primitive module. Any incremental delay
associated with a .in node, represents total interconnect delay (metal wire) between
two primitive modules. For PrimeTime, we observe that out of the total critical path
delay, 2.78ns, 1.26ns (around 45%) is contributed by interconnect delays. If we added the
1.26ns interconnect delay onto the total critical path delay, 1.52ns, reported by CGRA-
ME, then our estimation would be quite accurate. Similar results were observed for the
other benchmarks – interconnect delay is the main source of estimator inaccuracy. Hence,
it is important that our estimation integrates an interconnect delay model.
Figures 4.6a and 4.6b shows the revised results when fanout-based delay estimation
(c.f. Section 3.5) is incorporated. In this case, we use the fanouts in the CGRA device
model, the architecture MRRG, as input to the estimation model described in section 3.5.
Observe that the estimation error is improved relative to the results shown in Figure 4.4.
Chapter 4. Experimental Studies 49
(a) Area-optimized target.
(b) Delay-optimized target.
Figure 4.6: Critical path delay comparison – CGRA-ME estimates with MRRG fanout-inferred interconnect delay versus PrimeTime.
On average, the error is now 0.73ns, or 21%. After this improvement, out of the 16
benchmarks, 8 reported critical paths matched those reported by PrimeTime. Of the
remaining 8 benchmarks, there are 6 cases where Tatum did report the correct paths in
the respective top-four critical paths of each case, and these top-four critical paths are
within 0.5ns difference.
Further analysis of the results incorporating interconnect delay estimation showed
that a significant source of error was for signals on the outputs of multiplexers driving the
FUs. Table 4.4 shows the delays on the critical path of the mults1 benchmark, where the
Chapter 4. Experimental Studies 50
estimates incorporate fanout-based interconnect delay estimation (using MRRG fanouts).
The interconnect delay at the node “pe c3 r2.func.op mult.in” are 0.82ns and 0.102ns,
from PrimeTime and CGRA-ME, respectively. From the MRRG, the fanout count from
“pe c3 r2.mux b.out” is 3, however, PrimeTime reported the maximum fanout from all
wires in this bus as 149. The MRRG does not capture FU implementation details, such
as the available operations in a FU, and the implementation of each operation, resulting
in this difference. That is, the fanout of gates in the standard-cell implementation may
not align with the apparent bus-fanout in the MRRG.
PrimeTime CGRA-MENode Incr [ns] Path [ns] Incr [ns] Path [ns]
rf c1 r2.Q 0.100 0.100 0.100 0.100pe c1 r2.mux out.in 0.100 0.200 0.096 0.196pe c1 r2.mux out.out 0.050 0.240 0.070 0.266pe c0 r2.mux bypass.in 0.120 0.360 0.170 0.436pe c0 r2.mux bypass.out 0.070 0.430 0.110 0.546pe c0 r2.mux out.in 0.040 0.470 0.090 0.636pe c0 r2.mux out.out 0.040 0.510 0.070 0.706pe c3 r2.mux b.in 0.120 0.630 0.170 0.876pe c3 r2.mux b.out 0.130 0.760 0.070 0.946pe c3 r2.func.op mult.in 0.820 1.580 0.102 1.048pe c3 r2.func.op mult.out 1.130 2.710 1.100 2.148rf c3 r2.D 0.060 2.780 0.096 2.244
Table 4.4: The critical path delay report of the mults1 benchmark, mapped onto ADRES-O in delay-optimized scenario, generated by PrimeTime and CGRA-ME Tatum now withinterconnect delays modelled based on MRRG node fanouts.
The architecture model, as represented as an MRRG, is not able to capture gate-level
details. The FUs are able to perform many different logical and arithmetic operations,
leading to high fanout (at the gate level) for signals entering these units. In light of this
fanout gap, we added the capability for a user to override the MRRG fanout, for specific
CGRA instances, if desired. Fortunately, such fanout numbers can still be acquired with
the synthesis of primitive modules only, and this does not conflict with the primitive
characterization flow as described in Figure 3.2 in Section 3.2.
With the fanout of the multiplexers driving FUs overriden, the performance results
Chapter 4. Experimental Studies 51
(a) Area-optimized target
(b) Delay-optimized target
Figure 4.7: Critical path delay comparison – CGRA-ME estimations with selected fanoutcount overridden vs PrimeTime.
are shown in Figs. 4.7a and 4.7b. The average error reduces to 0.33ns, 9.6% compared
to baseline. Moreover, after the override, the correctness of the critical paths reported
also improved. Out of the 16 benchmarks, 12 had identical critical paths reported by
PrimeTime and Tatum. In the remaining 4 cases, the critical path reported by PrimeTime
was one of the top-four paths reported by Tatum.
We also revisit the same test case from Tables 4.3 and 4.4, but this time with fanout
numbers of multiplexers driving FUs overriden. In Table 4.5, interconnect delays at node
“pe c3 r2.func.op mult.in” are 0.82ns and 0.934ns, from PrimeTime and CGRA-ME,
Chapter 4. Experimental Studies 52
respectively, which are very close.
PrimeTime CGRA-MENode Incr [ns] Path [ns] Incr [ns] Path [ns]
rf c1 r2.Q 0.100 0.100 0.100 0.100pe c1 r2.mux out.in 0.100 0.200 0.096 0.196pe c1 r2.mux out.out 0.050 0.240 0.070 0.266pe c0 r2.mux bypass.in 0.120 0.360 0.170 0.436pe c0 r2.mux bypass.out 0.070 0.430 0.110 0.546pe c0 r2.mux out.in 0.040 0.470 0.090 0.636pe c0 r2.mux out.out 0.040 0.510 0.070 0.706pe c3 r2.mux b.in 0.120 0.630 0.170 0.876pe c3 r2.mux b.out 0.130 0.760 0.070 0.946pe c3 r2.func.op mult.in 0.820 1.580 0.934 1.880pe c3 r2.func.op mult.out 1.130 2.710 1.100 2.980rf c3 r2.D 0.060 2.780 0.096 3.076
Table 4.5: The critical path delay report of the mults1 benchmark, mapped onto ADRES-O in the delay-optimized scenario, generated by PrimeTime and CGRA-ME Tatum nowwith interconnect delays modelled based on MRRG node fanouts, and overridden for themultiplexer driving the FUs.
Without Interconnect Delay MRRG Based Fanout # Overriden Fanout #Target Area-Optimized Delay-Optimized Area-Optimized Delay-Optimized Area-Optimized Delay-Optimized
conv2 � 1 � 5 5 5
conv3 � 1 � 1 5 1
mac 1 5 1 5 5 5
mults1 1 5 1 5 5 5
nomem1 5 5 5 5 5 1
simple 5 1 5 1 5 1
simple2 5 1 5 1 5 1
sum 1 5 1 5 5 5
Table 4.6: Correctness of reported critical paths: 1) 5 – Tatum reported the same pathas PrimeTime, 2) � – path from PrimeTime is not reported by Tatum, 3) 1 – pathfrom PrimeTime is reported as one of the top-four critical paths from Tatum.
Table 4.6 summarizes the correctness of reported critical path for all three methods.
�’s in the table indicate that the critical path reported by PrimeTime (actual) is not
reported by Tatum (estimator), which are undesirable. 1’s indicate the PrimeTime-
reported path is one of the top four paths reported by Tatum, which are not ideal but
acceptable. 5’s represent the same path being reported in both PrimeTime and Tatum,
which are desirable. Observe that, when fanouts are taken into account to estimate
Chapter 4. Experimental Studies 53
interconnect delay, there is greater alignment between the paths reported by PrimeTime
and Tatum (more circles appear in the centre and right-most columns of the table), and
when the corner-case gate-level details are also supplied, the estimation performs even
better.
In this section, we have shown that our estimation engine can accurately estimate
area and performance, both within approximately 10% error. However, we have com-
pared our estimation against standard-cell CAD tools for only one architecture under
two implementation objectives: area and performance. In the following sections, we ex-
plore whether the estimation engine is applicable in a wider scope, by applying it on an
architectural variant, and also an entirely different architecture.
4.4 ADRES Architecture with Added Diagonal Con-
nectivity
It is desirable if the area/performance estimation can be used for architecture DSE. As a
step towards assessing this ability, we use a modified version of ADRES-O called ADRES-
D, which contains additional diagonal interconnect connectivity between PEs. In terms
of implementation modifications compared with ADRES-O, the PEs of ADRES-D are
equipped with larger input multiplexers, as well as have a larger set of fanouts from
the output multiplexer. The larger multiplexers also imply the architecture may require
more configuration cells, due to the wider select inputs. The differences are illustrated
in Figure 4.8.
Here, we highlight some expected effects from these modifications, based on intuition:
1) Area: intuitively, the larger multiplexers and additional configuration cells should
result in increased area usage.
2) Benchmark performance: it is difficult to guess whether the change will result in
Chapter 4. Experimental Studies 54
(a) A PE in the ADRES-O architecture. (b) A PE in the ADRES-D architecture,with modifications highlighted in red.
Figure 4.8: PEs from ADRES-O and ADRES-D put side-by-side for comparison.
better or worse performance. While ADRES-D has more flexible interconnect, the
tiles are also expected to be larger, leading to higher interconnect capacitances.
Likewise, the multiplexers are larger, leading to higher combinational delay.
3) Architecture flexibility: the modification provides more flexible interconnect, effec-
tively using fewer PEs solely for bypass.
4) Mapper runtime: since the architecture is more flexible, the MRRG representing
the architecture is also larger, potentially requiring longer time for the mapper. On
the other hand, the additional flexibility may allow the ILP-based mapper to find
solutions more easily.
4.4.1 Area
Before comparing ADRES-O and ADRES-D, we need to first confirm that the estima-
tion engine can retain its accuracy when compared to a standard-cell implementation.
We realized the ADRES-D architecture with standard-cells, and re-ran CGRA-ME area
estimation using the same primitive module characterization databases. In Table 4.7, we
observe that the error in area estimation is still within 7%. As anticipated, our estima-
tion technique area is still reasonably accurate, when applied onto an architecture with
Chapter 4. Experimental Studies 55
moderate modifications.
Area-optimized Delay-optimizedVLSI CAD [µm2] 128317.4 158140.2CGRA-ME [µm2] 120057.0 149088.0
Error -6.4% -5.7%
Table 4.7: ADRES-D: Total core area of area-optimized and delay-optimized designs, aswell as estimation by CGRA-ME.
To verify the area-estimation fidelity, we compare the difference in area betweem
ADRES-O and ADRES-D for both standard-cell implementations, and the estimates. For
standard-cell implementations, with the additional interconnect, we observe a 3877.3µm2
(3.1%) and 7787.9µm2 (5.2%) increase in core area, for area-optimized and delay-optimized
variants, respectively. On the other hand, CGRA-ME estimation shows 3890.0µm2 (3.3%)
and 8150.0µm2 (5.8%) increases in core area for the area- and delay-optimized variants,
respectively. As we can see, the area differences arising from the additional diagonal inter-
connect connections are faithfully represented by the estimation engine, when compared
with the results for standard-cell flow.
4.4.2 Performance
For performance, we must perform mapping again for all benchmarks, to take advan-
tage of the new diagonal connectivity. From the new mapping results, we created the
corresponding STA constraints, and generated timing results for both area-optimized
and delay-optimized targets. Figure B.1 shows the critical path delay of the same 8
benchmarks on the ADRES-D architecture. Table 4.8 demonstrates the correctness of
estimated critical path, versus that reported by PrimeTime for the standard-cell imple-
mentation. We observe that the accuracy and correctness of the performance estimation
still holds.
We now move on to compare the performance of ADRES-O with ADRES-D. We
first compare the performance of the standard-cell implementations (STA with Synop-
Chapter 4. Experimental Studies 56
Target Area-Optimized Delay-Optimized
conv2 5 5
conv3 5 5
mac 5 5
mults1 5 1
nomem1 5 5
simple 5 1
simple2 5 1
sum 5 5
Table 4.8: Correctness of reported critical paths on ADRES-D architecture: 1) 5 –Tatum reported the same path as PrimeTime, 2) � – path from PrimeTime is notreported by Tatum, 3) 1 – path from PrimeTime is reported as one of the top-fourcritical paths from Tatum.
sys PrimeTime). While having a different mapping per benchmark, we observe that,
on average, ADRES-O and ADRES-D have 4.10ns and 4.41ns critical path delays for
the area-optimized implementations, respectively. The analogous delays are 2.85ns and
3.14ns in the delay-optimized implementations. That is, in the area-optimized imple-
mentation, ADRES-D has a 0.31ns larger average critical path delay versus ADRES-O.
In delay-optimized implementation, ADRES-D has a 0.29ns larger average critical path
delay versus ADRES-O.
Turning now to the estimator, the estimator reports 0.4ns and 0.52ns larger average
critical path delays for area-optimized and delay-optimized variants, respectively. These
results match closely with the standard-cell results from PrimeTime. As mentioned ear-
lier, having diagonal interconnect could result in fewer interconnect “hops” per routed
connection. When we investigated the mapping results in detail, we observed that only
2 of the 8 mapped benchmarks used the diagonal interconnect. Additionally, while the
diagonal interconnect was used, we observed that the number of primitive modules the
critical path spans over was similar on average in ADRES-D and ADRES-O. This sug-
gests that the additional combinational delay from the larger multiplexers in ADRES-D
is the more dominant factor to the change in critical path delay. While these 8 bench-
Chapter 4. Experimental Studies 57
marks suggest that ADRES-O out performs ADRES-D, it is important to note that these
benchmarks do not impose heavy routing congestion (refer to Figure 4.2), which implies
that, a larger benchmark requiring most of the functional resources, may yield a different
conclusion. It is worth mentioning that when mapping the 8 benchmarks, it takes similar
runtime to map onto ADRES-D (on average 148.4s) compared to ADRES-O (on average
152.4s). Five benchmarks maps faster on ADRES-D. The added flexibility in ADRES-D
appears to be neutral to mapping runtime. With these results, we have demonstrated
that, at least for a modest architecture modification to ADRES-O, the estimation engine
is able to predict the area and performance consequences of the change with reasonable
accuracy.
4.5 ADRES Architecture versus HyCUBE Architec-
ture
The previous section considered whether the estimation engine can assess an architec-
ture modification with good fidelity. We now consider the case of two entirely different
architectures: ADRES-O and HyCUBE.
4.5.1 Architectural Differences: ADRES-O versus HyCUBE
Both ADRES and HyCUBE are generic CGRA architectures – the key differences be-
tween them are the interconnect and the locations of registers. In Figure 4.9, we compare
the PEs from the two architectures.
ADRES employs many RFs: one wide VLIW-shared RF at the top row, and multiple
smaller RFs, each associated with one PE. This provides mapping flexibility by allowing
data to be stored and retrieved an arbitrary number of clock cycles later. HyCUBE,
on the other hand, has individual registers, but no register files. As well, the CBS in
HyCUBE has bypassable input registers allowing data routing to any PE on the array
Chapter 4. Experimental Studies 58
(a) A PE in the ADRES-O architecture,with parts unique from HyCUBE PE high-lighted in blue.
(b) A PE in the HyCUBE architecture,with parts unique from ADRES-O PEhighlighted in red.
Figure 4.9: PEs from ADRES-O and HyCUBE.
within the same cycle. Additionally, the CBS allows a signal to route through a PE
without using its FU. The FU can be used for computation and its output routed through
the same CBS (as long as the CBS is not fully congested).
4.5.2 Area
We implemented a 4×4 HyCUBE architecture using the the standard-cell flow, and also
estimated its area using the same set of characterization databases. From Table 4.9, we
again observe that our approach remains fairly accurate: 10-11% estimation error.
Area-optimized Delay-optimizedVLSI CAD [µm2] 153707.1 187666.5CGRA-ME [µm2] 136880.0 169228.0
Error -11.0% -9.8%
Table 4.9: HyCUBE: Total area of area-optimized and delay-optimized variants, as wellas estimation by CGRA-ME.
For ADRES-O, the standard-cell area was 128317.4µm2 and 158140.2µm2, for area-
optimized and delay-optimized variants, respectively. For HyCUBE, the analogous standard-
cell areas are 153707.1µm2 and 187666.5µm2. The HyCUBE architecture requires 19.8%
Chapter 4. Experimental Studies 59
and 18.7% more area over the ADRES-O, for area-optimized and delay-optimized standard-
cell variants, respectively. Using CGRA-ME, the HyCUBE architecture was estimated
to require 17.8% and 20.1% more area than ADRES-O, for area- and delay-optimized
variants, respectively. This closely matches the increases observed with the standard-cell
variants.
4.5.3 Performance
For performance, the same 8 benchmarks are mapped onto the 4×4 HyCUBE, and we cre-
ated the corresponding constraints for PrimeTime to perform per-benchmark standard-
cell STA for both optimization targets. We have also modelled the performance within
CGRA-ME. In Table 4.10 and Figure B.2, notice that results for benchmark conv3 are
left empty, since the benchmark is unmappable on HyCUBE due to insufficient MPs.
From Table 4.10, we see that the critical paths are still mostly correct. However, from
Figure B.2, we see that there is an average of 17.2% error in the estimation. With some
investigation, we observed critical paths in HyCUBE spanned more primitive modules,
and that the multiplexer primitive modules are more widely used in HyCUBE when
compared against ADRES. However, for all sizes of multiplexers, we estimate their in-
terconnect delay using the average linear model presented in Figure 3.9. We believe that
error accumulates in chains of multiplexers on HyCUBE critical paths. We can remedy
this result if the interconnect delay model is acquired per primitive module instead of an
overall average – this is left as future work.
We now compare the performance of HyCUBE to ADRES-O for standard-cell imple-
mentations. The average critical paths are 3.59ns and 2.49ns, for the area-optimized and
delay-optimized variants, respectively. The average critical path delay of HyCUBE is
0.51ns and 0.36ns shorter than those of ADRES-O, for the two objectives, respectively.
However, estimation from CGRA-ME, suggests 0.64ns and 0.11ns longer average critical
path delays for HyCUBE versus ADRES-O, for area- and delay-optimized variants, re-
Chapter 4. Experimental Studies 60
Target Area-Optimized Delay-Optimized
conv2 1 5
conv3 N/A N/A
mac 5 5
mults1 1 1
nomem1 5 5
simple 5 �
simple2 5 �
sum 5 �
Table 4.10: Correctness of reported critical paths on HyCUBE architecture: 1) 5 –Tatum reported the same path as PrimeTime, 2) � – path from PrimeTime is notreported by Tatum, 3) 1 – path from PrimeTime is reported as one of the top-fourcritical paths from Tatum.
spectively. The difference again arises because of the weakness of the average interconnect
delay model.
While it is not the focus of this work, it is important to keep in mind that with the
built-in mapping algorithm of CGRA-ME, the HyCUBE architecture takes noticeably
longer to map comparing to either ADRES-O or ADRES-D. For the 7 benchmarks that
mapped, HyCUBE took more than 10× longer to map, when compared against ADRES-
O and ADRES-D.
4.6 Summary
In this chapter, we compared the area and performance estimates with the results from
a full standard-cell implementation. We showed that area and performance of a CGRA
can be estimated with reasonable accuracy, and that benchmark-by-benchmark perfor-
mance estimates also reflect the actual standard-cell performance. Additionally, with
the comparison of ADRES-O and ADRES-D, we showcased that our estimation engine
offers fidelity for an architecture modification to the interconnect fabric. Finally, with the
comparison of ADRES-O and HyCUBE, we demonstrated that for completely different
CGRAs, our framework can quantify some of their specific advantages and disadvantages,
Chapter 4. Experimental Studies 61
while interconnect delay estimation lacks granularity. The integration of the estimation
engine into CGRA-ME provides a human architect with rapid area and performance
estimation, without requiring a complete standard-cell flow, layout extraction, and STA.
Chapter 5
Conclusion and Future Work
In this thesis, we extended the CGRA-ME framework with the capability to rapidly es-
timate the area and performance of the modelled CGRA, without undergoing a lengthy
ASIC design CAD flow. Our estimation approach is to model architecture-level charac-
teristics based on a characterization database for primitive modules. This is somewhat
analogous to the characterization data associated with standard cells in a fab-provided
library. Performance characterization data is used in conjunction with an open-source
timing analysis tool, Tatum, to estimate critical paths, slacks, for an application bench-
mark mapped onto a modelled CGRA. In Chapter 4, we demonstrated that we can esti-
mate CGRA area and performance with reasonable accuracy, provided that interconnect
delays are also included in the estimates. We also demonstrated that the estimation en-
gine can be applied to gauge the area/delay of architecture modifications, and comparing
different architectures requires improvement for interconnect delay model accuracy. We
expect the estimation engine will be useful for early assessment of hypothetical CGRA
architectures.
62
Chapter 5. Conclusion and Future Work 63
5.1 Future Work
A first direction for future work is to further improve the interconnect delay estima-
tion, by using per-primitive estimation models, instead of a single average model for all
primitives. Another area requiring refinement on this front is for the interconnect model
to take into account the location of the CGRA tiles, for example, to realize that the
toroidal connections in ADRES-O are longer wires than many of the nearest-neighbour
connections. More diverse PEs and FUs implementations are also important to verify
the accuracy of interconnect delay estimation.
Another future direction is to build characterization databases for non-standard-cell
CGRA implementations. CGRAs may, for example, be implemented as FPGA overlays,
where they would undoubtedly have entirely different area and delay characteristics on a
primitive-by-primitive basis. It may also be necessary to re-think what exactly constitutes
a primitive in the FPGA-overlay context. Likewise, an entirely different interconnect
delay model may be required.
Further avenues for exploration include using the performance estimates provided
by this research within the CGRA mapping algorithm to realize performance-driven
CGRA mapping, as opposed to the strictly feasibility-driven mapping used in the current
CGRA-ME. It would also be interesting to extend the current work to provide early
power estimation, in addition to area and performance. This would allow CGRAs to be
compared against other computing platforms from the power angle.
Finally, while the work here considered estimation for area- and performance-optimized
CGRA standard-cell implementations, it would be useful to consider alternatives to this.
Some options here include: 1) balanced area/performance implementations, or 2) imple-
mentations wherein the standard-cell tools make use of optimized IP cores for arithmetic,
memory, or other CGRA structures.
Appendix A
Area Modelling
(a) Area-optimized target (b) Delay-optimized target
Figure A.1: ADRES-O Architecture: Area breakdown in µm2
64
Appendix A. Area Modelling 65
(a) Area-optimized target (b) Delay-optimized target
Figure A.2: ADRES-D Architecture: Area breakdown in µm2
Appendix A. Area Modelling 66
(a) Area-optimized target (b) Delay-optimized target
Figure A.3: HyCUBE Architecture: Area breakdown in µm2
Appendix B
Performance Modelling
(a) Area-optimized target
(b) Delay-optimized target
Figure B.1: ADRES-D Architecture: Critical path delay comparison – CGRA-ME esti-mations with selected fanout count overridden versus PrimeTime.
67
Appendix B. Performance Modelling 68
(a) Area-optimized target.
(b) Delay-optimized target.
Figure B.2: HyCUBE Architecture: Critical path delay comparison – CGRA-ME esti-mations with selected fanout count overridden versus PrimeTime.
Bibliography
[1] Hideharu Amano. A dynamically adaptive hardware using a multi-context recon-
figurable processor. Technical Report of Information Processing Society of Japan
(IPSJ) – Computer Architecture (ARC), 2002(112):59–64, 2002.
[2] Hideharu Amano. A survey on dynamically reconfigurable processors. The Institute
of Electronics, Information and Communication Engineers (IEICE) Transactions
on Communications, 89(12):3179–3187, 2006.
[3] Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt.
Analyzing cuda workloads using a detailed gpu simulator. In IEEE International
Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163–
174, 2009.
[4] Nikhil Bansal, Sumit Gupta, Nikil Dutt, and Alexandru Nicolau. Analysis of the
performance of coarse-grain reconfigurable architectures with different processing
element configurations. In Workshop on Application Specific Processors, held in
conjunction with the International Symposium on Microarchitecture (MICRO), 2003.
[5] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi,
Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti,
et al. The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2):1–7,
2011.
69
Bibliography 70
[6] Anastasiia Butko, Rafael Garibotti, Luciano Ost, and Gilles Sassatelli. Accuracy
evaluation of gem5 simulator system. In IEEE 7th International Workshop on Re-
configurable Communication-centric Systems-on-Chip (ReCoSoC), pages 1–7, 2012.
[7] Mike Butts. Synchronization through communication in a massively parallel proces-
sor array. IEEE Micro, 27(5), 2007.
[8] Anupam Chattopadhyay. Ingredients of adaptability: a survey of reconfigurable
processors. VLSI Design, Vol. 2013(Article ID 683615), 2013.
[9] Anupam Chattopadhyay, Xiaolin Chen, Harold Ishebabi, Rainer Leupers, Gerd As-
cheid, and Heinrich Meyr. High-level modelling and exploration of coarse-grained
re-configurable architectures. In Proceedings of the Conference on Design, Automa-
tion and Test in Europe (DATE), pages 1334–1339. ACM, 2008.
[10] S. Alexander Chin and Jason H Anderson. An architecture-agnostic integer linear
programming approach to cgra mapping. In Proceedings of the 55th Annual Design
Automation Conference (DAC), page 128. ACM, 2018.
[11] S. Alexander Chin, Kuang Ping Niu, Matthew Walker, Shizhang Yin, Alexander
Mertens, Jongeun Lee, and Jason Helge Anderson. Architecture exploration of
standard-cell and fpga-overlay cgras using the open-source CGRA-ME framework.
In Proceedings of the International Symposium on Physical Design (ISPD), pages
48–55. ACM, 2018.
[12] S. Alexander Chin, Noriaki Sakamoto, Allan Rui, Jim Zhao, Jin Hee Kim, Yuko
Hara-Azumi, and Jason Anderson. CGRA-ME: A unified framework for CGRA
modelling and exploration. In 28th International Conference on Application-specific
Systems, Architectures and Processors (ASAP), pages 184–189. IEEE, 2017.
Bibliography 71
[13] Kiyoung Choi. Coarse-grained reconfigurable array: Architecture and application
mapping. Information Processing Society of Japan (IPSJ) Transactions on System
LSI Design Methodology, 4:31–46, 2011.
[14] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc.
Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE
Journal of Solid-State Circuits, 9(5):256–268, 1974.
[15] Carl Ebeling, Darren C Cronquist, and Paul Franklin. RaPiDReconfigurable
pipelined datapath. In International Workshop on Field Programmable Logic and
Applications, pages 126–135. Springer, 1996.
[16] Carl Ebeling, Darren C Cronquist, Paul Franklin, Jason Secosky, and Stefan G Berg.
Mapping applications to the rapid configurable architecture. In Proceedings of the
5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
(FCCM), pages 106–115. IEEE, 1997.
[17] Fernando A Endo, Damien Courousse, and Henri-Pierre Charles. Micro-architectural
simulation of embedded core heterogeneity with gem5 and mcpat. In Proceedings of
the 7th Workshop on Rapid Simulation and Performance Evaluation: Methods and
Tools (RAPIDO), page 7. ACM, 2015.
[18] Stephen Friedman, Allan Carroll, Brian Van Essen, Benjamin Ylvisaker, Carl Ebel-
ing, and Scott Hauck. Spr: An architecture-adaptive cgra mapping tool. In Pro-
ceedings of the ACM/SIGDA International Symposium on Field Programmable Gate
Arrays (FPGA), pages 191–200. ACM, 2009.
[19] Seth Copen Goldstein, Herman Schmit, Mihai Budiu, Srihari Cadambi, Matthew
Moe, and R Reed Taylor. Piperench: A reconfigurable architecture and compiler.
IEEE Computer, 33(4):70–77, 2000.
Bibliography 72
[20] Seth Copen Goldstein, Herman Schmit, Matthew Moe, Mihai Budiu, Srihari
Cadambi, R Reed Taylor, and Ronald Laufer. Piperench: A coprocessor for stream-
ing multimedia acceleration. In Proceedings of the 26th International Symposium on
Computer Architecture, pages 28–39. IEEE, 1999.
[21] Marcel Gort and Jason H Anderson. Combined architecture/algorithm approach
to fast fpga routing. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 21(6):1067–1079, 2013.
[22] Varun Gunnala. Choosing appropriate utilization factor and metal layer numbers for
an efficient floor plan in vlsi physical design. International Journal of Engineering
Research and Applications, 2(3):456–462, 2012.
[23] Yohei Hasegawa, Shohei Abe, Hiroki Matsutani, Hideharu Amano, Kenichiro Anjo,
and Toru Awashima. An adaptive cryptographic accelerator for ipsec on dynamically
reconfigurable processor. In Proceedings of the International Conference on Field-
Programmable Technology (FPT), pages 163–170. IEEE, 2005.
[24] Ilhan Hatirnaz, Stephane Badel, Nuria Pazos, Yusuf Leblebici, Srinivasan Murali,
David Atienza, and Giovanni De-Micheli. Early wire characterization for predictable
network-on-chip global interconnects. In Proceedings of the International Workshop
on System Level Interconnect Prediction (SLIP), pages 57–64. ACM, 2007.
[25] R. B. Hitchcock, G. L. Smith, and D. D. Cheng. Timing analysis of computer
hardware. IBM Journal of Research and Development, 26(1):100–105, 1982.
[26] Myeong-Eun Hwang, Seong-Ook Jung, and Kaushik Roy. Slope interconnect effort:
Gate-interconnect interdependent delay modeling for early cmos circuit simulation.
IEEE Transactions on Circuits and Systems I: Regular Papers, 56(7):1428–1441,
2009.
Bibliography 73
[27] Gogul Ilango. Physical design - terminologies. https://gogul09.github.io/
hardware/physical-design-terminologies#core-utilization, June 2018. Ac-
cessed: 2019-01-14.
[28] Synopsys Inc. Designware library - datapath and building block ip. https://www.
synopsys.com/dw/buildingblock.php. Accessed: 2019-01-10.
[29] Nachiket Kapre, Bibin Chandrashekaran, Harnhua Ng, and Kirvy Teo. Driving tim-
ing convergence of fpga designs through machine learning and cloud computing. In
IEEE 23rd Annual International Symposium on Field-Programmable Custom Com-
puting Machines (FCCM), pages 119–126. IEEE, 2015.
[30] Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. Hy-
cube: A cgra with reconfigurable single-cycle multi-hop interconnect. In Proceedings
of the 54th Annual Design Automation Conference (DAC), pages 1–6. IEEE, 2017.
[31] C. Kim, M. Chung, Y. Cho, M. Konijnenburg, S. Ryu, and J. Kim. Ulp-srp: Ultra
low power samsung reconfigurable processor for biomedical applications. In Pro-
ceedings of the International Conference on Field-Programmable Technology (FPT),
pages 329–334. IEEE, Dec 2012.
[32] Y. Kim, R. N. Mahapatra, and K. Choi. Design space exploration for efficient
resource utilization in coarse-grained reconfigurable architecture. IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, 18(10):1471–1482, Oct 2010.
[33] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated
annealing. Science, 220(4598):671–680, 1983.
[34] I. Kuon and J. Rose. Measuring the gap between fpgas and asics. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 26(2):203–215, 2007.
Bibliography 74
[35] Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen,
and Norman P Jouppi. Mcpat: an integrated power, area, and timing modeling
framework for multicore and manycore architectures. In Proceedings of the 42nd
Annual IEEE/ACM International Symposium on Microarchitecture, pages 469–480.
ACM, 2009.
[36] Song Li and Carl Ebeling. Quickroute: a fast routing algorithm for pipelined ar-
chitectures. In Proceedings of the International Conference on Field-Programmable
Technology (FPT), pages 73–80. IEEE, 2004.
[37] Joey Y Lin, Deming Chen, and Jason Cong. Optimal simultaneous mapping and
clustering for fpga delay optimization. In Proceedings of the 43rd annual Design
Automation Conference (DAC), pages 472–477. ACM, 2006.
[38] J. Luu, K. Redmond, W. Lo, P. Chow, L. Lilge, and J. Rose. Fpga-based monte carlo
computation of light absorption for photodynamic cancer therapy. In Proceedings
of the 17th Annual IEEE Symposium on Field-Programmable Custom Computing
Machines (FCCM), pages 157–164. IEEE, 2009.
[39] Jason Luu, Jeff Goeders, Michael Wainberg, Andrew Somerville, Thien Yu, Kon-
stantin Nasartschuk, Miad Nasr, Sen Wang, Tim Liu, Norrudin Ahmed, Kenneth B.
Kent, Jason Anderson, Jonathan Rose, and Vaughn Betz. VTR 7.0: Next Generation
Architecture and CAD System for FPGAs. ACM Transactions on Reconfigurable
Technology and Systems (TRETS), 7(2):6:1–6:30, 2014.
[40] S. Malik and H. Jyu. Prediction of interconnect delay in logic synthesis. In IEEE
European Design and Test Conference (EDTC), page 411, 1995.
[41] Larry McMurchie and Carl Ebeling. Pathfinder: a negotiation-based performance-
driven router for fpgas. In Proceedings of the 3rd International Symposium on Field
Programmable Gate Arrays (FPGA), pages 111–117. ACM, 1995.
Bibliography 75
[42] Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwere-
ins. Dresc: A retargetable compiler for coarse-grained reconfigurable architectures.
In Proceedings of the IEEE International Conference on Field-Programmable Tech-
nology (FPT), pages 166–173. IEEE, 2002.
[43] Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwere-
ins. Adres: An architecture with tightly coupled vliw processor and coarse-grained
reconfigurable matrix. In International Conference on Field Programmable Logic
and Applications, pages 61–70. Springer, 2003.
[44] Gordon E Moore. Cramming more components onto integrated circuits, reprinted
from electronics, volume 38, number 8, april 19, 1965, pp. 114 ff. IEEE solid-state
circuits society newsletter, 11(3):33–35, 2006.
[45] M Motomura. A dynamically reconfigurable processor architecture. Microprocessor
Forum, Oct 2002.
[46] Kevin Murray and Vaughn Betz. Tatum: Parallel Timing Analysis for Faster Design
Cycles and Improved Optimization. In Proceedings of the International Conference
on Field-Programmable Technology (FPT). IEEE, 2018.
[47] Tony Nowatzki, Jaikrishnan Menon, Chen-Han Ho, and Karthikeyan Sankaralingam.
Architectural simulators considered harmful. IEEE Micro, 35(6):4–12, 2015.
[48] Seong Y. Ohm, Fadi J. Kurdahi, Nikil Dutt, and Min Xu. A comprehensive esti-
mation technique for high-level synthesis. In Proceedings of the 8th International
Symposium on System Synthesis (ISSS), pages 122–127. ACM, 1995.
[49] Davide Pandini, Lawrence T Pileggi, and Andrzej J Strojwas. Understanding and
addressing the impact of wiring congestion during technology mapping. In Proceed-
ings of the International Symposium on Physical Design (ISPD), pages 131–136.
ACM, 2002.
Bibliography 76
[50] Norbert Pramstaller, Stefan Mangard, Sandra Dominikus, and Johannes Wolkerstor-
fer. Efficient aes implementations on asics and fpgas. In International Conference
on Advanced Encryption Standard, pages 98–112. Springer, 2004.
[51] B. Ramakrishna Rau. Iterative modulo scheduling: An algorithm for software
pipelining loops. In Proceedings of the 27th Annual International Symposium on
Microarchitecture (MICRO), pages 63–74. ACM, 1994.
[52] Yaska Sankar and Jonathan Rose. Trading quality for compile time: Ultra-fast place-
ment for fpgas. In Proceedings of the ACM/SIGDA 7th International Symposium on
Field-Programmable Gate Arrays (FPGA), pages 157–166. ACM, 1999.
[53] Herman Schmit, David Whelihan, Andrew Tsai, Matthew Moe, Benjamin Levine,
and R Reed Taylor. Piperench: A virtualized programmable datapath in 0.18 micron
technology. In Proceedings of the IEEE Custom Integrated Circuits Conference, pages
63–66. IEEE, 2002.
[54] NanGate FreePDK45 Generic Open Cell Library. http://projects.si2.org/
openeda.si2.org/projects/nangatelib. Accessed: 2018-06-30.
[55] Hartej Singh, Ming-Hau Lee, Guangming Lu, Fadi J Kurdahi, Nader Bagherzadeh,
and Eliseu M Chaves Filho. Morphosys: an integrated reconfigurable system for
data-parallel and computation-intensive applications. IEEE Transactions on Com-
puters, 49(5):465–481, 2000.
[56] Hartej Singh, Ming-Hau Lee, Guangming Lu, Fadi J Kurdahi, Nader Bagherzadeh,
Tomas Lang, Robert Heaton, and MC Eliseu Filho. Morphosys: An integrated
re-configurable architecture. In Proceedings of the NATO RTO Symp. on System
Concepts and Integration, pages 20–22, 1998.
[57] D. Suh, K. Kwon, S. Kim, S. Ryu, and J. Kim. Design space exploration and
implementation of a high performance and low area coarse grained reconfigurable
Bibliography 77
processor. In Proceedings of the International Conference on Field-Programmable
Technology (FPT), pages 67–70. IEEE, 2012.
[58] Masayasu Suzuki, Yohei Hasegawa, Yutaka Yamada, Naoto Kaneko, Katsuaki
Deguchi, Hideharu Amano, Kenichiro Anjo, Masato Motomura, Kazutoshi Wak-
abayashi, Takao Toi, et al. Stream applications on the dynamically reconfigurable
processor. In Proceedings of the International Conference on Field-Programmable
Technology (FPT), pages 137–144. IEEE, 2004.
[59] Vaishali Tehre and Ravindra Kshirsagar. Survey on coarse grained reconfigurable
architectures. International Journal of Computer Applications, 48(16):1–7, 2012.
[60] T. Toi, N. Nakamura, T. Fujii, T. Kitaoka, K. Togawa, K. Furuta, and T. Awashima.
Optimizing time and space multiplexed computation in a dynamically reconfigurable
processor. In Proceedings of the International Conference on Field-Programmable
Technology (FPT), pages 106–111. IEEE, 2013.
[61] Leipo Yan, Thambipillai Srikanthan, and Niu Gang. Area and Delay Estimation
for FPGA Implementation of Coarse-grained Reconfigurable Architectures. In Pro-
ceedings of the ACM SIGPLAN/SIGBED Conference on Language, Compilers, and
Tool Support for Embedded Systems (LCTES), pages 182–188. ACM, 2006.