compact area and performance modelling for coarse-grained … · 2019-07-28 · component-wise...

Compact Area and Performance Modelling forCoarse-Grained Reconfigurable Architectures

by

Kuang-Ping Niu

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2019 by Kuang-Ping Niu

Abstract

Compact Area and Performance Modelling for Coarse-Grained Reconfigurable

Architectures

Kuang-Ping Niu

Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2019

We consider area and performance modelling for coarse-grained reconfigurable architec-

tures (CGRAs) and extend the open-source CGRA-ME (CGRA modelling and explo-

ration) framework to rapidly estimate these metrics. Area is modelled by synthesizing

commonly occurring CGRA primitives in isolation, and then aggregating the primitives’

component-wise areas. Performance is modelled by integrating a static-timing analysis

(STA) framework into CGRA-ME. The delays in the STA timing graph are based on

component-wise delays, as well as estimated interconnect delay. Experimental results us-

ing the estimation engine demonstrate reasonably accurate estimation for both area and

performance for different CGRA architectures, as well as different variations of the same

architecture. By offering fast and accurate estimation in an early phase of CGRA archi-

tecture exploration, the estimation engine allows the user to bypass the lengthy process

of a full VLSI implementation, and rapidly explore the area/performance architecture

space.

ii

Acknowledgements

My great thanks goes to my supervisor, Professor Jason Anderson. The advice and

guidance I have received from him were beyond the research we have done together.

Along with the exciting research project, he has inspired many interesting ideas to tackle

the many challenges I have faced. His mentorship and encouragement has helped me

reach important achievements in life. I am very grateful to have him as my supervisor.

I would like to thank my parents, my aunt, and my sister. For more than a decade

that I have been away from home, they have always been unconditionally supportive

from the other side of the phone. Big thanks to my family, for always encouraging me

when I was weak, and for sharing the happiness from my accomplishments.

Special credit goes to Cathy, for injecting non-technology-related elements into my

day to day life, for believing in me, and for constantly pushing me to challenge myself to

become more capable.

Lastly, many thanks to the bright minds in our research group: Xander, Brett, Jin

Hee, Julie, Joy, Ian, Matthew, Nick, and Austin. I greatly appreciate the many friendly

advice, critical feedback, brain-picking conversations, camping, coffee, frisbee, and many

more, throughout these years. Thank you all for making the journey that much more

rewarding and lively.

iii

Contents

List of Tables vi

List of Figures viii

List of Acronyms xi

List of Acronyms xiv

1 Introduction 1

1.1 Introduction to Coarse-Grained Reconfigurable Architectures (CGRAs) . 3

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 8

2.1 Existing CGRA Architectures . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 ADRES Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 HyCUBE Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Existing CGRA CAD Tools . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 CGRA-ME Framework Overview . . . . . . . . . . . . . . . . . . . . . . 18

2.6 ASIC Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7 Prior Work on Hardware Performance and Area Estimation . . . . . . . . 21

iv

3 CGRA-ME – Estimation Engine 24

3.1 Architecture Modelling in CGRA-ME and the Primitive Modules . . . . 24

3.2 Characterization of Primitive Modules . . . . . . . . . . . . . . . . . . . 27

3.3 Area Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Performance Modelling and STA . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Interconnect Delay Estimation . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Experimental Studies 39

4.1 Target Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Target Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 ADRES-O Full VLSI Implementation Versus CGRA-ME Estimation . . . 43

4.3.1 CGRA-ME Estimation Results . . . . . . . . . . . . . . . . . . . 44

4.4 ADRES Architecture with Added Diagonal Connectivity . . . . . . . . . 53

4.4.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5 ADRES Architecture versus HyCUBE Architecture . . . . . . . . . . . . 57

4.5.1 Architectural Differences: ADRES-O versus HyCUBE . . . . . . . 57

4.5.2 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.5.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Conclusion and Future Work 62

5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Appendix A Area Modelling 64

Appendix B Performance Modelling 67

Bibliography 69

v

List of Tables

3.1 Database of area and critical-path delay of the primitive modules as mapped

into the NanGate FreePDK45 45nm standard-cell library. . . . . . . . . . 30

4.1 ADRES with orthogonal interconnect (ADRES-O): Total core area of area-

optimized and delay-optimized designs, as well as estimation by CGRA-ME. 44

4.2 ADRES-O: Critical path delay of benchmarks for area-optimized and delay-

optimized targets from PrimeTime STA. . . . . . . . . . . . . . . . . . . 45

4.3 The critical path delay report of the mults1 benchmark, mapped onto

ADRES-O in the delay-optimized scenario, generated by PrimeTime and

CGRA Modelling and Exploration (CGRA-ME) Tatum (without account-

ing for interconnect delay). . . . . . . . . . . . . . . . . . . . . . . . . . . 48


ADRES-O in delay-optimized scenario, generated by PrimeTime and CGRA-

ME Tatum now with interconnect delays modelled based on Modulo Rout-

ing Resource Graph (MRRG) node fanouts. . . . . . . . . . . . . . . . . 50


ADRES-O in the delay-optimized scenario, generated by PrimeTime and

CGRA-ME Tatum now with interconnect delays modelled based on MRRG

node fanouts, and overridden for the multiplexer driving the Functional

Units (FUs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

vi

4.6 Correctness of reported critical paths: 1) 5 – Tatum reported the same

path as PrimeTime, 2) � – path from PrimeTime is not reported by

Tatum, 3) 1 – path from PrimeTime is reported as one of the top-four

critical paths from Tatum. . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.7 ADRES with diagonal interconnect (ADRES-D): Total core area of area-

optimized and delay-optimized designs, as well as estimation by CGRA-ME. 55

4.8 Correctness of reported critical paths on ADRES-D architecture: 1) 5 –

Tatum reported the same path as PrimeTime, 2) � – path from Prime-

Time is not reported by Tatum, 3) 1 – path from PrimeTime is reported

as one of the top-four critical paths from Tatum. . . . . . . . . . . . . . . 56

4.9 HyCUBE: Total area of area-optimized and delay-optimized variants, as

well as estimation by CGRA-ME. . . . . . . . . . . . . . . . . . . . . . . 58

4.10 Correctness of reported critical paths on HyCUBE architecture: 1) 5 –

Tatum reported the same path as PrimeTime, 2) � – path from Prime-

Time is not reported by Tatum, 3) 1 – path from PrimeTime is reported

as one of the top-four critical paths from Tatum. . . . . . . . . . . . . . . 60

vii

List of Figures

1.1 CPU trends from 1970s to 2010s: transistor count versus clock frequency. 2

1.2 Logic element comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 The Reconfigurable Pipelined Datapath (RaPiD) accelerator [excerpted

from [15,16]]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 The MorphoSys architecture [excerpted from [55]]. . . . . . . . . . . . . . 10

2.3 The PipeRench architecture [excerpted from [53]]. . . . . . . . . . . . . . 11

2.4 Mapping a (a) 5-virtual-stage application onto a (b) 3-physical-stage/stripe

PipeRench system [excerpted from [19]]. . . . . . . . . . . . . . . . . . . 12

2.5 The Dynamically Reconfigurable Processor (DRP) architecture [excerpted

from [58]]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 The ADRES CGRA system [excerpted from [43]]. . . . . . . . . . . . . . 15

2.7 The HyCUBE architecture. The Coarse-Grain Reconfigurable Array (CGRA)

consists of 2D array of FUs connected by multi-hop-capable crossbar switch

interconnect [excerpted from [30]]. . . . . . . . . . . . . . . . . . . . . . . 17

2.8 CGRA-ME framework overview . . . . . . . . . . . . . . . . . . . . . . . 19

2.9 Typical steps involved in Very-Large-Scale Integration (VLSI) design. . . 20

3.1 Illustrations of the two data structures in CGRA-ME representing a Pro-

cessing Element (PE) with 3 contexts. . . . . . . . . . . . . . . . . . . . 26

3.2 Characterization steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

viii

3.3 An example CGRA architecture modelled as a tree of module objects in

CGRA-ME. The modules filled in blue are the primitive modules, while the

ones filled in red are non-primitive/composite modules. During area esti-

mation, the modules with red solid outlines would require either a database

lookup (primitive) or summation of all submodule areas (composite). In a

highly regular architecture, there can be multiple instances of each unique

composite module, hence in this example, only a few composite modules

require summation and have red solid outlines. . . . . . . . . . . . . . . . 32

3.4 MRRG of a 3-context PE, with the mapped resources highlighted in red. 34

3.5 Three categories of subgraph to compose a full timing graph in Tatum,

converted from their corresponding MRRG counterparts. . . . . . . . . . 35

3.6 Timing graph representing the mapped MRRG from Figure 3.4. . . . . . 36

3.7 Mapped MRRG and timing graph of the “sum” benchmark put side-by-

side, showing difference in graph complexity at a larger scale. . . . . . . . 36

3.8 Standard-cell layout for fanout delay scaling analysis on op_and with 16

fanout registers, with the upper, middle, and lower rats nests representing

fanin registers, the primitive module, and fanout registers, respectively. . 37

3.9 Averaged fanout-delay of all primitive modules. . . . . . . . . . . . . . . 38

4.1 High-level view of the ADRES-like architectures and HyCUBE architec-

ture used in the experimental studies. . . . . . . . . . . . . . . . . . . . . 40

4.2 Illustration of the Dataflow Graphs (DFGs) of the 8 benchmarks used in

the experimental studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Standard-cell PnR architecture layout of ADRES – area-optimized (left)

vs. delay-optimized (right) side-by-side on the same scale. . . . . . . . . 44

4.4 Critical path delay comparison – CGRA-ME estimations without inter-

connect delays vs PrimeTime with interconnect delays. . . . . . . . . . . 46

ix

4.5 After mapping benchmark conv2, we produced the partial MRRG rep-

resenting the used portion of the hardware. Without interconnect delay

taken into account, Synopsys PrimeTime and CGRA-ME report different

critical paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.6 Critical path delay comparison – CGRA-ME estimates with MRRG fanout-

inferred interconnect delay versus PrimeTime. . . . . . . . . . . . . . . . 49

4.7 Critical path delay comparison – CGRA-ME estimations with selected

fanout count overridden vs PrimeTime. . . . . . . . . . . . . . . . . . . . 51

4.8 PEs from ADRES-O and ADRES-D put side-by-side for comparison. . . 54

4.9 PEs from ADRES-O and HyCUBE. . . . . . . . . . . . . . . . . . . . . . 58

A.1 ADRES-O Architecture: Area breakdown in µm2 . . . . . . . . . . . . . 64

A.2 ADRES-D Architecture: Area breakdown in µm2 . . . . . . . . . . . . . 65

A.3 HyCUBE Architecture: Area breakdown in µm2 . . . . . . . . . . . . . . 66

B.1 ADRES-D Architecture: Critical path delay comparison – CGRA-ME es-

timations with selected fanout count overridden versus PrimeTime. . . . 67

B.2 HyCUBE Architecture: Critical path delay comparison – CGRA-ME es-

timations with selected fanout count overridden versus PrimeTime. . . . 68

x

List of Acronyms

ADL Architecture Description Language. 16

ADRES Architecture for Dynamically Reconfigurable Embedded System. 14–16, 23,

25, 36, 57, 59

ADRES-D ADRES with diagonal interconnect. vii, x, 39, 53–57, 60, 65, 67

ADRES-O ADRES with orthogonal interconnect. vi, x, 39, 43–46, 48, 50, 52–60, 63,

64

ALU Arithmetic Logic Unit. 3, 9, 11, 12, 14, 15, 17, 29

API Application Programming Interface. 18

ASIC Application-Specific Integrated Circuit. 1, 2, 5–7, 28, 33, 62

CAD Computer-Aided Design. 5–8, 21, 53, 62

CBS Crossbar Switch. 41, 57, 58

CG-SMAC Coarse-Grained SMAC. 18

CGRA Coarse-Grain Reconfigurable Array. viii, ix, 2, 4–9, 12–17, 19, 22–25, 28, 29,

31–33, 36, 57, 60, 62, 63

CGRA-ME CGRA Modelling and Exploration. vi, viii–x, 4–8, 16–18, 24–29, 31–33,

37, 40, 43, 45–48, 50–52, 54, 55, 59–63

xi

CPU Central Processing Unit. 4, 22

CSTC Central STC. 14

DFG Dataflow Graph. ix, 18, 33, 41–43

DMA Direct Memory Access. 10

DMU Data Management Unit. 14

DRESC Dynamically Reconfigurable Embedded System Compiler. 15, 16

DRF Data Register File. 39

DRP Dynamically Reconfigurable Processor. viii, 13, 14

DSE Design Space Exploration. 16, 18, 22, 23, 39, 53

DSP Digital Signal Processor. 1

FF flip-flop. 21

FPGA Field-Programmable Gate Array. 1, 2, 4, 5, 17, 21, 33, 36, 41, 63

FSM Finite State Machine. 14

FU Functional Unit. vi, viii, 5, 11, 12, 15, 17, 40, 41, 49–52, 58, 63

GPGPU General-Purpose Graphic Processing Unit. 22

GPU Graphic Processing Unit. 1, 4, 22

HDL Hardware Description Language. 4, 20, 27, 29, 31

I/O Input/Output. 5, 13, 25, 31

II Initiation Interval. 17

xii

ILP Integer Linear Programming. 19

IMS Iterative Modulo Scheduling. 18

IP Intellectual Property. 29

IR Intermediate Representation. 16

LE Logic Element. 3

LUT Lookup Table. 3, 9, 12, 21

MP Memory Interface Port. 39–41, 44, 59

MRRG Modulo Routing Resource Graph. vi, ix, x, 16, 17, 19, 25–27, 33–36, 47, 48,

50, 52, 54

PE Processing Element. viii–x, 3, 11–17, 26, 27, 31, 33, 34, 39, 41, 53, 54, 57, 58, 63

PLD Programmable Logic Device. 1, 4

PnR Place-and-Route. 20, 21, 29, 33

RAA Reconfigurable Array Architecture. 23

RAM Random-Access Memory. 21

RaPiD Reconfigurable Pipelined Datapath. viii, 8–10

RC Reconfigurable Cell. 10, 11, 15

RF Register File. 39, 41, 57

RTL Register-Transfer Level. 4, 8, 25

SDC Synopsys Design Constraint. 33

xiii

SMAC Simultaneous Mapping and Clustering. 18

SPR Schedule, Place, and Route. 18

SRP Samsung Reconfigurable Processor. 22, 23

STA Static Timing Analysis. 6, 20, 21, 28, 33, 34, 37, 38, 41, 48, 55, 59, 61

STC State Transition Controller. 13, 14

VLIW Very-Long Instruction Word. 16, 57

VLSI Very-Large-Scale Integration. viii, 5, 8, 20, 21

VTR Verilog-to-Routing. 5, 6, 21

xiv

Chapter 1

Introduction

Computations, broadly speaking, can be implemented in software or in hardware. In

software, computations are typically specified in a high-level language and then executed

on a standard processor or an application-specific processor. Examples of the latter

include Graphic Processing Units (GPUs) or Digital Signal Processors (DSPs). When

computations are realized in hardware, frequently used options are custom Application-

Specific Integrated Circuits (ASICs), or Programmable Logic Devices (PLDs), such as

Field-Programmable Gate Arrays (FPGAs).

Until recently, Moore’s Law [44] and Dennard scaling [14] allowed improved logic

density, power and performance with each process generation. Unfortunately, perfor-

mance scaling plateaued in the mid-2000s. Figure 1.1 shows processor transistor counts

and clock frequency from 1970 until today. We observe, in the red points, that pro-

cessor clock frequencies have stalled in the single-digit GHz range. We can no longer

rely on process scaling to deliver higher computational throughput in standard proces-

sors. As such, other means must be used to meet insatiable consumer demand for higher

throughput and lower power. GPUs and DSPs offer such higher throughput for certain

applications: massively parallel floating point, and signal processing. On the other hand,

customized hardware, implemented as ASICs or using PLDs, can be precisely tailored to

1

Chapter 1. Introduction 2

Figure 1.1: CPU trends from 1970s to 2010s: transistor count versus clock frequency.

application needs, potentially providing orders-of-magnitude improvements in speed and

energy efficiency (e.g. [50]).

Custom ASICs deliver the highest density, speed and power efficiency for any given

application. However, their cost is prohibitively expensive for all but the highest-volume

applications, or those applications with aggressive speed and power constraints. FPGAs

can provide significant speed and energy advantages over processors (e.g. [38]). However,

owing to the overhead of FPGA programmability, they are 3-4× slower, 12× more power

hungry, and 3-35× less area efficient than custom ASICs for implementing a given ap-

plication [34]. Moreover, using FPGAs has traditionally required knowledge of hardware

design, and the compile times for large FPGA designs can run into hours or even days.

Coarse-Grain Reconfigurable Arrays (CGRAs) are an alternative style of PLD that offers

programmability, performance and power benefits over FPGAs in certain cases. Perfor-

mance and area modelling of CGRAs is the central topic of this thesis.

This chapter provides a general overview of what a CGRA is, and describes the

motivation for, and contributions of this thesis.


1.1 Introduction to Coarse-Grained Reconfigurable

Architectures (CGRAs)

A CGRA comprises a 2D grid of coarse-grained Arithmetic Logic Unit (ALU)-like Pro-

cessing Elements (PEs), interconnected by bus-based interconnect. This stands in con-

strast to FPGAs, which have a mix of coarse-grained and fine-grained logic elements,

and where individual logic signals are routed independently. Figure 1.2 compares a fine-

grained FPGA Logic Element (LE) (left) with a CGRA PE (right). As illustrated, the

FPGA LE contains a Lookup Table (LUT), which is a hardware implementation of a truth

table. The CGRA PE receives wide inputs and typically performs ALU-like operations on

such inputs, such as multiply, divide, etc. Because of their coarse-grained nature, CGRAs

have less area dedicated to programmability overhead, making them “less flexible” than

FPGAs. Despite this apparent weakness, CGRAs can excel in speed/power/area over

FPGAs in specific applications, where the CGRA processing element capabilities, and

the CGRA interconnect fabric align closely with application computational and commu-

nication needs.

(a) FPGA logic element. (b) CGRA processing element.

Figure 1.2: Logic element comparison.

As CGRAs are programmable at a higher level of abstraction than FPGAs, CAD tools

targeting CGRAs are responsibile for fewer decisions. Putting it succinctly, CGRAs are


configurable at the bus level, not the bit level. CGRA CAD tools contain many of the

same steps as FPGA CAD tools, including technology mapping, placement, and routing,

however, the number of decisions in each of these steps is reduced considerably. This leads

to another advantage of CGRAs, which is CAD tool runtime. While mapping a large

application to an FPGA may take hours or days in the worst-case [21,29,52], CGRA CAD

tool run-times are expected to be closer to software compile times. Moreover, CGRA

applications are typically specified in high-level software languages, rather than in a

Register-Transfer Level (RTL) Hardware Description Language (HDL), as traditionally

used for FPGAs. Thus, CGRAs hold the promise of addressing the two key usability

challenges of FPGAs: 1) software programmability, and 2) fast compile times.

1.2 Motivation

While CGRAs offer certain advantages, their relatively late appearance (1990s) com-

pared to FPGAs (1980s), the reliable Moore’s Law scaling which reduced the incentive

to adopt CGRA, and the lack of a killer CGRA application, are all reasons that CGRAs

have not taken PLD market share. Many CGRA architectures have appeared in literature

(e.g. [8,59]), and a few commercial CGRAs have been developed (e.g. [7,31,60]). However,

they are less studied than alternative computing platforms, such as Central Processing

Units (CPUs), GPUs, and FPGAs. For such platforms, architecture modelling and eval-

uation frameworks exist, allowing hypothetical architectures to be targeted, tested, and

compared with specific baseline architectures, or other hypothetical architectures. For

CGRAs, the only publicly accessible framework is CGRA Modelling and Exploration

(CGRA-ME) [11, 12], which is under active development at University of Toronto. This

thesis contributes new modelling functionality to CGRA-ME.

Architecture modelling frameworks for CPUs, GPUs, and FPGAs allow hypothetical

architectures to be modelled and evaluated at an abstract level. The frameworks offer an


early preview of the cost, performance, and power of hypothetical architectures before

actual fabrication. For example, in Verilog-to-Routing (VTR) [39], area is estimated by

counting the number of minimum-width transistors required for the modelled FPGA.

The advantage of this high-level approach is to facilitate the rapid exploration of the

architectural space. Once good points in the architecture space have been identified, a

more detailed implementation of the desirable architectures can be performed to refine

early estimates. If, on the other hand, a full standard-cell or custom Very-Large-Scale

Integration (VLSI) implementation were performed for each architectural candidate, the

breadth of exploration would be severely hindered by lengthy ASIC Computer-Aided

Design (CAD) tool run-times, likely hours or days to produce each datapoint.

1.3 Contributions

While a previous work [11] demonstrated that Verilog HDL, automatically produced by

CGRA-ME, could be pushed through commercial ASIC CAD tools (targeting standard

cells) to assess a CGRA’s area and performance, CGRA-ME offered no capability for

high-level area, performance, and power modelling. The work described in this the-

sis overcomes this limitation for the performance and area metrics. Our approach to

area and performance modelling is based on the notion that CGRAs are composed of

commonly occurring primitives, including multiplexers of various sizes, Functional Units

(FUs) with specific arithmetic capability, registers, register files, Input/Outputs (I/Os)

ports, and so on. As such, we use standard-cell ASIC tools to create models of area and

delay for each such primitive, thereby producing a characterization library. Several such

libraries are constructed, representing, for example, area-optimized or delay-optimized

implementations of the primitives by the ASIC tools. The two optimization targets are

selected because die size and performance are common design objectives in varied appli-

cations. Overall CGRA area can then be estimated by aggregating primitive component


areas.

For performance, the individual primitive timing information is insufficient to gauge

the speed performance of an application as implemented on a modelled CGRA. We

therefore incorporated full Static Timing Analysis (STA) into CGRA-ME, leveraging an

open-source STA framework – Tatum, which is also used in VTR [46]. The primitive

delays are annotated onto the timing graph of the framework, allowing a critical path

delay report to be produced (in addition to a variety of other reports). In an experimental

study, we compare the area and performance of the primitive-based model to that of a full

ASIC implementation of the modelled CGRA, and demonstrate that the rapid estimation

model produces reasonably accurate estimates. The primary contribution of this work is

the capability to perform fast and accurate area/performance estimation for any given

architecture modelled within CGRA-ME. The main contributions of this thesis are:

• We present an extension to the CGRA-ME framework which accurately estimates

chip area for a standard-cell implementation based on a database of area charac-

teristics of components commonly used in CGRAs.

• We present an extension to the CGRA-ME framework which accurately estimates

per-benchmark critical path delay on a proposed CGRA based on both component-

wise timing characteristics, and interconnect delay inferred from the fanout count

of each submodule.

• We model and analyze three architectures in the CGRA-ME framework, and com-

pare the area/performance estimates with the area/performance of the same archi-

tectures implemented in standard cells using a full ASIC CAD flow. These experi-

ments confirm the viability of using the proposed estimators within CGRA-ME to

perform architecture studies.


1.4 Thesis Outline

The thesis is structured as follows:

• Chapter 2 - Background: This chapter provides a brief survey of existing CGRA

architectures, CAD tools, and studies. The chapter overviews the CGRA-ME

framework in detail.

• Chapter 3 - CGRA-ME Estimation Engine: This chapter details the method

to generically model area and performance for CGRAs.

• Chapter 4 - Experimental Results: This chapter presents results for experi-

ments where the estimation models are applied to gauge area/performance of hypo-

thetical CGRA architectures. Comparisons with full ASIC implementations of the

given architectures are made to validate the area/performance estimation engine.

• Chapter 5 - Conclusion and Future Work: This chapter summarizes the thesis

and discusses potential future work.

Chapter 2

Background

This chapter summarizes recent CGRA literature, including various architectures, soft-

ware frameworks, as well as reviews material relevant to following chapters. We introduce

a wide set of CGRAs that have appeared in the literature, and present a more detailed

examination of two architectures, ADRES and HyCUBE, that we use as test vehicles in

this thesis. Then, we provide a detailed introduction to the CGRA-ME framework and its

core components, including kernel extraction, software architecture modelling, and RTL

generation. We overview conventional procedures to realize a full standard-cell VLSI

design. Lastly, we briefly describe various existing techniques for early rapid hardware

area/performance estimation, without involvement of the lengthy VLSI CAD flow.

2.1 Existing CGRA Architectures

An excellent overview of previously published CGRA architectures appears in [13]. Here,

to provide the reader with a sense of the range of CGRA architectures proposed, we

briefly review some of the most highly-cited architectures, highlighting their differences,

as well as observing important commonalities. The Reconfigurable Pipelined Datapath

(RaPiD) [15] was proposed in 1996, aiming to accelerate computation via pipelining.

The CGRA comprises both a datapath and control, as depicted in Figures 2.1a and 2.1b,

8

Chapter 2. Background 9

(a) Data path portion (b) Control path portion

(c) Bus connector component between bus segments

Figure 2.1: The RaPiD accelerator [excerpted from [15,16]].

respectively. The datapath circuit, also the main CGRA portion of the design, is dynami-

cally configured every cycle by the control circuit, which consists of statically programmed

LUTs, depicted in Figure 2.1b. Figure 2.1a represents the basic cell of the 1-D array,

scalable in the horizontal direction. Each basic cell consists of one multiplier, two ALUs,

three memory units, and six registers. These components are interconnected by lanes

of word-width (16 bit) bus segments of varying lengths, separated by bus connectors.

Depicted in Figure 2.1c, the bus connector is implemented using three multiplexers, two

tristate buffers, and one register. A bus connector allows directional signaling such as

left-to-right, right-to-left, and cutoff (connected to a driver on both ends). It also al-

lows latency control, with the output multiplexer selecting input from either the source

bus segment directly, or the registered data with one cycle latency. The RaPiD-I [15]

implementation replicates 16 basic cells. Effectively, the RaPiD architecture is a LUT-

controlled-CGRA. While RaPiD offers good performance, it requires careful memory

partitioning in order to maximize parallelism. Data memory bandwidth is the main

limiting factor to the performance scalability [15].


(a) MorphoSys M1 chip architecture overview (b) MorphoSys ReconfigurableCell (RC) array interconnect ar-chitecture

(c) Structure of a MorphoSys RC

Figure 2.2: The MorphoSys architecture [excerpted from [55]].

Subsequent to RaPiD, the MorphoSys architecture was introduced, specifically, the

M1 chip [56]. Figure 2.2a illustrates the architecture, consisting of main memory modules

external to the chip, a Direct Memory Access (DMA) controller, instruction/data cache,

a TinyRISC processor, a frame buffer, an RC array, and context memory for the RCs.

Equipped with a framebuffer fetching mechanism, MorphoSys specializes in multimedia

applications. The TinyRISC is a modified RISC processor, and it acts as the master to

the rest of the chip, sending control signals to the DMA controller, frame buffer, and

the RC array. The DMA controller is instructed by the TinyRISC processor to retrieve

frame and context data for the frame buffer and the RC array. Another external memory

module contains program instructions and data for the processor.

Figure 2.2b depicts the interconnect architecture for the 8×8 RC array. Within each

4×4 quadrant, there exists higher interconnectivity among the member RCs, with each


member RC accepting inputs from the nearest four neighbors, other RCs from the same

row and column, and crossbar output(s) if adjacent to another 4×4 quadrant. Each 4×4

quadrant is interconnected to its adjacent quadrants by the RCs on the corresponding

side, and these RCs are fully connected. Figure 2.2c depicts the implementation of the

RC, featuring two input multiplexers, an ALU with multiplier, shifter, output register,

feedback register file, as well as the context register which configures the components.

MorphoSys demonstrated superior performance on video compression, automatic target

recognition, and data encryption/decryption applications [55].

(a) PipeRench architecture overview (b) Structure of a PE in a PipeRench stripe

(c) Structure of a FU in a PipeRench PE

Figure 2.3: The PipeRench architecture [excerpted from [53]].

The PipeRench architecture [20], from Carnegie Mellon University, was introduced

around the same time as MorphoSys, and was later commercialized by Rapport [2].


Figure 2.3a depicts the general organization of the PipeRench architecture, featuring

multiple stripes, with each stripe being a 1D array of PEs. Each stripe is capable of

realizing a pipeline stage.

Figure 2.4: Mapping a (a) 5-virtual-stage application onto a (b) 3-physical-stage/stripePipeRench system [excerpted from [19]].

When mapping an application to PipeRench, the application is first represented in

virtual stages, depicted in Figure 2.4a. Depending on the actual available physical re-

sources, the virtual stages can then be mapped into set of physical stages, depicted in

Figure 2.4b, where each physical stage corresponds to a stripe. As shown in Figure 2.3b,

each PE consists of an FU, shifters for each input to the FU, and a register file. However,

as depicted in Figure 2.3c, an FU of PipeRench is unlike other CGRAs, which usually

contain an ALU. The FU in PipeRench consists of 8 identically configured 3-LUTs, a

carry-chain, and a zero detector. Each 3-LUT is driven by: 1) one-bit from input bus A,

2) one-bit from input bus B, and last signal of all 3-LUTs is driven by 3) an X input from

another FU. The uniform 3-LUT array along with the carry-chain, effectively allows the

FU to perform 8-bit addition, subtraction, or arbitrary bit-wise manipulation. The X

input can be either carry-out from an adjacent PE or zero, making it possible to combine

more than one PE, forming wider arithmetic operations. The throughput of PipeRench


is dependent on the number of physical resources. Suitable applications should have high

data locality, including streaming applications.

(a) DRP-1 prototype architecture overview (b) Dynamically Reconfigurable Processor(DRP) tile, containing a State TransitionController (STC) and 64 PEs

(c) PE of a DRP tile

Figure 2.5: The DRP architecture [excerpted from [58]].

A commercial CGRA example is the DRP from NEC Corp. (now Renesas) [1, 45].

The implementation is described in [58]. Figure 2.5a gives a bird’s-eye view of the

architecture. Described in [23], the chip is equipped with the DRP core, interfacing

via I/O ports, eight multipliers around the tiles, an external memory controller, and

a PCI controller. DRP-1 operates in a hierarchical fashion: 1) instructions for the 64


PEs within each tile are provisioned by a local Finite State Machine (FSM), called the

STC, located at center of Figure 2.5b; 2) context for the 8 tiles within the DRP core

are provisioned by the Central STC (CSTC), located at the center of Figure 2.5a. The

structure of a tile is shown in Figure 2.5b, where the PEs and STC are surrounded by

memory units, with single-ported memory units at the top and bottom side, labeled as

HMEM, as well as dual-ported memory units with their controllers at the left and right

side, labeled as VMEM and Vmemctrl. As depicted in Figure 2.5c, each PE consists

of a Data Management Unit (DMU) capable of shifting and masking, an ALU without

multiplier, an instruction table, output register, and a register file. The details on inter-

PE and inter-tile interconnect architecture are not fully disclosed, but from Figure 2.5c,

we can see that a PE appears to possess flexible interconnect, with input and output bus

selectors from perhaps many PEs within the same tile.

The DRP architecture possesses many advantages over previously proposed CGRAs.

Much like MorphoSys and PipeRench, the DRP is scalable with a tile as the base unit.

Much like RaPiD, the DRP comes with control-flow capability, while equipped with

more complex organization and better scalability. A distinctive feature of the DRP is

the highly distributed memory units. While these impose data partitioning requirements

on the application, they are likely useful in many applications, particularly streaming.

However, we observed that all architectures consist a similar set of basic modules.

2.2 ADRES Architecture

The Architecture for Dynamically Reconfigurable Embedded System (ADRES) CGRA

architecture [43] was proposed to overcome a perceived performance bottleneck of other

CGRA processor/accelerator systems arising from performance disparities between the

reconfigurable array and the connected processor. In ADRES, the processor is closely

coupled with the reconfigurable fabric. Figure 2.6a shows the ADRES system at a high


level, where instructions and data for the ADRES core are supplied by an external mem-

ory module. Figure 2.6b shows the ADRES core, where the FU and RC are merely labels

distinguishing PEs from the two operating modes.

(a) ADRES system overview:consisting of the ADRES core, in-struction and data cache, and externalmemory.

(b) ADRES core:CGRA body of the ADRES system, partitionedinto the VLIW view and reconfigurable matrixview.

(c) RC: an array element in an ADREScore.

(d) Dynamically Reconfigurable Embed-ded System Compiler (DRESC):compiler flow diagram.

Figure 2.6: The ADRES CGRA system [excerpted from [43]].

Dashed lines in the figure show the VLIW view and the reconfigurable matrix view

(CGRA). The top row of PEs, and a multi-ported register file are used in both views.

The ADRES PE is shown in 2.6c. It consists of a local register file (RF), and an ALU

(labelled as FU in the figure). Inputs to the FU are two operands, and a predicate.

Additionally, the ADRES architecture is a “template” instead of a fixed architecture,


allowing customization at various levels with an XML-based Architecture Description

Language (ADL).

In the VLIW mode, ADRES behaves like a VLIW processor, with instructions fetched,

decoded, dispatched, and so on. This mode is suitable for control-intensive program

segments. The CGRA mode is suitable for highly parallelized dataflow computation.

The two modes share data with one another through the shared register file. Effectively,

both the control and data path reside on the PE grid, efficiently reusing computational

resources in the core, and reducing the cost of data movement.

The DRESC compiler framework [42] is used to map applications onto ADRES. The

workflow is illustrated in Figure 2.6d, employing a C compiler frontend, producing the

Intermediate Representation (IR) of the program. The program is then partitioned into

a control-path and data-path. The control-path portion IR is compiled into Very-Long

Instruction Word (VLIW) instructions, to be executed in VLIW-mode. The ADL is

interpreted into an in-memory architecture model, called the Modulo Routing Resource

Graph (MRRG), which at a high level, is a graph-based representation of the CGRA

device. By MRRG scheduling, placement and routing, the data-path portion of the

IR is mapped and compiled into CGRA configurations, to be executed during CGRA-

mode. The work revolving around DRESC and ADRES has inspired many later works

on CGRAs, including the CGRA-ME project. In Chapter 4, Sections 4.3 and 4.4, we will

use ADRES as a target architecture to showcase Design Space Exploration (DSE) based

on CGRA-ME.

2.3 HyCUBE Architecture

Recently, the HyCUBE architecture [30] was proposed, featuring richer interconnect than

previous CGRAs. The architecture, HyCUBE, is named to be synonymous with the

concept of high-dimentional cube – a hypercube. The authors of HyCUBE argue that


Figure 2.7: The HyCUBE architecture. The CGRA consists of 2D array of FUs connectedby multi-hop-capable crossbar switch interconnect [excerpted from [30]].

the nearest-neighbor interconnect topology of existing architectures, and the requirement

to use FUs as route-throughs introduces performance loss and mapping difficulty, leading

to higher Initiation Intervals (IIs) – the rate at which new inputs can be injected into

the fabric. The authors argue that a flexible multi-hop interconnect would be a better

design choice.

The HyCUBE architecture is a 2D array of PEs, as shown in Figure 2.7. Each PE re-

ceives inputs from neighbouring PEs (North, South, East, West). The input registers to

the crossbar switch can be bypassed. The PE also contains a predicated ALU, a bypass-

able register on the ALU output, and most importantly, a crossbar switch, which drives

all neighbour PEs. The crossbar switches collectively realize an interconection network

allowing arbitrary source and destination pairs to route with arbitrary cycle latency. This

style of interconnect highly resembles the island-style FPGA routing network, except that

it is routable by a bus.

The compiler framework of HyCUBE also models the architecture with an MRRG,

and uses heuristics to find a viable mapping with different cost functions, and by incre-

menting the allowed II. The HyCUBE authors claim that the crossbar switches contribute

a quarter of both total power consumption and chip area, but provide a much shorter

compilation runtime and higher throughput/power. In Chapter 4.5, we will use CGRA-


ME to model and evaluate the area and performance of HyCUBE.

2.4 Existing CGRA CAD Tools

There are various architecture-specific CGRA DSE frameworks, but there are very few

generic CGRA modelling and exploration frameworks. In [9], a commercial framework

was extended to support high-level modelling of CGRAs. The commercial architecture

description language (ADL), LISA, was extended to support a CGRA coprocessor de-

scription, while the application mapping is handled by the Coarse-Grained SMAC (CG-

SMAC) algorithm inspired by Simultaneous Mapping and Clustering (SMAC) for FP-

GAs [37]. Schedule, Place, and Route (SPR) [18] is another generic CGRA mapping tool,

where the mapping problems: scheduling, placement, and routing are implemented using

Iterative Modulo Scheduling (IMS) [51], Simulated Annealing [33], and QuickRoute [36]

plus PathFinder [41], respectively.

2.5 CGRA-ME Framework Overview

CGRA Modelling and Exploration (CGRA-ME), as the name suggests, is a tool which

offers architecture exploration for CGRAs, as well as permitting research on CGRA CAD

algorithms. With CGRA-ME, both the architecture specification, and an application

benchmark, are input to the toolflow. CGRA-ME permits the scientific evaluation and

comparison of hypothetical CGRA architectures.

The CAD flow depicted in Figure 2.8 progresses from top to bottom. The main

user inputs are the architecture description and an application benchmark. Through

the LLVM kernel-extraction pass, key computations of the application benchmarks are

extracted and represented as Dataflow Graphs (DFGs). The architecture interpreter

accepts an architecture specification (either using XML or the C++ Application Program-

ming Interface (API)) as input and builds an in-memory device model of the architecture.


Figure 2.8: CGRA-ME framework overview

The in-memory model is a MRRG [42], which consists of nodes and edges. Nodes repre-

sent the CGRA’s functional units, multiplexers, register files, I/Os, multi-bit buses, and

so on, which may be annotated with additional data, such as cycle latency. Edges repre-

sent electrical connectivity between nodes. As a whole, a MRRG models how a CGRA

functionally behaves. The in-memory architecture model can also be used to generate a

Verilog implementation of the architecture.

The mapping step will schedule, place, and route the DFG onto the MRRG. That

is, in mapping, each computation in the DFG must be associated with a functional unit

node in the MRRG, and each connection between nodes in the DFG (data dependency)

must be mapped to a series of routing nodes within the MRRG, thereby connecting the

relevant functional unit nodes, accordingly. The mapper offers two choices of algorithm

– a simulated-annealing-based approach [12], and an Integer Linear Programming (ILP)-

based approach [10]. If the mapping is feasible, this implies the modelled CGRA can be

configured to realize the computations of the DFG. The mapping result and architecture

model can then be used to generate a bitstream to configure the CGRA, and verify

functionality through RTL simulation.

Figure 2.8 also highlights the contribution of this thesis: performance and area esti-


mation following the mapping step. There are two inputs to the estimation engine: 1) the

application benchmark as mapped onto the modelled CGRA; 2) profiles of the area and

delay of commonly used CGRA primitives. The estimation engine and characterization

data are elaborated upon in the next chapter.

2.6 ASIC Design Flow

We briefly review the steps of the standard-cell ASIC design flow, as this is relied upon in

subsequent chapters. The flow is shown in Figure 2.9. It consists of the following steps:

Figure 2.9: Typical steps involved in VLSI design.

1) Synthesis/Technology Mapping: An HDL file describing circuitry is input to the

synthesis tool. The logic functions in the circuit are optimized and its functionality

is mapped into standard cells drawn from a library. The synthesis tool accepts

area, timing and power objectives from the user and attempts to produce an im-

plementation that meets the user’s constraints.

2) Post-Synthesis Verification: The output circuit from step 1) undergoes verification,

involving early STA to verify timing validity, and simulation to verify functionality

and power/performance. If verification results meet all requirements, the netlist

files will be input to step 3). Otherwise the user goes back to step 1), iteratively

adjusting the implementation or synthesis objectives accordingly.

3) Place-and-Route (PnR)/Chip Layout: The standard-cell netlist is placed and routed

on the chip, again subject to user provided constraints, such as floorplan, poros-

ity, and timing constraints. Following routing, interconnect parasitic capacitance,


resistance, and inductance can be extracted. These permit accurate post-layout

analysis of timing and power.

4) STA: With parasitics information extracted from step 3), a precise STA is per-

formed. Depending on the results, the user either proceeds to step 5), or she/he

would revise the floorplan/constraints in step 3) and re-perform PnR. In worst case,

she/he would go back to step 1), revise synthesis objectives and re-synthesize the

design.

5) Post-Layout Verification: The final verification, which extensively verifies the design

layout in all aspects, such as timing, power, and functionality. After passing all

verification tests, the design would be ready for fabrication.

Although dependent on the size of the targeted design, the standard-cell design process is

expensive in many respects, such as engineering design effort, CAD runtime in each step,

cost of proprietary library/tool license and fabrication, etc. As the design process moves

toward the final verification stage, designers gradually gain more confidence on how well

the design will perform post-fabrication. However, the lengthy design process encourages

fast and accurate estimation of performance, at an early stage. In Chapters 3 and 4,

we will employ the above VLSI design methodologies on various levels, from primitive

modules, to top level architecture.

2.7 Prior Work on Hardware Performance and Area

Estimation

Regarding area/performance estimation for FPGAs, CPUs, and GPUs, there are pop-

ular frameworks publicly available. VTR comes with a built-in performance and area

estimation engine for FPGAs. STA in VTR is carried out by Tatum, which models tim-

ing of FPGA primitive blocks, such as LUTs, flip-flops (FFs), Random-Access Memories


(RAMs), etc., and estimates critical path delay. In GEM5 [5], a CPU modelling and

simulation framework, when a user specifies operating frequency, cache sizes, and the

memory speed of a targeted CPU system, the framework demonstrates accurate estima-

tion in [6]. However, in [6], the authors model an already produced architecture and

system. Hence, when designing a new architecture from scratch, GEM5 offers limited ca-

pability to preview system performance. McPAT [35], a multicore CPU power, area, and

timing modelling framework, was shown to be a viable extension to GEM5 [17], for core

area estimation. GPGPU-Sim [3], a General-Purpose Graphic Processing Unit (GPGPU)

simulation framework, models an architecture at a functional level, omitting many mi-

croarchitecture details. This makes GPGPU-Sim limited in ways similar to GEM5, and

requires dedicated third-party tools for timing, area, and power estimation. In [47], the

authors pointed out some pitfalls and limitations of these existing CPU/GPU simulation

and modelling frameworks.

There are previous works for modelling area and performance of arbitrary systems,

too. A previous work proposed CompEst and ChipEst [48], which are accurate and very

fine-grained performance and area modelling technique for high-level design. Each ba-

sic building block is realized in various implementations/topologies with standard-cell

technology, and performance results of all variants are later used in a top-level estimator

to select implementations for all instances of basic building blocks used in a high-level

design, while satisfying all specified design requirements. The tool then reports an es-

timate of total area. In [40], the authors proposed a technique to accurately estimate

circuit interconnect delay, based on gate attributes such as pins, area, fanin and fanout

numbers. [49] presents wireload-aware synthesis and floorplaning, emphasizing the im-

portance of modelling interconnect. Another work [24] proposes an accurate early wire

characterization technique, employed in an alternative design flow.

For CGRAs, there are previous works for architecture DSE. Suh et al. [57] de-

scribed architectural DSE for the Samsung Reconfigurable Processor (SRP) (a variation


of ADRES [31]), which improved the performance, and chip area of the SRP. Another

work [32] demonstrated DSE on a Reconfigurable Array Architecture (RAA), improving

performance, chip area, and power. Both DSE studies provide performance, area, and

power results.

However, there are fewer efforts providing architectural area and performance mod-

elling for CGRAs. In [4], the authors provided a detailed study on how different config-

urations of CGRA processing elements can affect performance measured in cycle-count

latency, but the paper does not elaborate on critical path delay since they did not specify

the implementation technology. In [61], we see performance and area estimation for a

CGRA implemented as an FPGA overlay. Although the above studies include area and

performance results, they are either tied to one specific CGRA architecture or implemen-

tation technology.

The CGRA-ME framework offers a generic mapper, targeting a user-specified archi-

tecture, and this thesis extends the framework to estimate performance and area by

accepting a user-specified area/performance profile based on a user-selected implemen-

tation technology.

Chapter 3

CGRA-ME – Estimation Engine

In this chapter, we discuss how the estimation engine in CGRA-ME is implemented. We

hypothesize that area and performance of a modelled CGRA architecture can be gauged

if area and performance of all sub-components can be characterized. For performance

estimation, we will describe why and how interconnect is taken into account. We will

also clarify several factors affecting whether the results from the estimation engine can

reflect an actual implementation. The estimation engine described in this chapter will

be used in the experiments detailed in Chapter 4.

3.1 Architecture Modelling in CGRA-ME and the

Primitive Modules

In Chapter 2, we described previous CGRA architectures. While these CGRA architec-

tures vary, they primarily consist of a similar set of primitive submodules. In CGRA-ME,

we provide a set of primitives for architecture designers to create the software architecture

model. The primitives are:

1) op_add: an adder unit

2) op_sub: a subtractor unit

24

Chapter 3. CGRA-ME – Estimation Engine 25

3) op_mul: a multiplier unit

4) op_shl: a left shift unit

5) op_ashr: an arithmetic right shift unit

6) op_lshr: a logical right shift unit

7) op_and: a bitwise AND unit

8) op_or: a bitwise OR unit

9) op_xor: an bitwise XOR unit

10) mux_*to1: multiplexers of various sizes

11) register: a set of edge-triggered Flip-Flops

12) registerFile: array of registers readable and writable from multiple ports

13) tristate: a tristate buffer, mainly used in external I/O ports

14) mem_unit: a memory unit, offering data load and store capabilities

With the above primitive modules, CGRA-ME can model many generic CGRA architec-

tures, such as ADRES and HyCUBE.

Modelling an architecture in CGRA-ME is very similar to the process of RTL de-

sign. CGRA-ME constructs a tree data structure to represent a hierarchy of modules,

with bottom-level nodes comprised of only primitive modules. Each node encapsulates

the connections among its submodules/child-nodes with a list. The module tree data

structure is used to construct the corresponding MRRG of architecture, which is a graph

representing the architecture functionality (nodes) and connectivity (nodes and edges).

Each primitive module comes with a corresponding MRRG representation. When con-

structing a full architecture MRRG, the tool flattens the module tree data structure, and

connects all sub-MRRGs accordingly.


(a) Module object representing the PE.

(b) MRRG representing the routing and functionality of the PE

Figure 3.1: Illustrations of the two data structures in CGRA-ME representing a PE with3 contexts.


Figure 3.1 depicts how an example PE with three contexts and three supported oper-

ations is represented in CGRA-ME. The module tree data structure closely resembles the

physical hierarchy of the architecture, while the MRRG closely resembles the operational

structure of the architecture. When creating the MRRG, a non-primitive module will

contain all lower-level MRRGs, and contain lists of ports and connections to connect all

sub-MRRGs. We hypothesize that when area and performance characteristics of primi-

tive modules are known, we will be capable of performing realistic area and performance

estimations by leveraging the existing data structures.

3.2 Characterization of Primitive Modules

Figure 3.2: Characterization steps.

Figure 3.2 shows the steps to characterize primitive area and performance. We first

retrieve HDL files of the target architecture, which will include all primitives modules

used in it. The primitive modules are synthesized in the technology of interest. In this

thesis, we target the 45nm NanGate FreePDK45 Generic Open Cell Library [54]. After

synthesis, we obtain area and performance characteristics of the primitives. The char-

acterization results are stored in a database. The database entries then serve as inputs


to the area/performance estimation engine. The approach is analogous to that used for

commercial FPGAs, wherein delay characterization data for key sub-circuits are stored

in a database, whose entries are then recalled during STA. In addition to the charac-

terization of primitive modules, we also modelled interconnect delay, elaborated upon

in Section 3.5. To obtain primitive module characterization, as well as an interconnect

delay model, we use the following CAD tools:

1) Technology mapping/synthesis: Synopsys Design Compiler

2) Place and route: Cadence Innovus

3) Timing analysis: Synopsys PrimeTime

In an ASIC standard-cell flow, constraints can be applied during technology map-

ping to optimize for area, delay, or a combination of the two. Individual cells (e.g. 2-

input NAND) are available in multiple drive strengths, and delay-driven mappings will

tend to use larger cells with higher drive strengths. CGRAs may be used in a high-

performance context, or a low-power embedded context, as and such, we opted to gen-

erate two databases of area/performance values for the primitives: an area-optimized

database wherein the ASIC tools were executed with a minimum-area objective, and a

delay-optimized database wherein the ASIC tools were executed with a minimum-delay

objective. During synthesis, a minimum-area objective will guide the tool to select smaller

cells with weaker drive strengths (slower signal transitions), whereas a minimum-delay

objective will guide the tool to select larger cells with stronger drive strengths (faster sig-

nal transitions). The two databases permit a human architect user of CGRA-ME to select

between the databases according to the intended CGRA usage. However, should the user

have custom constraints for the primitive modules, the tool is capable of incorporating a

user-generated database, too.

Layout quality and cell variety of a standard-cell library are the primary factors

dictating how well the synthesis can align the design with the synthesis constraints;


however, there are other factors that require attention from designers. A synthesis tool

can often realize a design leveraging only standard-cells from a library, but there are

many occasions when synthesis tool must infer implementation of part(s) of a design. In a

Verilog design, arithmetic operations are often expressed with generic syntax: “assign c

= a + b;”, which requires the synthesis tool to infer implementation of the “+” operator

from a library of implementations. This is why proprietary synthesis tools are sometimes

equipped with an Intellectual Property (IP) core library. For instance, Synopsys supports

the Design Compiler with their DesignWare IP core library, which contains a wide range

of arithmetic operations.

Verilog implementations generated by CGRA-ME, employ the generic syntax, allow-

ing users to select their preferred IP core library. When synthesizing ALU operation units

such as op_add and op_mul, we leverage the DesignWare library [28]. However, when an

HDL file is not sufficiently specific, the synthesis tool may select an IP core unintended

by the designer. For instance, we observed that when synthesizing op_mul without spec-

ifying an IP core, Design Compiler ended up selecting an unsigned and signed multiplier

for the area-optimized and delay-optimized target, respectively, which are functionally

different. For our experiments in Chapter 4, both area-optimized and delay-optimized

implementations must have the same functionality in order to draw fair comparisons. We

use the unsigned multiplier when synthesizing op_mul for both targets.

Table 3.1 shows a portion of the two databases used in Chapter 4. The left-most

column lists the primitive. The next two columns give the area, in square microns for

each primitive in each of the two databases. These numbers are the total silicon real

estate required for both standard-cell transistors and the additional space required for

PnR. They are derived by dividing total cell area by layout utilization factor/density

(standard-cell area per core area). A higher utilization factor implies heavier congestion,

and will lead to long PnR runtime. Since the CGRAs discussed in Chapter 4 are relatively

small, we set the utilization factor to be 0.8, while a typical value is around 0.7 [22,


Area [µm2] Critical Path Delay [ns]Target Area Optimized Delay Optimized Area Optimized Delay Optimized

op add 32b 168.00 536.00 2.78 0.37op sub 32b 190.00 539.00 2.80 0.40

op multiply 32b 2860.00 3008.00 1.12 1.10op and 32b 43.00 43.00 0.03 0.03op or 32b 43.00 74.00 0.05 0.04op xor 32b 64.00 64.00 0.06 0.06op shl 32b 456.00 491.00 0.53 0.43

op ashr 32b 456.00 470.00 0.53 0.45op lshr 32b 456.00 470.00 0.53 0.45

mux 2to1 32b 74.00 88.00 0.06 0.07mux 4to1 32b 147.00 166.00 0.06 0.06mux 5to1 32b 179.00 283.00 0.06 0.11mux 6to1 32b 215.00 245.00 0.09 0.07mux 7to1 32b 252.00 340.00 0.07 0.07mux 8to1 32b 286.00 325.00 0.07 0.07

RF 1in 2out 32b 1123.00 1231.00 0.07 0.08RF 4in 8out 32b 9307.00 11 568.00 0.17 0.20

register 32b 214.00 222.00 0.01 0.01tristate 32b 86.00 102.00 0.41 0.22const 32b 214.00 222.00 0.01 0.01

Table 3.1: Database of area and critical-path delay of the primitive modules as mappedinto the NanGate FreePDK45 45nm standard-cell library.

27]. The right-most two columns give the delay, in ns for each of the primitives. The

delay values recorded for register, RF_1in_2out, and RF_4in_8out are the delay (wire,

combinational, or both) from input pin to the D-input of the registers. While not listed

in Table 3.1, clock-to-Q delays, setup and hold times of the registers are also entries in

the databases.

From Table 3.1, we observe that in few cases, the delay with delay-optimized target

is actually slightly higher than those with area-optimized target, such as mux_2to1,

mux_5to1, RF_1in_2out, and RF_4in_8out. While these delay differences are under

0.05ns, there are a few factors that can lead to these results. When a targeted design is

trivial, and only a small set of cells is available to the synthesis tool, there is a higher

chance that the synthesis results may not align well with the synthesis objectives. Most

of the primitive modules are trivial in design, and while there is open-source merit in


using the NanGate FreePDK45 library, it has reduced cell variety compared to other

proprietary standard-cell libraries. While it is possible to cherry-pick these cases and

modify their constraints or augment the HDL design towards the synthesis objective, we

decided to keep synthesis constraints and HDL files the same for consistency. Broadly

speaking, we observe that the delay-optimized primitives are larger and faster than the

area-optimized primitives.

Note that the estimation engine is independent of the specific standard-cell target

technology, because the ASIC design flow is the same regardless of technology node or

standard-cell library. This means that users can redefine entries in the database based on

the standard-cell library, and IP libraries available to them, and the results of our CGRA

estimation engine will properly reflect the target technology and IP. The databases are

human-readable INI files and easy to read/modify.

3.3 Area Modelling

As mentioned previously, the CGRA-ME framework has an architecture interpreter,

which constructs a tree data structure modelling the targeted CGRA. The architec-

ture model is represented hierarchically, with a top-level module representing the entire

CGRA, and second-level modules representing CGRA tiles in the two-dimensional array,

and so on. Within CGRA-ME, an architect may specify modules with arbitrary levels of

hierarchy. However, at the bottom of the module hierarchy lie primitive modules, which

align with those discussed in the previous section, for which area and delay data are

contained in the database. Figure 3.3 depicts the data structure modelling an example

architecture with a 3×3 PE-grid, 3 I/O ports, and 3 memory ports.

Area estimation therefore performs depth-first traversal, aggregating area at every

module visited, with the top-level module as root. The area of each primitive module is

drawn from the database and aggregated upwards. The grid structure of CGRAs implies


Figure 3.3: An example CGRA architecture modelled as a tree of module objects inCGRA-ME. The modules filled in blue are the primitive modules, while the ones filledin red are non-primitive/composite modules. During area estimation, the modules withred solid outlines would require either a database lookup (primitive) or summation ofall submodule areas (composite). In a highly regular architecture, there can be multipleinstances of each unique composite module, hence in this example, only a few compositemodules require summation and have red solid outlines.

that many modules are repeatedly instantiated with the same parameters, and for these

cases, we do not require recomputation for every instance. In the traversal process, for

every unique non-primitive/composite module, after computing its area, it is given a

new entry in the area characterization database for reuse. In Figure 3.3, the module

instances with a red border are the only ones requiring an entry in the characterization

database or accumulation from submodules. At the end of the traversal, an estimate of

the total CGRA area is available and reported to the user. Likewise, a report also shows

the estimated area at each lower level of the hierarchy, giving the architect visibility

into the breakdown of area for the modelled CGRA. In Chapter 4, we compare this

straightforward primitive-aggregating estimation approach to the actual area after a full


ASIC PnR of a CGRA.

3.4 Performance Modelling and STA

As with an FPGA, the critical path of an application benchmark implemented on a

CGRA depends on the mapping, placement and routing of the application within the

CGRA, as well as the circuit delays within the CGRA device (from the characterization

database discussed in Section 3.2).

We integrated an open-source STA engine into CGRA-ME. The STA engine, called

Tatum, is also used within the VTR project [46]. Tatum performs timing analysis using a

timing graph, wherein nodes represent pins on electrical components, and edges represent

connections between pins. Delays are then annotated onto the edges of the timing graph.

Tatum has an easy-to-use C++ API that allows one to create the timing graph and perform

the delay annotation. Following this, Tatum performs timing analysis on the graph, and

can generate a critical-path delay report. That is, Tatum includes the functionality for

the familiar STA tasks of forward delay propagation to find the worst-case timing paths

in a design, and backward propagation of slacks [25] to find the timing slack on each

edge of the graph. Tatum can also be extended to accept a Synopsys Design Constraint

(SDC) file as input, to allow user-control over the timing analysis (e.g. setting of false

paths, selection of specific timing to analyze).

To integrate Tatum into CGRA-ME, we use the mapping results of the application

benchmark’s DFG onto the architecture model MRRG. In essence, we walk the used

part of the CGRA for the application benchmark, creating a partial MRRG, which will

serve as an input to create a timing graph in Tatum. Using the same example PE from

Figure 3.1, Figure 3.4 depicts an MRRG of a three-context PE, where the nodes and

edges highlighted in red are the used part of the CGRA. In order to create a timing

graph from the partial MRRG, all primitive modules involved in this MRRG will have


Figure 3.4: MRRG of a 3-context PE, with the mapped resources highlighted in red.

their corresponding timing graph representation created and connected. In Figure 3.5,

we see the three entities represented in Tatum’s timing graph:

1) Combinational entity: primitive modules with only combinational delay, such as

adder, multiplier, multiplexer, etc.

2) Register entity: primitive modules with clock inputs, such as registers and register

files, which requires annotation for setup time, hold time, clock-to-Q delay, and

clock skew (reported by per-primitive post-synthesis STA).

3) Interconnect delay entity: interconnect wires among primitive modules.

For example, in Figure 3.5a, the used input and output pins on a CGRA multiplexer

become nodes in Tatum’s timing graph, connected by an edge. The delays on the edges of

timing graph are then annotated based on the characterization database discussed above.

Regarding how interconnect delay is inferred, details will be discussed in the following


(a) An edge pointing from input pin (ipin) to output pin (opin), representing an abstractcombinational entity, and the edge is annotated with the combinational delay.

(b) Sub-structure with input pin (ipin), clock pin (cpin), source pin (src), sink pin(snk), and output pin (opin), representing a register entity. The edge from cpin to snk isannotated with hold and setup delay values, and the edge from src to opin is annotatedwith clock-to-Q delay value.

(c) An edge pointing from opin to ipin, representing an abstract interconnect entity, andthe edge is annotated with the interconnect delay.

Figure 3.5: Three categories of subgraph to compose a full timing graph in Tatum,converted from their corresponding MRRG counterparts.


section. Figure 3.6 depicts the resulting timing graph generated from the MRRG shown

in Figure 3.4.

Figure 3.6: Timing graph representing the mapped MRRG from Figure 3.4.

(a) Mapped MRRG (b) Timing graph

Figure 3.7: Mapped MRRG and timing graph of the “sum” benchmark put side-by-side,showing difference in graph complexity at a larger scale.

Another example is Figure 3.7, which illustrates the mapped MRRG of the sum

benchmark on the ADRES architecture, and the corresponding timing graph in Tatum.

Observe that the timing graph is generally larger than the mapped MRRG, because nodes

in the timing graph represent pins, whereas nodes in the MRRG are more coarse-grained.

Note that Tatum does not enforce the granularity of the modelled entity, meaning

that it is entirely up to the user to decide how detailed the timing behaviour is modelled.

CGRAs in general, can be modelled with coarse granularity, since they are configurable

by bus. Conversely, while FPGAs are configurable to a single bit, a much higher granu-


larity is required, which greatly enlarges and complicates the timing graph. The timing

graph created by CGRA-ME takes advantage of the coarse granularity, making primitive

modules the most detailed unit to model. As a result, STA in CGRA-ME is very fast

and computationally inexpensive.

3.5 Interconnect Delay Estimation

The characterization database contains delays for each type of primitive. However, inter-

connect delays are a growing contributor to total delay in deep-submicron VLSI technol-

ogy. Such delays are not accounted for in the database, and in Chapter 4, we show that

ignoring interconnect delay leads to poor accuracy. As our aim is to provide a high-level

performance estimation capability, we must estimate interconnect delay prior to place-

ment and routing (i.e. without detailed knowledge of wirelength, capacitance, resistance).

Therefore, to improve the delay-modelling capabilities of CGRA-ME, we constructed a

simple fanout-based interconnect delay estimation model. Fanout is widely used as a

proxy for interconnect delay estimation [26,40].

Figure 3.8: Standard-cell layout for fanout delay scaling analysis on op_and with 16fanout registers, with the upper, middle, and lower rats nests representing fanin registers,the primitive module, and fanout registers, respectively.


Specifically, for each type of primitive, we attached the outputs of the primitive to

various numbers of fanout registers. We then performed a full synthesis, placement, and

routing into standard-cells, followed by STA with Synopsys PrimeTime. We extracted

the delay from the outputs of the primitive to the fanout registers. This allowed us to

create a model associating primitive fanout with delay. Figure 3.8 shows the primitive

module op_and at the center of the chip, while the fanout registers are at the bottom.

Figure 3.9: Averaged fanout-delay of all primitive modules.

Figure 3.9 plots the relationship between fanout register count and average fanout

delay of all primitives, showing a line of best fit. As shown, the interconnect delay is

roughly linear with the fanout. Using the MRRG fanout and the linear fanout-delay

model, we can optionally annotate estimated interconnect delays onto Tatum’s timing

graph. In Chapter 4, we will use 8 benchmarks and 3 architectures to verify the accuracy

of modelling performance incorporating the fanout-based delay model.

Chapter 4

Experimental Studies

In this chapter, we experimentally assess the area and performance estimation models

described in the previous chapter. Section 4.1 overviews three architecture variants used

in the following sections. Section 4.2 overviews eight benchmarks used in this study.

Section 4.3 presents a comparison between the full VLSI CAD area/performance versus

the estimates. Section 4.4 presents estimation results for two variants of the same archi-

tecture, validating the ability to use the estimates for architecture-specific DSE studies.

Section 4.5 presents results for applying the estimators to two entirely different architec-

tures, thereby assessing the architecture-to-architecture comparison capability.

4.1 Target Architectures

Figures 4.1a and 4.1b show two variants of ADRES used in this study, referred to as

ADRES with orthogonal interconnect (ADRES-O) and ADRES with diagonal intercon-

nect (ADRES-D), respectively. From the top to bottom, the architecture is equipped

with a row of I/O ports, a wider Data Register File (DRF) shared by the first row of

PEs, a 4×4 grid of PEs, a smaller Register File (RF) coupled with each PE (excluding

the first row), and Memory Interface Ports (MPs) each connecting to a row of PEs. On

a side note, the MPs were not used in the original proposal of the ADRES architecture;

39

Chapter 4. Experimental Studies 40

(a) ADRES-O – Orthogonal interconnect. (b) ADRES-D – Diagonal interconnect.

(c) High-level view of the HyCUBE-like architecture used in the experimental studiesSection 4.5.

Figure 4.1: High-level view of the ADRES-like architectures and HyCUBE architectureused in the experimental studies.

however, in CGRA-ME, load and store operations are serviced by memory units, and

mapped to MPs, so in order to map benchmarks with memory operations, we decided

to add MPs to the architecture1. Each PE consumes and provides data to the nearest

orthogonal or diagonal neighbor PEs, highlighted in black and green, respectively. PEs

on edges are also connected by toroidal buses, highlighted in red and blue, representing

vertical and horizontal toroidal connections, respectively. Each PE can also perform a

bypass through a multiplexer, allowing data routing at the cost of an FU.

1The area/performance of the MPs is not modelled.


Figure 4.1c depicts the HyCUBE architecture with 4×4 PEs. The composition of the

PEs were detailed in Figure 2.7. Recall that the Crossbar Switch (CBS) in HyCUBE

allows data to be routed combinationally throughout the array (without requiring that

FUs be used as route-throughs). With the CBS, the architecture achieves multi-hop,

and multi-cast capability, at the cost of silicon area and delay overhead due to the extra

multiplexers in bypassable registers and the CBS. The leftmost column of the architecture

is also connected to MPs, for data load and store. In contrast with ADRES, HyCUBE

does not contain any RFs.

4.2 Target Benchmarks

When modelling architecture performance, it is not meaningful to perform STA on an

entire architecture, since the longest bypass/multi-hop chain will become the critical path.

Instead, much like timing analysis for FPGAs, we will map a set of benchmarks on all

target architectures. Timing analysis will be constrained and performed on mapped/used

resources only. The mapping results will be used to generate constraints for PrimeTime

STA to create baseline results, and will also be used to construct the Tatum timing graph

to produce estimation results. The 8 benchmarks selected are listed below:

1) conv2 : Computes and stores the dot product of 2 elements of array “a” with 2

constants, and stores into array “b”; 16 DFG nodes

2) conv2 : Computes and stores the dot product of 3 elements of array “a” with 3

constants, and stores into array “b”; 24 DFG nodes

3) mac : Computes the sum-of-product of 2 arrays; 11 DFG nodes

4) mults1 : Computes and accumulates the dot product of 4 elements of array “a”

with 4 constants; 31 DFG nodes


(a) conv2 benchmark. (b) conv3 benchmark.

(c) mac benchmark. (d) mults1 benchmark.

(e) nomem1 benchmark. (f) simple benchmark.

(g) simple2 benchmark. (h) sum benchmark.

Figure 4.2: Illustration of the DFGs of the 8 benchmarks used in the experimental studies.


5) nomem1 : Computes an arithmetic series without using any memory units; 6 DFG

nodes

6) simple : Computes and stores element-wise additions of array “a” and “b”; 12

DFG nodes

7) simple2: Computes and stores element-wise multiplications of array “a” and “b”;

12 DFG nodes

8) sum : Computes sum of all elements of array “a”; 7 DFG nodes

The benchmarks employ different computation resources and data-routing patterns.

Figure 4.2 depicts the DFGs of the 8 benchmarks.

There is an ongoing effort on improving the mapping runtime of CGRA-ME, but

currently, mapping the above mini benchmarks, averaging around 15 DFG nodes, is time

consuming (around an hour for ADRES, and three hours for HyCUBE), since the mapper

currently attempts to simultaneously satisfy constraints for scheduling, placement, and

routing. Mapping real-world applications, which can easily contain more than 100 DFG

nodes, is a direction that is currently being actively worked on within the group.

4.3 ADRES-O Full VLSI Implementation Versus CGRA-

ME Estimation

To assess the area and performance estimation within CGRA-ME, the full architecture

Verilog RTL of ADRES-O generated by CGRA-ME was realized in standard cells using

the same set of technology/tools used to profile the primitive modules (c.f. Chapter 3).

Figure 4.3 shows standard-cell layouts of ADRES-O for both the area-optimized and

delay-optimized targets, side by side, in the same scale.

In these layouts, only the immediate submodules under the top module are floor-

planned. Cells within these submodules are free to be placed anywhere within the floor-


Figure 4.3: Standard-cell PnR architecture layout of ADRES – area-optimized (left) vs.delay-optimized (right) side-by-side on the same scale.

planned rectangle. This implies that different instances of the same top-level submodules

will have different internal placements and routings, and therefore, we expect slightly dif-

ferent delays.

Area-optimized Delay-optimizedBaseline [µm2] 125945.3 151203.0

CGRA-ME [µm2] 116415.0 140854.0Error -7.57% -6.84%

Table 4.1: ADRES-O: Total core area of area-optimized and delay-optimized designs, aswell as estimation by CGRA-ME.

Table 4.1 shows the total chip area of both targets in the row labelled Baseline.

Table 4.2 summarizes the critical path delays of the benchmarks. Note that the MPs are

not included in the performance/area results, as these are typically proprietary IPs and

our focus is on the CGRA aspects of area/performance.

4.3.1 CGRA-ME Estimation Results

We first examine area estimation, where the estimates are shown in the CGRA-ME row

of Table 4.1. Comparing the two rows of the table, we see close alignment between the


Area-optimized [ns] Delay-optimized [ns]conv2 4.37 2.77conv3 4.46 3.20mac 4.51 2.92

mults1 4.51 2.77nomem1 4.33 2.72simple 3.24 2.71simple2 3.24 2.71

sum 4.13 2.96

Table 4.2: ADRES-O: Critical path delay of benchmarks for area-optimized and delay-optimized targets from PrimeTime STA.

estimates and actual area values for the entire CGRA. The estimates are about 6% lower

than the actual layout values. The results confirm that aggregating primitive module

areas provides a good estimate of overall CGRA area. The 6% gap between our tool and

the actual areas is anticipated, since our tool does not account for the area contributed

by configuration cells. We do not take configuration cells into account in our estimates

because their detailed implementation may vary depending on the CGRA (e.g. SRAM

cells or flip-flops). However, the number of configuration bits can be estimated to further

improve the result. For the baseline results, configuration cells are implemented as flip-

flops, connected in a scan chain.

Turning now to the performance results, Figures 4.4a and 4.4b show the critical

path delays of the 8 benchmarks, under area-optimized and delay-optimized objectives,

respectively. The blue bars show the actual post-routing critical path delays; the red

bars represent the estimates provided by our model. For the results from CGRA-ME, we

did not take interconnect delay into account – interconnect delays are zeroed. Observe

that the estimated critical path delays are almost always optimistic (i.e. the estimated

delays are smaller than the actual delays). The average error in the critical-path delay

estimation is 1.25ns, or 43%. Moreover, apart from this error, the critical path reported

by the estimator differed from the actual critical path in many cases. Out of the 16

(8 from each of area-optimized and delay-optimized implementations), the critical path


(a) Area-optimized target.

(b) Delay-optimized target.

Figure 4.4: Critical path delay comparison – CGRA-ME estimations without interconnectdelays vs PrimeTime with interconnect delays.

reported by the estimator was the same as the actual critical path in 7 of the benchmarks.

That is, in 9 of the cases, the wrong critical path was reported by the estimator. As a

concrete example, we depict the critical path reported by PrimeTime and CGRA-ME in

Figure 4.5.

When the mults1 benchmark was mapped onto the ADRES-O architecture in the

delay-optimized scenario, we observed that the critical path reported by the estimator

and PrimeTime are the same. However, the critical path delays reported are 2.78ns and

1.52ns, from PrimeTime and CGRA-ME, respectively. In Table 4.3, we show detailed


(a) Critical path reported by PrimeTime is highlighted in green.

(b) Critical path reported by CGRA-ME is highlighted in red.

Figure 4.5: After mapping benchmark conv2, we produced the partial MRRG represent-ing the used portion of the hardware. Without interconnect delay taken into account,Synopsys PrimeTime and CGRA-ME report different critical paths.


PrimeTime CGRA-MENode Incr [ns] Path [ns] Incr [ns] Path [ns]

rf c1 r2.Q 0.100 0.100 0.100 0.100pe c1 r2.mux out.in 0.100 0.200 0.000 0.100pe c1 r2.mux out.out 0.050 0.240 0.070 0.170pe c0 r2.mux bypass.in 0.120 0.360 0.000 0.170pe c0 r2.mux bypass.out 0.070 0.430 0.110 0.280pe c0 r2.mux out.in 0.040 0.470 0.000 0.280pe c0 r2.mux out.out 0.040 0.510 0.070 0.350pe c3 r2.mux b.in 0.120 0.630 0.000 0.350pe c3 r2.mux b.out 0.130 0.760 0.070 0.420pe c3 r2.func.op mult.in 0.820 1.580 0.000 0.420pe c3 r2.func.op mult.out 1.130 2.710 1.100 1.520rf c3 r2.D 0.060 2.780 0.000 1.520

Table 4.3: The critical path delay report of the mults1 benchmark, mapped onto ADRES-O in the delay-optimized scenario, generated by PrimeTime and CGRA-ME Tatum (with-out accounting for interconnect delay).

critical path delays for PrimeTime and CGRA-ME STA. The “Node” column represents

the pins in the design. The “Incr” column represents the incremental delay from one node

above. The “Path” column represents the cumulative critical path delay from the starting

node. In this report, any incremental delay associated with a .out node, represents total

delay (combinational cell + metal wire) in a primitive module. Any incremental delay

associated with a .in node, represents total interconnect delay (metal wire) between

two primitive modules. For PrimeTime, we observe that out of the total critical path

delay, 2.78ns, 1.26ns (around 45%) is contributed by interconnect delays. If we added the

1.26ns interconnect delay onto the total critical path delay, 1.52ns, reported by CGRA-

ME, then our estimation would be quite accurate. Similar results were observed for the

other benchmarks – interconnect delay is the main source of estimator inaccuracy. Hence,

it is important that our estimation integrates an interconnect delay model.

Figures 4.6a and 4.6b shows the revised results when fanout-based delay estimation

(c.f. Section 3.5) is incorporated. In this case, we use the fanouts in the CGRA device

model, the architecture MRRG, as input to the estimation model described in section 3.5.

Observe that the estimation error is improved relative to the results shown in Figure 4.4.




Figure 4.6: Critical path delay comparison – CGRA-ME estimates with MRRG fanout-inferred interconnect delay versus PrimeTime.

On average, the error is now 0.73ns, or 21%. After this improvement, out of the 16

benchmarks, 8 reported critical paths matched those reported by PrimeTime. Of the

remaining 8 benchmarks, there are 6 cases where Tatum did report the correct paths in

the respective top-four critical paths of each case, and these top-four critical paths are

within 0.5ns difference.

Further analysis of the results incorporating interconnect delay estimation showed

that a significant source of error was for signals on the outputs of multiplexers driving the

FUs. Table 4.4 shows the delays on the critical path of the mults1 benchmark, where the


estimates incorporate fanout-based interconnect delay estimation (using MRRG fanouts).

The interconnect delay at the node “pe c3 r2.func.op mult.in” are 0.82ns and 0.102ns,

from PrimeTime and CGRA-ME, respectively. From the MRRG, the fanout count from

“pe c3 r2.mux b.out” is 3, however, PrimeTime reported the maximum fanout from all

wires in this bus as 149. The MRRG does not capture FU implementation details, such

as the available operations in a FU, and the implementation of each operation, resulting

in this difference. That is, the fanout of gates in the standard-cell implementation may

not align with the apparent bus-fanout in the MRRG.



Table 4.4: The critical path delay report of the mults1 benchmark, mapped onto ADRES-O in delay-optimized scenario, generated by PrimeTime and CGRA-ME Tatum now withinterconnect delays modelled based on MRRG node fanouts.

The architecture model, as represented as an MRRG, is not able to capture gate-level

details. The FUs are able to perform many different logical and arithmetic operations,

leading to high fanout (at the gate level) for signals entering these units. In light of this

fanout gap, we added the capability for a user to override the MRRG fanout, for specific

CGRA instances, if desired. Fortunately, such fanout numbers can still be acquired with

the synthesis of primitive modules only, and this does not conflict with the primitive

characterization flow as described in Figure 3.2 in Section 3.2.

With the fanout of the multiplexers driving FUs overriden, the performance results


(a) Area-optimized target

(b) Delay-optimized target

Figure 4.7: Critical path delay comparison – CGRA-ME estimations with selected fanoutcount overridden vs PrimeTime.

are shown in Figs. 4.7a and 4.7b. The average error reduces to 0.33ns, 9.6% compared

to baseline. Moreover, after the override, the correctness of the critical paths reported

also improved. Out of the 16 benchmarks, 12 had identical critical paths reported by

PrimeTime and Tatum. In the remaining 4 cases, the critical path reported by PrimeTime

was one of the top-four paths reported by Tatum.

We also revisit the same test case from Tables 4.3 and 4.4, but this time with fanout

numbers of multiplexers driving FUs overriden. In Table 4.5, interconnect delays at node

“pe c3 r2.func.op mult.in” are 0.82ns and 0.934ns, from PrimeTime and CGRA-ME,


respectively, which are very close.



Table 4.5: The critical path delay report of the mults1 benchmark, mapped onto ADRES-O in the delay-optimized scenario, generated by PrimeTime and CGRA-ME Tatum nowwith interconnect delays modelled based on MRRG node fanouts, and overridden for themultiplexer driving the FUs.

Without Interconnect Delay MRRG Based Fanout # Overriden Fanout #Target Area-Optimized Delay-Optimized Area-Optimized Delay-Optimized Area-Optimized Delay-Optimized

conv2 � 1 � 5 5 5

conv3 � 1 � 1 5 1

mac 1 5 1 5 5 5

mults1 1 5 1 5 5 5

nomem1 5 5 5 5 5 1

simple 5 1 5 1 5 1

simple2 5 1 5 1 5 1

sum 1 5 1 5 5 5

Table 4.6: Correctness of reported critical paths: 1) 5 – Tatum reported the same pathas PrimeTime, 2) � – path from PrimeTime is not reported by Tatum, 3) 1 – pathfrom PrimeTime is reported as one of the top-four critical paths from Tatum.

Table 4.6 summarizes the correctness of reported critical path for all three methods.

�’s in the table indicate that the critical path reported by PrimeTime (actual) is not

reported by Tatum (estimator), which are undesirable. 1’s indicate the PrimeTime-

reported path is one of the top four paths reported by Tatum, which are not ideal but

acceptable. 5’s represent the same path being reported in both PrimeTime and Tatum,

which are desirable. Observe that, when fanouts are taken into account to estimate


interconnect delay, there is greater alignment between the paths reported by PrimeTime

and Tatum (more circles appear in the centre and right-most columns of the table), and

when the corner-case gate-level details are also supplied, the estimation performs even

better.

In this section, we have shown that our estimation engine can accurately estimate

area and performance, both within approximately 10% error. However, we have com-

pared our estimation against standard-cell CAD tools for only one architecture under

two implementation objectives: area and performance. In the following sections, we ex-

plore whether the estimation engine is applicable in a wider scope, by applying it on an

architectural variant, and also an entirely different architecture.

4.4 ADRES Architecture with Added Diagonal Con-

nectivity

It is desirable if the area/performance estimation can be used for architecture DSE. As a

step towards assessing this ability, we use a modified version of ADRES-O called ADRES-

D, which contains additional diagonal interconnect connectivity between PEs. In terms

of implementation modifications compared with ADRES-O, the PEs of ADRES-D are

equipped with larger input multiplexers, as well as have a larger set of fanouts from

the output multiplexer. The larger multiplexers also imply the architecture may require

more configuration cells, due to the wider select inputs. The differences are illustrated

in Figure 4.8.

Here, we highlight some expected effects from these modifications, based on intuition:

1) Area: intuitively, the larger multiplexers and additional configuration cells should

result in increased area usage.

2) Benchmark performance: it is difficult to guess whether the change will result in


(a) A PE in the ADRES-O architecture. (b) A PE in the ADRES-D architecture,with modifications highlighted in red.

Figure 4.8: PEs from ADRES-O and ADRES-D put side-by-side for comparison.

better or worse performance. While ADRES-D has more flexible interconnect, the

tiles are also expected to be larger, leading to higher interconnect capacitances.

Likewise, the multiplexers are larger, leading to higher combinational delay.

3) Architecture flexibility: the modification provides more flexible interconnect, effec-

tively using fewer PEs solely for bypass.

4) Mapper runtime: since the architecture is more flexible, the MRRG representing

the architecture is also larger, potentially requiring longer time for the mapper. On

the other hand, the additional flexibility may allow the ILP-based mapper to find

solutions more easily.

4.4.1 Area

Before comparing ADRES-O and ADRES-D, we need to first confirm that the estima-

tion engine can retain its accuracy when compared to a standard-cell implementation.

We realized the ADRES-D architecture with standard-cells, and re-ran CGRA-ME area

estimation using the same primitive module characterization databases. In Table 4.7, we

observe that the error in area estimation is still within 7%. As anticipated, our estima-

tion technique area is still reasonably accurate, when applied onto an architecture with


moderate modifications.

Area-optimized Delay-optimizedVLSI CAD [µm2] 128317.4 158140.2CGRA-ME [µm2] 120057.0 149088.0

Error -6.4% -5.7%

Table 4.7: ADRES-D: Total core area of area-optimized and delay-optimized designs, aswell as estimation by CGRA-ME.

To verify the area-estimation fidelity, we compare the difference in area betweem

ADRES-O and ADRES-D for both standard-cell implementations, and the estimates. For

standard-cell implementations, with the additional interconnect, we observe a 3877.3µm2

(3.1%) and 7787.9µm2 (5.2%) increase in core area, for area-optimized and delay-optimized

variants, respectively. On the other hand, CGRA-ME estimation shows 3890.0µm2 (3.3%)

and 8150.0µm2 (5.8%) increases in core area for the area- and delay-optimized variants,

respectively. As we can see, the area differences arising from the additional diagonal inter-

connect connections are faithfully represented by the estimation engine, when compared

with the results for standard-cell flow.

4.4.2 Performance

For performance, we must perform mapping again for all benchmarks, to take advan-

tage of the new diagonal connectivity. From the new mapping results, we created the

corresponding STA constraints, and generated timing results for both area-optimized

and delay-optimized targets. Figure B.1 shows the critical path delay of the same 8

benchmarks on the ADRES-D architecture. Table 4.8 demonstrates the correctness of

estimated critical path, versus that reported by PrimeTime for the standard-cell imple-

mentation. We observe that the accuracy and correctness of the performance estimation

still holds.

We now move on to compare the performance of ADRES-O with ADRES-D. We

first compare the performance of the standard-cell implementations (STA with Synop-


Target Area-Optimized Delay-Optimized

conv2 5 5

conv3 5 5

mac 5 5

mults1 5 1

nomem1 5 5

simple 5 1

simple2 5 1

sum 5 5

Table 4.8: Correctness of reported critical paths on ADRES-D architecture: 1) 5 –Tatum reported the same path as PrimeTime, 2) � – path from PrimeTime is notreported by Tatum, 3) 1 – path from PrimeTime is reported as one of the top-fourcritical paths from Tatum.

sys PrimeTime). While having a different mapping per benchmark, we observe that,

on average, ADRES-O and ADRES-D have 4.10ns and 4.41ns critical path delays for

the area-optimized implementations, respectively. The analogous delays are 2.85ns and

3.14ns in the delay-optimized implementations. That is, in the area-optimized imple-

mentation, ADRES-D has a 0.31ns larger average critical path delay versus ADRES-O.

In delay-optimized implementation, ADRES-D has a 0.29ns larger average critical path

delay versus ADRES-O.

Turning now to the estimator, the estimator reports 0.4ns and 0.52ns larger average

critical path delays for area-optimized and delay-optimized variants, respectively. These

results match closely with the standard-cell results from PrimeTime. As mentioned ear-

lier, having diagonal interconnect could result in fewer interconnect “hops” per routed

connection. When we investigated the mapping results in detail, we observed that only

2 of the 8 mapped benchmarks used the diagonal interconnect. Additionally, while the

diagonal interconnect was used, we observed that the number of primitive modules the

critical path spans over was similar on average in ADRES-D and ADRES-O. This sug-

gests that the additional combinational delay from the larger multiplexers in ADRES-D

is the more dominant factor to the change in critical path delay. While these 8 bench-


marks suggest that ADRES-O out performs ADRES-D, it is important to note that these

benchmarks do not impose heavy routing congestion (refer to Figure 4.2), which implies

that, a larger benchmark requiring most of the functional resources, may yield a different

conclusion. It is worth mentioning that when mapping the 8 benchmarks, it takes similar

runtime to map onto ADRES-D (on average 148.4s) compared to ADRES-O (on average

152.4s). Five benchmarks maps faster on ADRES-D. The added flexibility in ADRES-D

appears to be neutral to mapping runtime. With these results, we have demonstrated

that, at least for a modest architecture modification to ADRES-O, the estimation engine

is able to predict the area and performance consequences of the change with reasonable

accuracy.

4.5 ADRES Architecture versus HyCUBE Architec-

ture

The previous section considered whether the estimation engine can assess an architec-

ture modification with good fidelity. We now consider the case of two entirely different

architectures: ADRES-O and HyCUBE.

4.5.1 Architectural Differences: ADRES-O versus HyCUBE

Both ADRES and HyCUBE are generic CGRA architectures – the key differences be-

tween them are the interconnect and the locations of registers. In Figure 4.9, we compare

the PEs from the two architectures.

ADRES employs many RFs: one wide VLIW-shared RF at the top row, and multiple

smaller RFs, each associated with one PE. This provides mapping flexibility by allowing

data to be stored and retrieved an arbitrary number of clock cycles later. HyCUBE,

on the other hand, has individual registers, but no register files. As well, the CBS in

HyCUBE has bypassable input registers allowing data routing to any PE on the array


(a) A PE in the ADRES-O architecture,with parts unique from HyCUBE PE high-lighted in blue.

(b) A PE in the HyCUBE architecture,with parts unique from ADRES-O PEhighlighted in red.

Figure 4.9: PEs from ADRES-O and HyCUBE.

within the same cycle. Additionally, the CBS allows a signal to route through a PE

without using its FU. The FU can be used for computation and its output routed through

the same CBS (as long as the CBS is not fully congested).

4.5.2 Area

We implemented a 4×4 HyCUBE architecture using the the standard-cell flow, and also

estimated its area using the same set of characterization databases. From Table 4.9, we

again observe that our approach remains fairly accurate: 10-11% estimation error.

Area-optimized Delay-optimizedVLSI CAD [µm2] 153707.1 187666.5CGRA-ME [µm2] 136880.0 169228.0

Error -11.0% -9.8%

Table 4.9: HyCUBE: Total area of area-optimized and delay-optimized variants, as wellas estimation by CGRA-ME.

For ADRES-O, the standard-cell area was 128317.4µm2 and 158140.2µm2, for area-

optimized and delay-optimized variants, respectively. For HyCUBE, the analogous standard-

cell areas are 153707.1µm2 and 187666.5µm2. The HyCUBE architecture requires 19.8%


and 18.7% more area over the ADRES-O, for area-optimized and delay-optimized standard-

cell variants, respectively. Using CGRA-ME, the HyCUBE architecture was estimated

to require 17.8% and 20.1% more area than ADRES-O, for area- and delay-optimized

variants, respectively. This closely matches the increases observed with the standard-cell

variants.

4.5.3 Performance

For performance, the same 8 benchmarks are mapped onto the 4×4 HyCUBE, and we cre-

ated the corresponding constraints for PrimeTime to perform per-benchmark standard-

cell STA for both optimization targets. We have also modelled the performance within

CGRA-ME. In Table 4.10 and Figure B.2, notice that results for benchmark conv3 are

left empty, since the benchmark is unmappable on HyCUBE due to insufficient MPs.

From Table 4.10, we see that the critical paths are still mostly correct. However, from

Figure B.2, we see that there is an average of 17.2% error in the estimation. With some

investigation, we observed critical paths in HyCUBE spanned more primitive modules,

and that the multiplexer primitive modules are more widely used in HyCUBE when

compared against ADRES. However, for all sizes of multiplexers, we estimate their in-

terconnect delay using the average linear model presented in Figure 3.9. We believe that

error accumulates in chains of multiplexers on HyCUBE critical paths. We can remedy

this result if the interconnect delay model is acquired per primitive module instead of an

overall average – this is left as future work.

We now compare the performance of HyCUBE to ADRES-O for standard-cell imple-

mentations. The average critical paths are 3.59ns and 2.49ns, for the area-optimized and

delay-optimized variants, respectively. The average critical path delay of HyCUBE is

0.51ns and 0.36ns shorter than those of ADRES-O, for the two objectives, respectively.

However, estimation from CGRA-ME, suggests 0.64ns and 0.11ns longer average critical

path delays for HyCUBE versus ADRES-O, for area- and delay-optimized variants, re-


Target Area-Optimized Delay-Optimized

conv2 1 5

conv3 N/A N/A

mac 5 5

mults1 1 1

nomem1 5 5

simple 5 �

simple2 5 �

sum 5 �

Table 4.10: Correctness of reported critical paths on HyCUBE architecture: 1) 5 –Tatum reported the same path as PrimeTime, 2) � – path from PrimeTime is notreported by Tatum, 3) 1 – path from PrimeTime is reported as one of the top-fourcritical paths from Tatum.

spectively. The difference again arises because of the weakness of the average interconnect

delay model.

While it is not the focus of this work, it is important to keep in mind that with the

built-in mapping algorithm of CGRA-ME, the HyCUBE architecture takes noticeably

longer to map comparing to either ADRES-O or ADRES-D. For the 7 benchmarks that

mapped, HyCUBE took more than 10× longer to map, when compared against ADRES-

O and ADRES-D.

4.6 Summary

In this chapter, we compared the area and performance estimates with the results from

a full standard-cell implementation. We showed that area and performance of a CGRA

can be estimated with reasonable accuracy, and that benchmark-by-benchmark perfor-

mance estimates also reflect the actual standard-cell performance. Additionally, with

the comparison of ADRES-O and ADRES-D, we showcased that our estimation engine

offers fidelity for an architecture modification to the interconnect fabric. Finally, with the

comparison of ADRES-O and HyCUBE, we demonstrated that for completely different

CGRAs, our framework can quantify some of their specific advantages and disadvantages,


while interconnect delay estimation lacks granularity. The integration of the estimation

engine into CGRA-ME provides a human architect with rapid area and performance

estimation, without requiring a complete standard-cell flow, layout extraction, and STA.

Chapter 5

Conclusion and Future Work

In this thesis, we extended the CGRA-ME framework with the capability to rapidly es-

timate the area and performance of the modelled CGRA, without undergoing a lengthy

ASIC design CAD flow. Our estimation approach is to model architecture-level charac-

teristics based on a characterization database for primitive modules. This is somewhat

analogous to the characterization data associated with standard cells in a fab-provided

library. Performance characterization data is used in conjunction with an open-source

timing analysis tool, Tatum, to estimate critical paths, slacks, for an application bench-

mark mapped onto a modelled CGRA. In Chapter 4, we demonstrated that we can esti-

mate CGRA area and performance with reasonable accuracy, provided that interconnect

delays are also included in the estimates. We also demonstrated that the estimation en-

gine can be applied to gauge the area/delay of architecture modifications, and comparing

different architectures requires improvement for interconnect delay model accuracy. We

expect the estimation engine will be useful for early assessment of hypothetical CGRA

architectures.

62

Chapter 5. Conclusion and Future Work 63

5.1 Future Work

A first direction for future work is to further improve the interconnect delay estima-

tion, by using per-primitive estimation models, instead of a single average model for all

primitives. Another area requiring refinement on this front is for the interconnect model

to take into account the location of the CGRA tiles, for example, to realize that the

toroidal connections in ADRES-O are longer wires than many of the nearest-neighbour

connections. More diverse PEs and FUs implementations are also important to verify

the accuracy of interconnect delay estimation.

Another future direction is to build characterization databases for non-standard-cell

CGRA implementations. CGRAs may, for example, be implemented as FPGA overlays,

where they would undoubtedly have entirely different area and delay characteristics on a

primitive-by-primitive basis. It may also be necessary to re-think what exactly constitutes

a primitive in the FPGA-overlay context. Likewise, an entirely different interconnect

delay model may be required.

Further avenues for exploration include using the performance estimates provided

by this research within the CGRA mapping algorithm to realize performance-driven

CGRA mapping, as opposed to the strictly feasibility-driven mapping used in the current

CGRA-ME. It would also be interesting to extend the current work to provide early

power estimation, in addition to area and performance. This would allow CGRAs to be

compared against other computing platforms from the power angle.

Finally, while the work here considered estimation for area- and performance-optimized

CGRA standard-cell implementations, it would be useful to consider alternatives to this.

Some options here include: 1) balanced area/performance implementations, or 2) imple-

mentations wherein the standard-cell tools make use of optimized IP cores for arithmetic,

memory, or other CGRA structures.

Appendix A

Area Modelling

(a) Area-optimized target (b) Delay-optimized target

Figure A.1: ADRES-O Architecture: Area breakdown in µm2

64

Appendix A. Area Modelling 65


Figure A.2: ADRES-D Architecture: Area breakdown in µm2

Appendix A. Area Modelling 66


Figure A.3: HyCUBE Architecture: Area breakdown in µm2

Appendix B

Performance Modelling

(a) Area-optimized target

(b) Delay-optimized target

Figure B.1: ADRES-D Architecture: Critical path delay comparison – CGRA-ME esti-mations with selected fanout count overridden versus PrimeTime.

67

Appendix B. Performance Modelling 68



Figure B.2: HyCUBE Architecture: Critical path delay comparison – CGRA-ME esti-mations with selected fanout count overridden versus PrimeTime.

Bibliography

[1] Hideharu Amano. A dynamically adaptive hardware using a multi-context recon-

figurable processor. Technical Report of Information Processing Society of Japan

(IPSJ) – Computer Architecture (ARC), 2002(112):59–64, 2002.

[2] Hideharu Amano. A survey on dynamically reconfigurable processors. The Institute

of Electronics, Information and Communication Engineers (IEICE) Transactions

on Communications, 89(12):3179–3187, 2006.

[3] Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt.

Analyzing cuda workloads using a detailed gpu simulator. In IEEE International

Symposium on Performance Analysis of Systems and Software (ISPASS), pages 163–

174, 2009.

[4] Nikhil Bansal, Sumit Gupta, Nikil Dutt, and Alexandru Nicolau. Analysis of the

performance of coarse-grain reconfigurable architectures with different processing

element configurations. In Workshop on Application Specific Processors, held in

conjunction with the International Symposium on Microarchitecture (MICRO), 2003.

[5] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi,

Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti,

et al. The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2):1–7,

2011.

69

Bibliography 70

[6] Anastasiia Butko, Rafael Garibotti, Luciano Ost, and Gilles Sassatelli. Accuracy

evaluation of gem5 simulator system. In IEEE 7th International Workshop on Re-

configurable Communication-centric Systems-on-Chip (ReCoSoC), pages 1–7, 2012.

[7] Mike Butts. Synchronization through communication in a massively parallel proces-

sor array. IEEE Micro, 27(5), 2007.

[8] Anupam Chattopadhyay. Ingredients of adaptability: a survey of reconfigurable

processors. VLSI Design, Vol. 2013(Article ID 683615), 2013.

[9] Anupam Chattopadhyay, Xiaolin Chen, Harold Ishebabi, Rainer Leupers, Gerd As-

cheid, and Heinrich Meyr. High-level modelling and exploration of coarse-grained

re-configurable architectures. In Proceedings of the Conference on Design, Automa-

tion and Test in Europe (DATE), pages 1334–1339. ACM, 2008.

[10] S. Alexander Chin and Jason H Anderson. An architecture-agnostic integer linear

programming approach to cgra mapping. In Proceedings of the 55th Annual Design

Automation Conference (DAC), page 128. ACM, 2018.

[11] S. Alexander Chin, Kuang Ping Niu, Matthew Walker, Shizhang Yin, Alexander

Mertens, Jongeun Lee, and Jason Helge Anderson. Architecture exploration of

standard-cell and fpga-overlay cgras using the open-source CGRA-ME framework.

In Proceedings of the International Symposium on Physical Design (ISPD), pages

48–55. ACM, 2018.

[12] S. Alexander Chin, Noriaki Sakamoto, Allan Rui, Jim Zhao, Jin Hee Kim, Yuko

Hara-Azumi, and Jason Anderson. CGRA-ME: A unified framework for CGRA

modelling and exploration. In 28th International Conference on Application-specific

Systems, Architectures and Processors (ASAP), pages 184–189. IEEE, 2017.

Bibliography 71

[13] Kiyoung Choi. Coarse-grained reconfigurable array: Architecture and application

mapping. Information Processing Society of Japan (IPSJ) Transactions on System

LSI Design Methodology, 4:31–46, 2011.

[14] R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc.

Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE

Journal of Solid-State Circuits, 9(5):256–268, 1974.

[15] Carl Ebeling, Darren C Cronquist, and Paul Franklin. RaPiDReconfigurable

pipelined datapath. In International Workshop on Field Programmable Logic and

Applications, pages 126–135. Springer, 1996.

[16] Carl Ebeling, Darren C Cronquist, Paul Franklin, Jason Secosky, and Stefan G Berg.

Mapping applications to the rapid configurable architecture. In Proceedings of the

5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines

(FCCM), pages 106–115. IEEE, 1997.

[17] Fernando A Endo, Damien Courousse, and Henri-Pierre Charles. Micro-architectural

simulation of embedded core heterogeneity with gem5 and mcpat. In Proceedings of

the 7th Workshop on Rapid Simulation and Performance Evaluation: Methods and

Tools (RAPIDO), page 7. ACM, 2015.

[18] Stephen Friedman, Allan Carroll, Brian Van Essen, Benjamin Ylvisaker, Carl Ebel-

ing, and Scott Hauck. Spr: An architecture-adaptive cgra mapping tool. In Pro-

ceedings of the ACM/SIGDA International Symposium on Field Programmable Gate

Arrays (FPGA), pages 191–200. ACM, 2009.

[19] Seth Copen Goldstein, Herman Schmit, Mihai Budiu, Srihari Cadambi, Matthew

Moe, and R Reed Taylor. Piperench: A reconfigurable architecture and compiler.

IEEE Computer, 33(4):70–77, 2000.

Bibliography 72

[20] Seth Copen Goldstein, Herman Schmit, Matthew Moe, Mihai Budiu, Srihari

Cadambi, R Reed Taylor, and Ronald Laufer. Piperench: A coprocessor for stream-

ing multimedia acceleration. In Proceedings of the 26th International Symposium on

Computer Architecture, pages 28–39. IEEE, 1999.

[21] Marcel Gort and Jason H Anderson. Combined architecture/algorithm approach

to fast fpga routing. IEEE Transactions on Very Large Scale Integration (VLSI)

Systems, 21(6):1067–1079, 2013.

[22] Varun Gunnala. Choosing appropriate utilization factor and metal layer numbers for

an efficient floor plan in vlsi physical design. International Journal of Engineering

Research and Applications, 2(3):456–462, 2012.

[23] Yohei Hasegawa, Shohei Abe, Hiroki Matsutani, Hideharu Amano, Kenichiro Anjo,

and Toru Awashima. An adaptive cryptographic accelerator for ipsec on dynamically

reconfigurable processor. In Proceedings of the International Conference on Field-

Programmable Technology (FPT), pages 163–170. IEEE, 2005.

[24] Ilhan Hatirnaz, Stephane Badel, Nuria Pazos, Yusuf Leblebici, Srinivasan Murali,

David Atienza, and Giovanni De-Micheli. Early wire characterization for predictable

network-on-chip global interconnects. In Proceedings of the International Workshop

on System Level Interconnect Prediction (SLIP), pages 57–64. ACM, 2007.

[25] R. B. Hitchcock, G. L. Smith, and D. D. Cheng. Timing analysis of computer

hardware. IBM Journal of Research and Development, 26(1):100–105, 1982.

[26] Myeong-Eun Hwang, Seong-Ook Jung, and Kaushik Roy. Slope interconnect effort:

Gate-interconnect interdependent delay modeling for early cmos circuit simulation.

IEEE Transactions on Circuits and Systems I: Regular Papers, 56(7):1428–1441,

2009.

Bibliography 73

[27] Gogul Ilango. Physical design - terminologies. https://gogul09.github.io/

hardware/physical-design-terminologies#core-utilization, June 2018. Ac-

cessed: 2019-01-14.

[28] Synopsys Inc. Designware library - datapath and building block ip. https://www.

synopsys.com/dw/buildingblock.php. Accessed: 2019-01-10.

[29] Nachiket Kapre, Bibin Chandrashekaran, Harnhua Ng, and Kirvy Teo. Driving tim-

ing convergence of fpga designs through machine learning and cloud computing. In

IEEE 23rd Annual International Symposium on Field-Programmable Custom Com-

puting Machines (FCCM), pages 119–126. IEEE, 2015.

[30] Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. Hy-

cube: A cgra with reconfigurable single-cycle multi-hop interconnect. In Proceedings

of the 54th Annual Design Automation Conference (DAC), pages 1–6. IEEE, 2017.

[31] C. Kim, M. Chung, Y. Cho, M. Konijnenburg, S. Ryu, and J. Kim. Ulp-srp: Ultra

low power samsung reconfigurable processor for biomedical applications. In Pro-

ceedings of the International Conference on Field-Programmable Technology (FPT),

pages 329–334. IEEE, Dec 2012.

[32] Y. Kim, R. N. Mahapatra, and K. Choi. Design space exploration for efficient

resource utilization in coarse-grained reconfigurable architecture. IEEE Transactions

on Very Large Scale Integration (VLSI) Systems, 18(10):1471–1482, Oct 2010.

[33] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated

annealing. Science, 220(4598):671–680, 1983.

[34] I. Kuon and J. Rose. Measuring the gap between fpgas and asics. IEEE Transactions

on Computer-Aided Design of Integrated Circuits and Systems, 26(2):203–215, 2007.

Bibliography 74

[35] Sheng Li, Jung Ho Ahn, Richard D Strong, Jay B Brockman, Dean M Tullsen,

and Norman P Jouppi. Mcpat: an integrated power, area, and timing modeling

framework for multicore and manycore architectures. In Proceedings of the 42nd

Annual IEEE/ACM International Symposium on Microarchitecture, pages 469–480.

ACM, 2009.

[36] Song Li and Carl Ebeling. Quickroute: a fast routing algorithm for pipelined ar-

chitectures. In Proceedings of the International Conference on Field-Programmable

Technology (FPT), pages 73–80. IEEE, 2004.

[37] Joey Y Lin, Deming Chen, and Jason Cong. Optimal simultaneous mapping and

clustering for fpga delay optimization. In Proceedings of the 43rd annual Design

Automation Conference (DAC), pages 472–477. ACM, 2006.

[38] J. Luu, K. Redmond, W. Lo, P. Chow, L. Lilge, and J. Rose. Fpga-based monte carlo

computation of light absorption for photodynamic cancer therapy. In Proceedings

of the 17th Annual IEEE Symposium on Field-Programmable Custom Computing

Machines (FCCM), pages 157–164. IEEE, 2009.

[39] Jason Luu, Jeff Goeders, Michael Wainberg, Andrew Somerville, Thien Yu, Kon-

stantin Nasartschuk, Miad Nasr, Sen Wang, Tim Liu, Norrudin Ahmed, Kenneth B.

Kent, Jason Anderson, Jonathan Rose, and Vaughn Betz. VTR 7.0: Next Generation

Architecture and CAD System for FPGAs. ACM Transactions on Reconfigurable

Technology and Systems (TRETS), 7(2):6:1–6:30, 2014.

[40] S. Malik and H. Jyu. Prediction of interconnect delay in logic synthesis. In IEEE

European Design and Test Conference (EDTC), page 411, 1995.

[41] Larry McMurchie and Carl Ebeling. Pathfinder: a negotiation-based performance-

driven router for fpgas. In Proceedings of the 3rd International Symposium on Field

Programmable Gate Arrays (FPGA), pages 111–117. ACM, 1995.

Bibliography 75

[42] Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwere-

ins. Dresc: A retargetable compiler for coarse-grained reconfigurable architectures.

In Proceedings of the IEEE International Conference on Field-Programmable Tech-

nology (FPT), pages 166–173. IEEE, 2002.

[43] Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwere-

ins. Adres: An architecture with tightly coupled vliw processor and coarse-grained

reconfigurable matrix. In International Conference on Field Programmable Logic

and Applications, pages 61–70. Springer, 2003.

[44] Gordon E Moore. Cramming more components onto integrated circuits, reprinted

from electronics, volume 38, number 8, april 19, 1965, pp. 114 ff. IEEE solid-state

circuits society newsletter, 11(3):33–35, 2006.

[45] M Motomura. A dynamically reconfigurable processor architecture. Microprocessor

Forum, Oct 2002.

[46] Kevin Murray and Vaughn Betz. Tatum: Parallel Timing Analysis for Faster Design

Cycles and Improved Optimization. In Proceedings of the International Conference

on Field-Programmable Technology (FPT). IEEE, 2018.

[47] Tony Nowatzki, Jaikrishnan Menon, Chen-Han Ho, and Karthikeyan Sankaralingam.

Architectural simulators considered harmful. IEEE Micro, 35(6):4–12, 2015.

[48] Seong Y. Ohm, Fadi J. Kurdahi, Nikil Dutt, and Min Xu. A comprehensive esti-

mation technique for high-level synthesis. In Proceedings of the 8th International

Symposium on System Synthesis (ISSS), pages 122–127. ACM, 1995.

[49] Davide Pandini, Lawrence T Pileggi, and Andrzej J Strojwas. Understanding and

addressing the impact of wiring congestion during technology mapping. In Proceed-

ings of the International Symposium on Physical Design (ISPD), pages 131–136.

ACM, 2002.

Bibliography 76

[50] Norbert Pramstaller, Stefan Mangard, Sandra Dominikus, and Johannes Wolkerstor-

fer. Efficient aes implementations on asics and fpgas. In International Conference

on Advanced Encryption Standard, pages 98–112. Springer, 2004.

[51] B. Ramakrishna Rau. Iterative modulo scheduling: An algorithm for software

pipelining loops. In Proceedings of the 27th Annual International Symposium on

Microarchitecture (MICRO), pages 63–74. ACM, 1994.

[52] Yaska Sankar and Jonathan Rose. Trading quality for compile time: Ultra-fast place-

ment for fpgas. In Proceedings of the ACM/SIGDA 7th International Symposium on

Field-Programmable Gate Arrays (FPGA), pages 157–166. ACM, 1999.

[53] Herman Schmit, David Whelihan, Andrew Tsai, Matthew Moe, Benjamin Levine,

and R Reed Taylor. Piperench: A virtualized programmable datapath in 0.18 micron

technology. In Proceedings of the IEEE Custom Integrated Circuits Conference, pages

63–66. IEEE, 2002.

[54] NanGate FreePDK45 Generic Open Cell Library. http://projects.si2.org/

openeda.si2.org/projects/nangatelib. Accessed: 2018-06-30.

[55] Hartej Singh, Ming-Hau Lee, Guangming Lu, Fadi J Kurdahi, Nader Bagherzadeh,

and Eliseu M Chaves Filho. Morphosys: an integrated reconfigurable system for

data-parallel and computation-intensive applications. IEEE Transactions on Com-

puters, 49(5):465–481, 2000.

[56] Hartej Singh, Ming-Hau Lee, Guangming Lu, Fadi J Kurdahi, Nader Bagherzadeh,

Tomas Lang, Robert Heaton, and MC Eliseu Filho. Morphosys: An integrated

re-configurable architecture. In Proceedings of the NATO RTO Symp. on System

Concepts and Integration, pages 20–22, 1998.

[57] D. Suh, K. Kwon, S. Kim, S. Ryu, and J. Kim. Design space exploration and

implementation of a high performance and low area coarse grained reconfigurable

Bibliography 77

processor. In Proceedings of the International Conference on Field-Programmable


[58] Masayasu Suzuki, Yohei Hasegawa, Yutaka Yamada, Naoto Kaneko, Katsuaki

Deguchi, Hideharu Amano, Kenichiro Anjo, Masato Motomura, Kazutoshi Wak-

abayashi, Takao Toi, et al. Stream applications on the dynamically reconfigurable



[59] Vaishali Tehre and Ravindra Kshirsagar. Survey on coarse grained reconfigurable

architectures. International Journal of Computer Applications, 48(16):1–7, 2012.

[60] T. Toi, N. Nakamura, T. Fujii, T. Kitaoka, K. Togawa, K. Furuta, and T. Awashima.

Optimizing time and space multiplexed computation in a dynamically reconfigurable



[61] Leipo Yan, Thambipillai Srikanthan, and Niu Gang. Area and Delay Estimation

for FPGA Implementation of Coarse-grained Reconfigurable Architectures. In Pro-

ceedings of the ACM SIGPLAN/SIGBED Conference on Language, Compilers, and

Tool Support for Embedded Systems (LCTES), pages 182–188. ACM, 2006.

compact area and performance modelling for coarse-grained … · 2019-07-28 · component-wise...

Documents