glitch reduction and cad algorithm noise in fpgas · glitch reduction and cad algorithm noise in...

Glitch Reduction and CAD Algorithm Noise in FPGAs

by

Warren Shum

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

Copyright c⃝ 2011 by Warren Shum

Abstract

Glitch Reduction and CAD Algorithm Noise in FPGAs

Warren Shum

Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2011

This thesis presents two contributions to the FPGA CAD domain. First, a study of

glitch power in a commercial FPGA is presented, showing that glitch power in FPGAs is

significant. A CAD algorithm is presented that reduces glitch power at the post-routing

stage by taking advantage of don’t-cares in the logic functions of the circuit. This method

comes at no cost to area or performance.

The second contribution of this thesis is a study of FPGA CAD algorithm noise –

random choices which can have an unpredictable effect on the circuit as a whole. An

analysis of noise in the logic synthesis, technology mapping, and placement stages is

presented. A series of early performance and power metrics is proposed, in an effort to

find the best circuit implementation in the noise space.

ii

Acknowledgements

First and foremost, I would like to thank Professor Jason Anderson for supervising my

thesis research, and for guiding me along with good ideas and encouragement. I would

also like to thank Professors Jonathan Rose, Vaughn Betz, and Olivier Trescases, for

reviewing this work and serving on my defence committee.

I am also grateful to my parents, for supporting me in all my academic endeavors.

Thanks to my fellow research group members: Marcel, Bill, Jason L., James, Andrew,

Mark, Steven, Ahmed, Victor, Stefan, Alex, Kevin, and my office mates in PT477. I

appreciate the feedback on my work, the sporting activities, as well as just sharing

conversation.

Thanks to the staff at SciNet for their technical support.

I thank NSERC and OGS for financial support throughout my degree.

iii

Contents

1 Introduction 1

1.1 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Glitch Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 CAD Algorithm Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Glitch Power and Don’t-Cares in FPGAs 6

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Glitch Power in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Previous Work on Glitch Reduction in FPGAs . . . . . . . . . . . . . . . 10

2.5 Don’t-Cares in Logic Circuits . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Glitch Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Analysis of Don’t-Cares . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Glitch Reduction Using Don’t-Cares 16

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Glitch Reduction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Computing the Don’t-Cares for a LUT . . . . . . . . . . . . . . . 20

3.2.2 Scanning the Input Vectors . . . . . . . . . . . . . . . . . . . . . 20

3.2.3 Setting the Don’t-Cares . . . . . . . . . . . . . . . . . . . . . . . 22

iv

3.2.4 Iterative Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4.1 Fanout Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 FPGA CAD Algorithm Noise 32

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 CAD Flow Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.1 Logic Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.2 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.3 Placement and Routing . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Noise Measurement: Before Place and Route . . . . . . . . . . . . . . . . 44

4.5 Noise Measurement: After Place and Route . . . . . . . . . . . . . . . . 49

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Early Timing and Power Prediction With Noise 57

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Previous Work on Early Delay/Power Prediction . . . . . . . . . . . . . . 58

5.3 Delay Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3.1 Varying pin delays . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3.2 Logic, routing and constant factors . . . . . . . . . . . . . . . . . 63

5.3.3 Maximum/scaled metrics . . . . . . . . . . . . . . . . . . . . . . . 64

5.4 Power Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.5 Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

v

5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6 Conclusions 74

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.1.1 Glitch Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.1.2 CAD Algorithm Noise . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2.1 Glitch Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2.2 CAD Algorithm Noise . . . . . . . . . . . . . . . . . . . . . . . . 76

A Circuit Delay/Power Statistics With Noise 78

A.1 Individual Design Noise Results . . . . . . . . . . . . . . . . . . . . . . . 78

Bibliography 78

vi

List of Tables

2.1 Glitch example truth table for a logic function with inputs abc and output

f . A possible example of cares is given (care = Y, don’t-care = N ) . . . 9

2.2 Percentage of dynamic power from glitches. . . . . . . . . . . . . . . . . . 13

2.3 Percentage of simulated local LUT input states corresponding to don’t-cares. 14

4.1 Standard deviation of noise (before place-and-route) . . . . . . . . . . . . 49

4.2 Standard deviation of noise (after place-and-route) . . . . . . . . . . . . 51

5.1 Average percentile of top circuits with isolated model parameters. . . . . 70

5.2 Average benefit of prediction models . . . . . . . . . . . . . . . . . . . . 72

A.1 Critical path delay statistics by circuit . . . . . . . . . . . . . . . . . . . 79

A.2 Dynamic power statistics by circuit . . . . . . . . . . . . . . . . . . . . . 80

vii

List of Figures

2.1 (a) Logic blocks and routing in an island-style FPGA architecture.

(b) Example of a 3-input LUT (look-up-table) with truth table in Table 2.1. 8

2.2 Example waveform showing a glitch on the output of a LUT f with truth

table given in Table 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 (a) Example of SDCs (left) and ODCs (right).

(b) Miter circuit used in don’t-care analysis [Mish 05]. . . . . . . . . . . . 12

3.1 Example: before glitch reduction. (a) LUT with don’t-care SRAM bit

shaded. (b) Simulation waveform. . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Example: after glitch reduction. (a) LUT with altered don’t-care SRAM

bit shaded. (b) Simulation waveform with glitches removed. . . . . . . . 19

3.3 A cluster of don’t-cares. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Experimental flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 (a) Dynamic power reduction vs. baseline (default) don’t-care settings

and worst-case settings. (b) Glitch power reduction vs. baseline (default)

don’t-care settings and worst-case settings. . . . . . . . . . . . . . . . . . 25

3.6 Average vote bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.7 (a) Power per signal vs. fanout. (b) Normalized don’t-cares per node vs.

fanout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.8 Fanout splitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.9 Stratix III adaptive logic module (ALM) [Alteb]. . . . . . . . . . . . . . . 29

viii

3.10 Dynamic power reduction from fanout splitting. . . . . . . . . . . . . . . 30

4.1 FPGA CAD flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Example of an And-Inverter Graph (AIG). . . . . . . . . . . . . . . . . . 36

4.3 Example of an AIG before balancing (logic levels shown in parentheses). 37

4.4 Examples of balanced AIGs. . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.5 Example of AIG rewriting. . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.6 Number of circuits vs. normalized nodes/level (balancing noise). . . . . . 45

4.7 Number of circuits vs. normalized nodes/level (rewriting noise). . . . . . 45

4.8 Number of circuits vs. normalized nodes/level (refactoring noise). . . . . 46

4.9 Number of circuits vs. normalized nodes/level (depth-oriented mapping

noise). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.10 Number of circuits vs. normalized nodes/level (area-oriented mapping

noise). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.11 Number of circuits vs. normalized nodes/level (all noise). . . . . . . . . . 47

4.12 Number of circuits vs. normalized delay. . . . . . . . . . . . . . . . . . . 50

4.13 Number of circuits vs. normalized dynamic power. . . . . . . . . . . . . . 51

4.14 Synthesis noise: (a) Number of circuits vs. normalized delay. (b) Number

of circuits vs. normalized power. . . . . . . . . . . . . . . . . . . . . . . . 52

4.15 Delay rank of circuits under synthesis noise averaged across 4 and 5 place-

ment seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.16 Technology mapping noise: (a) Number of circuits vs. normalized delay.

(b) Number of circuits vs. normalized power. . . . . . . . . . . . . . . . . 54

5.1 Slow and fast inputs on lookup tables. . . . . . . . . . . . . . . . . . . . 61

5.2 Example for the pin utilization timing model. . . . . . . . . . . . . . . . 62

5.3 Example for the pin order timing model. . . . . . . . . . . . . . . . . . . 62

ix

5.4 Probability of finding the top circuit vs. percentage of top modeled circuits

considered (delay). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5 Percentile of predicted top circuits (delay). . . . . . . . . . . . . . . . . . 69

5.6 Probability of finding the top circuit vs. percentage of top modeled circuits

considered (power). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.7 Percentile of predicted top circuits (power). . . . . . . . . . . . . . . . . 72

x

Chapter 1

Introduction

1.1 Field-Programmable Gate Arrays

Field-programmable gate arrays (FPGAs) are user-configurable logic devices capable of

implementing digital circuits. These devices are used in a wide variety of areas including

communications, automotive, industrial and consumer markets. The appeal of FPGAs

versus application-specific integrated circuits (ASICs) is that they allow the user to avoid

the high cost of chip fabrication, as well as they reduce time-to-market. FPGAs allow

a hardware designer to prototype their design quickly, while an ASIC design would take

more time and money to repair, should an error be found. Mask set costs at 45nm can

cost as much as $2M [Fran 10], a cost high enough to drive away all but the highest-

volume applications.

To create an FPGA implementation of a design, a hardware engineer will typically use

a hardware description language (HDL), such as Verilog or VHDL. A series of computer-

aided design (CAD) tools transform the HDL into a digital circuit that can be pro-

grammed onto the FPGA. A typical sequence of steps in the CAD flow is as follows:

• Logic Synthesis: The logic functions needed to implement the circuit are derived

and optimized.

1

Chapter 1. Introduction 2

• Technology Mapping: The logic functions are mapped into the logic elements

specific to the target device architecture.

• Packing: The logic elements are grouped into larger units corresponding to the

target device architecture.

• Placement: The mapped logic elements are placed into physical locations on the

target device.

• Routing: The proper connections are made between the logic elements using the

programmable routing network.

The quality of the resulting circuit depends on the quality of the tools used to generate

it. Quality can be measured in terms of area, performance and power. It is here that

FPGAs fall short of ASICs – the area, performance and dynamic power gaps between

them have been estimated at 40x, 4x and 12x, respectively [Kuon 07]. By studying

existing CAD algorithms and exploring new ones, FPGAs can close the gap with ASICs

and attract a larger portion of the digital logic market.

1.2 Glitch Power

As mentioned previously, one area for improvement in FPGAs is power consumption.

Power can be reduced through efforts at various stages: the architectural level, the

circuit level, or the CAD level (which will be the focus here). In particular, glitch

power (the power dissipated by unnecessary signal transitions) is an attractive target for

reduction since it comprises from 4% to 73% of total dynamic power, with an average of

22.6% [Lamo 08]. We present two contributions in this area, the results of which have

been published [Shum 11]:

1. An analysis of glitch power in commercial FPGAs.


2. A CAD approach for reducing glitch power at no cost to area or performance.

Chapter 2 provides background on FPGA glitch power. It begins with a description

of how glitches occur in FPGAs, and some previous works on how to reduce glitch

power. To motivate our research, we present our own analysis on glitch power

in commercial FPGAs. Our results show an average of 26% of dynamic power

from glitches. This chapter also describes don’t-cares in logic functions, which will

be used in the glitch reduction algorithm. We show that the average occurrence

of don’t-cares under simulation is sufficient to supply ample opportunities for our

algorithm.

Chapter 3 presents an algorithm for glitch power reduction which can be performed

post-routing, incurring zero area and performance cost. The algorithm takes ad-

vantage of don’t-care bits in the truth tables of functions in a circuit, setting them

to values which minimize the amount of glitch power dissipated. The algorithm is

tested with a commercial FPGA CAD tool suite and architecture, and shows an

average glitch power reduction of 13.7%, and an average dynamic power reduction

of 4.0%.

1.3 CAD Algorithm Noise

Given the tremendous challenge of solving modern-day CAD problems, the algorithms

used for these problems generally use heuristics to seek a reasonable solution in an ac-

ceptable amount of time. In the course of exploring the vast solution space of these

problems, there is often a need to choose between two or more alternatives that appear

to have the same quality. Such choices, although seemingly innocuous at the time of

selection, can have ripple effects on future choices, causing the final quality of the circuit

to vary if different choices are made. We label these variations as noise. We present the

following contributions in this area:


1. An analysis of a series of logic synthesis and technology mapping algorithms, ex-

posing potential sources of noise that have not been studied before.

2. Experimental results on the amount of noise present in several CAD algorithms, in

terms of critical path delay and dynamic power. The concept of power noise is also

a new contribution which has not been previously studied.

3. A method for predicting the best circuits in terms of performance and power in the

presence of noise.

Chapter 4 introduces the concept of CAD algorithm noise. We expose hidden sources

of noise in the logic synthesis and technology mapping algorithms of the academic

CAD tool ABC [Berk 06]. We present the results of our noise analysis, showing

the effects of random choices in thousands of circuit compilations. The results of

the noise injection show a standard deviation of as much as 3.3% in critical path

delay, and 3.7% in dynamic power.

Chapter 5 presents a solution to the variance in circuit quality produced by CAD algo-

rithm noise. The idea is to perform several synthesis and mapping runs of a circuit

(using different seeds) and use early timing and power metrics to predict the best

one(s) to advance to the placement and routing stages. This would save the time

that would be spent on a large number of place-and-route runs. In this chapter, a

wide array of early timing prediction models are evaluated, including several ap-

proaches to estimating logic and routing delays. For power prediction, two fast

simulation models are used, as well as information from the packing stage of the

CAD flow. The application of these prediction models in a commercial FPGA leads

to an average benefit of up to 1.8% in delay and 1.8% in power compared to the

average noise-injected circuit.

Chapter 6 concludes the work. We summarize the contributions of the previous chap-


ters and present possible extensions and related research topics for future work.

Chapter 2

Glitch Power and Don’t-Cares in

FPGAs

2.1 Introduction

Power in FPGAs can be divided into two categories: static power and dynamic power.

Static power is due to current leakage in transistors. Dynamic power is a result of signal

transitions between logic-0 and logic-1. These transitions can be split into two types:

functional transitions and glitches. Functional transitions are those which are necessary

for the correct operation of the circuit. Glitches, on the other hand, are transitions that

arise from unbalanced delays to the inputs of a logic gate, causing the gate’s output to

transition briefly to an intermediate state. Although glitches do not adversely affect the

functionality of a synchronous circuit (as they settle before the next clock edge), they

have a significant effect on power consumption. Using an academic FPGA model, glitch

power has been estimated to comprise from 4% to 73% of total dynamic power, with an

average of 22.6% [Lamo 08]. This is a significant motivator for the reduction of glitch

power.

As a means of reducing glitch power, we seek to take advantage of don’t-cares in

a circuit. Don’t-cares are an important concept in logic synthesis and are frequently

6

Chapter 2. Glitch Power and Don’t-Cares in FPGAs 7

used for the optimization of logic circuits. A don’t-care of a logic function within a

larger circuit is an input state for which the function’s output can be either logic-0 or

logic-1, without affecting the circuit’s correctness. Don’t-cares can come from external

constraints or from within the circuit itself. An external constraint may be specified by

the designer (e.g. asserting that a certain input combination will never be applied). A

logic function within a circuit may also have don’t-cares due to its surrounding logic,

for example, if the logic feeding the function’s fanins can never satisfy a certain input

combination, or if the function’s output does not affect the circuit’s primary outputs

under certain circumstances.

This chapter is organized as follows. Section 2.2 gives a brief overview of basic FPGA

architecture. Section 2.3 describes how glitches occur in FPGAs. Section 2.4 summarizes

some previous works on FPGA glitch reduction. Section 2.5 describes don’t-cares and

how they can be found. Section 2.6 gives our analysis of glitch power, while Section 2.7

gives our analysis of don’t-cares. Section 2.8 summarizes the chapter.

2.2 FPGA Architecture

Before presenting our glitch analysis and glitch reduction method, it is important to recap

some basic FPGA architecture and terminology. Fig. 2.1(a) shows a section of a typical

island-style FPGA architecture. It is composed of logic blocks connected to one another

through a programmable routing network. Programmable routing switches (shown as x’s

in Fig. 2.1(a)) allow pins on logic blocks to be programmably connected to pre-fabricated

metal wire segments, and also allow wire segments to be programmably connected with

one another to form routing paths.

Inside the logic blocks, logic functions are implemented using look-up-tables (LUTs).

An example is shown in Fig. 2.1(b). A k-input LUT can implement any logic function of

up to k variables. In essence, a LUT is a hardware implementation of a truth table, where


(a) (b)

Figure 2.1: (a) Logic blocks and routing in an island-style FPGA architecture.(b) Example of a 3-input LUT (look-up-table) with truth table in Table 2.1.

the output value for each minterm is held in an SRAM configuration cell (bit). A k-input

LUT requires 2k configuration bits. For this work, we target an FPGA that contains 6-

input LUTs, which are typical of modern commercial FPGA architectures [Altec, Xili].

2.3 Glitch Power in FPGAs

The dynamic power consumed by an FPGA can be modeled by the formula

Pdyn =1

2

n∑i=1

SiCifV2dd (2.1)

where n is the number of nets in the circuit, Si is the switching activity of net i, Ci is

the capacitance of net i, f is the frequency of the circuit, and Vdd is the supply voltage.

The glitch reduction algorithm presented in this work aims to lower the switching activity

as a means of reducing dynamic power.

As a result of the differences in delays through the routing network and LUTs them-

selves, signals arriving at LUT inputs may transition at different times, leading to glitches.


Figure 2.2: Example waveform showing a glitch on the output of a LUT f with truthtable given in Table 2.1.

abc f Care000 0 Y001 0 Y010 0 Y011 0 Y100 1 N101 1 Y110 0 N111 0 Y

Table 2.1: Glitch example truth table for a logic function with inputs abc and output f .A possible example of cares is given (care = Y, don’t-care = N )

An example is shown in Fig. 2.2. This LUT implements the 3-input function given in

Table 2.1. Consider the case where the inputs transition from 000 → 111. Ideally, the

output f would remain constant at 0. However, varying arrival times on the inputs may

cause an input transition sequence such as 000→ 100→ 110→ 111, causing f to make

a 0 → 1 → 0 → 0 transition rather than remaining at 0. This leads to extra power

consumed by the LUT and any of its fanouts that propagate the glitch. Furthermore,

the glitch is propagated through the FPGA interconnect which presents a high capacitive

load due to its long metal wire segments and programmable (buffered) routing switches.

Prior work has shown, in fact, that interconnect accounts for 60% of total FPGA dynamic

power [Shan 02].


2.4 Previous Work on Glitch Reduction in FPGAs

Glitch reduction techniques can be applied at various stages in the CAD flow. Since

glitches are caused by unbalanced path delays to LUT inputs, it is natural to design

algorithms that attempt to balance the delays. This can be done at the technology

mapping stage [Chen 07b], in which the mapping is chosen based on glitch-aware switch-

ing activities. Another approach operates at the routing stage [Dinh 09], in which the

faster-arriving inputs to a LUT are delayed by extending their path through the rout-

ing network. Delay balancing can also be done at the architectural level. The work

in [Lamo 08] inserts programmable delay elements to balance the arrival times of signals

at LUT inputs. However, these approaches all incur an area or performance cost.

Some works use flip-flop insertion or pipelining to break up deep combinational logic

paths which are the root of high glitch power. Circuits with higher degrees of pipelining

tend to have lower glitch power because they have fewer logic levels, thus reducing the

opportunity for delay imbalance [Wilt 04]. Flip-flops with shifted-phase clocks can be

inserted to block the propagation of glitches [Lim 05]. Another work in [Czaj 07] uses

negative edge-triggered flip-flops in a similar fashion, but without the extra cost of gen-

erating additional clock signals. It is also possible to apply retiming to the circuit by

moving flip-flops to block glitches [Fisc 05].

Our work draws inspiration from hazard-free logic synthesis techniques for asyn-

chronous circuits, such as [Lin 95]. In asynchronous circuits, glitches (hazards) cannot

be tolerated because they may produce incorrect behavior (consider, for example, the

disasterous effect of a glitch on a handshaking signal). Our work is different in that while

hazards are tolerable from a functionality standpoint, it is beneficial to remove them to

reduce power consumption.

A key feature of the work presented here is that it has no impact on the rest of

the design flow. It is applied after placement and routing, and as a consequence, the

algorithm has no cost in terms of performance or area. Other methods incur additional


area/delay from the inclusion of delay elements, registers and extra routing resources, as

well as disrupting the synthesis and layout of the circuit in an unpredictable way. Our

approach maintains the results of the existing compilation while only making changes to

the don’t-cares within LUT truth table configuration bits. This zero-overhead method is

a highly desirable quality not shared by previous glitch reduction approaches.

2.5 Don’t-Cares in Logic Circuits

To prevent glitches, we take advantage of don’t-cares. These are entries in the truth

table where a LUT’s output can be set as either logic-0 or logic-1 without affecting the

correctness of the circuit. Don’t-cares fall into two categories: satisfiability don’t-cares

(SDCs) and observability don’t-cares (ODCs) [Mish 09]. SDCs occur when a particular

input pattern can never occur on the inputs to a LUT. In the example shown in Fig. 2.3(a),

the inputs a = 0, b = 1 will never occur. ODCs occur when the output of a LUT cannot

propagate to the circuit’s primary outputs. In the example, the output of f2 has no

effect when c = 0.

In this work, we leverage the don’t-care analysis capabilities of the ABC logic synthesis

network developed at UC Berkeley [Berk 06]. ABC incorporates Boolean satisfiability

(SAT)-based complete don’t-care analysis that can be used to determine the don’t-care

minterms for a given LUT in a technology mapped FPGA circuit [Mish 05]. To find the

don’t-cares for a given LUT, f , ABC uses a miter circuit, as illustrated in Fig. 2.3(b).

As shown, two instances of LUT f and (some of) its surrounding circuitry are created –

the surrounding circuitry is shown as a shaded region in the figure. In one instance, f ’s

output is in true form; in the other instance, f ’s output is inverted. The outputs of the

two instances are exclusive-OR’ed with one another, with the XOR gate outputs being fed

into a wide OR gate. The final OR gate produces an output logic signal C(x) for a given

input vector x.


(a) (b)

Figure 2.3: (a) Example of SDCs (left) and ODCs (right).(b) Miter circuit used in don’t-care analysis [Mish 05].

For an input vector x to the miter in Fig. 2.3(b), one can compute a local input

vector y to LUT f . For any such x where C(x) is logic-1, y is a care minterm of LUT f ;

that is, LUT f affects the circuit outputs for input vector x. The basic approach taken

in [Mish 05] is to use a fast vector-based simulation as well as SAT to find all vectors,

x, where C(x) evaluates to logic-1, yielding the complete care set for LUT f . This

provides a general picture of the don’t-care analysis approach and the reader is referred

to [Mish 05] for full details. Don’t-cares have recently been used for area reduction in

FPGA circuits [Mish 09].

2.6 Glitch Power Analysis

To motivate the need for glitch reduction, we examine the amount of glitch power dissi-

pated by 20 MCNC benchmark designs. These designs were fully compiled using Altera

Quartus 10.1, targeting 65nm Stratix III devices [Alteb]. ModelSim 6.3e was then used

to perform a functional (zero-delay) and timing simulation of each circuit using 5000 ran-

dom input vectors, producing two switching activity (VCD) files. The VCD files contain

a record of every transition of every net in the circuit. The dynamic power was then

computed using Quartus PowerPlay – Altera’s power analysis tool. The glitch filtering

setting was enabled, as it only filters glitches that are too short to occur in an actual


Circuit % glitch Circuit % glitchalu4 25.7 ex5p 41.6apex2 29.2 frisc 10.7apex4 30.3 misex3 25.4bigkey 29.6 pdc 36.7clma 24.2 s298 24.2des 45.4 s38417 26.8diffeq 5.8 s38584 1 11.4dsip 29.9 seq 26.2

elliptic 12.2 spla 33.2ex1010 35.0 tseng 17.5

Average 26.0

Table 2.2: Percentage of dynamic power from glitches.

FPGA. We only consider the core dynamic power – that is, no static power and no I/O

power. This was done in order to avoid skewing the results with power components un-

related to glitching. The glitch power was computed as the difference in dynamic power

between the functional and timing simulations.

The results are shown in Table 2.2. The percentage of dynamic power due to glitches

ranges from 5.8% to 45.4%, with an average of 26.0%, which is similar to that reported

in the academic FPGA context [Lamo 08]. This makes glitches an attractive target for

power reduction in commercial FPGAs. We do not believe any prior published work has

analyzed glitch power in a commercial FPGA.

2.7 Analysis of Don’t-Cares

In order to evaluate the potential for a don’t-care-based glitch reduction algorithm, we

analyzed every local input vector seen by each LUT in each circuit across its timing

simulation. This was done by taking the simulation output VCD generated by ModelSim

and inputting it to ABC. In ABC, we traverse the simulation vectors for each LUT, and

count the number of local input vectors to that LUT which correspond to its don’t-cares.

The percentage of such LUT input states which were don’t-cares is shown in Table 2.3.


Circuit % inputs DC Circuit % inputs DCalu4 18.4 ex5p 36.2apex2 7.4 frisc 5.2apex4 17.8 misex3 17.0bigkey 3.7 pdc 37.2clma 32.4 s298 15.3des 0.8 s38417 10.3diffeq 3.9 s38584 1 3.1dsip 4.6 seq 7.6

elliptic 0.7 spla 33.8ex1010 34.6 tseng 12.4

Average 15.1

Table 2.3: Percentage of simulated local LUT input states corresponding to don’t-cares.

The percentages vary from 0.8% to 37.2%, with an average of 15.1%. This tells us that

not only do circuits contain an abundance of don’t-cares, but also that, surprising, these

don’t-cares are often traversed in circuit operation. In other words, a LUT’s don’t-care

minterms are frequently “visited” under vector stimulus. The visits to such don’t-care

minterms may potentially lead to additional unnecessary toggles on LUT outputs. We

can thus potentially reduce glitches through don’t-care settings, which is the core idea of

our approach (which will be described in the next chapter).

2.8 Conclusion

In this chapter, we introduced basic FPGA architecture and gave an introduction to

power consumption in FPGAs. We summarized some previous works in the area of glitch

reduction. We described how glitches are generated, and presented our own analysis of

glitch power consumption in commercial FPGAs. Glitch power was found to comprise

an average of 26.0% of total dynamic power. We also explained logical don’t-cares and

how they can be found, as well as analyzing how often they occur in circuits. It was

found that an average of 15.1% of visited LUT input states are don’t-cares. Together,

these results indicate that glitch power is a good target for power reduction, and that


don’t-cares are prevalent enough to enable a don’t-care based glitch reduction algorithm.

This algorithm will be presented in the next chapter.

Chapter 3

Glitch Reduction Using Don’t-Cares

3.1 Introduction

In this chapter, we present a glitch reduction optimization algorithm based on don’t-

cares. It sets the output values for the don’t-cares of logic functions in such a way that

reduces the amount of glitching. This process is performed after placement and routing,

using timing simulation data to guide the algorithm. Relative to prior published FPGA

glitch reduction techniques, our approach is entirely new, and leverages the ability to

re-program FPGA logic functions without altering the placement and routing. Since the

placement and routing are maintained, this optimization has zero cost in terms of area

and delay, and can be executed after timing closure is completed.

Section 3.2 describes the new algorithm for glitch reduction. Section 3.3 describes the

methodology for testing the algorithm. Section 3.4 shows the power reduction results,

and Section 3.5 summarizes the chapter.

3.2 Glitch Reduction Algorithm

We begin with an example to illustrate how don’t-cares can be used to prevent glitches.

The general idea is to simulate the circuit, then traverse the simulation vectors for each

16

Chapter 3. Glitch Reduction Using Don’t-Cares 17

(a)

(b)

Figure 3.1: Example: before glitch reduction. (a) LUT with don’t-care SRAM bit shaded.(b) Simulation waveform.


LUT, focusing on vectors corresponding to don’t-cares. We keep a count of the number of

instances for each don’t-care when we would prefer setting it to logic-0 or logic-1 (based

on the care outputs surrounding it). We will refer to these counts as “votes”. When the

end of the simulation vectors is reached, we set the don’t-cares to the value (logic-0 or

logic-1) corresponding to the more popular vote.

Figs. 3.1(a) and 3.1(b) show an example of a LUT and its simulation waveform. Let

us assume that the truth table row for abc = 100 corresponds to a don’t-care, found using

the method described in Section 2.5. We illustrate the don’t-care by shading its SRAM

configuration bit in Fig. 3.1(a). We also assume that the don’t-care bit is currently set

to logic-1 – an arbitrary choice. We initialize the vote counts to 0 (vote0 = 0, vote1 = 0).

Now, we traverse the waveform of Fig. 3.1(b) from left to right, stopping when we

encounter an input corresponding to a don’t-care (DC). In this case, we encounter the

don’t-care input abc = 100 in the second time step. We then consider the previous LUT

output and the next LUT output. In this case, we see that they are both logic-0. If

we were to change the output for abc = 100 to logic-0 instead of logic-1, we would be

able to prevent two glitch transitions on f . Therefore, we increment the vote counter

for logic-0 (vote0 = 1, vote1 = 0). In the fourth time step, we encounter another don’t-

care flanked by two logic-0 outputs. We increment the vote counter for logic-0 again

(vote0 = 2, vote1 = 0).

At the sixth time step, we see the the neighboring outputs of this don’t-care instance

are logic-0 and logic-1. In this case, there would be one transition on f whether the

don’t-care is set to logic-0 or logic-1. Therefore, no change is made to the vote counts

(vote0 = 2, vote1 = 0). At this point, we have exhausted the simulation waveform. We

set the don’t-care bit to logic-0, since vote0 is greater than vote1. The resulting LUT and

waveform are shown in Figs. 3.2(a) and 3.2(b). We can see that four glitch transitions

have been eliminated on output f .

A more formal expression of the glitch reduction algorithm is shown in Algorithm 1.


(a)

(b)

Figure 3.2: Example: after glitch reduction. (a) LUT with altered don’t-care SRAM bitshaded. (b) Simulation waveform with glitches removed.


It takes a placed and routed netlist as its input. We represent the netlist as a graph

G(V,E), where V is the set of vertices (LUTs) and E is the set of edges (routing wires).

The algorithm also takes a value change dump (VCD) file containing the results of a

timing simulation of the circuit. The simulation vectors are denoted as S, where the ith

local input vector to LUT n is denoted as Sn[i]. A timing simulation is needed rather than

a functional one because glitches arise from delay mismatches, which will only appear

under timing simulation.

The algorithm iterates through each LUT in the netlist, progressing from shallower

levels to deeper ones. This order is used because glitches prevented on shallower LUTs

will be prevented from propagating to deeper LUTs, thus saving more power. Within

each level, the LUTs are examined in descending order of power consumption. This

prioritizes the LUTs with the greatest potential savings. For each LUT, the following

steps are performed:

1. Compute the don’t-cares of the LUT.

2. Scan the input vectors.

3. Set the values of the don’t-cares.

3.2.1 Computing the Don’t-Cares for a LUT

As described previously in Section 2.5, we use ABC’s SAT-based don’t-care analysis

to compute the inputs states (minterms) for the particular LUT which are don’t cares

(Algorithm 1, line 3). DC is the set of don’t-care input states.

3.2.2 Scanning the Input Vectors

The sequence of local input vectors to the LUT (denoted Sn) is extracted from the timing

simulation VCD file. These input vectors are examined in order (line 5). When an input


Algorithm 1 Glitch reduction algorithm.Input: a netlist G(V, E) with simulation vectors SOutput: a netlist with modified LUT functions1: for each LUT n ∈ V in order of priority do2: {1. Compute the don’t-cares of the LUT}3: DC = compute dont cares(n)4: {2. Scan the input vectors}5: for i = 0 to size(Sn) do6: if Sn[i] ∈ DC then7: prev ← previous care output8: next← next care output9: if prev = 0 and next = 0 then10: V otes0(Sn[i])← V otes0(Sn[i]) + 111: else if prev = 1 and next = 1 then12: V otes1(Sn[i])← V otes1(Sn[i]) + 113: end if14: end if15: end for16: {3. Set the values of the don’t-cares and update netlist}17: for each don’t-care d ∈ DC do18: if V otes0(d) > V otes1(d) then19: assign 0 as the output of d20: else if V otes1(d) > V otes0(d) then21: assign 1 as the output of d22: end if23: end for24: end for

vector Sn[i] corresponding to a don’t-care is reached (line 6), we look at the closest

states in the past and future that correspond to care input vectors (lines 7-8). We use

this information to decide whether this don’t-care should be set to a logic-0, logic-1, or

whether there is no preference. If the closest past and future cares are identical (both

logic-0 or both logic-1) then the don’t-care should be set to the same value. Otherwise,

there is no preference. For each don’t-care minterm, a count of “votes” is kept, indicating

how many times in the simulation it would be beneficial to set it to a logic-0 or logic-1

(lines 9-12). This process is repeated for each input vector Sn[i] in the full simulation

time (lines 5-15).

Consider again the example shown in Fig. 2.2 and Table 2.1. Suppose that for input

Sn[i] = 100, the LUT output is a don’t-care. This means that even though it is assigned

to logic-1 in the truth table, we can assign it to logic-0 or logic-1 without affecting the


Figure 3.3: A cluster of don’t-cares.

functionality of the circuit. In this case, we see a glitch on f making a 0 → 1 → 0 → 0

transition as the inputs transition 000→ 100→ 110→ 111. Looking at the closest care

states before and after input 100, we see that they both output a logic-0. Therefore, the

algorithm votes for the output of 100 to be logic-0.

It is possible that the simulation data may include a long contiguous cluster of don’t-

cares. In these cases, the more desirable state could be the opposite of the one that

would be chosen by this algorithm. For example, it may be beneficial to set a particular

don’t-care to logic-0 within a cluster of logic-0’s (don’t-cares) in between two logic-1’s

(cares) rather than attempting to set the entire cluster to logic-1. This situation is

illustrated in Figure 3.2.2. The fourth time step shows a don’t-care surrounded by other

don’t-cares which have high vote0 (i.e. they will be set to 0). Therefore, we can see that

setting this DC to 0 would be preferable. However, the algorithm would set it to 1, as

the nearest cares are both 1. This would cause a glitch. Fortunately, experimental data

shows that such long clusters are uncommon. The average length of don’t-care clusters

in the benchmark set is 3.5. This justifies our use of the closest care input vectors.

3.2.3 Setting the Don’t-Cares

When the end of the input vectors is reached, each don’t-care is set to the value with more

votes (unless the votes are tied, in which case nothing is done – the choice is arbitrary).


The loop at lines 17-23 walks through each don’t-care d ∈ DC (the set of don’t-care

minterms) and checks whether logic-0 or 1 has a majority of votes. The netlist is updated

accordingly before proceeding to the next LUT. This is critical because changing the logic

function of one LUT can affect the don’t-cares of other LUTs, due to incompatibility

between don’t-cares [Mish 09]. By ensuring that the don’t-cares are computed using the

most recent information, the circuit is guaranteed to remain functionally-equivalent to

the original.

3.2.4 Iterative Flow

Following the modification of the circuit, the simulation results become outdated, due

to the changes to the LUT functions. Therefore, we repeat the simulation using the

modified circuit after performing glitch reduction on the full circuit. The algorithm is

then repeated. In practice, the majority of the glitch reduction occurs within the first

three iterations.

It is important to note that the loop of the iterative flow does not involve re-running

placement and routing. This is vital for two main reasons. First, the results of the

existing compilation will be preserved, so there is no interference with timing closure.

Second, the delays within the circuit will be kept the same, thus minimizing the amount

of change to the simulation vectors. This allows the algorithm to converge quickly.

The algorithm runtime is on the order of minutes for the benchmarks used. Al-

though the iterative process employs a timing simulation, the fact that this algorithm is

performed after place-and-route mitigates the issue of runtime. We envision a usage sce-

nario in which the designer runs this algorithm as part of a final pass after timing closure

has been achieved. Since no modifications are made to the circuit’s timing characteristics,

timing closure is preserved.


Figure 3.4: Experimental flow.

3.3 Methodology

We perform our glitch reduction algorithm on 20 MCNC benchmark circuits. The exper-

imental methodology was chosen to include commercial CAD tools wherever possible, to

evaluate the efficacy of the algorithm on real-world FPGAs. The flow is shown in Fig. 3.4.

We perform a full compilation using Quartus II 10.1 (synthesis, placement and routing)

targeting the Altera Stratix III 65nm FPGA family [Alteb]. This is followed by a timing

simulation using ModelSim SE 6.3e. For each circuit, 5000 random input vectors are

applied. We use a set of custom scripts to transform the simulation netlist generated by

Quartus into BLIF format, which can then be read into ABC, where the glitch reduction

is performed. Combinational equivalence checking (command cec in ABC [Mish 06c]) is

used after the glitch reduction step to ensure that the functionality of the circuit remains

the same. The output from ABC is used to modify the configuration bits in the simula-

tion netlist, thus ensuring that the placement and routing remain identical. Three passes

of the optimization loop are performed. Experiments show that very few changes, if any,

are made after this point (i.e. further iterations have virtually no effect). The power

measurements are performed using Quartus PowerPlay.


(a)

(b)

Figure 3.5: (a) Dynamic power reduction vs. baseline (default) don’t-care settings andworst-case settings. (b) Glitch power reduction vs. baseline (default) don’t-care settingsand worst-case settings.


3.4 Results

The leftmost bars in Fig. 3.5(a) (vs. baseline) represent the percentage reduction in total

core dynamic power after performing the glitch reduction algorithm. Immediately, we

can see that about half of the circuits benefit from the algorithm. The average reduction

is 4.0%, with a peak of 12.5%. Fig. 3.5(b) shows the corresponding reduction in glitch

power. The average reduction is 13.7%, with a peak of 49.0%. Naturally, the amount of

power reduction possible is based on the amount of glitching present and the number of

don’t-cares available. While the overall average power reductions are relatively modest,

we believe they will interest FPGA vendors and power-sensitive FPGA customers, as they

come at no cost to performance or area. For some circuits, over 10% power reduction

can be achieved essentially for “free”.

It is also interesting to look at the optimized power vs. the worst case don’t-care

settings possible, as illustrated by the rightmost bars in Fig. 3.5 (vs. worst-case). In this

experiment, we set the don’t-cares to the opposite of how they would normally be set

by our optimization algorithm, to examine the potential worst-case glitch power arising

from don’t-cares. Here, we see an average total dynamic power savings of 9.8% and

a peak savings of 30.8% (Fig. 3.5(a)). These results show that don’t-care settings can

potentially have a large impact on power if set to sub-optimal values.

The varied results in Fig. 3.5 can be correlated with the glitch power and don’t-care

data in Tables 2.2 and 2.3. For instance, des had a high glitch power in Table 2.2, yet

we did not observe a significant power reduction for this circuit. However, in Table 2.3,

we see that it had only 0.8% of LUT inputs as don’t-cares, thus reducing the number of

opportunities for optimization. On the other hand, pdc had a high amount of glitching

as well as ample don’t-cares, thus allowing it to be greatly improved by the algorithm –

12.5% dynamic power reduction.

We also examined the bias of votes cast on each don’t-care minterm in each LUT

in each circuit. The average results are shown in Fig. 3.6. The bias is defined as the


Figure 3.6: Average vote bias.

percentage of votes that were cast for the more popular setting, whether logic-0 or logic-

1. Bias is calculated for each don’t-care individually and averaged across the circuit. As

shown in the figure, the bias value tends to be in the 80-100% range, indicating that there

usually exists a highly preferable setting for a particular don’t-care minterm in a LUT.

This is an important observation because it indicates that our don’t-care settings are

providing a benefit most of the time (as opposed to the case of a bias around 50%, which

would imply that selecting either logic-0 or logic-1 for the don’t-care minterm is equally

good). These observations suggest that there usually exists a value for each don’t-care

(either 0 or 1) that is much better than the other, meaning that one can pick don’t-care

logic values with a high degree of confidence.

The relationship between don’t-cares, power and fanout presents a challenge to the

glitch reduction algorithm. Fanout is closely related to interconnect capacitance, and

interconnect can represent 60% of total FPGA dynamic power, on average [Shan 02].

Fig. 3.7(a) shows logic signal power consumption versus fanout, averaged across all signals

in all circuits. Observe that, as expected, average signal power increases with fanout,

due to the increase in capacitance. We also examined, for each signal, the fraction of

minterms in its driving LUT that were don’t-cares, and averaged this across all signals


(a) (b)

Figure 3.7: (a) Power per signal vs. fanout. (b) Normalized don’t-cares per node vs.fanout.

Figure 3.8: Fanout splitting.

of a given fanout in all circuits. The results are shown in Fig. 3.7(b). While the results

are “noisy” for high fanout (due to a small sample size for such fanouts), we see that, in

general, high fanout signals have fewer don’t-cares in their driving LUTs than low fanout

signals. The rationale for this is that high fanout signals are more likely to be used by

at least one of their fanouts, decreasing ODCs for such signals. Essentially, we have two

competing trends in that it is desirable to reduce the power of high fanout signals (as

they consume significant power), yet such signals exhibit fewer don’t-care opportunities.


Figure 3.9: Stratix III adaptive logic module (ALM) [Alteb].

3.4.1 Fanout Splitting

Based on the trend of high-fanout signals having fewer don’t-cares, it seemed reasonable

to examine this as a potential area for improvement. Consider a LUT f1 with fanout

LUTs FO1...FOn. Suppose that LUTs FO1...FOn−1 do not care about the value of f1

when its input is x, but FOn does care about it. Then x is a care for f1, thereby reducing

the amount of don’t-care optimization opportunities, even though only one of its fanouts

uses it.

A possible solution to this problem is to duplicate LUT f1, creating f2, and trans-

ferring fanout FOn from f1 to f2. This would increase the amount of don’t-cares on f1,

since x would now be a don’t-care. In general, f1 can be split into two LUTs, f ′1 and

f2 (i.e. we redistribute the fanout of f1, moving some of its fanout to f2). Each LUT

now has more don’t-care opportunities, since the cares “generated” by fanouts of f ′1 are

no longer present in f2, and vice versa. An example is given in Fig. 3.8. The LUT f1

has four fanouts which have care set 1 (illustrated by the hatch marks as a subset of the

truth table). In other words, if no other fanouts existed besides those four, the overall

care set of f1 would be care set 1. The fifth fanout has care set 2. The overall care set

of f1 is the union of these care sets.

By splitting the fanout of f1 among two new LUTs, f ′1 and f2, we can create two


Figure 3.10: Dynamic power reduction from fanout splitting.

LUTs with smaller cares sets and therefore more don’t-care optimization opportunities.

However, this incurs a power cost in duplicating the LUT and some routing resources.

The fanin routing would have to be duplicated for the new LUT. This would add to

the capacitance of these fanin signals. Fortunately, the Stratix III architecture [Alteb]

provides us with a way to mitigate this cost. The Adaptive Logic Module (ALM) shown

in Fig. 3.9 is essentially a pairing of two LUTs. By co-locating f1 and f2 in the same

ALM, we can virtually eliminate the cost of routing to an entirely new LUT. This is

because the routing to one LUT is shared with the routing to the other. This is a special

opportunity offered by the Stratix III architecture.

Figure 3.10 shows the dynamic power reduction resulting from fanout splitting. Some

circuits could not be placed and routed after fanout splitting due to illegal placement

constraints. This is because pairing certain LUTs into a single ALM may cause issues

with the compatibility between the LUTs. Unfortunately, the possible power reduction

is quite low, aside from a 5% reduction on alu4. Several circuits even show an increase


in power. This is due to the extra LUT that must be used, as well as its associated

routing resources. Considering the tradeoff of saving the occasional glitch transition

versus the overhead of adding more logic and routing resources, the fanout splitting is

rarely beneficial. Therefore, we decided not to further pursue fanout splitting.

3.5 Conclusion

In this chapter, we presented an analysis of glitch power in FPGAs and a method for glitch

reduction using don’t-cares in logic synthesis. We showed that glitch power is a significant

portion of total power, and that there exist ample opportunities for don’t-care-based

optimizations. A novel glitch reduction technique was presented that sets don’t-cares in

FPGA configuration bits in order to avoid glitch transitions. This method is performed

after placement and routing, and has no effect on circuit area or performance. The

algorithm was evaluated with a commercial 65nm FPGA architecture using a commercial

tool flow. The algorithm achieved an average total dynamic power reduction of 4.0%,

with a peak reduction of 12.5%; glitch power was reduced by up to 49.0%, and 13.7% on

average.

Chapter 4

FPGA CAD Algorithm Noise

4.1 Introduction

The process of designing a circuit for an FPGA platform generally involves writing code

in a hardware description language such as Verilog or VHDL, then compiling the code

to a bitstream that will be programmed onto the FPGA. This compilation process is

broken into a series of CAD stages. Due to the complex nature of these problems, the

CAD algorithms make use of heuristics to handle them.

CAD algorithms commonly encounter situations where a choice must be made be-

tween two or more alternatives that appear to have the same quality. For example, a

logic function might be implemented in multiple ways, each having the same local cost

in terms of area, delay, power, or some other metric. However, the choice of how that

function is implemented may have an unknown global effect on the quality of the circuit.

In practice, the choice may be arbitrarily made (e.g. always select the first alternative)

or it may be controlled with the use of a random number generator. By running the

algorithm multiple times using different seeds for the random number generator, we can

obtain a set of circuits with different characteristics. The variation in the quality of these

circuits (area, performance, power) through seemingly neutral changes is what we will

call noise. It is interesting to note that noise places a limit on the prediction accuracy of

32

Chapter 4. FPGA CAD Algorithm Noise 33

any timing/power estimation tools that are used prior to a noise-containing CAD algo-

rithm. One of the goals of this work is to quantize the amount of noise present in several

CAD algorithms.

The practice of trying multiple seeds, or “seed sweeping” is well-established for place-

ment and routing [Altea]. However, it is also possible for noise to be found in the logic

synthesis and technology mapping stages of the CAD flow. By exposing the noise in these

earlier stages, we hope to allow seed sweeping to take place earlier, in less time-consuming

stages.

In the following chapters, the following questions are addressed:

• Where in the CAD flow does noise come from?

• How much noise exists in the various stages of the compilation flow?

• Is there a way to predict the best circuits from a group of candidates, in the presence

of noise?

In this chapter, we examine several CAD algorithms in the logic synthesis and tech-

nology mapping stages and expose noise in those algorithms. To our knowledge, there

is no prior work studying noise in these algorithms, nor is there existing work on power

noise in FPGAs (variations in dynamic power consumption due to CAD algorithm noise).

Section 4.2 presents background on the particular CAD algorithms to be studied, and

describes our noise injection method. Section 4.3 describes our methodology for eval-

uating the amount of noise present in a set of benchmark circuits. Section 4.4 shows

the performance and power results before place-and-route, while Section 4.5 shows the

results after place-and-route. Section 4.6 summarizes the chapter.

4.2 CAD Flow Stages

A typical FPGA CAD flow is shown in Fig. 4.1. It consists of the following steps:


Figure 4.1: FPGA CAD flow.

• Logic Synthesis: The logic functions needed to implement the circuit are derived

and optimized. We will be exploring new ways to inject noise into this stage.

• Technology Mapping: The logic functions are mapped into the logic elements

specific to the target device architecture. We will investigate noise in this stage as

well.

• Packing: The logic elements are grouped into larger units corresponding to the

target device architecture. We do not introduce noise in this stage, because we

are using commercial tools to perform packing and have no way to modify the

algorithm.

• Placement and routing: The mapped logic elements are placed into physical

locations on the target device, and the proper connections are made between the


logic elements using the programmable routing network. The presence of noise in

this stage has already been established, but it will still be considered in this work.

One of the few works to consider noise in FPGA CAD [Rubi 11] examines the

amount of delay noise in the routing stage of VPR [Betz 97]. The authors in-

voke randomness in the PathFinder routing algorithm by changing the order of

nets routed and making small perturbations in circuits. One experiment involves

changing the routing architecture to include some slightly faster wires such that the

maximum impact to critical path delay should be 0.5%. However, this modifica-

tion was experimentally shown to cause changes of -34% to +15%. This work also

proposes a technique to reduce this noise through delay-targeted routing, which

calculates the criticality of a route using a fixed delay target rather than a floating

one.

The work presented here focuses on the logic synthesis and technology mapping stages.

In particular, we use the algorithms implemented in the academic tool ABC [Berk 06].

These algorithms are explained in further detail below, as well as our new methods for

injecting noise into each of them.

4.2.1 Logic Synthesis

The algorithms studied in this stage act on an And-Inverter Graph (AIG) which is a

representation of a logic circuit using only two-input AND gates and inverters. This

is the primary data structure used by ABC. An example is shown in Fig. 4.2. The

large circles (nodes) represent AND gates, while the small dots on the edges represent

inversion. This example shows the function ¬x1 ∧ ((x2 ∨ x3) ∧ (x4 ∧ x5)).

The general goal of algorithms in this stage is to reduce the number of nodes in the

AIG and the number of logic levels, which is the maximum number of nodes from a

combinational input to a combinational output. In the example of Fig. 4.2, there are


Figure 4.2: Example of an And-Inverter Graph (AIG).

four nodes and three levels.

Three potential noise sources were identified in the logic synthesis stage:

• AIG balancing: And-Inverter Graph balancing is a technique that aims to reduce

the number of levels in an AIG [Mish 11]. An example of this is shown in Figs. 4.3

and 4.4. In Fig. 4.3, we see in the ellipse an AIG representing a 5-input AND. This

subgraph of the AIG has a depth of 4. In Fig. 4.4, we see two examples of AIGs

that could be generated by AIG balancing. In both cases, the number of levels is

reduced to 3. However, the balancing can be done in multiple ways, by placing

different signals on the shallower inputs. Balancing is done in two main steps:

– Tree covering: This step identifies multi-input AND gates in the AIG by

grouping together nodes which are not inverted and have no external fanout.


Figure 4.3: Example of an AIG before balancing (logic levels shown in parentheses).

An example is shown by the ellipse in Fig. 4.3. The tree cannot be expanded

to include x4 as it is inverted, and it cannot include x5 as it has another

fanout.

– Tree balancing: For each multi-input AND gate identified by the tree cov-

ering stage, the tree balancing stage decomposes it into a balanced tree of

two-input AND gates. The balancing is done considering the logic levels of

the nodes feeding the multi-input AND. The process is shown in Algorithm 2.

The algorithm essentially pairs nodes together until the tree is formed. It

begins by taking the lowest level node as the first one to be paired (line 3).

It then finds the nodes with the next lowest level, between the indices of

leftBound and rightBound (lines 5-9).


Figure 4.4: Examples of balanced AIGs.

At this point, the ABC code makes an arbitrary selection between the nodes

(selecting in the same way every time). However, we change the algorithm to

select one of these nodes randomly (line 11). The rand(m,n) function gives a

random integer between m and n (inclusive). Finally, this node is paired with

the first one into a two-input AND gate, replacing the original nodes. This

process continues until the last two nodes are paired. Consider the example in

Fig. 4.3 where the logic levels are in parentheses. Two nodes with the lowest

level (3) are randomly chosen and paired into another node with level 4. The

pairing might proceed as follows (Fig. 4.4, left):

∗ Begin by sorting input nodes in descending order by level

x1(4), x2(4), x3(3), x4(3), x5(3)

∗ Randomly choose two nodes with the lowest level (x4(3) and x5(3)) and

combine them into a new node, x45(4)

x1(4), x2(4), x45(4), x3(3)


Figure 4.5: Example of AIG rewriting.

∗ Combine x3(3) with a random level 4 node, x45(4), to form x345(5)

x345(5), x1(4), x2(4)

∗ Combine two level 4 nodes x1(4) and x2(4) into x12(5)

x12(5), x345(5)

∗ Combine final two nodes to complete balancing

x12345(6)

Alternatively, the random pairing may be done this way (Fig. 4.4, right):

∗ x1(4), x2(4), x3(3), x4(3), x5(3)

∗ Select x3(3) and x5(3) randomly (instead of x4(3) and x5(3) as before)

x1(4), x2(4), x35(4), x4(3)

∗ Combine x4(3) with a random level 4 node, x1(4), to form x14(5)

x14(5), x2(4), x35(4)

∗ Combine two level 4 nodes x2(4) and x35(4) into x235(5)

x14(5), x235(5)

∗ Combine final two nodes to complete balancing

x12345(6)


Algorithm 2 Tree balancing algorithm.Input: a vector V of input nodes to the multi-input AND gate, sorted by decreasing levelOutput: a balanced AIG1: while size(V ) > 1 do2: {1. Get the node with minimum level}3: node1← V [size(V )− 1]4: {2. Identify the nodes with the next lowest level (between leftBound and rightBound)}5: rightBound← size(V )− 26: leftBound← rightBound7: while leftBound ≥ 0 and level(V [leftBound− 1]) = level(V [rightBound]) do8: leftBound← leftBound− 19: end while10: {3. Select a node randomly (NEW)}11: node2← V [rand(leftBound, rightBound)]12: {4. Pair the nodes}13: newNode← AND(node1, node2)14: remove(V, node1, node2)15: insert(V, newNode)16: end while17: return newNode

This shows that the tree balancing stage does not provide a unique solution,

and is therefore a source of noise.

• AIG rewriting: AIG rewriting is an algorithm that reduces the number of nodes/logic

levels in an AIG by examining subgraphs of nodes and replacing them with lower-

cost substitutes [Mish 06b]. An example is shown in Fig. 4.5. In this case, the

AIG subgraph represents a 3-input AND. Using rewriting, it can be reduced from

3 nodes to 2. Algorithm 3 loops through each node n in the AIG (line 1) and enu-

merates all 4-input cuts of n (line 2). A cut of a node n is a set of nodes (leaves)

such that each path from a primary input to n passes through at least one node

of the cut. Each cut is replaced with equivalent AIG subgraphs from a hash table

of precomputed subgraphs (line 5). If the subgraph leads to a reduction in AIG

nodes, it is kept.

To add randomness at this stage, we modify the algorithm to allow changes even

when the replacement leads to no change in the AIG node count. It is kept with

a 50% probability (line 8). These are known as “zero-cost” replacements. If a


Algorithm 3 AIG rewriting algorithm.Input: an AIG and a hash table of precomputed subgraphs SOutput: a rewritten AIG1: for each node n of the AIG in topological order do2: for each 4-input cut c of n do3: bestGain← −14: bestS ← NULL5: for each possible rewriting option s from HashLookup(S, c) do6: gain← SavedNodes(s, c)−AddedNodes(s, c)7: {If zero-cost, keep the change with 50% probability (NEW).}8: if gain > 0 or (gain = 0 and rand(0, 1)) then9: if bestS = NULL or gain ≥ bestGain then10: bestGain← gain11: bestS ← s12: end if13: end if14: end for15: if bestS ̸= NULL then16: Update(AIG, bestS)17: end if18: end for19: end for

new subgraph is found (leading to a cost reduction or zero-cost change), the AIG is

updated (line 16). The zero-cost replacements are a source of noise for the rewriting

algorithm.

• AIG refactoring: This technique involves computing one large cut for each AIG

node, then replacing it with a factored form with fewer nodes [Mish 06a]. The cuts

are chosen based on how much reconvergence they contain, which is an indicator of

redundancy that can be exploited by refactoring. Refactoring differs from rewriting

in that it acts on a larger scale (by default, rewriting is done on 4-input cuts, while

refactoring can go as high as 16). The noise injection in this stage is similar to

the method used in AIG rewriting. New AIG subgraphs are generated, and the

replacements are made with a 50% probability if the new subgraphs result in a

zero-cost change.

The above algorithms are repeated in sequence several times as part of the ABC script

resyn2. This lets the algorithms create new optimization opportunities for each other.


Algorithm 4 Cut comparison algorithm.Input: two cuts, c1 and c2Output: 1 if c1 is better, -1 if c2 is better1: if metric1(c1) > metric1(c2) then2: return 13: end if4: if metric1(c2) > metric1(c1) then5: return -16: end if7: {...repeat for all metrics...}8: if metricn(c1) > metricn(c2) then9: return 110: end if11: if metricn(c2) > metricn(c1) then12: return -113: end if14: {If still tied, decide order randomly (NEW).}15: if rand(0, 1) then16: return 117: else18: return -119: end if

4.2.2 Technology Mapping

We use the priority cut-based technology mapping algorithm in ABC [Mish 07]. The

goal of this stage is to map the logic of the AIG to K-input functions which can be

implemented by LUTs on the FPGA (K depends on the FPGA architecture). The

mapper does this by first evaluating a set of priority cuts for each node in an AIG. These

cuts represent potential LUT implementations of that node. The cuts are selected and

sorted in terms of delay, number of inputs, and area. At this point in the CAD flow,

logic depth is used as a proxy for delay, and the number of LUTs is used as a proxy for

area.

The priority cuts for each node are sorted by several criteria, depending on the map-

ping parameters. These criteria include depth and cut size. The random noise in this

stage comes from deciding between cuts with the same values for each of these metrics.

Algorithm 4 shows the cut comparison function used when sorting priority cuts. It begins

by comparing the two cuts for each metric in order. If the cuts are tied in all cases, our

modification to the algorithm makes a random selection is made between them.


The mapping algorithm makes several passes over the netlist, stitching together the

best results (using depth-optimal mappings on critical paths, and area-oriented mappings

elsewhere). We introduce noise in two stages:

• Depth-oriented mapping: Here, the logic depth metric is prioritized over area.

• Area-oriented mapping: Area is prioritized over depth.

4.2.3 Placement and Routing

The placement problem deals with assigning physical locations to each of the logic

blocks in a circuit. A common technique for placement is simulated annealing [Kirk 83].

This algorithm mimics the annealing of metals, a process in which a material is heated,

then cooled in order for its atoms to settle into a low-energy configuration. In placement,

the “atoms” are logic blocks, which are moved around randomly. The random moves are

controlled by the current “temperature” of the anneal, which determines the likelihood

of accepting moves even when they reduce the current quality of the placement. This

“hill-climbing” quality allows the algorithm to avoid being stuck in a local minimum of

the solution space.

The placement and routing stages of the CAD flow are done using Quartus II 10.1, a

commercial CAD tool from Altera. Since this is a commercial tool, we cannot implement

our own noise injection method. Instead, the noise injection in this stage is simply a

matter of changing the seed option to Quartus’ place-and-route tool. Although place-

and-route noise is not the focus of this work, it is still important to consider noise in this

stage because it can mask the effects of noise in the previous stages. For example, a good

placement (due to noise) might hide the negative effects of a mapping with inherently

bad quality, or vice versa. Therefore, it is necessary to evaluate the noise in this stage

and try to separate it from the noise in the previous stages.


4.3 Methodology

In order to evaluate noise over a large number of random seeds, a large number of circuit

compilations were performed (over 10000). To do so, experiments were conducted on the

SciNet high-performance computing system [Loke 10]. This allowed us to perform many

compilations in parallel, allowing thousands of compiles to complete within a few days.

The tools used are the same as the ones used in the glitch power analysis section. We

use Altera’s Quartus 10.1 to pack, place and route the circuits and perform timing and

power analysis. The target FPGA family is Altera’s Stratix III. Modelsim 6.3e is used

for simulation to get toggle rates for the power estimation. 5000 random input vectors

are applied to each circuit.

The first benchmark set consists of 20 MCNC circuits. A second set of 7 benchmarks

was taken from the VPR 5.0 benchmark set, in order to have data for some larger circuits.

The circuits were selected by removing ones that were too similar to others (e.g. two

FIR filters with different parameters), and removing those which did not show significant

toggling on nets when subjected to random vector simulation (some circuits may require

specific input patterns to become active). This was done in order to get meaningful

dynamic power data.

4.4 Noise Measurement: Before Place and Route

In this section, the word “design” will be used to refer to all circuits having the same

original source file (e.g. “alu4” is a design). A “circuit” will refer to a particular

compilation of the design using certain seeds (e.g. “alu4” compiled with synthesis seed 1

and mapping seed 2 is a circuit). Six noise injection experiments are presented: one for

each of five noise injection stages tested individually, as well as one experiment containing

all noise injection stages. For each experiment, each of the 27 designs was processed by

ABC using 25 different seeds (making 25 ∗ 27 = 675 circuits). The results of the noise


Figure 4.6: Number of circuits vs. normalized nodes/level (balancing noise).

Figure 4.7: Number of circuits vs. normalized nodes/level (rewriting noise).


Figure 4.8: Number of circuits vs. normalized nodes/level (refactoring noise).

Figure 4.9: Number of circuits vs. normalized nodes/level (depth-oriented mappingnoise).


Figure 4.10: Number of circuits vs. normalized nodes/level (area-oriented mappingnoise).

Figure 4.11: Number of circuits vs. normalized nodes/level (all noise).


injection experiments are as follows:

1. AIG balancing (Fig. 4.6)

2. AIG rewriting (Fig. 4.7)

3. AIG refactoring (Fig. 4.8)

4. Technology mapping - depth-oriented (Fig. 4.9)

5. Technology mapping - area-oriented (Fig. 4.10)

6. All of the above (Fig. 4.11)

Each graph shows a histogram of the noise distribution for that stage. The x-axis

shows the normalized number of nodes (AIG nodes for unmapped circuits, LUTs for

mapped circuits) and the normalized number of logic levels. The results are normalized

to the average for each design. The y-axis shows the number of circuits in each bin (a

bin contains circuits falling to the left of its label). All graphs are shown with the same

scale to facilitate comparison.

From inspection, the circuits tend to fall in a normal probability distribution. The

synthesis stages (Figs. 4.6, 4.7, 4.8: balancing, rewriting, refactoring) tend to show wider

distributions. Note that the outliers in level count are generally due to low numbers

of logic levels (relative to the number of nodes in the circuit). It was observed that

any deviations in logic depth are limited to a single level. The node count distributions

are generally smoother. In the balancing stage, the majority of circuits appear to be

contained within +/- 1.0% of the mean (i.e. between 0.99 and 1.01), while the rewriting

and refactoring stages are tighter – around +/- 0.6% of the mean.

In contrast, noise in technology mapping (Figs. 4.9, 4.10) is much less than in syn-

thesis. The noise distributions are far narrower, showing that most circuits are within

0.2% of the mean in terms of node count and levels. When all noise injection stages are


Table 4.1: Standard deviation of noise (before place-and-route)Noise injection Node stdev. Level stdev.

Balancing 0.004 0.013Rewriting 0.003 0.010Refactoring 0.002 0.010

Mapping - depth 0.001 0Mapping - area 0.001 0

All 0.013 0.025

combined (Fig. 4.11), an accumulating effect can be seen, with the distribution stretching

as far as +/- 2.0%.

Table 4.1 shows the standard deviation of the normalized number of nodes and logic

levels. This gives a more quantitative view of the results in the previous graphs. Again, it

appears that the noise in earlier stages is greater, due to having more degrees of freedom.

Balancing, rewriting and refactoring show standard deviations as high as 0.4% (0.004) in

node count and 1.3% in levels.

In contrast, noise in technology mapping is limited to about 0.1% in node count,

while the noise in levels is virtually zero. This indicates that the cost functions used in

technology mapping seem to be more fine-grained, causing fewer ties between options

and thus fewer opportunities for randomness. This can be credited to the tiebreakers

used in the mapper described in Section 4.2.2 using secondary and tertiary metrics to

distinguish between otherwise equivalent options. Furthermore, the mapping algorithm

is meant to produce depth-optimal mappings, so the lack of change in levels is expected.

Naturally, the noise of all stages combined is the greatest (1.3% in nodes and 2.5% in

levels), indicating that there is an accumulating effect from all stages together.

4.5 Noise Measurement: After Place and Route

In this section, we examine the amount of noise after the placement and routing stages.

For these experiments, circuits with logic depth greater than minimum depth for that


Figure 4.12: Number of circuits vs. normalized delay.

design were removed. The reasoning for doing so is as follows. Traditionally, logic depth

is the primary metric used for estimating performance before placement and routing. It is

clear that deeper circuits are likely to have greater delay than shallower ones. However,

we wish to explore new metrics other than logic depth (as will be seen in the next

chapter). In other words, we would like to find ways of differentiating between minimal-

depth mappings. This is why the circuits for each design are all of minimum depth, to

remove that factor from consideration.

At the post-routing stage, we can evaluate the circuits in terms of critical path delay

and dynamic power. Placement and routing were performed with Quartus 10.1, using

the “standard fit” setting (i.e. maximum effort). The critical path delay was obtained

using Quartus’ TimeQuest timing analyzer.

We begin by showing the amount of noise present when all noise sources are acti-

vated (balancing, rewriting, refactoring, delay and area-oriented technology mapping,

and placement). Five seeds are used in each of synthesis, mapping and placement for 27

designs, making a total of 5 ∗ 5 ∗ 5 ∗ 27 = 3375 circuits. Figs. 4.12 and 4.13 show the

number of compiled circuits vs. their critical path delay and dynamic power normalized


Figure 4.13: Number of circuits vs. normalized dynamic power.

Table 4.2: Standard deviation of noise (after place-and-route)Noise injection Delay stdev. Power stdev.

Synthesis 0.018 0.027Mapping 0.009 0.014

to the average for that design. As before, the results appear to show a normal distri-

bution, with the power noise being slightly greater than delay noise. The majority of

the circuits fall between 0.9 and 1.1 (+/- 10% from the mean). The standard deviation

of critical path delay is 3.3% and dynamic power is 3.7%. This is a fairly significant

amount, enough to drive the use of seed sweeping to find the circuit compilations with

the best results in this distribution.

First, we attempt to isolate the effect of synthesis noise. Figs. 4.14(a) and 4.14(b)

show the number of circuits vs. their normalized delay and core dynamic power under

the influence of logic synthesis noise only. The x-axis scaling is kept the same to compare

the width of the noise distributions. 25 synthesis seeds were used for each design. All

synthesis noise sources were activated (balancing, rewriting, and refactoring). For each

circuit, the delay and power were averaged across five Quartus compilations using dif-

ferent place-and-route seeds. This was done in order to reduce the impact of placement


(a)

(b)

Figure 4.14: Synthesis noise: (a) Number of circuits vs. normalized delay. (b) Numberof circuits vs. normalized power.


Figure 4.15: Delay rank of circuits under synthesis noise averaged across 4 and 5 place-ment seeds.

noise, in an attempt to isolate the synthesis noise only. To justify the use of five seeds,

we compared the delay ranking of circuits between experiments using four and five seeds.

That is, for each design, we average the critical path delay of each of its 25 candidate

circuits over four place-and-route seeds, and repeat for five seeds. We then rank those

25 candidates from 1 (best) to 25 (worst) for the four-seed average, then repeat for the

five-seed average. As shown in Fig. 4.15, the ranks were fairly well-correlated between

four and five seeds, so five was deemed a sufficient number of seeds to preserve accuracy

without increasing runtime to an unacceptable degree. However, averaging was not per-

formed over mapping seeds (i.e. some mapping noise is still present, although the mapper

seed is kept constant). This was done because averaging over mapping seeds would have

increased runtime by an additional factor of five. Therefore, the results should not be

interpreted as a pure isolation of synthesis noise, but simply the noise arising from seed

changes in synthesis.

The delay and dynamic power distributions again appear to follow normal distribu-

tions. They are narrower than the ones without averaging across placement seeds, which

shows the effect of the placement noise. Again, the power noise is greater than the delay


(a)

(b)

Figure 4.16: Technology mapping noise: (a) Number of circuits vs. normalized delay.(b) Number of circuits vs. normalized power.


noise. A few circuits have normalized power as low as 0.86 or as high as 1.16. Power can

be affected by all changes to a circuit, while delay is only affected by changes affecting

the critical path. The standard deviations of delay and power are shown in Table 4.2.

The standard deviations in delay and power are 1.8% and 2.7%, respectively. This in-

dicates the degree to which random, zero-cost changes in the logic synthesis stage affect

the overall quality of the circuit. Although these changes may appear “zero-cost” at the

time they are made, it is clear that they can have a significant effect on the overall circuit

quality.

Figs. 4.16(a) and 4.16(b) show results obtained in similar fashion for technology map-

ping noise. Again, all technology mapping noise sources were activated (depth and area-

oriented). The delay is contained in the 0.96-1.04 range while the dynamic power is in

the 0.92-1.08 range. As before, the power distribution is wider. For a more quantitative

view, the standard deviations are shown in Table 4.2. The variance arising from mapping

is less than the variance from synthesis, echoing the results seen before place-and-route

(Section 4.4). Again, this is likely due to good tiebreaking mechanisms in the mapper,

as well as the fact that there are fewer downstream CAD stages to be affected by noise

introduced in mapping.

4.6 Conclusion

In this chapter, we introduced the concept of CAD algorithm noise – variations in circuit

quality under the influence of cost-neutral changes in the compilation flow. We proposed

the new concepts of noise in logic synthesis and technology mapping, as well as power

noise. We also identified noise sources in various logic synthesis and technology mapping

algorithms, and we described our method of noise injection in those algorithms. Finally,

we presented experimental results to show the degree to which noise can affect circuit

quality. Under the influence of synthesis noise, standard deviations of critical path delay


and dynamic power were 1.8% and 2.7%, while the results for technology mapping were

0.9% and 1.4%, respectively. Under the influence of noise in all CAD stages, the standard

deviations were 3.3% in delay and 3.7% in power. This shows the significance of noise

in FPGA CAD algorithms, and motivates further research to better understand and

mitigate its effects. In the next chapter, we will do this by exploring ways to find the

best circuits from a group of candidates produced by a noise-injected algorithm.

Chapter 5

Early Timing and Power Prediction

With Noise

5.1 Introduction

The previous chapter showed that CAD algorithm noise can have a significant impact

on the overall quality of a circuit. Given multiple compilations of the same design using

different seeds in the synthesis and mapping stages of the CAD flow, the performance

and power consumption of the resulting circuits can vary.

It is well established that hardware engineers may perform multiple placement and

routing compilations using different seeds in order to find the best one. However, this is

a long process, taking hours or even days for the largest designs [Gort 10]. On the other

hand, synthesis and mapping are relatively quick. This leads to the question of whether

we can sweep seeds in the synthesis and mapping stages. Seed sweeping would still be

performed in placement and routing, but seed sweeping in synthesis and mapping could

generate better circuits as inputs to place-and-route. This process would require early

timing and power prediction metrics that could be done at the post-mapping stage, in

order to find the best candidate circuits for placement and routing. By doing so, we

would be able to minimize the number of lengthy place-and-route runs.

57

Chapter 5. Early Timing and Power Prediction With Noise 58

This chapter deals with predicting the performance and dynamic power of circuits

after the technology mapping stage. Section 5.2 describes previous works on early delay

and power prediction. Section 5.3 describes our delay prediction method, while Sec-

tion 5.4 describes our power prediction method. Section 5.5 explains our consideration

of the packing stage of the CAD flow. Section 5.6 details our experimental methodology.

Section 5.7 shows the results of the delay/power prediction, and Section 5.8 concludes

the chapter.

5.2 Previous Work on Early Delay/Power Prediction

Early power and performance estimation has been done at various stages of the CAD

flow. At the high-level synthesis stage, power estimation can be done to drive low-power

resource allocation and binding techniques [Chen 07a]. Basic operations (e.g. addition,

multiplication) are characterized in terms of their area, delay and power consumption. By

building a library of these common operations, early estimates of the quality of a circuit

can be made. However, this technique is too coarse-grained for our purposes, since it

cannot detect small changes that would be made in the logic synthesis and mapping

stages.

At the pre-placement stage, work has been done to predict interconnect wirelength

and delay [Mano 07, Pand 07]. Doing so can help predict the critical path of a circuit, as

well as capacitances for power estimation. The work by Manohararajah et al. [Mano 07]

shows that interconnect delays can vary greatly when the placer seed is changed. It

proposes a simple timing model based on assigning a single value for each connection

depending on its source and destination node type and port (e.g. logic, I/O, memory).

However, this does not provide sufficient granularity for designs with few different port

types.

The work by Pandit and Akoglu [Pand 07] attempts to estimate wirelengths using


various structural metrics at the pre-placement stage. These metrics are taken from

various works in the ASIC domain and applied to FPGAs. The metrics are as follows:

• Intrinsic Shortest Path Length (ISPL) The ISPL [Kahn 05] between two nodes

is equal to the sum of weights of nets on the shortest path between the two nodes,

where the path does not include the net under consideration. The weight of a

net is related to the number of nodes connected to it. ISPL is supposed to show

a positive correlation with wirelength, since a longer path containing larger nets

would intuitively consume more space.

• Mutual Contraction (MC) MC [Liu 04] for a two-terminal net is calculated

based on the ratio of the weight of the net connecting the two nodes, and the

weights of nets connecting the nodes to outside nodes. Weights are based on net

fanout. MC can be interpreted as the ratio of forces pulling the nodes together

versus the forces pulling them apart. A strong MC indicates that the nodes are

tightly pulled together, implying a shorter wirelength.

• Logic Contraction (LC) For a net connecting a set of nodes N , LC [Liu 05] is

the sum of weights of edges connecting nodes from N to nodes outside N , divided

by the sum of weights of edges connecting nodes from N to other nodes inside N .

The numerator can be interpreted as the external forces pulling the nodes apart,

while the denominator is the sum of internal forces pulling the nodes together.

Therefore, a high LC indicates that the nodes adjacent to the net are being pulled

apart more than they are pulled together, suggesting a larger wirelength.

The metrics above show reasonable correlations with the average post-layout wire-

length of nets in a circuit, but fail to provide sufficient granularity for predicting wire-

lengths of individual nets. These works have shown that there is a great deal of variation

in the placement and routing stages, meaning that early delay and power prediction are

very difficult.


It should be noted that in general, early timing/power estimation works deal with

estimating the absolute values of critical path delay and power consumption. In contrast,

our work deals with applying early estimation to different compilations of the same design

with noise injection. This means that we are concerned with relative values instead.

5.3 Delay Prediction

In this section, we examine possible ways to predict the circuit with the lowest critical

path delay at the post-mapping stage, given a set of circuits compiled with different syn-

thesis/mapping seeds. Our delay prediction model begins by assigning each node (LUT)

a certain delay, then traversing the circuit graph in topological order and computing

the arrival time at each node. We examine numerous timing models, sweeping several

parameters:

• Varying pin delays

• Logic, routing and constant factors

• Maximum/scaled metrics

These parameters will be explained in the following sections.

5.3.1 Varying pin delays

Due to the tree-like structure of the multiplexer in a LUT, the delay from each of its input

pins varies (see Fig. 5.1). The inputs nearest to the output will have shorter delays than

the ones further away. This means that the slowest-arriving inputs should be assigned

to the pins with shorter delays and vice versa. The pin-dependent LUT delay for a LUT

n is modeled as:

logic delay(n) =pin index

total pins(5.1)


Figure 5.1: Slow and fast inputs on lookup tables.

where pin index is the index of the pin (fastest = 1, slowest = total pins). In our case,

we use total pins = 6 for Stratix III. Based on this, two timing models can be applied:

• Pin utilization model: In this model, we consider the number of LUT input pins

used where the fanin LUT’s logic level is one less than the level of the current LUT.

This gives an idea of how many fanins are competing for the fastest input pin. An

example is shown in Fig. 5.2. This LUT has six fanins, two with a logic level of

5, and four with a logic level of 6. The fastest pin is A, and the slowest is F. It is

assumed that the fanins with level 6 will be assigned the faster input pins, leaving

the slowest pins to the level 5 fanins. It is also assumed that the level 5 fanins are

fast enough that they will not become the critical inputs. Therefore, the critical

input will be the slowest pin coming from a fanin of level 6. In this case, it is pin

D. Since it is the fourth fastest input pin out of six, the LUT is assigned a delay of

4/6.

• Pin order model: This is a more refined version of the pin utilization model.


Figure 5.2: Example for the pin utilization timing model.

Figure 5.3: Example for the pin order timing model.


It attempts to predict the order that fanins will be assigned to pins. It sorts the

fanins by their maximum arrival time, then assigns the slowest ones to the fastest

pins, and vice versa. An example is shown in Fig. 5.3. The maximum delay in this

case is coming from input pin A, meaning that the maximum delay to the LUT

output is 4 + 1/6. In other words, the maximum delay to the LUT output is

max delay = maxi

(arrival timei + i/total pins) (5.2)

where i is the input pin index. The logic delay is the i/total pins portion of the

above equation.

5.3.2 Logic, routing and constant factors

The overall delay model for a LUT n can be expressed as the following:

Delay(n) = const+ logic factor ∗ logic delay + fanout factor ∗ fanout (5.3)

The parameters are as follows:

• const: This is a constant value for each LUT, which can be set to 0 or 1. If set

to 1, it represents a unit delay for the LUT. This is the standard metric used for

delay at the technology mapping stage.

• logic factor: This is a scaling factor for the LUT delay (logic delay, calculated

using one of the pin-based timing models above). We examined factors ranging

from 0 to 5, in order to see the effect of the scaling factor on the overall model.

• fanout factor: This is a scaling factor for the fanout of the node, which represents

the routing delay. It ranges from 0 to 5. The fanout was capped to 10 to prevent


very high-fanout nodes from dominating the delay.

By evaluating delay models using all combinations of these parameters, we are able to

test models with high interconnect delay and low logic delay, and vice versa. The ranges

chosen allow us to explore (2 ∗ 6 ∗ 6 = 72) models using these factors.

5.3.3 Maximum/scaled metrics

The critical path in a circuit is the maximum delay path, which suggests that our model

should simply take the maximum delay in the circuit. However, errors in the model could

mean that the critical path could differ from the one predicted. Therefore, we propose a

scaled model in which we take the sum of the top N max arrival times, each scaled by

an exponentially decaying factor. The scaled delay model for a circuit can be expressed

as

scaled delay model =N∑i=1

max arrival time(nodei) ∗ factori (5.4)

where max arrival time(nodei) is the maximum total delay to reach the node with

the ith longest arrival time, and 0 < factor < 1. For our experiments, a factor of 0.95

was used since it decays quickly enough to ignore path endpoints that are highly unlikely

to be critical, yet not so quickly that it considers only a few. N is capped at 100, as the

factor is very small by that point (0.95100 = 0.006).

5.4 Power Prediction

The dynamic power consumption of a circuit can be calculated by the following equation:

Pdyn =1

2

n∑i=1

SiCifV2dd (5.5)


where n is the number of nets in the circuit, Si is the switching activity of net i,

Ci is the capacitance of net i, f is the frequency of the circuit, and Vdd is the supply

voltage. The switching activity of a net is estimated using a fast gate-level simulation

implemented in ABC, run over 1000 cycles. The simulator can be run in two modes:

• Zero delay, in which the LUTs/wires are assumed to have zero delay (i.e. on each

clock cycle, all signals immediately settle into their final state).

• Unit delay, in which each LUT is assumed to have a delay of one. This allows for

some modeling of glitches.

The overall power model is a product of switching activity and fanout over all nets:

Pest =∑

i∈netsSi ∗ FOi (5.6)

Fanout is used as a substitute for capacitance (capped as in Section 5.3.2). Vdd and

f from Eqn. 5.5 are ignored since they are constant for each circuit.

5.5 Packing

Packing is a stage in the CAD flow in which logic blocks are grouped together. These

groups can then be placed into a larger logic unit (in the Stratix III architecture, these

units are known as Logic Array Blocks or LABs). Grouping related logic blocks together

allows them to be connected by shorter routes (intra-LAB) rather than longer, slower

routes (inter-LAB).

We considered performing prediction at the post-packing stage. This would allow

us to use different timing/power models for intra-LAB and inter-LAB connections. By

doing so, we could potentially form more accurate estimates of delays and capacitances.

However, there is no easy way to enforce a particular packing using the Altera CAD

flow. We can extract packing information after the entire placement and routing stages


have completed, but we cannot guarantee that that packing information would have been

available prior to place and route. This is due to optimizations occurring after technology

mapping, such as physical synthesis [Sing 05]. Therefore, we cannot be certain about how

the circuit has changed after it has left the technology mapping stage.

Unless otherwise stated, all prediction methods used here do not use any packing

information. If packing information is used, it should be interpreted with the caveat that

the packing data was extracted post-routing. This means that any benefits demonstrated

by using packing results should be considered as an upper bound (i.e. optimistic).

5.6 Methodology

We ran our prediction methods on the noise-injected circuits from Section 4.3. Each of

the 27 benchmark designs was synthesized and mapped with different seeds to create 25

mapped candidates. Each candidate was placed and routed using Quartus using 5 differ-

ent seeds. The timing and power results were obtained in the same way as in Section 4.3.

The results were averaged across the 5 placement seeds to obtain a representative result

for each mapped candidate circuit.

The prediction algorithms were all implemented in ABC [Berk 06]. For each mapped

candidate circuit, all prediction algorithms were run sweeping all parameter combina-

tions. These include (for delay):

• 2 LUT delay models (pin utilization model, pin order model)

• 2 constant factors (Section 5.3.2)

• 6 logic factors (Section 5.3.2)

• 6 fanout factors (Section 5.3.2)

• 2 metrics (Section 5.3.3: maximum/scaled)


Therefore, a total of 2 ∗ 2 ∗ 6 ∗ 6 ∗ 2 = 288 models can be used for delay. However,

due to redundancies or invalid models (e.g. all factors set to 0), 241 models were tested.

For power, six models were used, three each for the zero and unit delay simulations:

• No packing data used. All nets are considered equally.

• Only inter-LAB routes considered (using packing data). This model only uses the

inter-LAB nets for the summation in Eqn. 5.6. The intra-LAB power is considered

to be zero.

• Only intra-LAB routes considered (using packing data). This model only uses the

intra-LAB nets for the summation in Eqn. 5.6. The inter-LAB power is considered

to be zero.

5.7 Results

Fig. 5.4 shows the probability of finding the best circuit implementation in terms of

delay (or one within a 0.1% margin of it). The top three models (based on the ranking

of the top predicted circuit for each design) are shown, along with two simple models:

“Routing only” (fanout factor = 1 in Eqn. 5.3, all others 0, scaled) and “Logic depth

only” (const = 1, all others 0, scaled). The top models were chosen by taking the best

predicted circuit for each design and summing their actual ranking (best=1, worst=25).

The models with the lowest sums were chosen as the best ones. Designs with low swing

(no circuit with more than 1.5% deviation from the average) were ignored. This was done

because there is little benefit to be obtained from predicting those designs anyway. The

problem of distinguishing between circuits with such small differences presents a large

challenge for little gain. 20 of 27 designs were considered for delay, and 23 of 27 were

considered for power. For details, the interested reader may consult Appendix A. For

the delay portion of this study, the designs were split into two halves: one as a training


Figure 5.4: Probability of finding the top circuit vs. percentage of top modeled circuitsconsidered (delay).

set to find the best timing models, and the other as a test set to evaluate them. This

was not done for power due to the small number of models considered. The x-axis shows

the percentage of the top circuits (by model score) considered, while the y-axis shows

the probability of finding the actual top circuit within this group. The legend shows the

timing model used.

For example, the model “Pin order, const=0, logic=1, routing=3, scaled” refers to

the pin order model (Section 5.3.1) with (const = 0, logic factor = 1, fanout factor =

3) as the factor settings (Section 5.3.2), and the exponentially decaying scaling factor

(Section 5.3.3). Using this model, if we take the top 10% of circuits (according to the

model) we have approximately a 40% chance of selecting the best one.

The simplest models (Routing only, Logic depth only) both used the “scaled” metric

(Section 5.3.3). The results show that these models were not as good as the others. Using

the top 10% of predicted circuits, we have only a 10-20% probability of selecting the best


Figure 5.5: Percentile of predicted top circuits (delay).

ones. However, the “Routing only, scaled” model improves to 90% probability when the

top 40% of predicted circuits are used. This shows the importance of considering routing

delay.

Another viewpoint of the results can be taken by considering a scenario where a

hardware engineer performs synthesis and mapping using several seeds, and selects the

best 10% of circuits based on the prediction metric (in our case, we select two candidate

circuits per design). The intent is to allow the hardware engineer to pass only those two

circuits through placement and routing, then pick the better one. The question we would

like to answer is: if we pass all the circuits through placement and routing and rank them

according to delay, into what percentile would the predicted best circuit fall (best=100,

worst=0)? Fig. 5.5 shows the distribution of the actual percentile of the top circuits

predicted by the model. Each bin extends to the right (e.g. the 90 bin covers 90-99).

The 100 bin covers 100 only. For example, using the same model as above (Pin order,

const = 0, logic = 1, routing = 3, scaled), if we pick the top circuit (according to the


Table 5.1: Average percentile of top circuits with isolated model parameters.Model Average percentile of top predicted circuit

Baseline (max logic depth) 50+ scaled 52

+ scaled, pin util. 52+ scaled, pin order 67+ scaled, routing 54

model), it is in the highest percentile bin (100) of actual circuits for 4 out of 10 designs.

If the prediction was completely random, we would expect a flatter curve. However, since

the results tend towards the higher end of the scale (to the right), it appears that these

prediction metrics have value.

The variety of parameters seen in the top three models seems to indicate no clear best

setting in terms of the pin models, scaling factors, and max/scaled metrics. However,

it is hard to expect these models to be accurate at a very fine-grained level of detail,

distinguishing between circuits with only a few percent difference between each other.

Furthermore, different circuits and FPGA architectures may have different preferred

models. Despite this, it seems that a crude early estimation technique is still enough to

extract some information. For all models, if we take the best predicted circuit for each

design, the average normalized critical path delay (after place-and-route) is less than 1.

This means that the models have some ability to predict the better circuits in a group.

To examine the effect of each model parameter, we attempted to evaluate them in

isolation. Table 5.1 shows the average percentile of the single top circuit as predicted

by the baseline model (i.e. a max logic depth model – since the circuits for one design

all have the same logic depth, this model is essentially random). A value of 100 would

correspond to the best circuit. As expected, this model’s average is the 50th percentile

– no better than a random prediction. Adding the exponentially decaying scaling factor

improves this to 52, while adding pin utilization, order and routing models increase the

average to 52, 67 and 54. Although these numbers are not overwhelmingly positive, the

improvement is still encouraging as they tend towards 100 (better circuits) rather than


Figure 5.6: Probability of finding the top circuit vs. percentage of top modeled circuitsconsidered (power).

0 (worse circuits).

Figs. 5.6 and 5.7 show analogous results for power. “Zero delay” and “Unit delay”

represent the zero delay and unit delay simulation models. Results are also shown con-

sidering the use of intra-LAB and inter-LAB connections (using packing information).

In some cases, the use of inter-LAB connection information is valuable (as these connec-

tions have higher capacitance/power) as well as using unit delay instead of zero delay. In

particular, using the top 10% of predicted circuits using the unit delay, inter-LAB model

(considering only inter-LAB connections), the probability of capturing the top circuit is

over 40%. In contrast, the zero delay model using only intra-LAB connections is the

weakest, as it has no consideration of glitches or the inter-LAB connections which have

higher capacitance. However, the results are still satisfactory without packing informa-

tion. Aside from one circuit, the models always predicted above the 40th percentile in

Fig. 5.7. In general, power prediction was more successful than delay prediction since

the critical delay is based on a single path, while power is based on the entire circuit,

thus reducing the impact of any errors.


Figure 5.7: Percentile of predicted top circuits (power).

Table 5.2: Average benefit of prediction modelsDelay

Model % improv. % improv. (full) % des. improv.Best prediction 1.3 1.8 75Logic depth only 0.6 0.3 50

Fanout only 0.9 0.9 60Power

Model % improv. % des. improv.Best prediction 1.8 83

Unit delay 1.4 91Zero delay 1.1 87

Table 5.2 shows the average delay/power savings from our prediction models if the

top 10% predicted best circuits are carried forward to placement and routing. Looking at

the “% improv.” column, we see that the best prediction models (“Pin order, const=0,

logic=1, routing=3, scaled” for delay, “Unit delay, inter-LAB” for power) give an average

benefit of 1.3% in delay and 1.8% in power, more than the simpler models of logic depth

and fanout. These numbers were computed separately – the best circuit for delay is

not necessarily the best circuit for power. It should be noted that these results were

generated using relatively small training and test sets (for delay). If we instead use


the entire benchmark set for both training and test, we can obtain improvements of up

to 1.8% in delay (column “% improv. (full)”). The last column shows the percentage

of designs that showed any amount of improvement. The delay predictions achieve an

improvement on up to 75% of designs, while the power predictions improve up to 91% of

designs. In general, power prediction is more accurate, as we saw that power noise has

greater swing and so has more obvious differences between designs. On the other hand,

delay has less swing and depends on minute differences on a single critical path, making

prediction more difficult.

5.8 Conclusion

In this chapter, we presented the problem of timing and power prediction at the post-

mapping stage in the presence of noise. For delay prediction, a model was proposed

using varying LUT input pin delays, logic and routing delay scaling factors, and a scaled

delay metric. For power prediction, a model was proposed using a fast zero delay or unit

delay simulation accounting for glitches, as well as considering LUT packing information.

The results showed that the prediction models were successful in distinguishing between

different noise-injected circuits. When applied to designs with over 1.5% variation in

delay and power, the best prediction models averaged 1.3-1.8% lower critical path delay

and 1.8% lower dynamic power, as computed post-routing.

Chapter 6

Conclusions

6.1 Summary

This thesis has dealt with two topics in the area of FPGA CAD. The contributions to

these topics are as follows:

6.1.1 Glitch Reduction

1. We investigated glitch power in FPGAs, power which is consumed by unnecessary

signal transitions. An analysis of glitches in commercial FPGAs was presented,

showing that an average of 26.0% of dynamic power is due to glitches. We also

showed that don’t-cares make up 15.1% of LUT input states, motivating a don’t-

care based approach to glitch reduction.

2. We demonstrated a new glitch reduction algorithm making use of don’t-cares in

logic functions. It operates at the post-routing stage and has no area or performance

overhead. The algorithm sets the configuration bits of LUTs which correspond to

don’t-cares, in a manner that reduces unnecessary transitions on the LUT’s output.

The algorithm reduced glitch power by an average of 13.7%, and dynamic power

by 4.0%.

74

Chapter 6. Conclusions 75

6.1.2 CAD Algorithm Noise

1. We investigated the concept of random noise in CAD algorithms, which is the

variation in circuit quality due to random choices between two or more choices with

the same cost. Random noise was injected into the logic synthesis and technology

mapping stages of ABC, exposing the arbitrary choices made by the algorithms.

We presented data on the effects of this noise on overall circuit performance and

power. Under the influence of synthesis noise, standard deviations of critical path

delay and dynamic power were 1.8% and 2.7%, respectively, while the results for

technology mapping were 0.9% and 1.4%, respectively. Under the influence of noise

in all CAD stages, the standard deviations were 3.3% in delay and 3.7% in power.

2. Early timing and power analysis techniques were presented to find the best cir-

cuits from a pool of mapped candidate circuits generated using different random

seeds. For delay prediction, a model was proposed using varying LUT input pin

delays, logic and routing delay scaling factors, and a scaled delay metric. For power

prediction, a model was proposed using a fast zero delay or unit delay simulation

accounting for glitches, as well as considering LUT packing information. When

applied to circuits with over 1.5% swing in quality, the early estimation techniques

found circuits that were 1.3-1.8% better in speed and 1.8% better in power, on

average.

6.2 Future Work

6.2.1 Glitch Reduction

1. Although the glitch reduction algorithm runs fairly quickly for the MCNC circuits

tested (a few seconds), runtime is a potential concern for larger circuits. This is

due to two main factors: the runtime of the SAT instances to find don’t-cares, and


the iterative flow involving timing simulation. A potential solution is to integrate

more tightly the simulator and glitch reduction algorithm, such that changes to

LUTs can be incorporated into the simulation on the fly.

2. Another similar approach to glitch reduction could be taken before the placement

and routing stages, using a simpler simulation model (using unit delays per LUT,

for example). We actually tested this approach and found that it worked poorly

due to the lack of accurate timing information. However, such an approach would

have the benefit of having more degrees of freedom from being at an earlier stage of

the flow – instead of only being able to change LUT configuration bits, the mapping

of the circuit could be changed (for example). Further work would be necessary to

allow the algorithm to work well despite the lack of accurate timing data.

3. Don’t-care-based optimizations could also be applied to static power. Previous

work has demonstrated that SRAM bit polarities can affect static power dissipa-

tion [Ande 06]. The settings of the SRAM bits to logic-0 or 1 can affect the number

of leakage paths in the circuit. By using the freedom given by don’t-cares, bits may

be flipped in order to eliminate these leakage paths. An additional challenge would

be to integrate this technique with the dynamic power optimization in this work.

6.2.2 CAD Algorithm Noise

1. The work presented has focused primarily on noise sources in synthesis and map-

ping. However, there are many more CAD algorithms to explore, different imple-

mentations, and more noise sources. Packing, placement, and routing are several

other CAD stages that could be investigated.

2. The early timing and power estimation methods can still be further refined. One

possible approach is to try to take advantage of the fact that we are comparing

the quality of circuits relative to one another, not in absolute terms. This might


be done by locating similarities and differences between the circuits, rather than

computing the quality of each circuit individually.

3. The early prediction models could be applied to the upstream CAD algorithms in

order to improve their quality of results. For example, the pin-based delay models

could be implemented in technology mapping as a refinement of the logic-depth-

based delay model which is typically used.

Appendix A

Circuit Delay/Power Statistics With

Noise

A.1 Individual Design Noise Results

Table A.1 shows the mean and standard deviation of each design’s critical path delay.

The “Min. norm. delay” column indicates the minimum circuit delay, normalized to that

design’s average. All numbers are in absolute values and taken from Quartus’ STA timing

analyzer, using the methodology described in Section 4.3. The shaded rows indicate the

designs with minimum values more than 1.5% less than the average (i.e. 0.985 or less).

These designs were used as candidates for early delay estimation.

Table A.2 shows a similar table for dynamic power. The circuits were simulated at

10MHz to meet timing constraints by a wide margin. A few designs have notably low

minimum normalized values, which can be attributed to their low power consumption (a

small change accounts for a larger percentage).

78

Appendix A. Circuit Delay/Power Statistics With Noise 79

Table A.1: Critical path delay statistics by circuitCircuit Mean crit. delay (ns) Stdev. delay Min. norm. delayalu4 7.767 0.002 0.990apex2 8.705 0.008 0.971apex4 8.348 0.006 0.977bigkey 1.260 0.003 0.960cf cordic v 18 18 18 2.787 0.001 0.976cf fir 24 16 16 5.832 0.004 0.988clma 5.963 0.003 0.984des 8.864 0.013 0.976des perf 1.960 0.001 0.988diffeq 3.407 0.005 0.953dsip 1.364 0.001 0.979elliptic 4.544 0.014 0.972ex1010 9.744 0.003 0.990ex5p 8.261 0.006 0.980frisc 5.463 0.011 0.963mac2 8.541 0.004 0.987misex3 7.373 0.002 0.982oc54 cpu 8.395 0.024 0.967paj raygentop hierarchy no mem 6.565 0.004 0.982pdc 9.495 0.003 0.989s298 4.990 0.019 0.960s38417 3.660 0.006 0.951s38584 1 3.055 0.003 0.959seq 8.236 0.007 0.979spla 9.371 0.005 0.989sv chip1 hierarchy no mem 3.962 0.002 0.981tseng 3.038 0.004 0.966

Appendix A. Circuit Delay/Power Statistics With Noise 80

Table A.2: Dynamic power statistics by circuitCircuit Mean dynamic power (mW) Stdev. power Min. norm. poweralu4 2.912 0.002 0.978apex2 2.746 0.003 0.973apex4 2.012 0.001 0.970bigkey 2.465 0.000 0.993cf cordic v 18 18 18 11.030 0.009 0.985cf fir 24 16 16 143.091 0.671 0.992clma 4.391 0.007 0.952des 7.552 0.026 0.960des perf 15.765 0.020 0.989diffeq 0.066 0.000 0.936dsip 2.672 0.001 0.979elliptic 1.900 0.002 0.947ex1010 4.061 0.003 0.976ex5p 2.531 0.003 0.964frisc 0.897 0.003 0.831mac2 39.726 0.168 0.990misex3 2.490 0.001 0.971oc54 cpu 9.880 0.018 0.972paj raygentop hierarchy no mem 28.555 0.328 0.973pdc 4.524 0.004 0.973s298 1.637 0.002 0.941s38417 2.606 0.001 0.981s38584 1 4.227 0.001 0.984seq 2.753 0.003 0.961spla 4.523 0.004 0.977sv chip1 hierarchy no mem 8.708 0.011 0.982tseng 0.043 0.000 0.930

Bibliography

[Altea] Altera. “AN 584: Timing closure methodology for advanced FPGA designs”.http://www.altera.com/literature/an/an584.pdf.

[Alteb] Altera. “Stratix III device handbook”. http://www.altera.com/

literature/lit-stx3.jsp.

[Altec] Altera. “Stratix V device handbook”. http://www.altera.com/

literature/lit-stratix-v.jsp.

[Ande 06] J. Anderson and F. Najm. “Active leakage power optimization for FPGAs”.Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-tions on, Vol. 25, No. 3, pp. 423 – 437, march 2006.

[Berk 06] Berkeley Logic Synthesis and Verification Group. “ABC: A system for se-quential synthesis and verification”. Release 00406. http://www.eecs.

berkeley.edu/∼alanmi/abc/.

[Betz 97] V. Betz and J. Rose. “VPR: A new packing, placement and routing toolfor FPGA research”. In: Proceedings of the 7th International Workshop onField-Programmable Logic and Applications, pp. 213–222, Springer-Verlag,London, UK, 1997.

[Chen 07a] D. Chen, J. Cong, Y. Fan, and Z. Zhang. “High-level power estimation andlow-power design space exploration for FPGAs”. In: Design AutomationConference, 2007. ASP-DAC ’07. Asia and South Pacific, pp. 529 –534, Jan.2007.

[Chen 07b] L. Cheng, D. Chen, and M. Wong. “GlitchMap: An FPGA technology map-per for low power considering glitches”. In: Design Automation Conference,2007. DAC ’07. 44th ACM/IEEE, pp. 318 –323, 2007.

[Czaj 07] T. S. Czajkowski and S. D. Brown. “Using negative edge triggered FFsto reduce glitching power in FPGA circuits”. In: Proceedings of the 44thannual Design Automation Conference, pp. 324–329, ACM, New York, NY,USA, 2007.

[Dinh 09] Q. Dinh, D. Chen, and M. D. Wong. “A routing approach to reduce glitchesin low power FPGAs”. In: Proceedings of the 2009 international symposiumon Physical design, pp. 99–106, ACM, New York, NY, USA, 2009.

81

Bibliography 82

[Fisc 05] R. Fischer, K. Buchenrieder, and U. Nageldinger. “Reducing the power con-sumption of FPGAs through retiming”. In: Engineering of Computer-BasedSystems, 2005. ECBS ’05. 12th IEEE International Conference and Work-shops on the, pp. 89 – 94, Apr. 2005.

[Fran 10] S. Franssila. Introduction to Microfabrication. John Wiley & Sons, 2010.

[Gort 10] M. Gort and J. Anderson. “Deterministic multi-core parallel routing forFPGAs”. In: Field-Programmable Technology (FPT), 2010 InternationalConference on, pp. 78 –86, Dec. 2010.

[Kahn 05] A. Kahng and S. Reda. “Intrinsic shortest path length: a new, accurate apriori wirelength estimator”. In: Computer-Aided Design, 2005. ICCAD-2005. IEEE/ACM International Conference on, pp. 173 – 180, Nov. 2005.

[Kirk 83] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. “Optimization by simulatedannealing”. Science, Vol. 220, No. 4598, pp. 671–680, 1983.

[Kuon 07] I. Kuon and J. Rose. “Measuring the gap between FPGAs and ASICs”.Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-tions on, Vol. 26, No. 2, pp. 203 –215, Feb. 2007.

[Lamo 08] J. Lamoureux, G. Lemieux, and S. Wilton. “GlitchLess: Dynamic powerminimization in FPGAs through edge alignment and glitch filtering”. VeryLarge Scale Integration (VLSI) Systems, IEEE Transactions on, Vol. 16,No. 11, pp. 1521 –1534, Nov. 2008.

[Lim 05] H. Lim, K. Lee, Y. Cho, and N. Chang. “Flip-flop insertion with shifted-phase clocks for FPGA power reduction”. In: ICCAD ’05: Proceedings ofthe 2005 IEEE/ACM International conference on Computer-aided design,pp. 335–342, IEEE Computer Society, Washington, DC, USA, 2005.

[Lin 95] B. Lin and S. Devadas. “Synthesis of hazard-free multilevel logic undermultiple-input changes from binary decision diagrams”. Computer-AidedDesign of Integrated Circuits and Systems, IEEE Transactions on, Vol. 14,No. 8, pp. 974 –985, Aug. 1995.

[Liu 04] Q. Liu and M. Marek-Sadowska. “Pre-layout wire length and congestionestimation”. In: Design Automation Conference, 2004. Proceedings. 41st,pp. 582 –587, Jul. 2004.

[Liu 05] Q. Liu and M. Marek-Sadowska. “Pre-layout physical connectivity predictionwith application in clustering-based placement”. In: Computer Design: VLSIin Computers and Processors, 2005. ICCD 2005. Proceedings. 2005 IEEEInternational Conference on, pp. 31 – 37, Oct. 2005.

[Loke 10] C. Loken, D. Gruner, L. Groer, R. Peltier, N. Bunn, M. Craig, T. Henriques,J. Dempsey, C.-H. Yu, J. Chen, L. J. Dursi, J. Chong, S. Northrup, J. Pinto,

Bibliography 83

N. Knecht, and R. V. Zon. “SciNet: Lessons learned from building a power-efficient Top-20 system and data centre”. Journal of Physics: ConferenceSeries, Vol. 256, No. 1, p. 012026, 2010.

[Mano 07] V. Manohararajah, G. Chiu, D. Singh, and S. Brown. “Predicting intercon-nect delay for physical synthesis in a FPGA CAD flow”. Very Large ScaleIntegration (VLSI) Systems, IEEE Transactions on, Vol. 15, No. 8, pp. 895–903, Aug. 2007.

[Mish 05] A. Mishchenko and R. Brayton. “SAT-based complete don’t-care compu-tation for network optimization”. In: ACM/IEEE Design Automation andTest Conference, pp. 412–417, 2005.

[Mish 06a] A. Mishchenko and R. Brayton. “Scalable logic synthesis using a simplecircuit structure”. In: Proc. International Workshop on Logic and Synthesis,pp. 15–22, 2006.

[Mish 06b] A. Mishchenko, S. Chatterjee, and R. Brayton. “DAG-aware AIG rewrit-ing: a fresh look at combinational logic synthesis”. In: Design AutomationConference, 2006 43rd ACM/IEEE, pp. 532 –535, 2006.

[Mish 06c] A. Mishchenko, S. Chatterjee, R. Brayton, and N. Een. “Improvementsto combinational equivalence checking”. In: Proceedings of the 2006IEEE/ACM international conference on Computer-aided design, pp. 836–843, ACM, New York, NY, USA, 2006.

[Mish 07] A. Mishchenko, S. Cho, S. Chatterjee, and R. Brayton. “Combinational andsequential mapping with priority cuts”. In: Computer-Aided Design, 2007.ICCAD 2007. IEEE/ACM International Conference on, pp. 354 –361, Nov.2007.

[Mish 09] A. Mishchenko, R. Brayton, J.-H. R. Jiang, and S. Jang. “Scalable don’t-care-based logic optimization and resynthesis”. In: Proceedings of theACM/SIGDA international symposium on Field programmable gate arrays,pp. 151–160, ACM, New York, NY, USA, 2009.

[Mish 11] A. Mishchenko, R. Brayton, S. Jang, and V. Kravets. “Delay optimizationusing SOP balancing”. In: Proc. International Workshop on Logic and Syn-thesis, pp. 75–82, 2011.

[Pand 07] A. Pandit and A. Akoglu. “Wirelength prediction for FPGAs”. In: FieldProgrammable Logic and Applications, 2007. FPL 2007. International Con-ference on, pp. 749 –752, Aug. 2007.

[Rubi 11] R. Y. Rubin and A. M. DeHon. “Timing-driven pathfinder pathology andremediation: quantifying and reducing delay noise in VPR-pathfinder”. In:Proceedings of the 19th ACM/SIGDA international symposium on Field pro-grammable gate arrays, pp. 173–176, ACM, New York, NY, USA, 2011.

Bibliography 84

[Shan 02] L. Shang, A. S. Kaviani, and K. Bathala. “Dynamic power consumption inVirtex-II FPGA family”. In: Proceedings of the 2002 ACM/SIGDA tenthinternational symposium on Field-programmable gate arrays, pp. 157–164,ACM, New York, NY, USA, 2002.

[Shum 11] W. Shum and J. H. Anderson. “FPGA glitch power analysis and reduction”.In: Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design, pp. 27–32, IEEE Press, Piscataway, NJ, USA,2011.

[Sing 05] D. Singh, V. Manohararajah, and S. Brown. “Two-stage physical synthesisfor FPGAs”. In: Custom Integrated Circuits Conference, 2005. Proceedingsof the IEEE 2005, pp. 171 – 178, Sept. 2005.

[Wilt 04] S. J. Wilton, S.-S. Ang, and W. Luk. “The impact of pipelining on energyper operation in field-programmable gate arrays”. In: Proc. Intl. Conf. onField-Programmable Logic and its Applications, pp. 719–728, 2004.

[Xili] Xilinx. “7 Series FPGAs overview”. http://www.xilinx.com/support/

documentation/7 series.htm.

glitch reduction and cad algorithm noise in fpgas · glitch reduction and cad algorithm noise in...

Documents