glitch reduction and cad algorithm noise in fpgas · glitch reduction and cad algorithm noise in...
TRANSCRIPT
Glitch Reduction and CAD Algorithm Noise in FPGAs
by
Warren Shum
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
Copyright c⃝ 2011 by Warren Shum
Abstract
Glitch Reduction and CAD Algorithm Noise in FPGAs
Warren Shum
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2011
This thesis presents two contributions to the FPGA CAD domain. First, a study of
glitch power in a commercial FPGA is presented, showing that glitch power in FPGAs is
significant. A CAD algorithm is presented that reduces glitch power at the post-routing
stage by taking advantage of don’t-cares in the logic functions of the circuit. This method
comes at no cost to area or performance.
The second contribution of this thesis is a study of FPGA CAD algorithm noise –
random choices which can have an unpredictable effect on the circuit as a whole. An
analysis of noise in the logic synthesis, technology mapping, and placement stages is
presented. A series of early performance and power metrics is proposed, in an effort to
find the best circuit implementation in the noise space.
ii
Acknowledgements
First and foremost, I would like to thank Professor Jason Anderson for supervising my
thesis research, and for guiding me along with good ideas and encouragement. I would
also like to thank Professors Jonathan Rose, Vaughn Betz, and Olivier Trescases, for
reviewing this work and serving on my defence committee.
I am also grateful to my parents, for supporting me in all my academic endeavors.
Thanks to my fellow research group members: Marcel, Bill, Jason L., James, Andrew,
Mark, Steven, Ahmed, Victor, Stefan, Alex, Kevin, and my office mates in PT477. I
appreciate the feedback on my work, the sporting activities, as well as just sharing
conversation.
Thanks to the staff at SciNet for their technical support.
I thank NSERC and OGS for financial support throughout my degree.
iii
Contents
1 Introduction 1
1.1 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Glitch Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 CAD Algorithm Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Glitch Power and Don’t-Cares in FPGAs 6
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Glitch Power in FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Previous Work on Glitch Reduction in FPGAs . . . . . . . . . . . . . . . 10
2.5 Don’t-Cares in Logic Circuits . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Glitch Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Analysis of Don’t-Cares . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Glitch Reduction Using Don’t-Cares 16
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Glitch Reduction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Computing the Don’t-Cares for a LUT . . . . . . . . . . . . . . . 20
3.2.2 Scanning the Input Vectors . . . . . . . . . . . . . . . . . . . . . 20
3.2.3 Setting the Don’t-Cares . . . . . . . . . . . . . . . . . . . . . . . 22
iv
3.2.4 Iterative Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Fanout Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4 FPGA CAD Algorithm Noise 32
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 CAD Flow Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Logic Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.3 Placement and Routing . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Noise Measurement: Before Place and Route . . . . . . . . . . . . . . . . 44
4.5 Noise Measurement: After Place and Route . . . . . . . . . . . . . . . . 49
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Early Timing and Power Prediction With Noise 57
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Previous Work on Early Delay/Power Prediction . . . . . . . . . . . . . . 58
5.3 Delay Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Varying pin delays . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.2 Logic, routing and constant factors . . . . . . . . . . . . . . . . . 63
5.3.3 Maximum/scaled metrics . . . . . . . . . . . . . . . . . . . . . . . 64
5.4 Power Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
v
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6 Conclusions 74
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.1 Glitch Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1.2 CAD Algorithm Noise . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.1 Glitch Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.2 CAD Algorithm Noise . . . . . . . . . . . . . . . . . . . . . . . . 76
A Circuit Delay/Power Statistics With Noise 78
A.1 Individual Design Noise Results . . . . . . . . . . . . . . . . . . . . . . . 78
Bibliography 78
vi
List of Tables
2.1 Glitch example truth table for a logic function with inputs abc and output
f . A possible example of cares is given (care = Y, don’t-care = N ) . . . 9
2.2 Percentage of dynamic power from glitches. . . . . . . . . . . . . . . . . . 13
2.3 Percentage of simulated local LUT input states corresponding to don’t-cares. 14
4.1 Standard deviation of noise (before place-and-route) . . . . . . . . . . . . 49
4.2 Standard deviation of noise (after place-and-route) . . . . . . . . . . . . 51
5.1 Average percentile of top circuits with isolated model parameters. . . . . 70
5.2 Average benefit of prediction models . . . . . . . . . . . . . . . . . . . . 72
A.1 Critical path delay statistics by circuit . . . . . . . . . . . . . . . . . . . 79
A.2 Dynamic power statistics by circuit . . . . . . . . . . . . . . . . . . . . . 80
vii
List of Figures
2.1 (a) Logic blocks and routing in an island-style FPGA architecture.
(b) Example of a 3-input LUT (look-up-table) with truth table in Table 2.1. 8
2.2 Example waveform showing a glitch on the output of a LUT f with truth
table given in Table 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 (a) Example of SDCs (left) and ODCs (right).
(b) Miter circuit used in don’t-care analysis [Mish 05]. . . . . . . . . . . . 12
3.1 Example: before glitch reduction. (a) LUT with don’t-care SRAM bit
shaded. (b) Simulation waveform. . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Example: after glitch reduction. (a) LUT with altered don’t-care SRAM
bit shaded. (b) Simulation waveform with glitches removed. . . . . . . . 19
3.3 A cluster of don’t-cares. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Experimental flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 (a) Dynamic power reduction vs. baseline (default) don’t-care settings
and worst-case settings. (b) Glitch power reduction vs. baseline (default)
don’t-care settings and worst-case settings. . . . . . . . . . . . . . . . . . 25
3.6 Average vote bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 (a) Power per signal vs. fanout. (b) Normalized don’t-cares per node vs.
fanout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.8 Fanout splitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.9 Stratix III adaptive logic module (ALM) [Alteb]. . . . . . . . . . . . . . . 29
viii
3.10 Dynamic power reduction from fanout splitting. . . . . . . . . . . . . . . 30
4.1 FPGA CAD flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Example of an And-Inverter Graph (AIG). . . . . . . . . . . . . . . . . . 36
4.3 Example of an AIG before balancing (logic levels shown in parentheses). 37
4.4 Examples of balanced AIGs. . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.5 Example of AIG rewriting. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.6 Number of circuits vs. normalized nodes/level (balancing noise). . . . . . 45
4.7 Number of circuits vs. normalized nodes/level (rewriting noise). . . . . . 45
4.8 Number of circuits vs. normalized nodes/level (refactoring noise). . . . . 46
4.9 Number of circuits vs. normalized nodes/level (depth-oriented mapping
noise). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.10 Number of circuits vs. normalized nodes/level (area-oriented mapping
noise). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.11 Number of circuits vs. normalized nodes/level (all noise). . . . . . . . . . 47
4.12 Number of circuits vs. normalized delay. . . . . . . . . . . . . . . . . . . 50
4.13 Number of circuits vs. normalized dynamic power. . . . . . . . . . . . . . 51
4.14 Synthesis noise: (a) Number of circuits vs. normalized delay. (b) Number
of circuits vs. normalized power. . . . . . . . . . . . . . . . . . . . . . . . 52
4.15 Delay rank of circuits under synthesis noise averaged across 4 and 5 place-
ment seeds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.16 Technology mapping noise: (a) Number of circuits vs. normalized delay.
(b) Number of circuits vs. normalized power. . . . . . . . . . . . . . . . . 54
5.1 Slow and fast inputs on lookup tables. . . . . . . . . . . . . . . . . . . . 61
5.2 Example for the pin utilization timing model. . . . . . . . . . . . . . . . 62
5.3 Example for the pin order timing model. . . . . . . . . . . . . . . . . . . 62
ix
5.4 Probability of finding the top circuit vs. percentage of top modeled circuits
considered (delay). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Percentile of predicted top circuits (delay). . . . . . . . . . . . . . . . . . 69
5.6 Probability of finding the top circuit vs. percentage of top modeled circuits
considered (power). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.7 Percentile of predicted top circuits (power). . . . . . . . . . . . . . . . . 72
x
Chapter 1
Introduction
1.1 Field-Programmable Gate Arrays
Field-programmable gate arrays (FPGAs) are user-configurable logic devices capable of
implementing digital circuits. These devices are used in a wide variety of areas including
communications, automotive, industrial and consumer markets. The appeal of FPGAs
versus application-specific integrated circuits (ASICs) is that they allow the user to avoid
the high cost of chip fabrication, as well as they reduce time-to-market. FPGAs allow
a hardware designer to prototype their design quickly, while an ASIC design would take
more time and money to repair, should an error be found. Mask set costs at 45nm can
cost as much as $2M [Fran 10], a cost high enough to drive away all but the highest-
volume applications.
To create an FPGA implementation of a design, a hardware engineer will typically use
a hardware description language (HDL), such as Verilog or VHDL. A series of computer-
aided design (CAD) tools transform the HDL into a digital circuit that can be pro-
grammed onto the FPGA. A typical sequence of steps in the CAD flow is as follows:
• Logic Synthesis: The logic functions needed to implement the circuit are derived
and optimized.
1
Chapter 1. Introduction 2
• Technology Mapping: The logic functions are mapped into the logic elements
specific to the target device architecture.
• Packing: The logic elements are grouped into larger units corresponding to the
target device architecture.
• Placement: The mapped logic elements are placed into physical locations on the
target device.
• Routing: The proper connections are made between the logic elements using the
programmable routing network.
The quality of the resulting circuit depends on the quality of the tools used to generate
it. Quality can be measured in terms of area, performance and power. It is here that
FPGAs fall short of ASICs – the area, performance and dynamic power gaps between
them have been estimated at 40x, 4x and 12x, respectively [Kuon 07]. By studying
existing CAD algorithms and exploring new ones, FPGAs can close the gap with ASICs
and attract a larger portion of the digital logic market.
1.2 Glitch Power
As mentioned previously, one area for improvement in FPGAs is power consumption.
Power can be reduced through efforts at various stages: the architectural level, the
circuit level, or the CAD level (which will be the focus here). In particular, glitch
power (the power dissipated by unnecessary signal transitions) is an attractive target for
reduction since it comprises from 4% to 73% of total dynamic power, with an average of
22.6% [Lamo 08]. We present two contributions in this area, the results of which have
been published [Shum 11]:
1. An analysis of glitch power in commercial FPGAs.
Chapter 1. Introduction 3
2. A CAD approach for reducing glitch power at no cost to area or performance.
Chapter 2 provides background on FPGA glitch power. It begins with a description
of how glitches occur in FPGAs, and some previous works on how to reduce glitch
power. To motivate our research, we present our own analysis on glitch power
in commercial FPGAs. Our results show an average of 26% of dynamic power
from glitches. This chapter also describes don’t-cares in logic functions, which will
be used in the glitch reduction algorithm. We show that the average occurrence
of don’t-cares under simulation is sufficient to supply ample opportunities for our
algorithm.
Chapter 3 presents an algorithm for glitch power reduction which can be performed
post-routing, incurring zero area and performance cost. The algorithm takes ad-
vantage of don’t-care bits in the truth tables of functions in a circuit, setting them
to values which minimize the amount of glitch power dissipated. The algorithm is
tested with a commercial FPGA CAD tool suite and architecture, and shows an
average glitch power reduction of 13.7%, and an average dynamic power reduction
of 4.0%.
1.3 CAD Algorithm Noise
Given the tremendous challenge of solving modern-day CAD problems, the algorithms
used for these problems generally use heuristics to seek a reasonable solution in an ac-
ceptable amount of time. In the course of exploring the vast solution space of these
problems, there is often a need to choose between two or more alternatives that appear
to have the same quality. Such choices, although seemingly innocuous at the time of
selection, can have ripple effects on future choices, causing the final quality of the circuit
to vary if different choices are made. We label these variations as noise. We present the
following contributions in this area:
Chapter 1. Introduction 4
1. An analysis of a series of logic synthesis and technology mapping algorithms, ex-
posing potential sources of noise that have not been studied before.
2. Experimental results on the amount of noise present in several CAD algorithms, in
terms of critical path delay and dynamic power. The concept of power noise is also
a new contribution which has not been previously studied.
3. A method for predicting the best circuits in terms of performance and power in the
presence of noise.
Chapter 4 introduces the concept of CAD algorithm noise. We expose hidden sources
of noise in the logic synthesis and technology mapping algorithms of the academic
CAD tool ABC [Berk 06]. We present the results of our noise analysis, showing
the effects of random choices in thousands of circuit compilations. The results of
the noise injection show a standard deviation of as much as 3.3% in critical path
delay, and 3.7% in dynamic power.
Chapter 5 presents a solution to the variance in circuit quality produced by CAD algo-
rithm noise. The idea is to perform several synthesis and mapping runs of a circuit
(using different seeds) and use early timing and power metrics to predict the best
one(s) to advance to the placement and routing stages. This would save the time
that would be spent on a large number of place-and-route runs. In this chapter, a
wide array of early timing prediction models are evaluated, including several ap-
proaches to estimating logic and routing delays. For power prediction, two fast
simulation models are used, as well as information from the packing stage of the
CAD flow. The application of these prediction models in a commercial FPGA leads
to an average benefit of up to 1.8% in delay and 1.8% in power compared to the
average noise-injected circuit.
Chapter 6 concludes the work. We summarize the contributions of the previous chap-
Chapter 1. Introduction 5
ters and present possible extensions and related research topics for future work.
Chapter 2
Glitch Power and Don’t-Cares in
FPGAs
2.1 Introduction
Power in FPGAs can be divided into two categories: static power and dynamic power.
Static power is due to current leakage in transistors. Dynamic power is a result of signal
transitions between logic-0 and logic-1. These transitions can be split into two types:
functional transitions and glitches. Functional transitions are those which are necessary
for the correct operation of the circuit. Glitches, on the other hand, are transitions that
arise from unbalanced delays to the inputs of a logic gate, causing the gate’s output to
transition briefly to an intermediate state. Although glitches do not adversely affect the
functionality of a synchronous circuit (as they settle before the next clock edge), they
have a significant effect on power consumption. Using an academic FPGA model, glitch
power has been estimated to comprise from 4% to 73% of total dynamic power, with an
average of 22.6% [Lamo 08]. This is a significant motivator for the reduction of glitch
power.
As a means of reducing glitch power, we seek to take advantage of don’t-cares in
a circuit. Don’t-cares are an important concept in logic synthesis and are frequently
6
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 7
used for the optimization of logic circuits. A don’t-care of a logic function within a
larger circuit is an input state for which the function’s output can be either logic-0 or
logic-1, without affecting the circuit’s correctness. Don’t-cares can come from external
constraints or from within the circuit itself. An external constraint may be specified by
the designer (e.g. asserting that a certain input combination will never be applied). A
logic function within a circuit may also have don’t-cares due to its surrounding logic,
for example, if the logic feeding the function’s fanins can never satisfy a certain input
combination, or if the function’s output does not affect the circuit’s primary outputs
under certain circumstances.
This chapter is organized as follows. Section 2.2 gives a brief overview of basic FPGA
architecture. Section 2.3 describes how glitches occur in FPGAs. Section 2.4 summarizes
some previous works on FPGA glitch reduction. Section 2.5 describes don’t-cares and
how they can be found. Section 2.6 gives our analysis of glitch power, while Section 2.7
gives our analysis of don’t-cares. Section 2.8 summarizes the chapter.
2.2 FPGA Architecture
Before presenting our glitch analysis and glitch reduction method, it is important to recap
some basic FPGA architecture and terminology. Fig. 2.1(a) shows a section of a typical
island-style FPGA architecture. It is composed of logic blocks connected to one another
through a programmable routing network. Programmable routing switches (shown as x’s
in Fig. 2.1(a)) allow pins on logic blocks to be programmably connected to pre-fabricated
metal wire segments, and also allow wire segments to be programmably connected with
one another to form routing paths.
Inside the logic blocks, logic functions are implemented using look-up-tables (LUTs).
An example is shown in Fig. 2.1(b). A k-input LUT can implement any logic function of
up to k variables. In essence, a LUT is a hardware implementation of a truth table, where
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 8
(a) (b)
Figure 2.1: (a) Logic blocks and routing in an island-style FPGA architecture.(b) Example of a 3-input LUT (look-up-table) with truth table in Table 2.1.
the output value for each minterm is held in an SRAM configuration cell (bit). A k-input
LUT requires 2k configuration bits. For this work, we target an FPGA that contains 6-
input LUTs, which are typical of modern commercial FPGA architectures [Altec, Xili].
2.3 Glitch Power in FPGAs
The dynamic power consumed by an FPGA can be modeled by the formula
Pdyn =1
2
n∑i=1
SiCifV2dd (2.1)
where n is the number of nets in the circuit, Si is the switching activity of net i, Ci is
the capacitance of net i, f is the frequency of the circuit, and Vdd is the supply voltage.
The glitch reduction algorithm presented in this work aims to lower the switching activity
as a means of reducing dynamic power.
As a result of the differences in delays through the routing network and LUTs them-
selves, signals arriving at LUT inputs may transition at different times, leading to glitches.
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 9
Figure 2.2: Example waveform showing a glitch on the output of a LUT f with truthtable given in Table 2.1.
abc f Care000 0 Y001 0 Y010 0 Y011 0 Y100 1 N101 1 Y110 0 N111 0 Y
Table 2.1: Glitch example truth table for a logic function with inputs abc and output f .A possible example of cares is given (care = Y, don’t-care = N )
An example is shown in Fig. 2.2. This LUT implements the 3-input function given in
Table 2.1. Consider the case where the inputs transition from 000 → 111. Ideally, the
output f would remain constant at 0. However, varying arrival times on the inputs may
cause an input transition sequence such as 000→ 100→ 110→ 111, causing f to make
a 0 → 1 → 0 → 0 transition rather than remaining at 0. This leads to extra power
consumed by the LUT and any of its fanouts that propagate the glitch. Furthermore,
the glitch is propagated through the FPGA interconnect which presents a high capacitive
load due to its long metal wire segments and programmable (buffered) routing switches.
Prior work has shown, in fact, that interconnect accounts for 60% of total FPGA dynamic
power [Shan 02].
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 10
2.4 Previous Work on Glitch Reduction in FPGAs
Glitch reduction techniques can be applied at various stages in the CAD flow. Since
glitches are caused by unbalanced path delays to LUT inputs, it is natural to design
algorithms that attempt to balance the delays. This can be done at the technology
mapping stage [Chen 07b], in which the mapping is chosen based on glitch-aware switch-
ing activities. Another approach operates at the routing stage [Dinh 09], in which the
faster-arriving inputs to a LUT are delayed by extending their path through the rout-
ing network. Delay balancing can also be done at the architectural level. The work
in [Lamo 08] inserts programmable delay elements to balance the arrival times of signals
at LUT inputs. However, these approaches all incur an area or performance cost.
Some works use flip-flop insertion or pipelining to break up deep combinational logic
paths which are the root of high glitch power. Circuits with higher degrees of pipelining
tend to have lower glitch power because they have fewer logic levels, thus reducing the
opportunity for delay imbalance [Wilt 04]. Flip-flops with shifted-phase clocks can be
inserted to block the propagation of glitches [Lim 05]. Another work in [Czaj 07] uses
negative edge-triggered flip-flops in a similar fashion, but without the extra cost of gen-
erating additional clock signals. It is also possible to apply retiming to the circuit by
moving flip-flops to block glitches [Fisc 05].
Our work draws inspiration from hazard-free logic synthesis techniques for asyn-
chronous circuits, such as [Lin 95]. In asynchronous circuits, glitches (hazards) cannot
be tolerated because they may produce incorrect behavior (consider, for example, the
disasterous effect of a glitch on a handshaking signal). Our work is different in that while
hazards are tolerable from a functionality standpoint, it is beneficial to remove them to
reduce power consumption.
A key feature of the work presented here is that it has no impact on the rest of
the design flow. It is applied after placement and routing, and as a consequence, the
algorithm has no cost in terms of performance or area. Other methods incur additional
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 11
area/delay from the inclusion of delay elements, registers and extra routing resources, as
well as disrupting the synthesis and layout of the circuit in an unpredictable way. Our
approach maintains the results of the existing compilation while only making changes to
the don’t-cares within LUT truth table configuration bits. This zero-overhead method is
a highly desirable quality not shared by previous glitch reduction approaches.
2.5 Don’t-Cares in Logic Circuits
To prevent glitches, we take advantage of don’t-cares. These are entries in the truth
table where a LUT’s output can be set as either logic-0 or logic-1 without affecting the
correctness of the circuit. Don’t-cares fall into two categories: satisfiability don’t-cares
(SDCs) and observability don’t-cares (ODCs) [Mish 09]. SDCs occur when a particular
input pattern can never occur on the inputs to a LUT. In the example shown in Fig. 2.3(a),
the inputs a = 0, b = 1 will never occur. ODCs occur when the output of a LUT cannot
propagate to the circuit’s primary outputs. In the example, the output of f2 has no
effect when c = 0.
In this work, we leverage the don’t-care analysis capabilities of the ABC logic synthesis
network developed at UC Berkeley [Berk 06]. ABC incorporates Boolean satisfiability
(SAT)-based complete don’t-care analysis that can be used to determine the don’t-care
minterms for a given LUT in a technology mapped FPGA circuit [Mish 05]. To find the
don’t-cares for a given LUT, f , ABC uses a miter circuit, as illustrated in Fig. 2.3(b).
As shown, two instances of LUT f and (some of) its surrounding circuitry are created –
the surrounding circuitry is shown as a shaded region in the figure. In one instance, f ’s
output is in true form; in the other instance, f ’s output is inverted. The outputs of the
two instances are exclusive-OR’ed with one another, with the XOR gate outputs being fed
into a wide OR gate. The final OR gate produces an output logic signal C(x) for a given
input vector x.
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 12
(a) (b)
Figure 2.3: (a) Example of SDCs (left) and ODCs (right).(b) Miter circuit used in don’t-care analysis [Mish 05].
For an input vector x to the miter in Fig. 2.3(b), one can compute a local input
vector y to LUT f . For any such x where C(x) is logic-1, y is a care minterm of LUT f ;
that is, LUT f affects the circuit outputs for input vector x. The basic approach taken
in [Mish 05] is to use a fast vector-based simulation as well as SAT to find all vectors,
x, where C(x) evaluates to logic-1, yielding the complete care set for LUT f . This
provides a general picture of the don’t-care analysis approach and the reader is referred
to [Mish 05] for full details. Don’t-cares have recently been used for area reduction in
FPGA circuits [Mish 09].
2.6 Glitch Power Analysis
To motivate the need for glitch reduction, we examine the amount of glitch power dissi-
pated by 20 MCNC benchmark designs. These designs were fully compiled using Altera
Quartus 10.1, targeting 65nm Stratix III devices [Alteb]. ModelSim 6.3e was then used
to perform a functional (zero-delay) and timing simulation of each circuit using 5000 ran-
dom input vectors, producing two switching activity (VCD) files. The VCD files contain
a record of every transition of every net in the circuit. The dynamic power was then
computed using Quartus PowerPlay – Altera’s power analysis tool. The glitch filtering
setting was enabled, as it only filters glitches that are too short to occur in an actual
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 13
Circuit % glitch Circuit % glitchalu4 25.7 ex5p 41.6apex2 29.2 frisc 10.7apex4 30.3 misex3 25.4bigkey 29.6 pdc 36.7clma 24.2 s298 24.2des 45.4 s38417 26.8diffeq 5.8 s38584 1 11.4dsip 29.9 seq 26.2
elliptic 12.2 spla 33.2ex1010 35.0 tseng 17.5
Average 26.0
Table 2.2: Percentage of dynamic power from glitches.
FPGA. We only consider the core dynamic power – that is, no static power and no I/O
power. This was done in order to avoid skewing the results with power components un-
related to glitching. The glitch power was computed as the difference in dynamic power
between the functional and timing simulations.
The results are shown in Table 2.2. The percentage of dynamic power due to glitches
ranges from 5.8% to 45.4%, with an average of 26.0%, which is similar to that reported
in the academic FPGA context [Lamo 08]. This makes glitches an attractive target for
power reduction in commercial FPGAs. We do not believe any prior published work has
analyzed glitch power in a commercial FPGA.
2.7 Analysis of Don’t-Cares
In order to evaluate the potential for a don’t-care-based glitch reduction algorithm, we
analyzed every local input vector seen by each LUT in each circuit across its timing
simulation. This was done by taking the simulation output VCD generated by ModelSim
and inputting it to ABC. In ABC, we traverse the simulation vectors for each LUT, and
count the number of local input vectors to that LUT which correspond to its don’t-cares.
The percentage of such LUT input states which were don’t-cares is shown in Table 2.3.
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 14
Circuit % inputs DC Circuit % inputs DCalu4 18.4 ex5p 36.2apex2 7.4 frisc 5.2apex4 17.8 misex3 17.0bigkey 3.7 pdc 37.2clma 32.4 s298 15.3des 0.8 s38417 10.3diffeq 3.9 s38584 1 3.1dsip 4.6 seq 7.6
elliptic 0.7 spla 33.8ex1010 34.6 tseng 12.4
Average 15.1
Table 2.3: Percentage of simulated local LUT input states corresponding to don’t-cares.
The percentages vary from 0.8% to 37.2%, with an average of 15.1%. This tells us that
not only do circuits contain an abundance of don’t-cares, but also that, surprising, these
don’t-cares are often traversed in circuit operation. In other words, a LUT’s don’t-care
minterms are frequently “visited” under vector stimulus. The visits to such don’t-care
minterms may potentially lead to additional unnecessary toggles on LUT outputs. We
can thus potentially reduce glitches through don’t-care settings, which is the core idea of
our approach (which will be described in the next chapter).
2.8 Conclusion
In this chapter, we introduced basic FPGA architecture and gave an introduction to
power consumption in FPGAs. We summarized some previous works in the area of glitch
reduction. We described how glitches are generated, and presented our own analysis of
glitch power consumption in commercial FPGAs. Glitch power was found to comprise
an average of 26.0% of total dynamic power. We also explained logical don’t-cares and
how they can be found, as well as analyzing how often they occur in circuits. It was
found that an average of 15.1% of visited LUT input states are don’t-cares. Together,
these results indicate that glitch power is a good target for power reduction, and that
Chapter 2. Glitch Power and Don’t-Cares in FPGAs 15
don’t-cares are prevalent enough to enable a don’t-care based glitch reduction algorithm.
This algorithm will be presented in the next chapter.
Chapter 3
Glitch Reduction Using Don’t-Cares
3.1 Introduction
In this chapter, we present a glitch reduction optimization algorithm based on don’t-
cares. It sets the output values for the don’t-cares of logic functions in such a way that
reduces the amount of glitching. This process is performed after placement and routing,
using timing simulation data to guide the algorithm. Relative to prior published FPGA
glitch reduction techniques, our approach is entirely new, and leverages the ability to
re-program FPGA logic functions without altering the placement and routing. Since the
placement and routing are maintained, this optimization has zero cost in terms of area
and delay, and can be executed after timing closure is completed.
Section 3.2 describes the new algorithm for glitch reduction. Section 3.3 describes the
methodology for testing the algorithm. Section 3.4 shows the power reduction results,
and Section 3.5 summarizes the chapter.
3.2 Glitch Reduction Algorithm
We begin with an example to illustrate how don’t-cares can be used to prevent glitches.
The general idea is to simulate the circuit, then traverse the simulation vectors for each
16
Chapter 3. Glitch Reduction Using Don’t-Cares 17
(a)
(b)
Figure 3.1: Example: before glitch reduction. (a) LUT with don’t-care SRAM bit shaded.(b) Simulation waveform.
Chapter 3. Glitch Reduction Using Don’t-Cares 18
LUT, focusing on vectors corresponding to don’t-cares. We keep a count of the number of
instances for each don’t-care when we would prefer setting it to logic-0 or logic-1 (based
on the care outputs surrounding it). We will refer to these counts as “votes”. When the
end of the simulation vectors is reached, we set the don’t-cares to the value (logic-0 or
logic-1) corresponding to the more popular vote.
Figs. 3.1(a) and 3.1(b) show an example of a LUT and its simulation waveform. Let
us assume that the truth table row for abc = 100 corresponds to a don’t-care, found using
the method described in Section 2.5. We illustrate the don’t-care by shading its SRAM
configuration bit in Fig. 3.1(a). We also assume that the don’t-care bit is currently set
to logic-1 – an arbitrary choice. We initialize the vote counts to 0 (vote0 = 0, vote1 = 0).
Now, we traverse the waveform of Fig. 3.1(b) from left to right, stopping when we
encounter an input corresponding to a don’t-care (DC). In this case, we encounter the
don’t-care input abc = 100 in the second time step. We then consider the previous LUT
output and the next LUT output. In this case, we see that they are both logic-0. If
we were to change the output for abc = 100 to logic-0 instead of logic-1, we would be
able to prevent two glitch transitions on f . Therefore, we increment the vote counter
for logic-0 (vote0 = 1, vote1 = 0). In the fourth time step, we encounter another don’t-
care flanked by two logic-0 outputs. We increment the vote counter for logic-0 again
(vote0 = 2, vote1 = 0).
At the sixth time step, we see the the neighboring outputs of this don’t-care instance
are logic-0 and logic-1. In this case, there would be one transition on f whether the
don’t-care is set to logic-0 or logic-1. Therefore, no change is made to the vote counts
(vote0 = 2, vote1 = 0). At this point, we have exhausted the simulation waveform. We
set the don’t-care bit to logic-0, since vote0 is greater than vote1. The resulting LUT and
waveform are shown in Figs. 3.2(a) and 3.2(b). We can see that four glitch transitions
have been eliminated on output f .
A more formal expression of the glitch reduction algorithm is shown in Algorithm 1.
Chapter 3. Glitch Reduction Using Don’t-Cares 19
(a)
(b)
Figure 3.2: Example: after glitch reduction. (a) LUT with altered don’t-care SRAM bitshaded. (b) Simulation waveform with glitches removed.
Chapter 3. Glitch Reduction Using Don’t-Cares 20
It takes a placed and routed netlist as its input. We represent the netlist as a graph
G(V,E), where V is the set of vertices (LUTs) and E is the set of edges (routing wires).
The algorithm also takes a value change dump (VCD) file containing the results of a
timing simulation of the circuit. The simulation vectors are denoted as S, where the ith
local input vector to LUT n is denoted as Sn[i]. A timing simulation is needed rather than
a functional one because glitches arise from delay mismatches, which will only appear
under timing simulation.
The algorithm iterates through each LUT in the netlist, progressing from shallower
levels to deeper ones. This order is used because glitches prevented on shallower LUTs
will be prevented from propagating to deeper LUTs, thus saving more power. Within
each level, the LUTs are examined in descending order of power consumption. This
prioritizes the LUTs with the greatest potential savings. For each LUT, the following
steps are performed:
1. Compute the don’t-cares of the LUT.
2. Scan the input vectors.
3. Set the values of the don’t-cares.
3.2.1 Computing the Don’t-Cares for a LUT
As described previously in Section 2.5, we use ABC’s SAT-based don’t-care analysis
to compute the inputs states (minterms) for the particular LUT which are don’t cares
(Algorithm 1, line 3). DC is the set of don’t-care input states.
3.2.2 Scanning the Input Vectors
The sequence of local input vectors to the LUT (denoted Sn) is extracted from the timing
simulation VCD file. These input vectors are examined in order (line 5). When an input
Chapter 3. Glitch Reduction Using Don’t-Cares 21
Algorithm 1 Glitch reduction algorithm.Input: a netlist G(V, E) with simulation vectors SOutput: a netlist with modified LUT functions1: for each LUT n ∈ V in order of priority do2: {1. Compute the don’t-cares of the LUT}3: DC = compute dont cares(n)4: {2. Scan the input vectors}5: for i = 0 to size(Sn) do6: if Sn[i] ∈ DC then7: prev ← previous care output8: next← next care output9: if prev = 0 and next = 0 then10: V otes0(Sn[i])← V otes0(Sn[i]) + 111: else if prev = 1 and next = 1 then12: V otes1(Sn[i])← V otes1(Sn[i]) + 113: end if14: end if15: end for16: {3. Set the values of the don’t-cares and update netlist}17: for each don’t-care d ∈ DC do18: if V otes0(d) > V otes1(d) then19: assign 0 as the output of d20: else if V otes1(d) > V otes0(d) then21: assign 1 as the output of d22: end if23: end for24: end for
vector Sn[i] corresponding to a don’t-care is reached (line 6), we look at the closest
states in the past and future that correspond to care input vectors (lines 7-8). We use
this information to decide whether this don’t-care should be set to a logic-0, logic-1, or
whether there is no preference. If the closest past and future cares are identical (both
logic-0 or both logic-1) then the don’t-care should be set to the same value. Otherwise,
there is no preference. For each don’t-care minterm, a count of “votes” is kept, indicating
how many times in the simulation it would be beneficial to set it to a logic-0 or logic-1
(lines 9-12). This process is repeated for each input vector Sn[i] in the full simulation
time (lines 5-15).
Consider again the example shown in Fig. 2.2 and Table 2.1. Suppose that for input
Sn[i] = 100, the LUT output is a don’t-care. This means that even though it is assigned
to logic-1 in the truth table, we can assign it to logic-0 or logic-1 without affecting the
Chapter 3. Glitch Reduction Using Don’t-Cares 22
Figure 3.3: A cluster of don’t-cares.
functionality of the circuit. In this case, we see a glitch on f making a 0 → 1 → 0 → 0
transition as the inputs transition 000→ 100→ 110→ 111. Looking at the closest care
states before and after input 100, we see that they both output a logic-0. Therefore, the
algorithm votes for the output of 100 to be logic-0.
It is possible that the simulation data may include a long contiguous cluster of don’t-
cares. In these cases, the more desirable state could be the opposite of the one that
would be chosen by this algorithm. For example, it may be beneficial to set a particular
don’t-care to logic-0 within a cluster of logic-0’s (don’t-cares) in between two logic-1’s
(cares) rather than attempting to set the entire cluster to logic-1. This situation is
illustrated in Figure 3.2.2. The fourth time step shows a don’t-care surrounded by other
don’t-cares which have high vote0 (i.e. they will be set to 0). Therefore, we can see that
setting this DC to 0 would be preferable. However, the algorithm would set it to 1, as
the nearest cares are both 1. This would cause a glitch. Fortunately, experimental data
shows that such long clusters are uncommon. The average length of don’t-care clusters
in the benchmark set is 3.5. This justifies our use of the closest care input vectors.
3.2.3 Setting the Don’t-Cares
When the end of the input vectors is reached, each don’t-care is set to the value with more
votes (unless the votes are tied, in which case nothing is done – the choice is arbitrary).
Chapter 3. Glitch Reduction Using Don’t-Cares 23
The loop at lines 17-23 walks through each don’t-care d ∈ DC (the set of don’t-care
minterms) and checks whether logic-0 or 1 has a majority of votes. The netlist is updated
accordingly before proceeding to the next LUT. This is critical because changing the logic
function of one LUT can affect the don’t-cares of other LUTs, due to incompatibility
between don’t-cares [Mish 09]. By ensuring that the don’t-cares are computed using the
most recent information, the circuit is guaranteed to remain functionally-equivalent to
the original.
3.2.4 Iterative Flow
Following the modification of the circuit, the simulation results become outdated, due
to the changes to the LUT functions. Therefore, we repeat the simulation using the
modified circuit after performing glitch reduction on the full circuit. The algorithm is
then repeated. In practice, the majority of the glitch reduction occurs within the first
three iterations.
It is important to note that the loop of the iterative flow does not involve re-running
placement and routing. This is vital for two main reasons. First, the results of the
existing compilation will be preserved, so there is no interference with timing closure.
Second, the delays within the circuit will be kept the same, thus minimizing the amount
of change to the simulation vectors. This allows the algorithm to converge quickly.
The algorithm runtime is on the order of minutes for the benchmarks used. Al-
though the iterative process employs a timing simulation, the fact that this algorithm is
performed after place-and-route mitigates the issue of runtime. We envision a usage sce-
nario in which the designer runs this algorithm as part of a final pass after timing closure
has been achieved. Since no modifications are made to the circuit’s timing characteristics,
timing closure is preserved.
Chapter 3. Glitch Reduction Using Don’t-Cares 24
Figure 3.4: Experimental flow.
3.3 Methodology
We perform our glitch reduction algorithm on 20 MCNC benchmark circuits. The exper-
imental methodology was chosen to include commercial CAD tools wherever possible, to
evaluate the efficacy of the algorithm on real-world FPGAs. The flow is shown in Fig. 3.4.
We perform a full compilation using Quartus II 10.1 (synthesis, placement and routing)
targeting the Altera Stratix III 65nm FPGA family [Alteb]. This is followed by a timing
simulation using ModelSim SE 6.3e. For each circuit, 5000 random input vectors are
applied. We use a set of custom scripts to transform the simulation netlist generated by
Quartus into BLIF format, which can then be read into ABC, where the glitch reduction
is performed. Combinational equivalence checking (command cec in ABC [Mish 06c]) is
used after the glitch reduction step to ensure that the functionality of the circuit remains
the same. The output from ABC is used to modify the configuration bits in the simula-
tion netlist, thus ensuring that the placement and routing remain identical. Three passes
of the optimization loop are performed. Experiments show that very few changes, if any,
are made after this point (i.e. further iterations have virtually no effect). The power
measurements are performed using Quartus PowerPlay.
Chapter 3. Glitch Reduction Using Don’t-Cares 25
(a)
(b)
Figure 3.5: (a) Dynamic power reduction vs. baseline (default) don’t-care settings andworst-case settings. (b) Glitch power reduction vs. baseline (default) don’t-care settingsand worst-case settings.
Chapter 3. Glitch Reduction Using Don’t-Cares 26
3.4 Results
The leftmost bars in Fig. 3.5(a) (vs. baseline) represent the percentage reduction in total
core dynamic power after performing the glitch reduction algorithm. Immediately, we
can see that about half of the circuits benefit from the algorithm. The average reduction
is 4.0%, with a peak of 12.5%. Fig. 3.5(b) shows the corresponding reduction in glitch
power. The average reduction is 13.7%, with a peak of 49.0%. Naturally, the amount of
power reduction possible is based on the amount of glitching present and the number of
don’t-cares available. While the overall average power reductions are relatively modest,
we believe they will interest FPGA vendors and power-sensitive FPGA customers, as they
come at no cost to performance or area. For some circuits, over 10% power reduction
can be achieved essentially for “free”.
It is also interesting to look at the optimized power vs. the worst case don’t-care
settings possible, as illustrated by the rightmost bars in Fig. 3.5 (vs. worst-case). In this
experiment, we set the don’t-cares to the opposite of how they would normally be set
by our optimization algorithm, to examine the potential worst-case glitch power arising
from don’t-cares. Here, we see an average total dynamic power savings of 9.8% and
a peak savings of 30.8% (Fig. 3.5(a)). These results show that don’t-care settings can
potentially have a large impact on power if set to sub-optimal values.
The varied results in Fig. 3.5 can be correlated with the glitch power and don’t-care
data in Tables 2.2 and 2.3. For instance, des had a high glitch power in Table 2.2, yet
we did not observe a significant power reduction for this circuit. However, in Table 2.3,
we see that it had only 0.8% of LUT inputs as don’t-cares, thus reducing the number of
opportunities for optimization. On the other hand, pdc had a high amount of glitching
as well as ample don’t-cares, thus allowing it to be greatly improved by the algorithm –
12.5% dynamic power reduction.
We also examined the bias of votes cast on each don’t-care minterm in each LUT
in each circuit. The average results are shown in Fig. 3.6. The bias is defined as the
Chapter 3. Glitch Reduction Using Don’t-Cares 27
Figure 3.6: Average vote bias.
percentage of votes that were cast for the more popular setting, whether logic-0 or logic-
1. Bias is calculated for each don’t-care individually and averaged across the circuit. As
shown in the figure, the bias value tends to be in the 80-100% range, indicating that there
usually exists a highly preferable setting for a particular don’t-care minterm in a LUT.
This is an important observation because it indicates that our don’t-care settings are
providing a benefit most of the time (as opposed to the case of a bias around 50%, which
would imply that selecting either logic-0 or logic-1 for the don’t-care minterm is equally
good). These observations suggest that there usually exists a value for each don’t-care
(either 0 or 1) that is much better than the other, meaning that one can pick don’t-care
logic values with a high degree of confidence.
The relationship between don’t-cares, power and fanout presents a challenge to the
glitch reduction algorithm. Fanout is closely related to interconnect capacitance, and
interconnect can represent 60% of total FPGA dynamic power, on average [Shan 02].
Fig. 3.7(a) shows logic signal power consumption versus fanout, averaged across all signals
in all circuits. Observe that, as expected, average signal power increases with fanout,
due to the increase in capacitance. We also examined, for each signal, the fraction of
minterms in its driving LUT that were don’t-cares, and averaged this across all signals
Chapter 3. Glitch Reduction Using Don’t-Cares 28
(a) (b)
Figure 3.7: (a) Power per signal vs. fanout. (b) Normalized don’t-cares per node vs.fanout.
Figure 3.8: Fanout splitting.
of a given fanout in all circuits. The results are shown in Fig. 3.7(b). While the results
are “noisy” for high fanout (due to a small sample size for such fanouts), we see that, in
general, high fanout signals have fewer don’t-cares in their driving LUTs than low fanout
signals. The rationale for this is that high fanout signals are more likely to be used by
at least one of their fanouts, decreasing ODCs for such signals. Essentially, we have two
competing trends in that it is desirable to reduce the power of high fanout signals (as
they consume significant power), yet such signals exhibit fewer don’t-care opportunities.
Chapter 3. Glitch Reduction Using Don’t-Cares 29
Figure 3.9: Stratix III adaptive logic module (ALM) [Alteb].
3.4.1 Fanout Splitting
Based on the trend of high-fanout signals having fewer don’t-cares, it seemed reasonable
to examine this as a potential area for improvement. Consider a LUT f1 with fanout
LUTs FO1...FOn. Suppose that LUTs FO1...FOn−1 do not care about the value of f1
when its input is x, but FOn does care about it. Then x is a care for f1, thereby reducing
the amount of don’t-care optimization opportunities, even though only one of its fanouts
uses it.
A possible solution to this problem is to duplicate LUT f1, creating f2, and trans-
ferring fanout FOn from f1 to f2. This would increase the amount of don’t-cares on f1,
since x would now be a don’t-care. In general, f1 can be split into two LUTs, f ′1 and
f2 (i.e. we redistribute the fanout of f1, moving some of its fanout to f2). Each LUT
now has more don’t-care opportunities, since the cares “generated” by fanouts of f ′1 are
no longer present in f2, and vice versa. An example is given in Fig. 3.8. The LUT f1
has four fanouts which have care set 1 (illustrated by the hatch marks as a subset of the
truth table). In other words, if no other fanouts existed besides those four, the overall
care set of f1 would be care set 1. The fifth fanout has care set 2. The overall care set
of f1 is the union of these care sets.
By splitting the fanout of f1 among two new LUTs, f ′1 and f2, we can create two
Chapter 3. Glitch Reduction Using Don’t-Cares 30
Figure 3.10: Dynamic power reduction from fanout splitting.
LUTs with smaller cares sets and therefore more don’t-care optimization opportunities.
However, this incurs a power cost in duplicating the LUT and some routing resources.
The fanin routing would have to be duplicated for the new LUT. This would add to
the capacitance of these fanin signals. Fortunately, the Stratix III architecture [Alteb]
provides us with a way to mitigate this cost. The Adaptive Logic Module (ALM) shown
in Fig. 3.9 is essentially a pairing of two LUTs. By co-locating f1 and f2 in the same
ALM, we can virtually eliminate the cost of routing to an entirely new LUT. This is
because the routing to one LUT is shared with the routing to the other. This is a special
opportunity offered by the Stratix III architecture.
Figure 3.10 shows the dynamic power reduction resulting from fanout splitting. Some
circuits could not be placed and routed after fanout splitting due to illegal placement
constraints. This is because pairing certain LUTs into a single ALM may cause issues
with the compatibility between the LUTs. Unfortunately, the possible power reduction
is quite low, aside from a 5% reduction on alu4. Several circuits even show an increase
Chapter 3. Glitch Reduction Using Don’t-Cares 31
in power. This is due to the extra LUT that must be used, as well as its associated
routing resources. Considering the tradeoff of saving the occasional glitch transition
versus the overhead of adding more logic and routing resources, the fanout splitting is
rarely beneficial. Therefore, we decided not to further pursue fanout splitting.
3.5 Conclusion
In this chapter, we presented an analysis of glitch power in FPGAs and a method for glitch
reduction using don’t-cares in logic synthesis. We showed that glitch power is a significant
portion of total power, and that there exist ample opportunities for don’t-care-based
optimizations. A novel glitch reduction technique was presented that sets don’t-cares in
FPGA configuration bits in order to avoid glitch transitions. This method is performed
after placement and routing, and has no effect on circuit area or performance. The
algorithm was evaluated with a commercial 65nm FPGA architecture using a commercial
tool flow. The algorithm achieved an average total dynamic power reduction of 4.0%,
with a peak reduction of 12.5%; glitch power was reduced by up to 49.0%, and 13.7% on
average.
Chapter 4
FPGA CAD Algorithm Noise
4.1 Introduction
The process of designing a circuit for an FPGA platform generally involves writing code
in a hardware description language such as Verilog or VHDL, then compiling the code
to a bitstream that will be programmed onto the FPGA. This compilation process is
broken into a series of CAD stages. Due to the complex nature of these problems, the
CAD algorithms make use of heuristics to handle them.
CAD algorithms commonly encounter situations where a choice must be made be-
tween two or more alternatives that appear to have the same quality. For example, a
logic function might be implemented in multiple ways, each having the same local cost
in terms of area, delay, power, or some other metric. However, the choice of how that
function is implemented may have an unknown global effect on the quality of the circuit.
In practice, the choice may be arbitrarily made (e.g. always select the first alternative)
or it may be controlled with the use of a random number generator. By running the
algorithm multiple times using different seeds for the random number generator, we can
obtain a set of circuits with different characteristics. The variation in the quality of these
circuits (area, performance, power) through seemingly neutral changes is what we will
call noise. It is interesting to note that noise places a limit on the prediction accuracy of
32
Chapter 4. FPGA CAD Algorithm Noise 33
any timing/power estimation tools that are used prior to a noise-containing CAD algo-
rithm. One of the goals of this work is to quantize the amount of noise present in several
CAD algorithms.
The practice of trying multiple seeds, or “seed sweeping” is well-established for place-
ment and routing [Altea]. However, it is also possible for noise to be found in the logic
synthesis and technology mapping stages of the CAD flow. By exposing the noise in these
earlier stages, we hope to allow seed sweeping to take place earlier, in less time-consuming
stages.
In the following chapters, the following questions are addressed:
• Where in the CAD flow does noise come from?
• How much noise exists in the various stages of the compilation flow?
• Is there a way to predict the best circuits from a group of candidates, in the presence
of noise?
In this chapter, we examine several CAD algorithms in the logic synthesis and tech-
nology mapping stages and expose noise in those algorithms. To our knowledge, there
is no prior work studying noise in these algorithms, nor is there existing work on power
noise in FPGAs (variations in dynamic power consumption due to CAD algorithm noise).
Section 4.2 presents background on the particular CAD algorithms to be studied, and
describes our noise injection method. Section 4.3 describes our methodology for eval-
uating the amount of noise present in a set of benchmark circuits. Section 4.4 shows
the performance and power results before place-and-route, while Section 4.5 shows the
results after place-and-route. Section 4.6 summarizes the chapter.
4.2 CAD Flow Stages
A typical FPGA CAD flow is shown in Fig. 4.1. It consists of the following steps:
Chapter 4. FPGA CAD Algorithm Noise 34
Figure 4.1: FPGA CAD flow.
• Logic Synthesis: The logic functions needed to implement the circuit are derived
and optimized. We will be exploring new ways to inject noise into this stage.
• Technology Mapping: The logic functions are mapped into the logic elements
specific to the target device architecture. We will investigate noise in this stage as
well.
• Packing: The logic elements are grouped into larger units corresponding to the
target device architecture. We do not introduce noise in this stage, because we
are using commercial tools to perform packing and have no way to modify the
algorithm.
• Placement and routing: The mapped logic elements are placed into physical
locations on the target device, and the proper connections are made between the
Chapter 4. FPGA CAD Algorithm Noise 35
logic elements using the programmable routing network. The presence of noise in
this stage has already been established, but it will still be considered in this work.
One of the few works to consider noise in FPGA CAD [Rubi 11] examines the
amount of delay noise in the routing stage of VPR [Betz 97]. The authors in-
voke randomness in the PathFinder routing algorithm by changing the order of
nets routed and making small perturbations in circuits. One experiment involves
changing the routing architecture to include some slightly faster wires such that the
maximum impact to critical path delay should be 0.5%. However, this modifica-
tion was experimentally shown to cause changes of -34% to +15%. This work also
proposes a technique to reduce this noise through delay-targeted routing, which
calculates the criticality of a route using a fixed delay target rather than a floating
one.
The work presented here focuses on the logic synthesis and technology mapping stages.
In particular, we use the algorithms implemented in the academic tool ABC [Berk 06].
These algorithms are explained in further detail below, as well as our new methods for
injecting noise into each of them.
4.2.1 Logic Synthesis
The algorithms studied in this stage act on an And-Inverter Graph (AIG) which is a
representation of a logic circuit using only two-input AND gates and inverters. This
is the primary data structure used by ABC. An example is shown in Fig. 4.2. The
large circles (nodes) represent AND gates, while the small dots on the edges represent
inversion. This example shows the function ¬x1 ∧ ((x2 ∨ x3) ∧ (x4 ∧ x5)).
The general goal of algorithms in this stage is to reduce the number of nodes in the
AIG and the number of logic levels, which is the maximum number of nodes from a
combinational input to a combinational output. In the example of Fig. 4.2, there are
Chapter 4. FPGA CAD Algorithm Noise 36
Figure 4.2: Example of an And-Inverter Graph (AIG).
four nodes and three levels.
Three potential noise sources were identified in the logic synthesis stage:
• AIG balancing: And-Inverter Graph balancing is a technique that aims to reduce
the number of levels in an AIG [Mish 11]. An example of this is shown in Figs. 4.3
and 4.4. In Fig. 4.3, we see in the ellipse an AIG representing a 5-input AND. This
subgraph of the AIG has a depth of 4. In Fig. 4.4, we see two examples of AIGs
that could be generated by AIG balancing. In both cases, the number of levels is
reduced to 3. However, the balancing can be done in multiple ways, by placing
different signals on the shallower inputs. Balancing is done in two main steps:
– Tree covering: This step identifies multi-input AND gates in the AIG by
grouping together nodes which are not inverted and have no external fanout.
Chapter 4. FPGA CAD Algorithm Noise 37
Figure 4.3: Example of an AIG before balancing (logic levels shown in parentheses).
An example is shown by the ellipse in Fig. 4.3. The tree cannot be expanded
to include x4 as it is inverted, and it cannot include x5 as it has another
fanout.
– Tree balancing: For each multi-input AND gate identified by the tree cov-
ering stage, the tree balancing stage decomposes it into a balanced tree of
two-input AND gates. The balancing is done considering the logic levels of
the nodes feeding the multi-input AND. The process is shown in Algorithm 2.
The algorithm essentially pairs nodes together until the tree is formed. It
begins by taking the lowest level node as the first one to be paired (line 3).
It then finds the nodes with the next lowest level, between the indices of
leftBound and rightBound (lines 5-9).
Chapter 4. FPGA CAD Algorithm Noise 38
Figure 4.4: Examples of balanced AIGs.
At this point, the ABC code makes an arbitrary selection between the nodes
(selecting in the same way every time). However, we change the algorithm to
select one of these nodes randomly (line 11). The rand(m,n) function gives a
random integer between m and n (inclusive). Finally, this node is paired with
the first one into a two-input AND gate, replacing the original nodes. This
process continues until the last two nodes are paired. Consider the example in
Fig. 4.3 where the logic levels are in parentheses. Two nodes with the lowest
level (3) are randomly chosen and paired into another node with level 4. The
pairing might proceed as follows (Fig. 4.4, left):
∗ Begin by sorting input nodes in descending order by level
x1(4), x2(4), x3(3), x4(3), x5(3)
∗ Randomly choose two nodes with the lowest level (x4(3) and x5(3)) and
combine them into a new node, x45(4)
x1(4), x2(4), x45(4), x3(3)
Chapter 4. FPGA CAD Algorithm Noise 39
Figure 4.5: Example of AIG rewriting.
∗ Combine x3(3) with a random level 4 node, x45(4), to form x345(5)
x345(5), x1(4), x2(4)
∗ Combine two level 4 nodes x1(4) and x2(4) into x12(5)
x12(5), x345(5)
∗ Combine final two nodes to complete balancing
x12345(6)
Alternatively, the random pairing may be done this way (Fig. 4.4, right):
∗ x1(4), x2(4), x3(3), x4(3), x5(3)
∗ Select x3(3) and x5(3) randomly (instead of x4(3) and x5(3) as before)
x1(4), x2(4), x35(4), x4(3)
∗ Combine x4(3) with a random level 4 node, x1(4), to form x14(5)
x14(5), x2(4), x35(4)
∗ Combine two level 4 nodes x2(4) and x35(4) into x235(5)
x14(5), x235(5)
∗ Combine final two nodes to complete balancing
x12345(6)
Chapter 4. FPGA CAD Algorithm Noise 40
Algorithm 2 Tree balancing algorithm.Input: a vector V of input nodes to the multi-input AND gate, sorted by decreasing levelOutput: a balanced AIG1: while size(V ) > 1 do2: {1. Get the node with minimum level}3: node1← V [size(V )− 1]4: {2. Identify the nodes with the next lowest level (between leftBound and rightBound)}5: rightBound← size(V )− 26: leftBound← rightBound7: while leftBound ≥ 0 and level(V [leftBound− 1]) = level(V [rightBound]) do8: leftBound← leftBound− 19: end while10: {3. Select a node randomly (NEW)}11: node2← V [rand(leftBound, rightBound)]12: {4. Pair the nodes}13: newNode← AND(node1, node2)14: remove(V, node1, node2)15: insert(V, newNode)16: end while17: return newNode
This shows that the tree balancing stage does not provide a unique solution,
and is therefore a source of noise.
• AIG rewriting: AIG rewriting is an algorithm that reduces the number of nodes/logic
levels in an AIG by examining subgraphs of nodes and replacing them with lower-
cost substitutes [Mish 06b]. An example is shown in Fig. 4.5. In this case, the
AIG subgraph represents a 3-input AND. Using rewriting, it can be reduced from
3 nodes to 2. Algorithm 3 loops through each node n in the AIG (line 1) and enu-
merates all 4-input cuts of n (line 2). A cut of a node n is a set of nodes (leaves)
such that each path from a primary input to n passes through at least one node
of the cut. Each cut is replaced with equivalent AIG subgraphs from a hash table
of precomputed subgraphs (line 5). If the subgraph leads to a reduction in AIG
nodes, it is kept.
To add randomness at this stage, we modify the algorithm to allow changes even
when the replacement leads to no change in the AIG node count. It is kept with
a 50% probability (line 8). These are known as “zero-cost” replacements. If a
Chapter 4. FPGA CAD Algorithm Noise 41
Algorithm 3 AIG rewriting algorithm.Input: an AIG and a hash table of precomputed subgraphs SOutput: a rewritten AIG1: for each node n of the AIG in topological order do2: for each 4-input cut c of n do3: bestGain← −14: bestS ← NULL5: for each possible rewriting option s from HashLookup(S, c) do6: gain← SavedNodes(s, c)−AddedNodes(s, c)7: {If zero-cost, keep the change with 50% probability (NEW).}8: if gain > 0 or (gain = 0 and rand(0, 1)) then9: if bestS = NULL or gain ≥ bestGain then10: bestGain← gain11: bestS ← s12: end if13: end if14: end for15: if bestS ̸= NULL then16: Update(AIG, bestS)17: end if18: end for19: end for
new subgraph is found (leading to a cost reduction or zero-cost change), the AIG is
updated (line 16). The zero-cost replacements are a source of noise for the rewriting
algorithm.
• AIG refactoring: This technique involves computing one large cut for each AIG
node, then replacing it with a factored form with fewer nodes [Mish 06a]. The cuts
are chosen based on how much reconvergence they contain, which is an indicator of
redundancy that can be exploited by refactoring. Refactoring differs from rewriting
in that it acts on a larger scale (by default, rewriting is done on 4-input cuts, while
refactoring can go as high as 16). The noise injection in this stage is similar to
the method used in AIG rewriting. New AIG subgraphs are generated, and the
replacements are made with a 50% probability if the new subgraphs result in a
zero-cost change.
The above algorithms are repeated in sequence several times as part of the ABC script
resyn2. This lets the algorithms create new optimization opportunities for each other.
Chapter 4. FPGA CAD Algorithm Noise 42
Algorithm 4 Cut comparison algorithm.Input: two cuts, c1 and c2Output: 1 if c1 is better, -1 if c2 is better1: if metric1(c1) > metric1(c2) then2: return 13: end if4: if metric1(c2) > metric1(c1) then5: return -16: end if7: {...repeat for all metrics...}8: if metricn(c1) > metricn(c2) then9: return 110: end if11: if metricn(c2) > metricn(c1) then12: return -113: end if14: {If still tied, decide order randomly (NEW).}15: if rand(0, 1) then16: return 117: else18: return -119: end if
4.2.2 Technology Mapping
We use the priority cut-based technology mapping algorithm in ABC [Mish 07]. The
goal of this stage is to map the logic of the AIG to K-input functions which can be
implemented by LUTs on the FPGA (K depends on the FPGA architecture). The
mapper does this by first evaluating a set of priority cuts for each node in an AIG. These
cuts represent potential LUT implementations of that node. The cuts are selected and
sorted in terms of delay, number of inputs, and area. At this point in the CAD flow,
logic depth is used as a proxy for delay, and the number of LUTs is used as a proxy for
area.
The priority cuts for each node are sorted by several criteria, depending on the map-
ping parameters. These criteria include depth and cut size. The random noise in this
stage comes from deciding between cuts with the same values for each of these metrics.
Algorithm 4 shows the cut comparison function used when sorting priority cuts. It begins
by comparing the two cuts for each metric in order. If the cuts are tied in all cases, our
modification to the algorithm makes a random selection is made between them.
Chapter 4. FPGA CAD Algorithm Noise 43
The mapping algorithm makes several passes over the netlist, stitching together the
best results (using depth-optimal mappings on critical paths, and area-oriented mappings
elsewhere). We introduce noise in two stages:
• Depth-oriented mapping: Here, the logic depth metric is prioritized over area.
• Area-oriented mapping: Area is prioritized over depth.
4.2.3 Placement and Routing
The placement problem deals with assigning physical locations to each of the logic
blocks in a circuit. A common technique for placement is simulated annealing [Kirk 83].
This algorithm mimics the annealing of metals, a process in which a material is heated,
then cooled in order for its atoms to settle into a low-energy configuration. In placement,
the “atoms” are logic blocks, which are moved around randomly. The random moves are
controlled by the current “temperature” of the anneal, which determines the likelihood
of accepting moves even when they reduce the current quality of the placement. This
“hill-climbing” quality allows the algorithm to avoid being stuck in a local minimum of
the solution space.
The placement and routing stages of the CAD flow are done using Quartus II 10.1, a
commercial CAD tool from Altera. Since this is a commercial tool, we cannot implement
our own noise injection method. Instead, the noise injection in this stage is simply a
matter of changing the seed option to Quartus’ place-and-route tool. Although place-
and-route noise is not the focus of this work, it is still important to consider noise in this
stage because it can mask the effects of noise in the previous stages. For example, a good
placement (due to noise) might hide the negative effects of a mapping with inherently
bad quality, or vice versa. Therefore, it is necessary to evaluate the noise in this stage
and try to separate it from the noise in the previous stages.
Chapter 4. FPGA CAD Algorithm Noise 44
4.3 Methodology
In order to evaluate noise over a large number of random seeds, a large number of circuit
compilations were performed (over 10000). To do so, experiments were conducted on the
SciNet high-performance computing system [Loke 10]. This allowed us to perform many
compilations in parallel, allowing thousands of compiles to complete within a few days.
The tools used are the same as the ones used in the glitch power analysis section. We
use Altera’s Quartus 10.1 to pack, place and route the circuits and perform timing and
power analysis. The target FPGA family is Altera’s Stratix III. Modelsim 6.3e is used
for simulation to get toggle rates for the power estimation. 5000 random input vectors
are applied to each circuit.
The first benchmark set consists of 20 MCNC circuits. A second set of 7 benchmarks
was taken from the VPR 5.0 benchmark set, in order to have data for some larger circuits.
The circuits were selected by removing ones that were too similar to others (e.g. two
FIR filters with different parameters), and removing those which did not show significant
toggling on nets when subjected to random vector simulation (some circuits may require
specific input patterns to become active). This was done in order to get meaningful
dynamic power data.
4.4 Noise Measurement: Before Place and Route
In this section, the word “design” will be used to refer to all circuits having the same
original source file (e.g. “alu4” is a design). A “circuit” will refer to a particular
compilation of the design using certain seeds (e.g. “alu4” compiled with synthesis seed 1
and mapping seed 2 is a circuit). Six noise injection experiments are presented: one for
each of five noise injection stages tested individually, as well as one experiment containing
all noise injection stages. For each experiment, each of the 27 designs was processed by
ABC using 25 different seeds (making 25 ∗ 27 = 675 circuits). The results of the noise
Chapter 4. FPGA CAD Algorithm Noise 45
Figure 4.6: Number of circuits vs. normalized nodes/level (balancing noise).
Figure 4.7: Number of circuits vs. normalized nodes/level (rewriting noise).
Chapter 4. FPGA CAD Algorithm Noise 46
Figure 4.8: Number of circuits vs. normalized nodes/level (refactoring noise).
Figure 4.9: Number of circuits vs. normalized nodes/level (depth-oriented mappingnoise).
Chapter 4. FPGA CAD Algorithm Noise 47
Figure 4.10: Number of circuits vs. normalized nodes/level (area-oriented mappingnoise).
Figure 4.11: Number of circuits vs. normalized nodes/level (all noise).
Chapter 4. FPGA CAD Algorithm Noise 48
injection experiments are as follows:
1. AIG balancing (Fig. 4.6)
2. AIG rewriting (Fig. 4.7)
3. AIG refactoring (Fig. 4.8)
4. Technology mapping - depth-oriented (Fig. 4.9)
5. Technology mapping - area-oriented (Fig. 4.10)
6. All of the above (Fig. 4.11)
Each graph shows a histogram of the noise distribution for that stage. The x-axis
shows the normalized number of nodes (AIG nodes for unmapped circuits, LUTs for
mapped circuits) and the normalized number of logic levels. The results are normalized
to the average for each design. The y-axis shows the number of circuits in each bin (a
bin contains circuits falling to the left of its label). All graphs are shown with the same
scale to facilitate comparison.
From inspection, the circuits tend to fall in a normal probability distribution. The
synthesis stages (Figs. 4.6, 4.7, 4.8: balancing, rewriting, refactoring) tend to show wider
distributions. Note that the outliers in level count are generally due to low numbers
of logic levels (relative to the number of nodes in the circuit). It was observed that
any deviations in logic depth are limited to a single level. The node count distributions
are generally smoother. In the balancing stage, the majority of circuits appear to be
contained within +/- 1.0% of the mean (i.e. between 0.99 and 1.01), while the rewriting
and refactoring stages are tighter – around +/- 0.6% of the mean.
In contrast, noise in technology mapping (Figs. 4.9, 4.10) is much less than in syn-
thesis. The noise distributions are far narrower, showing that most circuits are within
0.2% of the mean in terms of node count and levels. When all noise injection stages are
Chapter 4. FPGA CAD Algorithm Noise 49
Table 4.1: Standard deviation of noise (before place-and-route)Noise injection Node stdev. Level stdev.
Balancing 0.004 0.013Rewriting 0.003 0.010Refactoring 0.002 0.010
Mapping - depth 0.001 0Mapping - area 0.001 0
All 0.013 0.025
combined (Fig. 4.11), an accumulating effect can be seen, with the distribution stretching
as far as +/- 2.0%.
Table 4.1 shows the standard deviation of the normalized number of nodes and logic
levels. This gives a more quantitative view of the results in the previous graphs. Again, it
appears that the noise in earlier stages is greater, due to having more degrees of freedom.
Balancing, rewriting and refactoring show standard deviations as high as 0.4% (0.004) in
node count and 1.3% in levels.
In contrast, noise in technology mapping is limited to about 0.1% in node count,
while the noise in levels is virtually zero. This indicates that the cost functions used in
technology mapping seem to be more fine-grained, causing fewer ties between options
and thus fewer opportunities for randomness. This can be credited to the tiebreakers
used in the mapper described in Section 4.2.2 using secondary and tertiary metrics to
distinguish between otherwise equivalent options. Furthermore, the mapping algorithm
is meant to produce depth-optimal mappings, so the lack of change in levels is expected.
Naturally, the noise of all stages combined is the greatest (1.3% in nodes and 2.5% in
levels), indicating that there is an accumulating effect from all stages together.
4.5 Noise Measurement: After Place and Route
In this section, we examine the amount of noise after the placement and routing stages.
For these experiments, circuits with logic depth greater than minimum depth for that
Chapter 4. FPGA CAD Algorithm Noise 50
Figure 4.12: Number of circuits vs. normalized delay.
design were removed. The reasoning for doing so is as follows. Traditionally, logic depth
is the primary metric used for estimating performance before placement and routing. It is
clear that deeper circuits are likely to have greater delay than shallower ones. However,
we wish to explore new metrics other than logic depth (as will be seen in the next
chapter). In other words, we would like to find ways of differentiating between minimal-
depth mappings. This is why the circuits for each design are all of minimum depth, to
remove that factor from consideration.
At the post-routing stage, we can evaluate the circuits in terms of critical path delay
and dynamic power. Placement and routing were performed with Quartus 10.1, using
the “standard fit” setting (i.e. maximum effort). The critical path delay was obtained
using Quartus’ TimeQuest timing analyzer.
We begin by showing the amount of noise present when all noise sources are acti-
vated (balancing, rewriting, refactoring, delay and area-oriented technology mapping,
and placement). Five seeds are used in each of synthesis, mapping and placement for 27
designs, making a total of 5 ∗ 5 ∗ 5 ∗ 27 = 3375 circuits. Figs. 4.12 and 4.13 show the
number of compiled circuits vs. their critical path delay and dynamic power normalized
Chapter 4. FPGA CAD Algorithm Noise 51
Figure 4.13: Number of circuits vs. normalized dynamic power.
Table 4.2: Standard deviation of noise (after place-and-route)Noise injection Delay stdev. Power stdev.
Synthesis 0.018 0.027Mapping 0.009 0.014
to the average for that design. As before, the results appear to show a normal distri-
bution, with the power noise being slightly greater than delay noise. The majority of
the circuits fall between 0.9 and 1.1 (+/- 10% from the mean). The standard deviation
of critical path delay is 3.3% and dynamic power is 3.7%. This is a fairly significant
amount, enough to drive the use of seed sweeping to find the circuit compilations with
the best results in this distribution.
First, we attempt to isolate the effect of synthesis noise. Figs. 4.14(a) and 4.14(b)
show the number of circuits vs. their normalized delay and core dynamic power under
the influence of logic synthesis noise only. The x-axis scaling is kept the same to compare
the width of the noise distributions. 25 synthesis seeds were used for each design. All
synthesis noise sources were activated (balancing, rewriting, and refactoring). For each
circuit, the delay and power were averaged across five Quartus compilations using dif-
ferent place-and-route seeds. This was done in order to reduce the impact of placement
Chapter 4. FPGA CAD Algorithm Noise 52
(a)
(b)
Figure 4.14: Synthesis noise: (a) Number of circuits vs. normalized delay. (b) Numberof circuits vs. normalized power.
Chapter 4. FPGA CAD Algorithm Noise 53
Figure 4.15: Delay rank of circuits under synthesis noise averaged across 4 and 5 place-ment seeds.
noise, in an attempt to isolate the synthesis noise only. To justify the use of five seeds,
we compared the delay ranking of circuits between experiments using four and five seeds.
That is, for each design, we average the critical path delay of each of its 25 candidate
circuits over four place-and-route seeds, and repeat for five seeds. We then rank those
25 candidates from 1 (best) to 25 (worst) for the four-seed average, then repeat for the
five-seed average. As shown in Fig. 4.15, the ranks were fairly well-correlated between
four and five seeds, so five was deemed a sufficient number of seeds to preserve accuracy
without increasing runtime to an unacceptable degree. However, averaging was not per-
formed over mapping seeds (i.e. some mapping noise is still present, although the mapper
seed is kept constant). This was done because averaging over mapping seeds would have
increased runtime by an additional factor of five. Therefore, the results should not be
interpreted as a pure isolation of synthesis noise, but simply the noise arising from seed
changes in synthesis.
The delay and dynamic power distributions again appear to follow normal distribu-
tions. They are narrower than the ones without averaging across placement seeds, which
shows the effect of the placement noise. Again, the power noise is greater than the delay
Chapter 4. FPGA CAD Algorithm Noise 54
(a)
(b)
Figure 4.16: Technology mapping noise: (a) Number of circuits vs. normalized delay.(b) Number of circuits vs. normalized power.
Chapter 4. FPGA CAD Algorithm Noise 55
noise. A few circuits have normalized power as low as 0.86 or as high as 1.16. Power can
be affected by all changes to a circuit, while delay is only affected by changes affecting
the critical path. The standard deviations of delay and power are shown in Table 4.2.
The standard deviations in delay and power are 1.8% and 2.7%, respectively. This in-
dicates the degree to which random, zero-cost changes in the logic synthesis stage affect
the overall quality of the circuit. Although these changes may appear “zero-cost” at the
time they are made, it is clear that they can have a significant effect on the overall circuit
quality.
Figs. 4.16(a) and 4.16(b) show results obtained in similar fashion for technology map-
ping noise. Again, all technology mapping noise sources were activated (depth and area-
oriented). The delay is contained in the 0.96-1.04 range while the dynamic power is in
the 0.92-1.08 range. As before, the power distribution is wider. For a more quantitative
view, the standard deviations are shown in Table 4.2. The variance arising from mapping
is less than the variance from synthesis, echoing the results seen before place-and-route
(Section 4.4). Again, this is likely due to good tiebreaking mechanisms in the mapper,
as well as the fact that there are fewer downstream CAD stages to be affected by noise
introduced in mapping.
4.6 Conclusion
In this chapter, we introduced the concept of CAD algorithm noise – variations in circuit
quality under the influence of cost-neutral changes in the compilation flow. We proposed
the new concepts of noise in logic synthesis and technology mapping, as well as power
noise. We also identified noise sources in various logic synthesis and technology mapping
algorithms, and we described our method of noise injection in those algorithms. Finally,
we presented experimental results to show the degree to which noise can affect circuit
quality. Under the influence of synthesis noise, standard deviations of critical path delay
Chapter 4. FPGA CAD Algorithm Noise 56
and dynamic power were 1.8% and 2.7%, while the results for technology mapping were
0.9% and 1.4%, respectively. Under the influence of noise in all CAD stages, the standard
deviations were 3.3% in delay and 3.7% in power. This shows the significance of noise
in FPGA CAD algorithms, and motivates further research to better understand and
mitigate its effects. In the next chapter, we will do this by exploring ways to find the
best circuits from a group of candidates produced by a noise-injected algorithm.
Chapter 5
Early Timing and Power Prediction
With Noise
5.1 Introduction
The previous chapter showed that CAD algorithm noise can have a significant impact
on the overall quality of a circuit. Given multiple compilations of the same design using
different seeds in the synthesis and mapping stages of the CAD flow, the performance
and power consumption of the resulting circuits can vary.
It is well established that hardware engineers may perform multiple placement and
routing compilations using different seeds in order to find the best one. However, this is
a long process, taking hours or even days for the largest designs [Gort 10]. On the other
hand, synthesis and mapping are relatively quick. This leads to the question of whether
we can sweep seeds in the synthesis and mapping stages. Seed sweeping would still be
performed in placement and routing, but seed sweeping in synthesis and mapping could
generate better circuits as inputs to place-and-route. This process would require early
timing and power prediction metrics that could be done at the post-mapping stage, in
order to find the best candidate circuits for placement and routing. By doing so, we
would be able to minimize the number of lengthy place-and-route runs.
57
Chapter 5. Early Timing and Power Prediction With Noise 58
This chapter deals with predicting the performance and dynamic power of circuits
after the technology mapping stage. Section 5.2 describes previous works on early delay
and power prediction. Section 5.3 describes our delay prediction method, while Sec-
tion 5.4 describes our power prediction method. Section 5.5 explains our consideration
of the packing stage of the CAD flow. Section 5.6 details our experimental methodology.
Section 5.7 shows the results of the delay/power prediction, and Section 5.8 concludes
the chapter.
5.2 Previous Work on Early Delay/Power Prediction
Early power and performance estimation has been done at various stages of the CAD
flow. At the high-level synthesis stage, power estimation can be done to drive low-power
resource allocation and binding techniques [Chen 07a]. Basic operations (e.g. addition,
multiplication) are characterized in terms of their area, delay and power consumption. By
building a library of these common operations, early estimates of the quality of a circuit
can be made. However, this technique is too coarse-grained for our purposes, since it
cannot detect small changes that would be made in the logic synthesis and mapping
stages.
At the pre-placement stage, work has been done to predict interconnect wirelength
and delay [Mano 07, Pand 07]. Doing so can help predict the critical path of a circuit, as
well as capacitances for power estimation. The work by Manohararajah et al. [Mano 07]
shows that interconnect delays can vary greatly when the placer seed is changed. It
proposes a simple timing model based on assigning a single value for each connection
depending on its source and destination node type and port (e.g. logic, I/O, memory).
However, this does not provide sufficient granularity for designs with few different port
types.
The work by Pandit and Akoglu [Pand 07] attempts to estimate wirelengths using
Chapter 5. Early Timing and Power Prediction With Noise 59
various structural metrics at the pre-placement stage. These metrics are taken from
various works in the ASIC domain and applied to FPGAs. The metrics are as follows:
• Intrinsic Shortest Path Length (ISPL) The ISPL [Kahn 05] between two nodes
is equal to the sum of weights of nets on the shortest path between the two nodes,
where the path does not include the net under consideration. The weight of a
net is related to the number of nodes connected to it. ISPL is supposed to show
a positive correlation with wirelength, since a longer path containing larger nets
would intuitively consume more space.
• Mutual Contraction (MC) MC [Liu 04] for a two-terminal net is calculated
based on the ratio of the weight of the net connecting the two nodes, and the
weights of nets connecting the nodes to outside nodes. Weights are based on net
fanout. MC can be interpreted as the ratio of forces pulling the nodes together
versus the forces pulling them apart. A strong MC indicates that the nodes are
tightly pulled together, implying a shorter wirelength.
• Logic Contraction (LC) For a net connecting a set of nodes N , LC [Liu 05] is
the sum of weights of edges connecting nodes from N to nodes outside N , divided
by the sum of weights of edges connecting nodes from N to other nodes inside N .
The numerator can be interpreted as the external forces pulling the nodes apart,
while the denominator is the sum of internal forces pulling the nodes together.
Therefore, a high LC indicates that the nodes adjacent to the net are being pulled
apart more than they are pulled together, suggesting a larger wirelength.
The metrics above show reasonable correlations with the average post-layout wire-
length of nets in a circuit, but fail to provide sufficient granularity for predicting wire-
lengths of individual nets. These works have shown that there is a great deal of variation
in the placement and routing stages, meaning that early delay and power prediction are
very difficult.
Chapter 5. Early Timing and Power Prediction With Noise 60
It should be noted that in general, early timing/power estimation works deal with
estimating the absolute values of critical path delay and power consumption. In contrast,
our work deals with applying early estimation to different compilations of the same design
with noise injection. This means that we are concerned with relative values instead.
5.3 Delay Prediction
In this section, we examine possible ways to predict the circuit with the lowest critical
path delay at the post-mapping stage, given a set of circuits compiled with different syn-
thesis/mapping seeds. Our delay prediction model begins by assigning each node (LUT)
a certain delay, then traversing the circuit graph in topological order and computing
the arrival time at each node. We examine numerous timing models, sweeping several
parameters:
• Varying pin delays
• Logic, routing and constant factors
• Maximum/scaled metrics
These parameters will be explained in the following sections.
5.3.1 Varying pin delays
Due to the tree-like structure of the multiplexer in a LUT, the delay from each of its input
pins varies (see Fig. 5.1). The inputs nearest to the output will have shorter delays than
the ones further away. This means that the slowest-arriving inputs should be assigned
to the pins with shorter delays and vice versa. The pin-dependent LUT delay for a LUT
n is modeled as:
logic delay(n) =pin index
total pins(5.1)
Chapter 5. Early Timing and Power Prediction With Noise 61
Figure 5.1: Slow and fast inputs on lookup tables.
where pin index is the index of the pin (fastest = 1, slowest = total pins). In our case,
we use total pins = 6 for Stratix III. Based on this, two timing models can be applied:
• Pin utilization model: In this model, we consider the number of LUT input pins
used where the fanin LUT’s logic level is one less than the level of the current LUT.
This gives an idea of how many fanins are competing for the fastest input pin. An
example is shown in Fig. 5.2. This LUT has six fanins, two with a logic level of
5, and four with a logic level of 6. The fastest pin is A, and the slowest is F. It is
assumed that the fanins with level 6 will be assigned the faster input pins, leaving
the slowest pins to the level 5 fanins. It is also assumed that the level 5 fanins are
fast enough that they will not become the critical inputs. Therefore, the critical
input will be the slowest pin coming from a fanin of level 6. In this case, it is pin
D. Since it is the fourth fastest input pin out of six, the LUT is assigned a delay of
4/6.
• Pin order model: This is a more refined version of the pin utilization model.
Chapter 5. Early Timing and Power Prediction With Noise 62
Figure 5.2: Example for the pin utilization timing model.
Figure 5.3: Example for the pin order timing model.
Chapter 5. Early Timing and Power Prediction With Noise 63
It attempts to predict the order that fanins will be assigned to pins. It sorts the
fanins by their maximum arrival time, then assigns the slowest ones to the fastest
pins, and vice versa. An example is shown in Fig. 5.3. The maximum delay in this
case is coming from input pin A, meaning that the maximum delay to the LUT
output is 4 + 1/6. In other words, the maximum delay to the LUT output is
max delay = maxi
(arrival timei + i/total pins) (5.2)
where i is the input pin index. The logic delay is the i/total pins portion of the
above equation.
5.3.2 Logic, routing and constant factors
The overall delay model for a LUT n can be expressed as the following:
Delay(n) = const+ logic factor ∗ logic delay + fanout factor ∗ fanout (5.3)
The parameters are as follows:
• const: This is a constant value for each LUT, which can be set to 0 or 1. If set
to 1, it represents a unit delay for the LUT. This is the standard metric used for
delay at the technology mapping stage.
• logic factor: This is a scaling factor for the LUT delay (logic delay, calculated
using one of the pin-based timing models above). We examined factors ranging
from 0 to 5, in order to see the effect of the scaling factor on the overall model.
• fanout factor: This is a scaling factor for the fanout of the node, which represents
the routing delay. It ranges from 0 to 5. The fanout was capped to 10 to prevent
Chapter 5. Early Timing and Power Prediction With Noise 64
very high-fanout nodes from dominating the delay.
By evaluating delay models using all combinations of these parameters, we are able to
test models with high interconnect delay and low logic delay, and vice versa. The ranges
chosen allow us to explore (2 ∗ 6 ∗ 6 = 72) models using these factors.
5.3.3 Maximum/scaled metrics
The critical path in a circuit is the maximum delay path, which suggests that our model
should simply take the maximum delay in the circuit. However, errors in the model could
mean that the critical path could differ from the one predicted. Therefore, we propose a
scaled model in which we take the sum of the top N max arrival times, each scaled by
an exponentially decaying factor. The scaled delay model for a circuit can be expressed
as
scaled delay model =N∑i=1
max arrival time(nodei) ∗ factori (5.4)
where max arrival time(nodei) is the maximum total delay to reach the node with
the ith longest arrival time, and 0 < factor < 1. For our experiments, a factor of 0.95
was used since it decays quickly enough to ignore path endpoints that are highly unlikely
to be critical, yet not so quickly that it considers only a few. N is capped at 100, as the
factor is very small by that point (0.95100 = 0.006).
5.4 Power Prediction
The dynamic power consumption of a circuit can be calculated by the following equation:
Pdyn =1
2
n∑i=1
SiCifV2dd (5.5)
Chapter 5. Early Timing and Power Prediction With Noise 65
where n is the number of nets in the circuit, Si is the switching activity of net i,
Ci is the capacitance of net i, f is the frequency of the circuit, and Vdd is the supply
voltage. The switching activity of a net is estimated using a fast gate-level simulation
implemented in ABC, run over 1000 cycles. The simulator can be run in two modes:
• Zero delay, in which the LUTs/wires are assumed to have zero delay (i.e. on each
clock cycle, all signals immediately settle into their final state).
• Unit delay, in which each LUT is assumed to have a delay of one. This allows for
some modeling of glitches.
The overall power model is a product of switching activity and fanout over all nets:
Pest =∑
i∈netsSi ∗ FOi (5.6)
Fanout is used as a substitute for capacitance (capped as in Section 5.3.2). Vdd and
f from Eqn. 5.5 are ignored since they are constant for each circuit.
5.5 Packing
Packing is a stage in the CAD flow in which logic blocks are grouped together. These
groups can then be placed into a larger logic unit (in the Stratix III architecture, these
units are known as Logic Array Blocks or LABs). Grouping related logic blocks together
allows them to be connected by shorter routes (intra-LAB) rather than longer, slower
routes (inter-LAB).
We considered performing prediction at the post-packing stage. This would allow
us to use different timing/power models for intra-LAB and inter-LAB connections. By
doing so, we could potentially form more accurate estimates of delays and capacitances.
However, there is no easy way to enforce a particular packing using the Altera CAD
flow. We can extract packing information after the entire placement and routing stages
Chapter 5. Early Timing and Power Prediction With Noise 66
have completed, but we cannot guarantee that that packing information would have been
available prior to place and route. This is due to optimizations occurring after technology
mapping, such as physical synthesis [Sing 05]. Therefore, we cannot be certain about how
the circuit has changed after it has left the technology mapping stage.
Unless otherwise stated, all prediction methods used here do not use any packing
information. If packing information is used, it should be interpreted with the caveat that
the packing data was extracted post-routing. This means that any benefits demonstrated
by using packing results should be considered as an upper bound (i.e. optimistic).
5.6 Methodology
We ran our prediction methods on the noise-injected circuits from Section 4.3. Each of
the 27 benchmark designs was synthesized and mapped with different seeds to create 25
mapped candidates. Each candidate was placed and routed using Quartus using 5 differ-
ent seeds. The timing and power results were obtained in the same way as in Section 4.3.
The results were averaged across the 5 placement seeds to obtain a representative result
for each mapped candidate circuit.
The prediction algorithms were all implemented in ABC [Berk 06]. For each mapped
candidate circuit, all prediction algorithms were run sweeping all parameter combina-
tions. These include (for delay):
• 2 LUT delay models (pin utilization model, pin order model)
• 2 constant factors (Section 5.3.2)
• 6 logic factors (Section 5.3.2)
• 6 fanout factors (Section 5.3.2)
• 2 metrics (Section 5.3.3: maximum/scaled)
Chapter 5. Early Timing and Power Prediction With Noise 67
Therefore, a total of 2 ∗ 2 ∗ 6 ∗ 6 ∗ 2 = 288 models can be used for delay. However,
due to redundancies or invalid models (e.g. all factors set to 0), 241 models were tested.
For power, six models were used, three each for the zero and unit delay simulations:
• No packing data used. All nets are considered equally.
• Only inter-LAB routes considered (using packing data). This model only uses the
inter-LAB nets for the summation in Eqn. 5.6. The intra-LAB power is considered
to be zero.
• Only intra-LAB routes considered (using packing data). This model only uses the
intra-LAB nets for the summation in Eqn. 5.6. The inter-LAB power is considered
to be zero.
5.7 Results
Fig. 5.4 shows the probability of finding the best circuit implementation in terms of
delay (or one within a 0.1% margin of it). The top three models (based on the ranking
of the top predicted circuit for each design) are shown, along with two simple models:
“Routing only” (fanout factor = 1 in Eqn. 5.3, all others 0, scaled) and “Logic depth
only” (const = 1, all others 0, scaled). The top models were chosen by taking the best
predicted circuit for each design and summing their actual ranking (best=1, worst=25).
The models with the lowest sums were chosen as the best ones. Designs with low swing
(no circuit with more than 1.5% deviation from the average) were ignored. This was done
because there is little benefit to be obtained from predicting those designs anyway. The
problem of distinguishing between circuits with such small differences presents a large
challenge for little gain. 20 of 27 designs were considered for delay, and 23 of 27 were
considered for power. For details, the interested reader may consult Appendix A. For
the delay portion of this study, the designs were split into two halves: one as a training
Chapter 5. Early Timing and Power Prediction With Noise 68
Figure 5.4: Probability of finding the top circuit vs. percentage of top modeled circuitsconsidered (delay).
set to find the best timing models, and the other as a test set to evaluate them. This
was not done for power due to the small number of models considered. The x-axis shows
the percentage of the top circuits (by model score) considered, while the y-axis shows
the probability of finding the actual top circuit within this group. The legend shows the
timing model used.
For example, the model “Pin order, const=0, logic=1, routing=3, scaled” refers to
the pin order model (Section 5.3.1) with (const = 0, logic factor = 1, fanout factor =
3) as the factor settings (Section 5.3.2), and the exponentially decaying scaling factor
(Section 5.3.3). Using this model, if we take the top 10% of circuits (according to the
model) we have approximately a 40% chance of selecting the best one.
The simplest models (Routing only, Logic depth only) both used the “scaled” metric
(Section 5.3.3). The results show that these models were not as good as the others. Using
the top 10% of predicted circuits, we have only a 10-20% probability of selecting the best
Chapter 5. Early Timing and Power Prediction With Noise 69
Figure 5.5: Percentile of predicted top circuits (delay).
ones. However, the “Routing only, scaled” model improves to 90% probability when the
top 40% of predicted circuits are used. This shows the importance of considering routing
delay.
Another viewpoint of the results can be taken by considering a scenario where a
hardware engineer performs synthesis and mapping using several seeds, and selects the
best 10% of circuits based on the prediction metric (in our case, we select two candidate
circuits per design). The intent is to allow the hardware engineer to pass only those two
circuits through placement and routing, then pick the better one. The question we would
like to answer is: if we pass all the circuits through placement and routing and rank them
according to delay, into what percentile would the predicted best circuit fall (best=100,
worst=0)? Fig. 5.5 shows the distribution of the actual percentile of the top circuits
predicted by the model. Each bin extends to the right (e.g. the 90 bin covers 90-99).
The 100 bin covers 100 only. For example, using the same model as above (Pin order,
const = 0, logic = 1, routing = 3, scaled), if we pick the top circuit (according to the
Chapter 5. Early Timing and Power Prediction With Noise 70
Table 5.1: Average percentile of top circuits with isolated model parameters.Model Average percentile of top predicted circuit
Baseline (max logic depth) 50+ scaled 52
+ scaled, pin util. 52+ scaled, pin order 67+ scaled, routing 54
model), it is in the highest percentile bin (100) of actual circuits for 4 out of 10 designs.
If the prediction was completely random, we would expect a flatter curve. However, since
the results tend towards the higher end of the scale (to the right), it appears that these
prediction metrics have value.
The variety of parameters seen in the top three models seems to indicate no clear best
setting in terms of the pin models, scaling factors, and max/scaled metrics. However,
it is hard to expect these models to be accurate at a very fine-grained level of detail,
distinguishing between circuits with only a few percent difference between each other.
Furthermore, different circuits and FPGA architectures may have different preferred
models. Despite this, it seems that a crude early estimation technique is still enough to
extract some information. For all models, if we take the best predicted circuit for each
design, the average normalized critical path delay (after place-and-route) is less than 1.
This means that the models have some ability to predict the better circuits in a group.
To examine the effect of each model parameter, we attempted to evaluate them in
isolation. Table 5.1 shows the average percentile of the single top circuit as predicted
by the baseline model (i.e. a max logic depth model – since the circuits for one design
all have the same logic depth, this model is essentially random). A value of 100 would
correspond to the best circuit. As expected, this model’s average is the 50th percentile
– no better than a random prediction. Adding the exponentially decaying scaling factor
improves this to 52, while adding pin utilization, order and routing models increase the
average to 52, 67 and 54. Although these numbers are not overwhelmingly positive, the
improvement is still encouraging as they tend towards 100 (better circuits) rather than
Chapter 5. Early Timing and Power Prediction With Noise 71
Figure 5.6: Probability of finding the top circuit vs. percentage of top modeled circuitsconsidered (power).
0 (worse circuits).
Figs. 5.6 and 5.7 show analogous results for power. “Zero delay” and “Unit delay”
represent the zero delay and unit delay simulation models. Results are also shown con-
sidering the use of intra-LAB and inter-LAB connections (using packing information).
In some cases, the use of inter-LAB connection information is valuable (as these connec-
tions have higher capacitance/power) as well as using unit delay instead of zero delay. In
particular, using the top 10% of predicted circuits using the unit delay, inter-LAB model
(considering only inter-LAB connections), the probability of capturing the top circuit is
over 40%. In contrast, the zero delay model using only intra-LAB connections is the
weakest, as it has no consideration of glitches or the inter-LAB connections which have
higher capacitance. However, the results are still satisfactory without packing informa-
tion. Aside from one circuit, the models always predicted above the 40th percentile in
Fig. 5.7. In general, power prediction was more successful than delay prediction since
the critical delay is based on a single path, while power is based on the entire circuit,
thus reducing the impact of any errors.
Chapter 5. Early Timing and Power Prediction With Noise 72
Figure 5.7: Percentile of predicted top circuits (power).
Table 5.2: Average benefit of prediction modelsDelay
Model % improv. % improv. (full) % des. improv.Best prediction 1.3 1.8 75Logic depth only 0.6 0.3 50
Fanout only 0.9 0.9 60Power
Model % improv. % des. improv.Best prediction 1.8 83
Unit delay 1.4 91Zero delay 1.1 87
Table 5.2 shows the average delay/power savings from our prediction models if the
top 10% predicted best circuits are carried forward to placement and routing. Looking at
the “% improv.” column, we see that the best prediction models (“Pin order, const=0,
logic=1, routing=3, scaled” for delay, “Unit delay, inter-LAB” for power) give an average
benefit of 1.3% in delay and 1.8% in power, more than the simpler models of logic depth
and fanout. These numbers were computed separately – the best circuit for delay is
not necessarily the best circuit for power. It should be noted that these results were
generated using relatively small training and test sets (for delay). If we instead use
Chapter 5. Early Timing and Power Prediction With Noise 73
the entire benchmark set for both training and test, we can obtain improvements of up
to 1.8% in delay (column “% improv. (full)”). The last column shows the percentage
of designs that showed any amount of improvement. The delay predictions achieve an
improvement on up to 75% of designs, while the power predictions improve up to 91% of
designs. In general, power prediction is more accurate, as we saw that power noise has
greater swing and so has more obvious differences between designs. On the other hand,
delay has less swing and depends on minute differences on a single critical path, making
prediction more difficult.
5.8 Conclusion
In this chapter, we presented the problem of timing and power prediction at the post-
mapping stage in the presence of noise. For delay prediction, a model was proposed
using varying LUT input pin delays, logic and routing delay scaling factors, and a scaled
delay metric. For power prediction, a model was proposed using a fast zero delay or unit
delay simulation accounting for glitches, as well as considering LUT packing information.
The results showed that the prediction models were successful in distinguishing between
different noise-injected circuits. When applied to designs with over 1.5% variation in
delay and power, the best prediction models averaged 1.3-1.8% lower critical path delay
and 1.8% lower dynamic power, as computed post-routing.
Chapter 6
Conclusions
6.1 Summary
This thesis has dealt with two topics in the area of FPGA CAD. The contributions to
these topics are as follows:
6.1.1 Glitch Reduction
1. We investigated glitch power in FPGAs, power which is consumed by unnecessary
signal transitions. An analysis of glitches in commercial FPGAs was presented,
showing that an average of 26.0% of dynamic power is due to glitches. We also
showed that don’t-cares make up 15.1% of LUT input states, motivating a don’t-
care based approach to glitch reduction.
2. We demonstrated a new glitch reduction algorithm making use of don’t-cares in
logic functions. It operates at the post-routing stage and has no area or performance
overhead. The algorithm sets the configuration bits of LUTs which correspond to
don’t-cares, in a manner that reduces unnecessary transitions on the LUT’s output.
The algorithm reduced glitch power by an average of 13.7%, and dynamic power
by 4.0%.
74
Chapter 6. Conclusions 75
6.1.2 CAD Algorithm Noise
1. We investigated the concept of random noise in CAD algorithms, which is the
variation in circuit quality due to random choices between two or more choices with
the same cost. Random noise was injected into the logic synthesis and technology
mapping stages of ABC, exposing the arbitrary choices made by the algorithms.
We presented data on the effects of this noise on overall circuit performance and
power. Under the influence of synthesis noise, standard deviations of critical path
delay and dynamic power were 1.8% and 2.7%, respectively, while the results for
technology mapping were 0.9% and 1.4%, respectively. Under the influence of noise
in all CAD stages, the standard deviations were 3.3% in delay and 3.7% in power.
2. Early timing and power analysis techniques were presented to find the best cir-
cuits from a pool of mapped candidate circuits generated using different random
seeds. For delay prediction, a model was proposed using varying LUT input pin
delays, logic and routing delay scaling factors, and a scaled delay metric. For power
prediction, a model was proposed using a fast zero delay or unit delay simulation
accounting for glitches, as well as considering LUT packing information. When
applied to circuits with over 1.5% swing in quality, the early estimation techniques
found circuits that were 1.3-1.8% better in speed and 1.8% better in power, on
average.
6.2 Future Work
6.2.1 Glitch Reduction
1. Although the glitch reduction algorithm runs fairly quickly for the MCNC circuits
tested (a few seconds), runtime is a potential concern for larger circuits. This is
due to two main factors: the runtime of the SAT instances to find don’t-cares, and
Chapter 6. Conclusions 76
the iterative flow involving timing simulation. A potential solution is to integrate
more tightly the simulator and glitch reduction algorithm, such that changes to
LUTs can be incorporated into the simulation on the fly.
2. Another similar approach to glitch reduction could be taken before the placement
and routing stages, using a simpler simulation model (using unit delays per LUT,
for example). We actually tested this approach and found that it worked poorly
due to the lack of accurate timing information. However, such an approach would
have the benefit of having more degrees of freedom from being at an earlier stage of
the flow – instead of only being able to change LUT configuration bits, the mapping
of the circuit could be changed (for example). Further work would be necessary to
allow the algorithm to work well despite the lack of accurate timing data.
3. Don’t-care-based optimizations could also be applied to static power. Previous
work has demonstrated that SRAM bit polarities can affect static power dissipa-
tion [Ande 06]. The settings of the SRAM bits to logic-0 or 1 can affect the number
of leakage paths in the circuit. By using the freedom given by don’t-cares, bits may
be flipped in order to eliminate these leakage paths. An additional challenge would
be to integrate this technique with the dynamic power optimization in this work.
6.2.2 CAD Algorithm Noise
1. The work presented has focused primarily on noise sources in synthesis and map-
ping. However, there are many more CAD algorithms to explore, different imple-
mentations, and more noise sources. Packing, placement, and routing are several
other CAD stages that could be investigated.
2. The early timing and power estimation methods can still be further refined. One
possible approach is to try to take advantage of the fact that we are comparing
the quality of circuits relative to one another, not in absolute terms. This might
Chapter 6. Conclusions 77
be done by locating similarities and differences between the circuits, rather than
computing the quality of each circuit individually.
3. The early prediction models could be applied to the upstream CAD algorithms in
order to improve their quality of results. For example, the pin-based delay models
could be implemented in technology mapping as a refinement of the logic-depth-
based delay model which is typically used.
Appendix A
Circuit Delay/Power Statistics With
Noise
A.1 Individual Design Noise Results
Table A.1 shows the mean and standard deviation of each design’s critical path delay.
The “Min. norm. delay” column indicates the minimum circuit delay, normalized to that
design’s average. All numbers are in absolute values and taken from Quartus’ STA timing
analyzer, using the methodology described in Section 4.3. The shaded rows indicate the
designs with minimum values more than 1.5% less than the average (i.e. 0.985 or less).
These designs were used as candidates for early delay estimation.
Table A.2 shows a similar table for dynamic power. The circuits were simulated at
10MHz to meet timing constraints by a wide margin. A few designs have notably low
minimum normalized values, which can be attributed to their low power consumption (a
small change accounts for a larger percentage).
78
Appendix A. Circuit Delay/Power Statistics With Noise 79
Table A.1: Critical path delay statistics by circuitCircuit Mean crit. delay (ns) Stdev. delay Min. norm. delayalu4 7.767 0.002 0.990apex2 8.705 0.008 0.971apex4 8.348 0.006 0.977bigkey 1.260 0.003 0.960cf cordic v 18 18 18 2.787 0.001 0.976cf fir 24 16 16 5.832 0.004 0.988clma 5.963 0.003 0.984des 8.864 0.013 0.976des perf 1.960 0.001 0.988diffeq 3.407 0.005 0.953dsip 1.364 0.001 0.979elliptic 4.544 0.014 0.972ex1010 9.744 0.003 0.990ex5p 8.261 0.006 0.980frisc 5.463 0.011 0.963mac2 8.541 0.004 0.987misex3 7.373 0.002 0.982oc54 cpu 8.395 0.024 0.967paj raygentop hierarchy no mem 6.565 0.004 0.982pdc 9.495 0.003 0.989s298 4.990 0.019 0.960s38417 3.660 0.006 0.951s38584 1 3.055 0.003 0.959seq 8.236 0.007 0.979spla 9.371 0.005 0.989sv chip1 hierarchy no mem 3.962 0.002 0.981tseng 3.038 0.004 0.966
Appendix A. Circuit Delay/Power Statistics With Noise 80
Table A.2: Dynamic power statistics by circuitCircuit Mean dynamic power (mW) Stdev. power Min. norm. poweralu4 2.912 0.002 0.978apex2 2.746 0.003 0.973apex4 2.012 0.001 0.970bigkey 2.465 0.000 0.993cf cordic v 18 18 18 11.030 0.009 0.985cf fir 24 16 16 143.091 0.671 0.992clma 4.391 0.007 0.952des 7.552 0.026 0.960des perf 15.765 0.020 0.989diffeq 0.066 0.000 0.936dsip 2.672 0.001 0.979elliptic 1.900 0.002 0.947ex1010 4.061 0.003 0.976ex5p 2.531 0.003 0.964frisc 0.897 0.003 0.831mac2 39.726 0.168 0.990misex3 2.490 0.001 0.971oc54 cpu 9.880 0.018 0.972paj raygentop hierarchy no mem 28.555 0.328 0.973pdc 4.524 0.004 0.973s298 1.637 0.002 0.941s38417 2.606 0.001 0.981s38584 1 4.227 0.001 0.984seq 2.753 0.003 0.961spla 4.523 0.004 0.977sv chip1 hierarchy no mem 8.708 0.011 0.982tseng 0.043 0.000 0.930
Bibliography
[Altea] Altera. “AN 584: Timing closure methodology for advanced FPGA designs”.http://www.altera.com/literature/an/an584.pdf.
[Alteb] Altera. “Stratix III device handbook”. http://www.altera.com/
literature/lit-stx3.jsp.
[Altec] Altera. “Stratix V device handbook”. http://www.altera.com/
literature/lit-stratix-v.jsp.
[Ande 06] J. Anderson and F. Najm. “Active leakage power optimization for FPGAs”.Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-tions on, Vol. 25, No. 3, pp. 423 – 437, march 2006.
[Berk 06] Berkeley Logic Synthesis and Verification Group. “ABC: A system for se-quential synthesis and verification”. Release 00406. http://www.eecs.
berkeley.edu/∼alanmi/abc/.
[Betz 97] V. Betz and J. Rose. “VPR: A new packing, placement and routing toolfor FPGA research”. In: Proceedings of the 7th International Workshop onField-Programmable Logic and Applications, pp. 213–222, Springer-Verlag,London, UK, 1997.
[Chen 07a] D. Chen, J. Cong, Y. Fan, and Z. Zhang. “High-level power estimation andlow-power design space exploration for FPGAs”. In: Design AutomationConference, 2007. ASP-DAC ’07. Asia and South Pacific, pp. 529 –534, Jan.2007.
[Chen 07b] L. Cheng, D. Chen, and M. Wong. “GlitchMap: An FPGA technology map-per for low power considering glitches”. In: Design Automation Conference,2007. DAC ’07. 44th ACM/IEEE, pp. 318 –323, 2007.
[Czaj 07] T. S. Czajkowski and S. D. Brown. “Using negative edge triggered FFsto reduce glitching power in FPGA circuits”. In: Proceedings of the 44thannual Design Automation Conference, pp. 324–329, ACM, New York, NY,USA, 2007.
[Dinh 09] Q. Dinh, D. Chen, and M. D. Wong. “A routing approach to reduce glitchesin low power FPGAs”. In: Proceedings of the 2009 international symposiumon Physical design, pp. 99–106, ACM, New York, NY, USA, 2009.
81
Bibliography 82
[Fisc 05] R. Fischer, K. Buchenrieder, and U. Nageldinger. “Reducing the power con-sumption of FPGAs through retiming”. In: Engineering of Computer-BasedSystems, 2005. ECBS ’05. 12th IEEE International Conference and Work-shops on the, pp. 89 – 94, Apr. 2005.
[Fran 10] S. Franssila. Introduction to Microfabrication. John Wiley & Sons, 2010.
[Gort 10] M. Gort and J. Anderson. “Deterministic multi-core parallel routing forFPGAs”. In: Field-Programmable Technology (FPT), 2010 InternationalConference on, pp. 78 –86, Dec. 2010.
[Kahn 05] A. Kahng and S. Reda. “Intrinsic shortest path length: a new, accurate apriori wirelength estimator”. In: Computer-Aided Design, 2005. ICCAD-2005. IEEE/ACM International Conference on, pp. 173 – 180, Nov. 2005.
[Kirk 83] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. “Optimization by simulatedannealing”. Science, Vol. 220, No. 4598, pp. 671–680, 1983.
[Kuon 07] I. Kuon and J. Rose. “Measuring the gap between FPGAs and ASICs”.Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-tions on, Vol. 26, No. 2, pp. 203 –215, Feb. 2007.
[Lamo 08] J. Lamoureux, G. Lemieux, and S. Wilton. “GlitchLess: Dynamic powerminimization in FPGAs through edge alignment and glitch filtering”. VeryLarge Scale Integration (VLSI) Systems, IEEE Transactions on, Vol. 16,No. 11, pp. 1521 –1534, Nov. 2008.
[Lim 05] H. Lim, K. Lee, Y. Cho, and N. Chang. “Flip-flop insertion with shifted-phase clocks for FPGA power reduction”. In: ICCAD ’05: Proceedings ofthe 2005 IEEE/ACM International conference on Computer-aided design,pp. 335–342, IEEE Computer Society, Washington, DC, USA, 2005.
[Lin 95] B. Lin and S. Devadas. “Synthesis of hazard-free multilevel logic undermultiple-input changes from binary decision diagrams”. Computer-AidedDesign of Integrated Circuits and Systems, IEEE Transactions on, Vol. 14,No. 8, pp. 974 –985, Aug. 1995.
[Liu 04] Q. Liu and M. Marek-Sadowska. “Pre-layout wire length and congestionestimation”. In: Design Automation Conference, 2004. Proceedings. 41st,pp. 582 –587, Jul. 2004.
[Liu 05] Q. Liu and M. Marek-Sadowska. “Pre-layout physical connectivity predictionwith application in clustering-based placement”. In: Computer Design: VLSIin Computers and Processors, 2005. ICCD 2005. Proceedings. 2005 IEEEInternational Conference on, pp. 31 – 37, Oct. 2005.
[Loke 10] C. Loken, D. Gruner, L. Groer, R. Peltier, N. Bunn, M. Craig, T. Henriques,J. Dempsey, C.-H. Yu, J. Chen, L. J. Dursi, J. Chong, S. Northrup, J. Pinto,
Bibliography 83
N. Knecht, and R. V. Zon. “SciNet: Lessons learned from building a power-efficient Top-20 system and data centre”. Journal of Physics: ConferenceSeries, Vol. 256, No. 1, p. 012026, 2010.
[Mano 07] V. Manohararajah, G. Chiu, D. Singh, and S. Brown. “Predicting intercon-nect delay for physical synthesis in a FPGA CAD flow”. Very Large ScaleIntegration (VLSI) Systems, IEEE Transactions on, Vol. 15, No. 8, pp. 895–903, Aug. 2007.
[Mish 05] A. Mishchenko and R. Brayton. “SAT-based complete don’t-care compu-tation for network optimization”. In: ACM/IEEE Design Automation andTest Conference, pp. 412–417, 2005.
[Mish 06a] A. Mishchenko and R. Brayton. “Scalable logic synthesis using a simplecircuit structure”. In: Proc. International Workshop on Logic and Synthesis,pp. 15–22, 2006.
[Mish 06b] A. Mishchenko, S. Chatterjee, and R. Brayton. “DAG-aware AIG rewrit-ing: a fresh look at combinational logic synthesis”. In: Design AutomationConference, 2006 43rd ACM/IEEE, pp. 532 –535, 2006.
[Mish 06c] A. Mishchenko, S. Chatterjee, R. Brayton, and N. Een. “Improvementsto combinational equivalence checking”. In: Proceedings of the 2006IEEE/ACM international conference on Computer-aided design, pp. 836–843, ACM, New York, NY, USA, 2006.
[Mish 07] A. Mishchenko, S. Cho, S. Chatterjee, and R. Brayton. “Combinational andsequential mapping with priority cuts”. In: Computer-Aided Design, 2007.ICCAD 2007. IEEE/ACM International Conference on, pp. 354 –361, Nov.2007.
[Mish 09] A. Mishchenko, R. Brayton, J.-H. R. Jiang, and S. Jang. “Scalable don’t-care-based logic optimization and resynthesis”. In: Proceedings of theACM/SIGDA international symposium on Field programmable gate arrays,pp. 151–160, ACM, New York, NY, USA, 2009.
[Mish 11] A. Mishchenko, R. Brayton, S. Jang, and V. Kravets. “Delay optimizationusing SOP balancing”. In: Proc. International Workshop on Logic and Syn-thesis, pp. 75–82, 2011.
[Pand 07] A. Pandit and A. Akoglu. “Wirelength prediction for FPGAs”. In: FieldProgrammable Logic and Applications, 2007. FPL 2007. International Con-ference on, pp. 749 –752, Aug. 2007.
[Rubi 11] R. Y. Rubin and A. M. DeHon. “Timing-driven pathfinder pathology andremediation: quantifying and reducing delay noise in VPR-pathfinder”. In:Proceedings of the 19th ACM/SIGDA international symposium on Field pro-grammable gate arrays, pp. 173–176, ACM, New York, NY, USA, 2011.
Bibliography 84
[Shan 02] L. Shang, A. S. Kaviani, and K. Bathala. “Dynamic power consumption inVirtex-II FPGA family”. In: Proceedings of the 2002 ACM/SIGDA tenthinternational symposium on Field-programmable gate arrays, pp. 157–164,ACM, New York, NY, USA, 2002.
[Shum 11] W. Shum and J. H. Anderson. “FPGA glitch power analysis and reduction”.In: Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design, pp. 27–32, IEEE Press, Piscataway, NJ, USA,2011.
[Sing 05] D. Singh, V. Manohararajah, and S. Brown. “Two-stage physical synthesisfor FPGAs”. In: Custom Integrated Circuits Conference, 2005. Proceedingsof the IEEE 2005, pp. 171 – 178, Sept. 2005.
[Wilt 04] S. J. Wilton, S.-S. Ang, and W. Luk. “The impact of pipelining on energyper operation in field-programmable gate arrays”. In: Proc. Intl. Conf. onField-Programmable Logic and its Applications, pp. 719–728, 2004.
[Xili] Xilinx. “7 Series FPGAs overview”. http://www.xilinx.com/support/
documentation/7 series.htm.