invited paper techniques for fast physical synthesis · 2008-01-25 · paper techniques for fast...
TRANSCRIPT
INV ITEDP A P E R
Techniques for FastPhysical SynthesisFast, efficient buffer design, logic transformations, and clustering components
for placement are some of the techniques being used to reduce design
turnaround for large, complex chips.
By Charles J. Alpert, Fellow IEEE, Shrirang K. Karandikar, Zhuo Li, Member IEEE,
Gi-Joon Nam, Member IEEE, Stephen T. Quay, Haoxing Ren, Member IEEE,
C. N. Sze, Member IEEE, Paul G. Villarrubia, and Mehmet C. Yildiz, Member IEEE
ABSTRACT | The traditional purpose of physical synthesis is to
perform timing closure, i.e., to create a placed design that
meets its timing specifications while also satisfying electrical,
routability, and signal integrity constraints. In modern design
flows, physical synthesis tools hardly ever achieve this goal in
their first iteration. The design team must iterate by studying
the output of the physical synthesis run, then potentially
massage the input, e.g., by changing the floorplan, timing
assertions, pin locations, logic structures, etc., in order to
hopefully achieve a better solution for the next iteration. The
complexity of physical synthesis means that systems can take
days to run on designs with multimillions of placeable objects,
which severely hurts design productivity. This paper discusses
some newer techniques that have been deployed within IBM’s
physical synthesis tool called PDS [1] that significantly improves
throughput. In particular, we focus on some of the biggest
contributors to runtime, placement, legalization, buffering, and
electric correction, and present techniques that generate
significant turnaround time improvements.
KEYWORDS | Circuit optimization; circuit synthesis; CMOS
integrated circuits; design automation
I . INTRODUCTION
Physical synthesis has emerged as a critical component of
modern design methodologies. The primary purpose of
physical synthesis is to perform timing closure. Several
technology generations ago, back when wire delay was
insignificant, synthesis provided an accurate picture of the
timing of the design. However, technology scaling has
caused wire delay to continue to increase relative to gate
delay. Consequently, a design that meets timing require-
ments in synthesis likely will not close once its physical
footprint is realized, due to the wire delays. The purpose ofphysical synthesis is place the design, recognize the delays
and signal integrity issues introduced by the wiring, and fix
the problems. It may also need to locally resynthesize
pieces of the design that no longer meet timing con-
straints. That new logic needs to be replaced, which causes
iterations between synthesis and placement, until hope-
fully the design closes on timing.
Unfortunately, more often than not, the design will notclose on timing without manual designer intervention.
Perhaps the designer needs to modify the floorplan or re-
structure certain sets of paths. This causes the designer to
iterate between manual design work and automatic
physical synthesis. The turnaround time of the physical
design stage critically depends on the efficiency (and
quality) of the physical synthesis system. On large multi-
million ASIC parts, physical synthesis can take several daysto complete, even on the best hardware available. This
trend is only getting worse as designs seem to scale faster
than the hardware improves to optimize them. While
hierarchical or system on a chip (SoC) methodologies can
be used to handle the large complexities, performing
timing closure on a flat part is always preferable if at all
possible [2], since it avoids all the complexities of
hierarchical design.Of course, there are many newer challenges that the
physical system needs to handle besides traditional timing
closure. Some examples include lowering power using a
Manuscript received March 8, 2006; revised October 20, 2006.
C. J. Alpert, S. K. Karandikar, Z. Li, G.-J. Nam, and C. N. Sze are with the IBM Austin
Research Laboratory, Austin, TX 78758 USA (e-mail: [email protected];
[email protected]; [email protected]; [email protected]; [email protected]).
S. T. Quay, H. Ren, P. G. Villarrubia, and M. C. Yildiz are with the IBM Corporation,
Austin, TX 78758 USA (e-mail: [email protected]; [email protected];
[email protected]; [email protected]).
Digital Object Identifier: 10.1109/JPROC.2006.890096
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 5730018-9219/$25.00 �2007 IEEE
technology library with multiple threshold voltages, fixnoise violations that show up after performing routing,
and handling the timing variability and uncertainty
introduced by modern design processes. Inserting techni-
ques for analysis and optimization for these more complex
problems only add to the runtime of the entire system.
Thus, the turnaround time for the core system needs to be
as fast as possible. This work surveys some of the recent
techniques introduced into PDS [1] to improve turn-around time.
A. Buffering TrendsMuch of the paper focuses on innovation in buffering
techniques since buffering is perhaps the most important
challenge for physical synthesis as it moves beyond 90 nm
technologies. As technology scales, wires become thinner
which causes their resistance to increase. The result is thatwire delays increasingly dominate gate delays, and the
problem only becomes worse with each advance from the
65 to the 45 to the 32 nm node. Saxena et al. [3] predict a
Bbuffering explosion[ whereby over half of all the logic
will consist of buffers, which are essentially performing no
useful computationVthey are merely helping move sig-
nals from one part of the chip to another. Even in today’s
90 nm designs, we commonly see 20%–25% of the logicconsisting of buffers and/or inverters; some of the larger
designs that PDS optimizes end up with over a million
buffers.
Given these trends, there are several challenges to
achieve a fast and effective physical synthesis result.
1) One has to be able to perform buffer insertion
incredibly quickly. If one is going to insert over a
million buffers and then may have to rip them upand redo trees to improve timing and routability,
the underlying algorithm must be efficient.
2) Area and power are big concerns. Smart floor-
planning and logic coding from the designer can
help mitigate the buffering effects, but still, one
should insert buffers to minimize both total area
(so that they can be easily incorporated into the
design) and power.3) Buffering algorithms need to understand where
the free space is in the layout to be effective and
not overfill areas that cannot handle the buffers.
Some methodologies invoke buffer block plan-
ning to drive buffer locations to preallocated
areas (e.g., [4], [5]).
4) Buffering constricts or seeds global routing.
Because the distance between buffers continuesto decrease, a long net may have perhaps ten
stages of buffering to get from point A to point B.
The locations of those ten buffers force the global
router to route from A to the first buffer, then
from the first buffer to the second buffer, etc.,
instead of finding the best direct route from A
to B. Essentially, the routing problem is pushed
up to be handled by buffering. A good bufferingsolution can make the global router’s job easy,
while a bad one makes it more difficult.
B. Major Phases of Physical SynthesisThe authors of [1] present seven primary stages to PDS:
1) initial placement and optimization;
2) timing-driven placement and optimization;
3) timing-driven detailed placement;4) optimization techniques;
5) clock insertion and optimization;
6) routing and post routing optimization;
7) early-mode timing optimization.
Before running physical synthesis, at the very least one
should achieve timing closure with a zero wireload (ZWL)
timing model. If one cannot close on the design with ZWL,
then one certainly will not be able to once the design isrealized physically. In fact, since a ZWL model may be
hopelessly optimistic since wire delays are increasingly
significant, a designer may want to achieve timing closure
with a more pessimistic model. As examples, one could
multiply each gate delay by a constant factor and/or use a
linear optimally buffered delay model for logic that is re-
stricted by the designer’s floorplan. Thus, before proceed-
ing with physical synthesis, the designer should iterate onthe architecture, synthesis, and floorplan to achieve a
closed design from some type of physically ignorant timing
model so that the design is in a reasonably good state.
Similarly, if one cannot close on the timing before
clock insertion and routing, then it is unlikely one will be
able to close after these steps. Thus, the first four stages of
the above flow can be considered the core physical
synthesis operations. The designer will typically iteratewith physical synthesis runs in this part of the flow before
proceedings to steps 5, 6, and 7. Hence, the focus for this
paper will be on the first four stages.
The purpose of initial placement (e.g., [6]–[8], mFar
[9], [10]) is to place the cells such that they do not overlap
with each other or existing fixed objects from the
floorplan. At this point, the timing of the design will
have degraded completely from the ZWL timing due to theintroduction of long wires. The optimization steps then
buffer and repower the design so that the timing looks
quite reasonable. From the timing analysis, one can then
draw conclusions as to which nets must be shortened by
placement and which ones do not, or for that matter, could
even afford to be longer.
The purpose of timing-driven placement (e.g., [11]–[13])
is to use timing analysis to drive the placement to achievea good timing result at the possible expense of wire-
length. Probably the easiest (and certainly fastest) way to
achieve this is to perform net weighting [14]–[16] whereby
the nets which need to be shorter are assigned a high
weight, and the nets that can afford to be longer are
assigned a low weight. Any placement algorithm can be
modified to handle net weights. For example, a net with
Alpert et al.: Techniques for Fast Physical Synthesis
574 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
integer weight n can be replaced with n identical nets of
weight one. The problem of coming up with a good map-
ping of nets to weights is a difficult problem. The
approach of Pan et al. [15] is advantageous in that it fig-ures out which nets can influence the most possible crit-
ical paths and gives these nets higher weight; since nets
which are both influential and negative are emphasized,
the wirelength degradation from timing is minimized.
The mechanism with which the given placer handles net
weights certainly affects the quality; a particular net
weighting algorithm may work splendidly with Placer A
but not with Placer B.In general, net weighting actually causes the total
wirelength in the design to increase, though it will cause
the timing to be significantly better. After net weighting,
the entire placement is performed again from scratch,
though repowering levels and buffering structures may
remain from the previous phase. Once again, the timing
picture will look quite grim immediately after placement
due to new long wires. Another round of buffering andrepowering optimizations can then be applied to get the
timing into reasonable shape for the next phase.
After timing-driven placement and optimization, many
cells may be placed in locations that are locally suboptimal.
Timing-driven detailed placement makes local moves and
swaps to try to improve both wirelength and the global
timing of the design. The detailed placement is timing
driven in that it can also use net weights to guide itssolution. Constraints to limit cell movement may be used
to prevent global moves that may undo the placement
achieved by the previous phase.
The final phase of core timing closure is pure opti-mization. At this point, the timing is hopefully reasonably
close, but buffering and repowering alone does not suffice
to fix the critical paths. Direct logic transforms can be
applied at this point [1], [17]–[19]. Examples include thefollowing.
1) Cloning takes a cell that may be driving a large
number of pins and duplicates it so that the load
can be divided between the existing cell and the
original. This may or may not reduce delay, de-
pending on the increased load caused by the new
cell. It certainly will increase area.
2) Inverter manipulation takes inverters that aredriving or driven by a cell and absorbs them into
the cell. For example, an and gate driving an
inverter can become a nand gate. The reverse can
happen as well, whereby inverters are pulled out
of the logic of a cell.
3) Logic decomposition breaks apart a single logic cell
into several cells. For example, Fig. 1 shows a
4-input nand gate can be decomposed into (b)two 2-input and gates each driving a third 2-input
nand gate.
4) Connection reordering rewires commutative con-
nections in fan-in trees. Fig. 1(c) shows an exam-
ple reordering of the inputs to derive a different
physical solution.
5) Cell movement picks a cell along a critical path and
tries to find a new location for the cell that im-proves timing.
These optimizations can be deployed on critical paths
along with incremental timing analysis to push the design
closer to timing closure.
C. A Closer Look at OptimizationWhile placement is relatively straightforward, the
pieces that constitute optimization may not be so clear.Optimization can also be broken down into the following
phases:
1) electrical correction;
2) critical path optimization;
3) histogram compression;
4) legalization.
The purpose of electrical correction [20] is to fix the
capacitance and slew violations introduced, usuallythrough buffering and repowering. Most of these are
introduced from the placement stage. In general, one
wants to first perform electrical correction in order to get
the design in a reasonable state for the subsequent
Fig. 1. Examples of logic direct logic transforms. (a) Initial gate. (b) Logic decomposition. (c) Connection reordering.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 575
optimizations. Electrical correction is potentially a bigruntime hog. The reason is that designs need more buffers
than ever to fix slew violations due to the ever decreasing
ratio between gate and wire delays. Some older designs
[21] may require 250 000 buffers and some newer designs
today require over a million buffers. The trend toward
large and more complex designs has turned a relatively
simple and fast phase into a complex and slow one.
During critical path optimization one examines a smallsubset of the most critical paths and performs optimization
specifically to improve timing on those paths. This phase
needs to be intertwined with incremental timing so that
one can see the impact of logic changes right away and to
then find the next set of critical paths to work on. Here one
can afford to throw Bthe kitchen sink[ at the problem; any
optimization such as direct logic transforms described
above that may potentially improve the timing is fair gameand can be attempted. A continuing challenge in the field
is to derive more complex transforms that involve the
interaction of multiple gates and potential cell movements.
For example, one may wish to Bstraighten[ all the gates in
a path, simultaneously repower them, and perform buffer
insertion on the fly. Unlike electrical correction, the
runtime for this phase does not scale nearly as much with
increasing design size, since the number of paths that areworked on in this phase is a user parameter that is in-
dependent of the design size. The bottlenecks for runtime
here are how far the critical paths are from closure and the
time it takes to update the timing.
Critical path optimization certainly can fail to close on
timing. There could be a path (or paths) in the design that
is completely incapable of meeting timing requirements,
e.g., due to floorplanning of fixed blocks. At this pointphysical synthesis could return with its best solution found
so far, though there might still be thousands of paths which
do not meet their timing targets. The purpose of the
histogram compression phase is to perform optimization on
these less critical, but still negative paths. This is analogous
to pushing down on the timing histogram returned from
timing analysis. The size of the histogram after this phase
gives the designer an indication of how much work thereremains to close on timing. This phase helps the designer
distinguish between a few really poor paths versus
thousands of systemic problems.
Throughout all of the above phases, every optimization
will disrupt the placement. One can choose to always find
a legal location for every buffer or piece of logic during
optimization; however, this will be very expensive. In
PDS, optimizations are allowed to make changes and placecells that may cause the placement to have overlaps,
potentially in the thousands. Periodically a phase of place-
ment legalization needs to be called to resolve these
overlaps to once again make the placement viable. The
frequency that this step needs to be called (along with the
size of its task) can be a major contributor to the total
runtime of the system.
Fig. 2 gives an example of how the four major phases of
core physical synthesis may be broken down further. Forexample, observe how no legalization occurs at the end of
the first phase. Since the entire design will be replaced in
phase 2, legalization at this point can be considered
unnecessary. Also observe that in phase 4, critical path
optimization and legalization are run after each other
three times. In practice, this loop can be made even tighter
so that any timing disturbances caused by legalization are
quickly reflected in the list of most critical paths. The flowshown in this figure is just an example of how the different
phases may operate together. Many different combinations
can be employed (such as more or fewer placements,
optimizations before initial placement, etc.) that may
achieve better results. It remains a challenge of physical
synthesis to find flows that achieve excellent results across
a wide range of design styles.
D. Achieving Fast Physical SynthesisIn order to make the physical synthesis as fast as
possible, we have focused on a variety of techniques that
can be deployed throughout the flows. A key philosophy
for achieving both a fast and high quality result is to do the
optimizations as fast as possible even if some optimality is
sacrificed. As long as the design is in a reasonably good
Fig. 2. Major phases of physical synthesis.
Alpert et al.: Techniques for Fast Physical Synthesis
576 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
state after applying fast optimization, one can always applyslower, but more accurate optimization to further polish
the design. In other words, one can break a few eggs while
making the cake, as long as there is a way to clean them up
(but if one is careless and breaks too many eggs, the cake
will never be completed). This paper presents some of the
major algorithmic techniques that have been discovered or
utilized. They include the following.
• Clustering for multilevel placement. While it iswell established that multilevel partitioning [22]
gives superior runtime and quality of results for the
circuit partitioning problem, achieving a similar
result for placement has been much more elusive.
For placement, this requires clustering with a bit
more care; we have been able to cluster to achieve
speedups of a factor of 3–5 versus flat placement
while obtaining similar placement quality. Thisresult can be applied to both the initial and timing-
driven placement phases. Details of this technique
can be found in [23].
• Fast timing-driven buffering. It is well known
that Van Ginneken’s algorithm [24] can achieve an
optimal buffering result for a given tree topology.
When one extends it to handle a large buffer li-
brary and to control the total buffer area, the run-time increases significantly. This work shows how
one can add new pruning and estimation tech-
niques to improve runtime without any measurable
degradation in solution quality. This result can be
applied to any of the buffering phases. Details of
this work can be found in [25].
• Integrated electrical correction. As mentioned
above, electrical correction consumes an increas-ingly large percentage of the runtime of physical
synthesis. This work proposes integrating buffer-
ing and repowering into a single engine that rec-
ognizes which optimization is best to perform
for a given net. Details of the scheme can be
found in [20].
• Timerless buffering. For electrical correction,
one does not require the best solution in terms oftiming. Any suboptimal timing solution that
proves critical can always rebuffered later. When
potentially inserting a million buffers for electrical
correction, it is essential to fix slew and capaci-
tance violations as fast as possible while using the
minimum buffer area. This section describes a
new algorithm for solving this problem that is an
order of magnitude faster than timing-drivenbuffering. Details of the algorithms can be found
in [26].
• Layout aware buffer trees. When performing
buffer insertion, one can run into danger by
ignoring the density of placed objects and the
design routability because placements of buffers in
those locations may require them to get moved
later by legalization. Often one may find locationsthat are in sparser regions of the chip but are no
worse than the locations in dense regions. This
work presents a generalized fast technique for
constructing a Steiner tree for buffering, via either
timing-driven or timerless buffering. Details of the
work can be found in [27].
• Diffusion based legalization. The danger of le-
galization is that it can potentially degrade timingby moving a timing critical cell to a legal location
that is far away from its optimal location. To avoid
this, one can run legalization very frequently to
keep it from doing too much work for any iteration.
As an alternative, diffusion-based legalization more
smoothly spread cells using the paradigm of phys-
ical process of diffusion. Consequently, timing-
degradations are less frequent and legalization canbe run less often in between optimizations. Fur-
ther, this technique can be used to alleviate local
routing congestion hot spots. Details of the algo-
rithm can be found in [28].
The remainder of the paper discusses each of these
technical contributions in more detail.
II . CLUSTERING FOR FASTGLOBAL PLACEMENT
Global placement is perhaps the most independent and
well defined component of physical synthesis that is a
major contributor to the total runtime of the system.
Global placement algorithms can generally be categorized
as simulated annealing, top-down cut-based partitioning,
analytic placement, or some combination thereof. Simu-lated annealing [29] is an iterative optimization method
which refines a placement solution using stochastic algo-
rithm. Although this is an effective method to integrate
non-conventional multidimensional objective functions
for global placement, it is known to be slow and not
scalable compared to other global placement algorithms.
Recent years have seen the emergence of several new
academic placement tools, especially in the top-downpartitioning and analytic domains. With the advent of
multilevel partitioning [22], [30] as a fast and effective
algorithm for min-cut partitioning, new generations of top-
down cut-based placers such as Capo [31], Feng Shui [32],
Dragon2000 [33] have appeared in recent years. A placer
in this class partitions the cells into two (bisection) or four
(quadrisection) regions of the chip, then recursively
partitions each region until a coarse global placement isachieved. In general, recursive cut-based placement ap-
proaches perform quite well when designs are dense, but
rather poorly when they are sparse.
Analytic placers typically solve a relaxed placement
formulation (such as minimum total squared wirelength)
optimally, allowing cells to temporarily overlap. Legali-
zation is achieved by removing overlaps via either
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 577
partitioning or by introducing additional forces/constraintsto generate a new optimization problem. The recent
placement contest [6] shows that analytic placement
algorithms can produce high quality placement solutions
on modern real circuits. This helped the advent of new
renaissance of analytic placers, e.g., Aplace [7], mPL [8],
mFar [9], FastPlace [10]. The genesis of this analytic
placement movement began with [34] and has very
recently been signicantly imporved in [35]. Analyticplacers tends to find better global placement solutions
particularly when designs have non-trivial white space in
it. In other words, analytic placer seems to have an
advantage in managing white space during global place-
ment process.
For any placer, clustering can be used to make it faster.
Clustering groups cells into fewer clusters, then placement
can be run directly on the clusters. However, for any placerof reasonable quality, the challenge lies in using clustering
to maintain and perhaps even enhance solution quality.
The particular clustering technique needs to be adapted for
the placer to which it is being applied.
The remainder of this section discusses how hierarchi-
cal clustering and unclustering techniques are integrated
into a top-down analytic placer that exists in PDS, though
is general enough to be applied to any placer. This placerwas chosen since it has been proven effective in the design
of several hundred real ASIC parts and is flexible enough
to handle a wide variety of special user constraints, like
bounds on cell movements. Further, clustering can also
help improve timing-driven placement under a net-
weighting paradigm. By grouping cells with high weights
into clusters, these cell groups will likely be placed close
together in the final placement.The hierarchical analytic placement is the integration
of three key components: analytic top-down placement,
best-choice clustering, and area-based unclustering. First,
we briefly review the flat global placement algorithm used
for this particular speedup technique. Multilevel place-
ment can be applied to just about any flat global placer,
though the techniques for clustering and unclustering
must be customized to obtain good results.
A. Analytic Top-Down Placement OverviewThe analytic top-down global placement algorithm pre-
sented here is based on quadratic placement with geo-
metric partitioning [36]. A quadratic wirelength objective
is often used for analytic placement since it can be easily
optimized, e.g., with a conjugate gradient solver (CG)
minimize �ð~x;~yÞ ¼Xi 9 j
wij ðxi � xjÞ2 þ ðyi � yjÞ2� �
(1)
where ~x ¼ ½x1; x2; . . . ; xn� and ~y ¼ ½y1; y2; . . . ; yn� are the
coordinates of the movable cells~v ¼ ½v1; v2; . . . ; vn� and wij
is the weight of the net connecting vi and vj. The optimalsolution is found by solving one linear system for~x and one
for~y. Quadratic placement only works on a placement with
fixed objects (anchor points). Otherwise, it produces a
degenerate solution where all cells are on top of each other
on a single point.
Although the solution of (1) provides a placement
solution with optimal squared wirelength, the solution will
have lots of overlapping cells. To remove overlaps, weadopt geometric four-way partitioning [36]. Four-way
partitioning, or quadrisection, is a function f : V ! i 2f0; 1; 2; 3g where i represents an index for one of sub-
regions or bins B0, B1, B2, B3. The assignment of cells to
bins needs to satisfy the capacity constraint for each bin.
With the given cell locations determined by quadratic
optimization, four-way geometric partitioning minimizes
the sum of weighted cell movements (using a linear timealgorithm), defined as
Xv 2 V
sizeðvÞ � d ðxv; yvÞ; BfðvÞ� �
(2)
where v is a cell, ðxv; yvÞ is a location of cell v from
quadratic solution and BfðvÞ refers to one of four bins whichthe cell v is assigned to. The distance term dððx; yÞ; BiÞ with
i 2 f0; 1; 2; 3g is the Manhattan distance from coordinate
ðx; yÞ to the nearest point to the bin Bi. The distance is
weighted by the size of cell, sizeðvÞ. The intuition of this
objective function is to obtain the partitioning solution
with the minimum perturbation to the previous quadratic
optimization solution.
Quadrisection is recursively applied, so that at level k,there are 4k placement sub-regions or bins. For each bin,
the process of quadratic optimization and subsequent
geometric partition are repeated until each sub-placement
region contains a trivial number of objects. At each
placement level, one can also apply local refinement tech-
niques such as repartitioning [37]. Repartitioning consists
of applying a quadrisection algorithm on each 2 2 sub-
problem instance in a sequential manner. The fundamen-tal reason that repartitioning can improve wirelengths of
placement is that it can fix any poor assignments from the
mininimum movement quadrisection step. Since it can see
the assignments made from the prior level, it is able to
locally reverse any poor assignments based on the
repartitioning objective function.
B. Clustering for Multilevel PlacementAs placement instances climb into the multimillions of
cells, clustering becomes a powerful tool for speeding upthe global placer. Clustering effectively reduces the
problem size fed to the placer by viewing each cluster of
cells as a single cell. The quality of a clustering based or
Alpert et al.: Techniques for Fast Physical Synthesis
578 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
multilevel placer is critically dependent on the ability toperform intelligent clustering.
In terms of the interactions between clustering and
placement, the prior work can be classified into two cat-
egories: transient and persistent. Transient clustering
usually involves clustering and unclustering as part of
the internal placement algorithm interactions. For exam-
ple, multilevel min-cut partitioning [22], [31] falls into this
category. The clustering is used for partitioning for a givenlevel, but then an entirely new clustering is generated for
the subsequent level. Hence, the clustering is transient
since it is constantly recomputed based on the current
state of the placer. In contrast, persistent clustering gener-
ates a cluster hierarchy at the beginning of a placement in
order to reduce the size of a problem for the entire
placement [9]. The clustered objects can be dissolved at or
near the end of placement, with a final Bclean-up[ oper-ation. For persistent clustering, the clustering algorithm
itself is actually independent the core placer. Rather, it is a
preprocessing step which imposes a more compact netlist
structure for the placer, e.g., [29].
To embed clustering within our placer, we propose a
semipersistent clustering strategy. One problem with per-
sistent clustering is that clustered objects may be too large
relative to the decision granularity (for example, the sizeof bin which the cluster is assigned during partitioning),
which results in the degradation of final placement
solution quality. The goal of semipersistent clustering is
to address this deficiency. Semipersistent clustering takes
advantage of the hierarchical nature of clustering so that
clustered objects are dissolved slowly during the place-
ment flow. At the early stage of the placement algorithm,
a global optimization process is performed on highlyclustered netlist while local optimization/refinement
can be executed on the almost flattened netlist at later
stage.
There are many algorithms and objectives for cluster-
ing (see [38] for a survey). For example, a common
technique is to match pairs of similar objects and apply
matching passes recursively [22]. While extremely fast, the
pairs that get merged towards the end of a pass may clusterto a less desirable neighbor. To avoid this behavior, one
could perform partial passes where one only merges some
small fraction of the cells before updating the list of
potential matches. In its most extreme, one can use a
partial list of one match. In other words, at each pass only
perform the single best clustering over all possible clusters
according to the given objective function. This is what we
call best-choice clustering [23] as shown in Fig. 3. By usinga priority queue to identify the best cluster, one obtains a
sub-quadratic implementation. Priority queue manage-
ment naturally provides an ideal clustering sequence and it
is always guaranteed that two objects with the best
clustering score will be clustered.
The degree of clustering may be controlled by com-
puting a target number of objects. Best-choice clustering is
simply repeated until the overall number of objects be-
comes the target number of objects. Fewer target objectsimply more extensive clustering (and a larger speedup).
During the clustering score calculation, the weight we
of a hyperedge e is defined as 1=jej which is inversely
proportional to the number of objects that are incident to
the hyperedge. Given two objects u and v, the clustering
score dðu; vÞ between u and v is defined as
dðu; vÞ ¼X
e2Eju;v2e
we
aðuÞ þ aðvÞð Þ (3)
where e is a hyperedge connecting object u and v, we is a
corresponding edge weight, and aðuÞ and aðvÞ are the areas
of u and v respectively. The clustering score of two objects
is directly proportional to the total sum of edge weights
connecting them, and inversely proportional to the sum of
their areas. This clustering score function can handle
hyper edge directly without transforming it into a clique
model. Also the area-based denominator of the scorefunction helps to produce more balanced clustering
results.
Suppose Nu is the set of neighboring objects to a given
object u. We define the closest object to u, denoted cðuÞ, as
the neighbor object with the highest clustering score to u,
i.e., cðuÞ ¼ v such that dðu; vÞ ¼ maxfdðu; zÞjz 2 Nug.
The best-choice algorithm is composed of two phases.
In phase I, for each object u in the netlist, the closestobject v and its associated clustering score d are calculated.
Then, the tuple ðu; v; dÞ is inserted to the priority queue
with d as comparison key. For each object u, only one tuple
with the closest object v is inserted. This vertex-oriented
priority queue allows for more efficient data structure
Fig. 3. Best-choice clustering algorithm.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 579
managements than edge-based methods. Phase I is a
simply priority queue PQ initialization step.
In the second phase, the top tuple ðu; v; dÞ in PQ is
picked up (Step 2), and the pair of objects ðu; vÞ are
clustered creating a new object u0 (Step 3). The netlist is
updated (Step 4), the closest object v0 to the new object u0
and its associated clustering score d0 are calculated, and a
new tuple ðu0; v0; d0Þ is inserted to PQ (Steps 5–6). Since
clustering changes the netlist connectivity, some of
previously calculated clustering scores might become
invalid. Thus, the clustering scores of the neighbors of
the new object u0, (equivalently all neighbors of u and v)
need to be recalculated (Step 7), and PQ is adjusted
accordingly. The following example illustrates clusteringscore calculation and updating.
For example, assume the input netlist with six objects
fA; B; C;D; E; Fg and eight hyperedges fA; Bg, fA; Cg,
fA;Dg, fA; Eg, fA; Fg, another fA; Cg, fB; Cg, and
fA; C; Fg as in Fig. 4(a). Let the size of each objects is 1.
By calculating the clustering score of A to its neighbors, we
find that dðA; BÞ ¼ 1=4, dðA; CÞ ¼ 2=3, dðA;DÞ ¼ 1=4,
dðA; EÞ ¼ 1=4, and dðA; FÞ ¼ 5=12. dðA; CÞ has the highestscore, and C is declared as the closest object to A. Since
dðA; CÞ is the highest score in the priority queue, A will be
clustered with C and the circuit netlist will be updated as
shown in Fig. 4(b). With a new object AC introduced,
corresponding cluster scores will be dðAC; FÞ ¼ 1=3,
dðAC; EÞ ¼ 1=6, dðAC;DÞ ¼ 1=6, and dðAC; BÞ ¼ 1=3.
C. Area-Based Selective UnclusteringIn this semipersistent clustering scenario, the cluster-
ing hierarchy is preserved during the most global place-
ment. However, if the size of a clustered object is relatively
large to the decision granularity, the geometric partition-
ing result on this object can affect not only the quality of
global placement solution, but also the subsequent
legalization due to the limited amount of available free
space. To address this issue, we employ an adaptive area-based unclustering strategy. For each bin, the size of each
clustered object is compared to the available free space. Ifthe size is bigger than the predetermined percentage of the
available free space, the clustered object is dissolved. Our
empirical analysis shows that with the appropriate
threshold value (5%), most clusters can be preserved
during the global placement flow with insignificant loss of
wirelength. The area-based selective unclustering is
another knob to provide a tradeoff between runtime and
quality of placement solution. More aggressive uncluster-ing (lower threshold value) produces better wirelengths at
the cost of higher CPU time.
D. Putting the Placer TogetherFinally, the clustering can be integrated with analytic
top-down placement to derive a new hierarchical global
placement algorithm that is summarized in Fig. 5. With a
given initial netlist, a coarsened netlist is generated viabest-choice clustering which is used as a seed for the
subsequent global placement. Steps 2–5 are the basic
analytic top-down global placement algorithm described in
Section II-A. After quadratic optimization and quadrisec-
tion are performed for each bin, an area-based selective
unclustering is performed to dissolve large clustered
objects (Step 6). At the end of each placement level, a
repartitioning refinement is executed for local improve-ment (Step 8). Steps 2–9 consist of the main global
placement algorithm. If there exist clustered objects after
the global placement, those are dissolved unconditionally
(Step 10) before the final legalization and detailed
placement are executed (Step 11). The proposed algorithm
relies on three strategic components; best-choice cluster-
ing, analytic top-down global placement, and area-based
selective unclustering.
E. Results and SummaryTable 1 shows the performance of hierarchical place-
ment over flat placement on real industrial circuits. The
table shows the size of circuits in terms of the number of
Fig. 4. Clustering a pair of objects A and C.
Fig. 5. Hierarchical analytic top-down placement algorithm.
Alpert et al.: Techniques for Fast Physical Synthesis
580 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
objects and nets, the utilization of designs, the wirelength
improvement and speed-up over flat placement. Let � be
the ratio of the number of cells to the target number of
clusters. With clustering ratio � ¼ 2, hierarchical place-ment is on average twice as fast as flat placement while
obtaining a slight 0.92% improvement in wirelength. With
a more aggressive clustering ratio of � ¼ 10, hierarchical
placement is about five times faster than flat placement,
with a slight 3% degradation in wirelength. Different
values of � can be used to trade off speed and quality.
Overall, we demonstrate that careful clustering and
unclustering strategies can yield a hierarchical placementthat is significantly faster than flat while with comparable
solution quality.
III . TECHNIQUES FOR FASTTIMING-DRIVEN BUFFER INSERTION
For timing critical nets, buffer insertion must be deployed
frequently to improve delay, either for handling nets withlarge fanout, long wires, or isolating non-critical sinks
from critical ones. For example, Fig. 6(a) shows a 3-pin net
with poor timing in which the small squares are potential
buffer insertion locations. Proper buffer insertion, as
shown in Fig. 6(b), improves the timing to the most critical
sink by 200 ps. The bottom sink is not critical so only a
decoupling buffer is required for that subpath.
The buffering algorithms in PDS are based on theclassic dynamic programming paradigm [24]. The reason
for this is because the algorithm is provably optimal for a
given tree topology (such as [39], [40]), though this result
will frequently insert many additional buffers to obtain a
negligible improvement in performance. Thus, the algo-rithm must also manage the tradeoff between buffering
resources and delay [41]. Doing so changes the algorithms
complexity from polynomial to pseudopolynomial and in
practice adds an order of magnitude to the runtime. The
result is an extremely effective algorithm for timing-driven
buffer trees, though the algorithm’s inefficiency is
problematic.
Thus, it is essential to make this core optimization asfast as possible. Hence, this section explores tricks for
tweaking the classic algorithm to obtain significant
performance improvement without losing solution quality.
These techniques can be easily integrated with the classic
buffer insertion framework while also considering slew,
noise, and capacitance constraints [42], [43]. Used in
conjunction, these techniques can lead to more than a
factor of ten performance improvement versus traditionaldynamic programming.
A. Overview of Classic Buffering AlgorithmFor a given Steiner tree with a set of buffer locations
(namely the internal nodes), buffer insertion inserts
buffers at some subset of legal locations such that the
required arrival time (RAT) at the source is maximized. In
the dynamic programming framework, candidate solutionsare generated and propagated from the sinks toward the
Table 1 Comparisons of Hierarchical Analytic Top-Down Placement Against Flat Placement in Wirelengths and Runtimes
Fig. 6. An example of how buffer insertion can improve timing to critical sinks. (a) A net without buffers inserted. (b) Proper buffer
insertion improves timing.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 581
source. Each candidate solution � is associated with aninternal node in the tree and is characterized by a 3-tuple
ðq; c;wÞ. The value q represents the required arrival time;
c is the downstream load capacitance; and w is the cost
summation for the buffer insertion decision.
Initially, a single candidate ðq; c;wÞ is assigned for each
sink where q is the sink RAT, c is the load capacitance and
w ¼ 0. When the candidate solutions are propagated from
a node to its parent, all three terms are updated ac-cordingly. At an internal node, a new candidate is gen-
erated by inserting a buffer. At each Steiner node, two sets
of solutions from the children are merged. Finally at the
source, the solutions with max q are selected.
The candidate solutions at each node are organized as
an array of linked lists. The solutions in each list of the
array have the same buffer cost value w ¼ 0; 1; 2; . . ..During the algorithm, inferior solutions are pruned. Asolution is defined as inferior (or redundant) if there exists
another solution that is better in slack, capacitance, and
buffer cost. More precisely, for two candidate solutions
�1 ¼ ðq1; c1;w1Þ and �2 ¼ ðq2; c2;w2Þ, �2 dominates �1 if
q2 � q1, c2 � c1 and w2 � w1. In such case, we say �1 is
redundant and may be pruned. After pruning, every list
with the same cost is a sorted in terms of q and c.
A buffer library is a set of buffers and inverters, whileeach of them is associated with its driving resistance, input
capacitance, intrinsic delay, and buffer cost. During
optimization, we wish to control the total buffer resources
so that the design is not over-buffered for marginal timing
improvement. While total buffer area can be used, to the
first order, the number of buffers provides a reasonably
good approximation for the buffer resource utilization.
Indeed, we use the number of buffers since it allows amuch more efficient baseline van Ginneken implementa-
tion. Note that, our techniques presented in this paper can
be applied on any buffer resources model, such as total
buffer area or power.
At the end of the algorithm, a set of solutions with
different cost-RAT tradeoff is obtained. Each solution gives
the maximum RAT achieved under the corresponding cost
bound. Practically, we choose neither the solution withmaximum RAT at source nor the one with minimum total
buffer cost. Usually, we would like to pick one solution in
the middle such that the solution with one more buffer
brings marginal timing gain. In PDS, we use the B10 ps
rule[ (though the value can of course be modified de-
pending on the frequency target). For the final solutions
sorted by the source’s RAT value, we start from the so-
lution with maximum RAT and compare it with the secondsolution (usually it has one buffer less). If the difference in
RAT is more than 10 ps, we pick the first solution.
Otherwise, we drop it (since with less than 10 ps timing
improvement, it does not worth an extra buffer) and
continue to compare the second and the third solution. Of
course, instead of 10 ps, any time threshold can be used
when applying to different nets.
B. Preslack PruningDuring the algorithm, a candidate solution is pruned
out only if there is another solution that is superior in
terms of capacitance, slack and cost. This pruning is based
on the information at the current node being processed.
However, all solutions at this node must be propagated
further upstream toward the source. This means the load
seen at this node must be driven by some minimal amount
of upstream wire or gate resistance. By anticipating theupstream resistance ahead of time, one can prune out more
potentially inferior solutions earlier rather than later,
which reduces the total number of candidates generated.
More specifically, assume that each candidate must be
driven by an upstream resistance of at least Rmin. The
pruning based on anticipated upstream resistance is called
the prebuffer slack pruning.
Prebuffer Slack Pruning (PSP): For two non-redundant
solutions ðq1; c1;wÞ and ðq2; c2;wÞ, where q1 G q2 and
c1 G c2, if ðq2 � q1Þ=ðc2 � c1Þ � Rmin, then ðq2; c2;wÞ is
pruned.
The PSP technique was first proposed in [44]. Using an
appropriate value of Rmin guarantees optimality is not lost
[44], [45]. However, what if we are willing to sacrifice
optimality for a faster solution by using a resistance Rwhich is larger than Rmin. In practice, we observe that a
somewhat larger value than Rmin does not hurt solution
quality.
We performed buffer insertion experiments on 1000
high capacitance industrial nets by varying the value of Rused for preslack pruning. The percent slack and CPU time
compared to no preslack pruning is shown in Fig. 7.
Observe that the slack slowly degrades as a function ofresistance, while the CPU time decrease is fairly sharp. For
example, R ¼ 120 � is the minimum resistance value
which preslack pruning is still optimal solution. However,
one can get a 50% speedup for less than 5% slack
degradation for a larger value of R ¼ 600 �. These results
indicate that using PSP can bring a huge speed-up in
classic buffering for a fairly small degradation in solution
quality.
C. Squeeze PruningThe basic data structure of van Ginneken style
algorithms is a sorted list of non-dominated candidate
solutions. Both the pruning in van Ginneken style
algorithm and the prebuffer slack pruning are performed
by comparing two neighboring candidate solutions at a
time. However, more potentially inferior solutions can bepruned out by comparing three neighboring candidate
solutions simultaneously. For three solutions in the sorted
list, the middle one may be pruned according to the
squeeze pruning defined as follows.
Squeeze Pruning: For every three candidate solutions
ðq1; c1;wÞ, ðq2; c2;wÞ, ðq3; c3;wÞ, where q1 G q2 G q3 and
Alpert et al.: Techniques for Fast Physical Synthesis
582 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
c1 G c2 G c3, if ðq2 � q1Þ=ðc2 � c1Þ G ðq3 � q2Þ=ðc3 � c2Þ,then ðq2; c2;wÞ is pruned.
For a two-pin net, consider the case that the algorithm
proceeds to a buffer location and there are three sorted
candidate solutions with the same cost that correspond to
the first three candidate solutions in Fig. 3(a). According
to the rationale in prebuffer slack pruning, the q-c slope fortwo neighboring candidate solutions tells the potential that
the candidate solution with smaller c can prune out the
other one. A small slope implies a high potential. For
example, ðq1; c1;wÞ has a high potential to prune out
ðq2; c2;wÞ if ðq2 � q1Þ=ðc2 � c1Þ is small. If the slope value
between the first and the second candidate solutions is
smaller than the slope value between the second and the
third candidate solutions, then the middle candidatesolution is always dominated by either the first candidate
solution or the third candidate solution. Squeeze pruning
keeps optimality for a two-pin net. After squeeze pruning,
the solution curve in ðq; cÞ plane is concave as shown in
Fig. 8(b).
For a multisink net, squeeze pruning does not
guarantee optimality since each candidate solution may
merge with different candidate solutions from the otherbranch and the middle candidate solution in Fig. 8(a) may
offer smaller capacitance to other candidate solutions in
the other branch. Squeeze pruning may prune out a post-
merging candidate solution that is originally with less total
capacitance. However, despite the loss of guaranteed
optimality, most of the time squeeze pruning causes no
degradation in solution quality and overall is a fairly safe
pruning technique.
D. Library LookupThe size of buffer library is an important factor in
determining runtime. Modern designs may have hundreds
of buffers and inverters to choose from. The theoretical
complexity of van Ginneken style buffer insertion is
quadratic in terms of the library size, though in practice it
appears to be linear. To avoid the slow down from large
libraries, we take advantage of buffer library pruning [46]
to select a small yet effective set of buffers from all those
that may be used. We now discuss a more effectivetechnique, library lookup.
During van Ginneken style buffer insertion, every
buffer in the library is examined for iteration. If there are ncandidate solutions at an internal node before buffer in-
sertion and the library consists of m buffers, then mntentative solutions are evaluated. For example, in Fig. 9(a),
all eight buffers are considered for all n candidate
solutions.However, many of these candidate solutions are
clearly not worth considering We seek to avoid generating
poor candidate solutions in the first place and not even
consider adding m buffered candidate solutions for each
Fig. 8. Squeeze pruning example. (a) The solution curve in
ðq; cÞ plane before squeeze pruning. (b) The solution
curve after squeeze pruning.
Fig. 7. The speed-up and solution sacrifice of aggressive preslack-pruning for 1000 nets as a function of R.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 583
unbuffered candidate solution. We propose to consider
each candidate solution in turn. For each candidate so-
lution with capacitance ci, we look up the best non-
inverting buffer and the best inverting buffer that yield
the best delay from two precomputed tables before opti-mization. For Fig. 9(b), the capacitance ci results in se-
lecting buffer B3 and inverter I2 from the non-inverting
and inverting buffer tables.
The two tables may be precomputed before buffer
insertion begins.
All 2n tentative new buffered candidate solutions can
be divided into two groups, where one group includes ncandidate solutions with an inverting buffer just insertedand the other group includes n candidate solutions with a
non-inverting buffer just inserted. We only choose one
candidate solution that yields the maximum slack from
each group and finally only two candidate solutions are
inserted into the original candidate solution lists. Since the
number of tentative new buffered solutions is reduced
from mn to 2n, the speedup is achieved. Also, since only
two new candidate solutions instead of m new candidatesolutions are inserted, the number of total candidate so-
lutions is reduced. This is similar to the case when the
buffer library size is only two, but the buffer type may
change depending on the downstream load.
E. Results and SummaryTable 2 shows the impact of the three speedup tech-
niques: preslack pruning (PSP), squeeze pruning (SqP),and library lookup (LL) versus the classic algorithm
(baseline). The results are average for 5000 high capa-
citance results from an ASIC chip. The second column
shows the total slack improvement (for all 5000 nets)
improvement after buffer insertion, and the third column
gives the total CPU time. Overall, the three techniques
resulted in a 20X speedup, with just 3% degradation in
solution quality.
Buffer insertion is a core optimization for fixing
timing critical paths. When optimizing tens of thousandsof nets, some optimality can be afforded to be sacrificed in
order to get sufficient runtime. Note that at the end of
physical synthesis, one could try reapplying buffer in-
sertion without these speedups (while also using more
accurate delay models) to the handful of remaining
critical nets. This is still much more efficient than ap-
plying full blown high accuracy buffer insertion for the
entire design.This work in essence summarizes our philosophy to fast
physical synthesis. Do the optimization well as fast as
possible, even if a little optimality is sacrificed. At the end,
if the design is close to timing closure, slower and more
accurate techniques can always be employed to further
refine the design.
IV. FAST ELECTRICAL CORRECTION
The previous discussion discusses fast buffering for critical
path optimization. Our focus now turns toward using
buffers and gate sizing for electrical correction. As dis-
cussed in the first section, electrical correction is be-
coming an increasingly costly phase of physical synthesis.
High wire resistance and sharp required slew rates (for
either noise or performance) mean that potentiallymillions of buffers must be inserted and millions of gates
must be repowered simply to have an electrically correct
design. Critical path optimization techniques rely on the
correct operation of the timing analyzer; however, any
timer, even a sophisticated one, only works correctly if the
design it is given is in a reasonable electrical state. For
example, if capacitive loads are outside the range that a
gate model has been characterized for, the timer will giveresults that do not reflect the true performance of the gate.
Further, if one can quickly make the timing result look
decent, this will leave much less work for the subsequent
slower critical path optimizations.
This section focuses on how to quickly perform elec-
trical correction, i.e., fix capacitance and slew violations
[20]. Further, it is crucial that this phase requires minimal
Fig. 9. Library lookup example. B1 to B4 are non-inverting buffers.
I1 and I4 are inverting buffers. (a) van Ginneken style buffer
insertion. (b) Library lookup.
Table 2 Simulation Results for Full Library Consisting of 24 Buffers.
Baseline are the Results of the Algorithm of Lillis et al. [47]. PSP Shows the
Results of Aggressive Prebuffer Slack Pruning Technique. SqP Stands for
our Squeeze Pruning Technique. LL is the Library Lookup Technique
Alpert et al.: Techniques for Fast Physical Synthesis
584 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
area overhead, thereby reducing unnecessary powerconsumption and silicon real estate. The need for reducing
area usage is obvious for area-constrained designs. How-
ever, even in designs where the total area may not be at a
premium, local regions may be congested. Further, in
delay-constrained designs, the area saving can be used by
subsequent optimizations to improve the performance of
critical regions.
A. Types of Electrical ViolationsTiming analyzers utilize models for gate delays and
slews, which are precharacterized. Each gate is character-
ized with a maximum capacitive load that it can drive and a
maximum input slew rate, and the operation of the timer is
valid within these ranges. If these conditions are violated,
timers usually extrapolate to obtain Bbest guess[ values.
However, values calculated in this manner may be in-accurate. This leads to the limits that define electrical
violations. There are two Brules[ that a design has to pass
for it to be electrically clean, as follows.
• Slew Limits: These rules define the maximum
slews permissible on all nets of the design. If the
slew (defined here as the 10%–90% rise or fall
time of a signal, other definitions can be used as
well) at the input of a logic gate is too large, a gatemay not switch at the target speed, or may not
switch at all, leading to incorrect operation.
• Capacitance Limits: These define the maximum
effective capacitance that a gate or an input pin can
drive. A large capacitance on the output of a gate
directly affects its switching speed and power
dissipation. Additionally, gates are typically char-
acterized for a limited range of output capacitance,and delay calculation during design can be in-
correct if the output capacitance is greater than the
maximum value.
Violations of these rules (referred to as slew violations
and capacitance violations) taken together are called elec-
trical violations. These limits are principally determined
during gate characterization, but designers may choose totighten these constraints further. High performance de-
signs, such as microprocessors typically have much tighter
slew limits than ASICs.
B. Causes of ViolationsFig. 10 shows the main causes of slew violations, and
how these may be fixed. Consider a net having source
gate A and sink gate B. The capacitive load seen by gate Ais the sum of the interconnect capacitance of the net and
the input capacitance of gate B. Assume that a signal with
slew s1 is applied at the input of gate A. Due to the load
that it has to drive, the slew s2 at the output of gate A may
be more than s1. Thus, one cause of degradation is the
source gate not being capable of driving the load at its
output. Next, even if the slew at the output of A, s2, is
within the specified limits, it could degrade as the signaltraverses the net to the sink. Thus, at the sink, the signal
could have an even larger slew of s3. This is the second
contribution to slew degradation.
There are two main methods of fixing slew violations,
as shown in Fig. 10. First, the source gate of the net can be
sized up, so that the new gate can drive the load present.
While this may fix violations on the net in question, the
obvious disadvantage is that the problem has been movedto the input of the source gate, where the input nets now
have larger sink capacitances. However, this may or may
not create violations on the input nets.
Second, keeping the source at its original size, buffers
can be inserted on the net in question. These isolate the
load capacitance of the sink, and repower the signal on the
net, so that slews are within the specified limits. Unlike
resizing, this method does not affect the electrical state ofany other nets, but the area overhead can be much higher.
Additionally, the time required to determine where to best
insert buffers is much greater than the time required to
resize a gate.
The causes of capacitance violations are similar to those
of slew violations: sink and interconnect capacitance both
Fig. 10. Causes of slew violations, and different methods of fixing them. (a) Slew degradation due to gate and interconnect.
(b) Fixing slew violation by sizing source. (c) Fixing slew violation by buffering.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 585
contribute to the existence of a violation. The fixes too aresimilar, using resizing and buffering. However, it is
important to note that it is possible to have capacitance
violations on a net that does not have slew violations, and
vice versa. Therefore, both capacitance and slew violations
have to be taken into consideration individually.
The simplest way to perform electrical correction is via
a sequential approach. First try resizing gates to fix vio-
lations while being careful not to over size them. For thosenets that cannot be solved with resizing, invoke a buffer
insertion algorithm. This may require a second pass of
resizing in order to properly size the newly inserted buffers
and inverters.
The most important drawback of this approach is that
sizing and buffering used to fix violations are applied
sequentially, with no communication or, indeed, knowl-
edge of each other’s capabilities. Thus, a pass of resizing orbuffering tries to fix the violations that it sees, and as-
sumes that the the other will be able to handle the vio-
lations that it cannot fix. Thus, if resizing is applied to a
net to fix a slew violation on a sink, it may decide that
buffering is the best solution, for a variety of reasons.
However, in the next pass, when the net is passed to the
buffer insertion routine, there may be conditions that
prohibit the insertion of buffers, such as blockages. Sub-sequent passes of resizing and buffering are then needed
with different settings, to overcome this situation, and
there is no guarantee that any of these passes will fix the
existing violation.
C. An Integrated ApproachAlternatively, we propose a framework that tightly
integrates the selection of the two optimizations, allowingfor the use of the correct optimization in a single pass over
the design. This integrated approach seeks to selectively
apply the resizing and buffering optimizations on a net-by-
net basis. Nets are selected in topological order, from out-
puts to inputs, and on each net, the following operations
are carried out.
• If there are no violations on the net, then the
source (driving) gate is sized down as much aspossible, without introducing new violations.
• If slew violations exist on the net, the source gate is
sized up as necessary, to fix the violations.
• If the previous step (resizing to fix violations) does
not succeed, the net is buffered.
The rationale of this approach is as follows. First, nets
are processed in output-to-input order; any side effect of
resizing gates only impacts the input nets, which are yet tobe processed. Sizing a gate up to remove a violation on its
output has a detrimental affect on its input nets. This is
handled by processing nets in the correct order.
Second, sizing gates down when possible has two
benefits. First, area is recovered when gates are larger than
necessary, and second, reducing the load on input nets
potentially removes violations that may exist, or reduces
their severity. The area salvaged in this step is better usedfor improving delay on critical paths of the circuit. Of
course, this step can be skipped if the design has already
been optimized for delay.
Finally, if resizing cannot fix a violation, buffering is
used to fix the net. Since buffering is the last resort, this
optimization can be as aggressive as required, which is
used to our advantage as shown later. This order (resizing
followed by buffering) is also advantageous from a runtimestandpoint, since buffering a net is much slower than
simply sizing the source gate.
The approach to gate sizing is straightforward. Given
an input slew rate and output load, we iterate through all
available sizes, and select the smallest gate size that can
deliver the required output slew. Buffering is based on the
algorithm described in the next section. The algorithm
selects the minimum area solution such that electricalconstraints are satisfied. For runtime considerations, a
coarse buffer library is often used for buffer insertion. The
lack of granularity in the buffer library makes the potential
to resize the buffer gates possible. Of course, a more fine-
grained library can be used, but can cause extra runtime.
To decide whether a gate meets its required slew
target, we adopt the model of Kashyap et al. [48] because
of its simplicity. It is actually the slew equivalent of theElmore delay model, but actually does not suffer as se-
verely from inaccuracies caused be resistive shielding.
The slew model can be explained using a generic
example which is a path p from node vi (upstream) to vj
(downstream) in a buffered tree. There is a buffer (or the
driver) bu at vi, and there is no buffer between vi and vj.
The slew rate sðvjÞ at vj depends on both the output slew
sbu;outðviÞ at buffer bu and the slew degradation swðpÞ alongpath p (or wire slew), and is given by [48]
sðvjÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffisbu;outðviÞ2 þ swðpÞ2
q: (4)
The slew degradation swðpÞ can be computed with
Bakoglu’s metric [49] as
swðpÞ ¼ ln 9 � DðpÞ (5)
where DðpÞ is the Elmore delay from vi to vj.
The basic framework presented above is flexible, and
lends itself to multiple refinements as follows. Once a netis buffered, the integrated framework allows for a quick
sizing of the newly added buffers. The buffering algorithm
can therefore be used with a small library of buffers.
Existing inverter trees can be ripped up and reinserted as
required, keeping in mind signal polarity constraints on
the sinks. If buffering does not fix a net, the cause of the
failure can be analyzed on the fly, and different algorithms,
e.g., for blockage avoidance can be used. Finally, if area is
Alpert et al.: Techniques for Fast Physical Synthesis
586 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
at a premium, both resizing and buffering can be applied toevery net, and the solution with the lowest cost can be
selected.
D. Electrical Correction SummaryThe integrated framework allows PDS to efficiently
perform electrical correction. However, in our initial
implementation, we found that 80%–90% of the runtime
takes place in the van Ginneken style buffer insertionalgorithm, even with the speedups discussed above. For
electrical correction, using a buffer insertion algorithm
which optimizes for delay is wasteful, since the purpose of
this stage is to simply produce an electrically correct
design. This motivates a new buffer insertion formulation
specifically for electrical correction that is discussed in the
next section.
V. FAST TIMERLESS BUFFERING
The efficiency of electrical correction directly depends on
the efficiency of the buffering algorithm. While Section III
shows how one can speed up performance driven
buffering, it still suffers from the fact that three constraints
must be handled at once: area, slew, and delay. In elec-
trical correction, one can afford to ignore the last ob-jective, delay. The assumption is that if a tree buffered by
electrical correction subsequently becomes part of a
critical path, it can always be ripped up and rebuffered
by the critical path optimizations while taking into account
the most up to date timing analysis. In general, we find
that only a relatively small percentage of nets (e.g., 5%)
need to be rebuffered. Thus, this section proposes a
simpler buffer formulation that ignores delay constraintsin order to achieve a more runtime and area efficient
result.
The key observation that motivates this approach is that
traditional buffer insertion requires pruning based on
three components: capacitance, slack (or delay), and area
(or power). Because a candidate has to be inferior in all
three categories to be pruned, the list of possible can-
didates can grow quite large. However, to perform elec-trical correction, the optimal delay solution is not required
and instead one wishes to fix electrical violations with
minimum area. By using only two instead of three
categories for pruning, one can obtain a much more ef-
ficient solution (that is actually linear time in the case of a
single buffer).
A. Problem FormulationFor electrical correction, we seek the minimum area
(or cost) buffering solution such that slew constraints are
satisfied. Since one does not need to know required arrival
time at sinks, it can be performed independently of timing
analysis, hence the term, timerless buffering. While this
new formulation is actually NP-complete, some highly
efficient and practical algorithms can be utilized.
The input to the timerless buffering problem includes arouting tree T ¼ ðV; EÞ, where V ¼ fs0g [ Vs [ Vn, and
E � V V. Vertex s0 is the source vertex, Vs is the set of
sink vertices, and Vn is the set of internal vertices. Each sink
vertex s 2 Vs is associated with sink capacitance cs. Each
edge e 2 E is associated with lumped resistance Re and
capacitance ce. A buffer library B contains different types
of buffers. Each type of buffer b has a cost wb, which can be
measured by area or any other metric, depending on theoptimization objective. Without loss of generality, we
assume that the driver at source s0 is also in B. A function
f : Vn ! 2B specifies the types of buffers allowed at each
internal vertex.
The output slew of a buffer, such as bu at vi, depends on
the input slew at this buffer and the load capacitance seen
from the output of the buffer. For a fixed input slew, the
output slew of buffer b at vertex v is then given by
sb;outðvÞ ¼ Rb � cðvÞ þ Kb (6)
where cðvÞ is the downstream capacitance at v, Rb and Kb
are empirical fitting parameters. This is similar to
empirically derived K-factor equations [50]. We call Rb
the slew resistance and Kb the intrinsic slew of buffer b.
A buffer assignment � is a mapping � : Vn ! B [ fbgwhere b denotes that no buffer is inserted. The cost of a
solution � is wð�Þ ¼P
b2� wb. With the above notations,
the basic timerless buffering problem can be formulated
as follows.
Timerless Buffering Problem: Given a Steiner treeT ¼ ðV; EÞ a buffer library B, compute a buffer assignment
� such that the total cost wð�Þ is minimized such that the
input slew at each buffer or sink is no greater than a
constant � .
B. A Timerless Buffering AlgorithmIn the dynamic programming framework, a set of
candidate solutions are propagated from the sinks towardthe source along the given tree. Each solution � is char-
acterized by a three-tuple ðc;w; sÞ, where c denotes the
downstream capacitance at the current node, w denotes
the cost of the solution and s is the accumulated slew
degradation sw defined in (5). At a sink node, the cor-
responding solution has c equal to the sink capacitance,
w ¼ 0 and s ¼ 0. The solution propagation is accom-
plished by the following operations.Consider to propagate solutions from a node v to its
parent node u through edge e ¼ ðu; vÞ. A solution �v at vbecomes solution �u at u, which can be computed as
cð�uÞ ¼ cð�vÞ þ ce;wð�uÞ ¼ wð�vÞ a n d sð�uÞ ¼ sð�vÞ þln 9 � De where De ¼ Reððce=2Þ þ cð�vÞÞ.
In addition to keeping the unbuffered solution �u, a
buffer bi can be inserted at u to generate a buffered
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 587
solution �u;buf which can be then computed as cð�u;buf Þ ¼cbi;wð�u;buf Þ ¼ wð�vÞ þ wbi
and sð�u;buf Þ ¼ 0.
When two sets of solutions are propagated through left
child branch and right child branch to reach a branching
node, they are merged. Denote the left-branch solution set
and the right-branch solution set by �l and �r, respec-
tively. For each solution �l 2 �l and each solution �r 2 �r,
the corresponding merged solution �0 can be obtained ac-
cording to: cð�0Þ ¼ cð�lÞ þ cð�rÞ;wð�0Þ ¼ wð�lÞ þ wð�rÞand sð�0Þ ¼ maxfsð�lÞ; sð�rÞg. To ensure that the worst
case in the two branches still satisfies slew constraint, we
take the maximum slew degradation for the merged
solution.
For any two solutions �1, �2 at the same node, �1
dominates �2 if cð�1Þ � cð�2Þ, wð�1Þ � wð�2Þ and sð�1Þ �sð�2Þ. Whenever a solution becomes dominated, it is
pruned from the solution set without further propagation.A solution � can also be pruned when it is infeasible, i.e.,
either its accumulated slew degradation sð�Þ or the slew
rate of any downstream buffer in � is greater than the slew
constraint � .
When a buffer bi is inserted into a solution �, sð�Þ is set
to zero and cð�Þ is set to cðbiÞ. This means that inserting
one buffer may bring only one new solution, namely, the one
with the smallest w. However, in minimum cost timingbuffering, a buffer insertion may result in many non-
dominated ðq; c;wÞ tuples with the same c value, where qdenotes the required arrival time (RAT).
Consequently, in timerless buffering, at each buffer
position along a single branch, at most jBj new solutions
can be generated through buffer insertion since c; s are
the same after inserting each buffer. In contrast, buffer
insertion in the same situation may introduce many newsolutions in timing buffering. This sheds light on why
timerless buffering can be much more efficiently
computed.
Another important fact is that the slew constraint is in
some sense close to length constraint. In timerless buff-
ering, solutions can soon become infeasible if we do not
add a buffer into it and thus many solutions, which are
only propagated through wire insertion, are oftenremoved soon. An extreme case demonstrating this point
is that in standard timing buffering, the solutions with no
buffer inserted can always live until being pruned by
driver given a loose timing constraint. This may nothappen in timerless buffering: this kind of solutions soon
become infeasible as long as the slew constraint is not too
loose.
Due to these special characteristics of the timerless
buffering problem, a linear time optimal algorithm for
buffering with a single buffer type is possible. In tim-
ing buffering, it is not known how to design a polynomialtime algorithm in this case. From these facts, the basicdifferences between these two somewhat related buffering
problems are clear.
C. Results and SummaryTable 3 compares timerless buffering to timing-driven
buffering for 1000 high capacitance nets from an ASIC
design for slew constraints ranging from 0.4 to 2.0 ns. A
library of 48 buffers was used. The experiment showsthat timerless buffering does result in a consistent de-
gradation in slack, which is not surprising since it does
not utilize timing information. Because timerless buffer-
ing minimizes area in its objective function, it is more
efficient in buffering area and the number of buffers
used. The area savings tends to increase as the slew
constraint is relaxed. Finally, the CPU time advantage is
clear as speedups of 25 to over 100 are observed. Thetiming-driven buffering used here does utilize preslack
pruning and squeeze pruning, but not library lookup.
Obviously the latter technique would reduce the advan-
tage somewhat.
Since electrical correction can result in millions of
buffers being inserted, one needs to do this as fast as
possible. Even with the speedups in Section III, a delay
driven technique is not suitable for this task. Instead, usinga timerless formulation that seeks to minimize area proves
significantly faster and actually uses less area.
Ultimately, one needs a large back of buffering so-
lutions depending on where one is in the physical synthesis
flow. For early electrical correction, a faster timerless
algorithm is appropriate. For critical path optimization, a
van Ginneken style algorithm is needed. However, one
often may need to pay attention to the blockages orplacement and routing congestion that may exist in the
design. The next section shows a framework for dealing
with any of these layout characteristics.
Table 3 Comparison of Timerless Buffering With Timing-Driven Buffering
Alpert et al.: Techniques for Fast Physical Synthesis
588 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
VI. LAYOUT AWARE FAST AND FLEXIBLEBUFFER TREES
Given a Steiner tree, we can insert buffers for critical pathoptimization using timing-driven buffering or electrical
correction using timerless buffering. The quality of the
results strongly depends on the Steiner tree used, and
so we use a buffer-aware tree construction as described
in [39]. However, this construction ignores the blockages
and congestion present in the layout. Ignoring this can
potentially cause several design headaches.
A. Types of Layout IssuesFor example, Fig. 11(a) illustrates the Balley[ problem,
in which space is limited between two large fixed blocks.
The space between blocks is highly desirable since routes
that cross the blockages have only potential insertion space
in the alley. Fig. 11(b) shows the buffer Bpile-up[ phe-
nomenon. Several nets may desire buffers to be inserted in
the black congested region, yet since there is no space for
buffers there, the buffers are inserted as close to the
boundary as possible. As more nets are optimized, thesebuffers pile up and spiral out further from their ideal
locations. This could be alleviated by only allowing buffers
from critical path optimization (not electrical correction)
to use these scarce resources.
As technology continues to scale, the optimum distance
between consecutive buffers continues to decrease. In
hierarchical design, this means allocating spaces within
macro blocks for buffering of global nets. An example isshown in Fig. 12(a). The space for buffers is potentially
limited, so non-critical nets should be routed around the
blocks while critical ones can use the holes. Long non-
critical nets still require buffers to fix slew and/or ca-
pacitance violations. In addition, these nets could be
critical, but have a wide range of possible buffering
solutions that may bring them into the non-critical group.
In the figure, the top net is non-critical and requires threebuffers, while the bottom net is critical and needs only two
by exploiting holes punched in the block.
Even without holes in block, designs may have pockets
of low density for which inserting buffers is preferred, as
shown in Fig. 12(b). In the figure, the Steiner route is
located in the low density part of the chip, which makes
the buffers inserted along the route also use low densityregions. Fig. 12(c) shows an example where one may be
willing to insert buffers in high density regions if a net is
critical. The 2-buffer route above the block yields faster
delays than the 4-buffer route below the block that is better
suited for noncritical nets. Finally, Fig. 12(d) shows
routing congestion between two blocks; the preferred
buffered route avoids this congestion without sacrificing
timing.There are some buffering approaches that attack a
subset of these type of problems by simultaneously in-
tegrate the layout environment, build a Steiner tree, and
buffer (e.g., [51], [52]), but doing too much work at once
inherently makes these algorithms too inefficient for this
application. Instead, we propose the following flow:
• Step 1: construct a fast timing-driven Steiner tree
(e.g., [39]) that is ignorant of the environment.• Step 2: reroute the Steiner tree to preserve its to-
pology while navigating environmental constraints.
• Step 3: insert buffers via the algorithms in
Section III or V.
This section focuses on solving the problem in Step 2.
B. Rerouting Algorithm OverviewTo reroute the tree, the design area is divided into
tiles, as in global routing, and stores the placement and
routing density characteristics for each tile. The algorithm
takes the existing Steiner tree and breaks it into disjoint
2-paths, i.e., paths which start and end with either the
source, a sink, or a Steiner point such that every internal
node has degree two. For example, the nets shown in
Fig. 13(a) and (b) both decompose into three 2-paths.
Finally, each 2-path is rerouted in turn to minimize cost,Fig. 11. Buffer insertion can potentially: (a) fill up constrained ‘‘alleys’’
and (b) cause buffer ‘‘pile-ups.’’
Fig. 12. Some environmental based constraints include: (a) holes
in large blocks; (b) navigating large blocks and dense regions;
(c) distinguishing between critical and noncritical preferred
routes; and (d) avoiding routing congestion.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 589
starting from the sinks and ending at the source. The new
Steiner tree is assembled from the new 2-path routes.
Essentially, the algorithm is performing maze routing for
each subsection of the tree. The two key components of
achieving a good result are plate expansion, which allows
the Steiner points to migrate and deriving the right mazerouting cost function.
If a Steiner point is in a congested region, it needs to
migrate from its original location. One could consider
allowing it to move anywhere in the layout, but since the
original Steiner layout was presumably Bgood[ we restrict
it to move only within a specified Bplate[ region. This is
one key for enabling the algorithm to be efficient. The
plate needs to be large enough to enable the Steiner pointto migrate to a less congested tile.
During maze rerouting, one considers routing to any
tile in the plate instead of just the original tile.
Fig. 13(a) shows a routing tree after Step 1. The striped
tile is the Steiner point, and the shaded region shows a
5 5 plate centered at the original Steiner point.
Fig. 13(b) shows a Steiner tree that might result after re-
routing. The Steiner point has moved to a different lo-cation within the plate; where it ends up depends on the
cost function that is optimized. The dotted region shows
the potential search space for the rerouting of the 2-path
from the Steiner point to the source. In this case, the
bounding box containing the two endpoints was expanded
by one tile.
C. Maze Routing Cost Function forElectrical Correction
Each tile is assigned cost that should reflect potentially
inserting a buffer and/or routing through the tile. Let
eðtÞ � 1 be the environmental cost of using tile t, where
eðtÞ ¼ 0 if the tile is totally void of any resource utilization,
while eðtÞ ¼ 1 represents a fully utilized tile. As an
example, for placement congestion, let dðtÞ could be the
placement density (cell area divided by total area available)
of tile t. and le rðtÞ be its routability (used tracks divided bytotal tracks available). Then one could use
eðtÞ ¼ �dðtÞ2 þ ð1 � �ÞrðtÞ2(7)
where 0 � � � 1 trades off between routing and place-
ment cost.
For fixing electrical violations, one wants the net to
avoid high cost tiles, while still making an attempt to
minimize wirelength. For this case, consider
costðtÞ ¼ 1 þ eðtÞ: (8)
This cost function implies that a fully utilized tile has
cost twice that of a tile that uses no resources. The constant
of one can be viewed as a Bdelay component.[ Let the costof a path be equal to the cost of all tiles in the path, and
initially assign all sinks to zero initial cost. We wish to
minimize the cost of the entire tree being constructed. For
a tile t that corresponds to a Steiner point, with subtree
children L and R, the cost of the tree routed at t is
costðtÞ ¼ costðLÞ þ costðRÞ.
D. Maze Routing Cost Function for CriticalPath Optimization
For critical nets, the cost impact of the environment is
relatively immaterial. We seek the absolute best possible
slack, but still need the route to avoid regions wherebuffers cannot be inserted at all. When a net is optimally
buffered (assuming no obstacles), its delay is a linear
function of its length [53]. Of course, this solution must be
realizable. To minimize delay, we simply minimize the
number of tiles to the most critical sink. Thus, the cost for
a tile is just costðtÞ ¼ 1 (there is no eðtÞ term). When
merging branches, one wants to choose the branch with
w o r s t s l a c k , s o t h e m e r g e d c o s t costðtÞ i s :maxðcostðLÞ; costðRÞÞ. To initialize the slack, a notion of
which sink is critical is needed. Since our cost function
basically counts tiles as delay, the required arrival time
(RAT) must be converted to tiles. Let DpT be the minimum
delay per tile achievable on an optimally buffered line. For
a sink s, the costðsÞ is initialized to �RATðsÞ=DpT. The
more critical a sink, the higher its initial cost. The
objective is to minimize cost at the source.Fig. 14(a) shows one of several possible solutions for
rerouting the net in Fig. 13 using this cost function, where
s2 is considered two tiles more critical than s1. Note that it
achieves a shortest path to s2. Contrast that with the
electrical correction cost function shown in Fig. 14(b), in
which the Bblob[ represents an area of high cost. In this
case, the route avoids the congested area even though it
means the route to the critical sink is much longer.
Fig. 13. Example of a three-pin net: (a) before and (b) after rerouting.
The shaded square region is the ‘‘plate’’ and the dotted region
is the solution search space for the final 2-path.
Alpert et al.: Techniques for Fast Physical Synthesis
590 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
E. General Cost FunctionThe previous cost functions can generate extreme
behavior; however, one can trade off between the two cost
functions. Let 0 � K � 1 be the tradeoff parameter,
where K ¼ 1 corresponds to a electrical correction and
K ¼ 0 corresponds to a critical net. The cost function for
tile t is then
costðtÞ ¼ 1 þ K � eðtÞ: (9)
For critical nets, merging branches is a maximization
function, while it is an additive function for non-critical
nets. These ideas can be combined with to yield:
costðtÞ ¼ max costðLÞ; costðRÞð Þþ K � min costðLÞ; costðRÞð Þ: (10)
Finally, the sink initialization formula becomes
costðsÞ ¼ ðK � 1ÞRATðsÞ=DpT: (11)
Thus, K trades off the cost function, the merging
operation, and sink initialization. In practice, we use K ¼ 1for electrical correction and subsequently smaller values
up to K ¼ 0:1 for critical path optimization.
F. Slew Threshold ConstraintAs described the maze routing cost functions do not
guarantee slew constraints will be satisfied. Let T be the
maximum number of tiles that can be driven by a buffer
before the slew constraint is violated. If the route goes over
more than T consecutive blocked tiles, there will be an
unavoidable slew violation when buffering. Hence, duringmaze routing we track the number of consecutive blocked
tiles and forbid it from exceeding T by not performing
node expansion once this threshold is reached. Thisguarantees that the resulting Steiner tree will have
sufficient area for buffers so that slew violations can be
fixed by subsequent dynamic programming.
G. Example and SummaryThe effect of rerouting can be shown by the example in
Fig. 15, which displays the a placement density map for a
given 7-pin net of an industrial design. The source ismarked with a white x, while sinks are marked with dark
squares. The white dots are potential buffer insertion
locations, and the diamonds are the inserted buffers. The
route on the left is the solution with K ¼ 1:0, while the one
on the right is the solution for K ¼ 0:1. Observe that the left
route totally avoids the large blockage, which ultimately
leads to a 4134 ps slack improvement over the unbuffered
solution. However, for when K ¼ 0:1, the route success-fully finds the prime real estate (the holes inside the block)
and places buffers in them where it deems it appropriate.
This improves the slack by 4646 ps. The simple parameter
setting of the cost function yields a different Steiner route
that can recognize layout constraints depending on the
particular phase of physical synthesis.
Optimizations that ignore the layout can cause severe
headaches for timing closure and routability. The mazererouting technique proposed in this section is general
enough to handle any kinds of layout configurations,
whether blockages, regions packed with dense cells, or
routing congestion. One does not need to deploy this
throughout physical synthesis though. Instead one could
wait for the Bmess[ and then clean it up. For example, PDS
has a phase to identify all buffers in routing congested
regions, rip-up those buffers, then reroute them using thismaze routing strategy. This clean-up-the-mess strategy
enables more overall efficient optimization than trying to
always preemptively avoid the mess. The next section
explains how a different kind of legalization algorithm is
Fig. 15. Illustration of the different routes obtained with the general
maze routing cost function for a layout containing a large block
with punched out holes. (a) A routed net with K ¼ 1:0.
(b) The same net with K ¼ 0:1.
Fig. 14. Examples of the (a) critical and (b) non-critical net cost
functions. The shaded area represents a region of high cost.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 591
more effective at cleaning up messes made from synthesisoperations.
VII. DIFFUSION-BASED PLACEMENTTECHNIQUES FOR LEGALIZATION
During electrical correction and critical path optimization,
some gates may be resized while new ones are inserted
into the design. PDS does not assign a location right away,but rather assigns a preferred location that may overlap
existing cells. Periodically, legalization needs to run to
snap these cells from overlapping to legal locations. If one
waits too long between legalization invocations, cells may
end up quite far from their preferred location which may
severely hurt timing. This section discusses a new
legalization paradigm called diffusion that was first
described in [28]. Diffusion tries to avoid this behaviorby keeping the relative ordering of the cells intact.
Of course, there are other methods that can also
achieve legalization without moving any one cell too far
away. Brenner et al. [54] describe a network flow algo-
rithm that superimposes a flow network on top of grid bins
and then flows cells from overly dense bins to bins that are
under capacity. More recently, Luo et al. superimpose a
Delauney triangluation on top of the cells and use thisstructure to enforce relative order while achieving local
density targets. Techniques for local cell movement, swap-
ping and shifting to improve placement quality after legal-
ization can be found in [55], [56].
During optimization, local regions can become overfull
at which point synthesis, buffering, and repowering opti-
mizations may become handcuffed if they are forbidden to
add to the area in an already full bin. The main advantageof diffusion is that it can allow the optimizations to
proceed anyway, knowing that cells will not be moved too
far away from their intended location. Further, diffusion
can be implemented or run in just a few minutes, even on
designs with millions of gates.
Diffusion is a well-understood physical process that
moves elements from a state with non-zero potential en-
ergy to a state of equilibrium. The process can be modeledby breaking down the movements into several small finite
time steps, then moving each element the distance it would
be expected to move during that time step. Our legalization
approach follows this model; it moves each cell a small
amount in a given time step according to its local density
gradient. The more time steps the process is run, the closer
the placement gets toward achieving equilibrium.
Assume that a placement is close to legal if all that isrequired to legalize the placement is to snap cells to rows
or perhaps perform minor cell sliding in order to fit the
cells. Also, assume the chip layout is divided into small,
equally sized bins which can fit around 5–15 cells. Let dmax
be the maximum allowed density of a bin, where com-
monly dmax ¼ 1. The placement is considered close to legal
if the area density of every bin is less than or equal to dmax.
For all bins with density greater than dmax, cells must bemigrated out of those bins into less dense ones. The goal of
legalization is to reduce the density of each bin to no more
than dmax while avoiding moving these cells far from their
original locations and also to preserve the ordering in-
duced by the original placement. Once each bin satisfies its
density requirement dj;k � dmax, a legal placement solution
can generally be easily achieved (since each bin is
guaranteed sufficient space), e.g., through local slide andspiral optimization.
A. The Diffusion ProcessDiffusion is driven by the concentration gradient,
which is the slope and steepness of the concentration dif-
ference at a given point. The increase in concentration in a
cross section of unit area with time is simply the difference
of the material flow into the cross section and the material
flow out of it. Diffusion reaches equilibrium when the
material concentration is evenly distributed.Mathematically, the relationship of material concen-
tration with time and space can be described using the
following partial differential equation:
@dx;yðtÞ@t
¼ r2dx;yðtÞ (12)
where dx;yðtÞ is the material concentration at position
ðx; yÞ at time t. Equation (12) states that the speed of
density change is linear with respect to its second-order
gradient over the density space. In the context of place-
ment, cells will move quicker when their local densityneighborhood has a steeper gradient.
When the region for diffusion is fixed (as in
placement), the boundary conditions are defined as
rdxb;ybðtÞ ¼ 0 for coordinates ðxb; ybÞ on the chip bound-
ary. We also define coordinates over fixed blocks in the
same way in order to prevent cells from diffusing on top of
fixed blocks. This forces cells to diffuse around the blocks.
In diffusion, a cell migrates from an initial location toits final equilibrium location via a non-direct route. This
route can be captured by a velocity function that gives the
velocity of a cell at every location in the circuit for a given
time t. This velocity at certain position and time is
determined by the local density gradient and the density
itself. Intuitively, a sharp density gradient causes cells to
move faster. For every potential ðx; yÞ location, define a
2-D velocity field vx;y ¼ ðvHx;y; vV
x;yÞ of diffusion at time tas follows:
vHx;yðtÞ ¼ �
@dx;yðtÞ@x
=dx;yðtÞ
vVx;yðtÞ ¼ �
@dx;yðtÞ@y
=dx;yðtÞ: (13)
Alpert et al.: Techniques for Fast Physical Synthesis
592 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
Given this equation, and a starting location ðxð0Þ; yð0ÞÞfor a particular location, one can find the new location
ðxðtÞ; yðtÞÞ for the element at time t by integrating the
velocity field
xðtÞ ¼ xð0Þ þZ t
0
vHxðt0Þ;yðt0Þðt0Þdt0
yðtÞ ¼ yð0Þ þZ t
0
vVxðt0Þ;yðt0Þðt0Þdt0: (14)
Equations (12)–(14) are sufficient to simulate the
diffusion process. Given any particular element, one can
now find the new location of the molecule at any point
in time t. To apply this paradigm to placement, one
needs to migrate from this continuous space to a discreteplace since cells have various rectangular sizes and the
placement image itself is discrete. The next section
presents a technique to simulate diffusion specifically for
placement.
B. Diffusion Based PlacementOne can discretize continuous coordinates by dividing
the placement areas into equal sized bins indexed by ðj; kÞ.Assume the coordinate system is scaled so that the width
and height of each bin is one. Then location ðx; yÞ liesinside bin ðj; kÞ ¼ ðbxc; bycÞ. We can also discretize con-
tinuous time t as n�t, where �t is the size of the discrete
time step.
Instead of the continuous density dx;y, we now can
describe diffusion in the context of the density dj;k of bin
ðj; kÞ. The initial density dj;kð0Þ of each bin ðj; kÞ can be
defined as dj;kð0Þ ¼ �Ai where Ai is the overlapping area of
cell i and bin ðj; kÞ.For simplicity, assume that if a fixed block overlaps a
bin, it overlaps the bin completely. In these cases, the
bin density is defined to be one, though boundary con-
ditions prevent cells from diffusing on top of fixed
blocks.
Assume that the density dj;kðnÞ has already been com-
puted for time n. Now one needs to find how the density
changes and cells movements for the next time stepn þ 1. We use the Forward Time Centered Space (FTCS)
[57] scheme to discretize (12). The new bin density is
given by
dj;kðnþ1Þ¼dj;kðnÞþ�t
2djþ1;kðnÞþdj�1;kðnÞ�2dj;kðnÞ� �
þ�t
2dj;kþ1ðnÞþdj;k�1ðnÞ�2dj;kðnÞ� �
: (15)
The new density of a bin at time n þ 1 depends only on itsdensity and the density of its four neighbor bins. Note that
one does not actually use the cell locations at time n þ 1 to
compute the density.
Just as (12) can be discretized to compute placement
bin density, (13) can be discretized to compute the velocity
for cells inside the bins. For now, assume that each cell in
the bin is assigned the same velocity, the velocity for the
bin, given by
vHj;kðnÞ ¼ �
djþ1;kðnÞ � dj�1;kðnÞ2dj;kðnÞ
vVj;kðnÞ ¼ �
dj;kþ1ðnÞ � dj;k�1ðnÞ2dj;kðnÞ
: (16)
The horizontal (vertical) velocity is proportional to thedifferences in density of the two neighboring horizontal
(vertical) bins.
To make sure that fixed cells and bins outside the
boundary do not move, we enforce vV ¼ 0 at a horizontal
boundary and vH ¼ 0 at a vertical boundary.
Assuming that each cell in a bin has the same velocity
fails to distinguish between the relative locations of cells
within a bin. Further, two cells that are right next to eachother but in different bins can be assigned very different
velocities which could change their relative ordering.
Since the goal of placement migration is to preserve the
integrity of the original placement, this behavior cannot be
permitted. To remedy this behavior, we apply velocity
interpolation to generate a horizontal (vertical) velocity
vHx;yðvV
x;yÞ and for a given ðx; yÞ. The interpolation looks at
the four closest bins for each cell and interpolates from thevelocities assigned to each of those bins, generating a
unique velocity vectory for a cell at location ðx; yÞ.Finally, since the velocity for each cell can be de-
termined at time n ¼ t=�t, one can compute its new
placement via a discretized form of (14). Suppose at time
step n a cell has location ðxðnÞ; yðnÞÞ. Its location for the
next time stamp is given by
xðn þ 1Þ ¼ xðnÞ þ vHxðnÞ;yðnÞ ��t
yðn þ 1Þ ¼ yðnÞ þ vVxðnÞ;yðnÞ ��t: (17)
An example is shown in Fig. 16 in which a cell takes
nine discrete time steps. Observe how the cell never
overlaps a blockage and also how the magnitude of its
movements becomes smaller toward the tail end of its path.
C. Making it WorkSince the diffusion process reaches equilibrium when
each bin has the same density, we can expect the final
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 593
density after diffusion to be the same as the average density�dj;k=N. This can cause unnecessary spreading, even if
every bin’s density is well below dmax. This additional
spreading will no doubt degrade the placement quality of
results.
Essentially, what we would like is to run diffusion
for the regions which require it, perhaps for legalization
or even to remove routing congestion while leaving the
rest of the design (which may be in very good shape)alone. The idea of local diffusion is to only run diffu-
sion on cells in a window around bins that violate the
target density constraint. Local diffusion also has the
advantages of less work to do each iteration and faster
convergence.
Although we use (15) to compute bin densities during
diffusion, the computed densities are not exactly the same
as the real placement densities. The mathematics of thediffusion process [(15), (16), and (17)] assume continu-
ously distributed equal size particle distribution. However,
the real standard cell distribution does not always satisfy
this condition. This happens because cells are not equally
distributed inside a bin and because cells have different
sizes. Periodically, one should update the density based on
the real cell placement when the error exceeds a certain
threshold, then restart the diffusion algorithm from thenew placement map.
D. Diffusion SummaryFig. 17 shows an example of diffusion-based legaliza-
tion in a region surrounded by other placed cells and fixed
blocks. The top-left figure shows an initial illegal
placement in which the colored regions represent areas
of cell overlap. The top-right figure shows what happens
when traditional legalization is invoked. Observe how the
integrity of the regions is no longer preserved as the
colored cells mix. This shows how some cells can movequite far away from their neighboring cells from the top
illegal placement. Finally, the bottom figure shows theresult of diffusion based legalization, in which the con-
tinuity of the colored regions is relatively well preserved.
This example illustrates that diffusion is able to perform a
smooth spreading, which is less disruptive to the state of
the design.
To see how effective diffusion-based legalization can be
in a physical synthesis engine, we ran PDS physical
synthesis optimization on seven ASIC testcases in whichwe did not legalize at all during the run. This results in a
large amount of overlaps caused by physical synthesis. We
ran a greedy and flow-based legalizer for comparison and
measure the best results obtained by those approaches
[28]. Compared to the traditional approaches, diffusion
averages about 4% improvement in the total wirelength of
the design. Further, the timing of the worst slack path is
48% better on average and the overall number of negativepaths is 36% better. The improvement can be observed for
all seven designs.
The ability of diffusion to minimize timing degrada-
tion, to smoothly spread out the placement, and to attack
local hotspots of either placement or routing congestion
makes it a powerful technique for physical synthesis. For
starters, one can afford to run legalization less often since
diffusion is less likely to significantly disrupt the state ofthe design.
VIII . CONCLUSION
A. Impact of the Stages of Physical SynthesisThis paper discussed various techniques to achieve fast
physical synthesis which may be applied in all the phases ofphysical synthesis. Recall the four main phases that we are
considering in this paper are:
1) initial placement and optimization;
2) timing-driven placement and optimization;
3) timing-driven detailed placement;
4) optimization techniques.
One need not apply all the techniques in performing
design closure, and frequently designers mix and matchthe pieces depending upon their needs. For example, the
first phase is especially useful during the floorplanning
process. The designer may wish to find the locations of
large blocks and also restrict the movement of key logic.
Through placement and optimization, the designer can
reasonably evaluate the quality of the floorplan. If the
designer is happy, with this result he or she may skip all
the way to the last technique to push down the timing onany remaining critical paths.
In general, the timing after performing the first step
will be far from achieving closure, e.g., the cycle time may
be double what is required by the design specifications.
Performing timing-driven placement and optimization
generally helps significantly and results in many fewer
negative paths. The third stage generally does not helpFig. 16. An example cell movement from diffusion.
Alpert et al.: Techniques for Fast Physical Synthesis
594 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
timing but may improve wiring by anywhere from 2% to5%, and this can make a huge difference in achieving a
routable design.
Finally, unless the design is for some reason Beasy,[ the
last stage of optimization is critical for actually achieving
timing closure. Designers exploit this stage the most
during their iterations as they tweak the design. If only
minor changes are required, going back to global place-
ment would be far too disruptive and potentially put thedesign in a completely different state. The ability to iterate
and perform in-place synthesis is critical in garnering the
last bit of performance out of the design. However, if the
timing of the design is in really bad shape, optimization
alone will not be able to close on timing. The designer
must go back and iterate on the floorplan and global
placement steps.
B. Future DirectionsPhysical synthesis is a runtime intensive, complex
system that requires the integration and cooperation of
several types of algorithms and functions. Exacerbating the
turnaround time problem is that designs sizes will likely
soon move from the millions to tens of millions of place-
able objects. There are numerous research directions in
the timing closure space that we believe are worth pursu-
ing to achieve both faster runtime and higher quality ofresults. In general, achieving better quality can also be a
great way to achieve a faster system, as the back end
optimization could have far fewer negative paths to work
on. Some promising research directions include the
following.
1) Better net weighting for timing-driven placement. For
example, consider two critical paths A and B, both
Fig. 17. Diffusion-based legalization example.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 595
of which are equally critical, but A spends 80% ofits delay traversing fixed blocks and 20% through
moveable logic, while B spends 20% and 80% in
fixed and moveable logic, respectively. In this
case, A does not have much room for error as
placement needs to fix the 20% of the logic that
can be fixed, while B has considerably more
opportunity for placement to straighten out the
80% of logic that it can affect. Thus, net weightingshould give more priority to nets in path A than B.
There are numerous other scenarios that can be
studied and modeled to improve net weighting.
2) Removing a global placement. In the flow described,
placement is run twice. If clever net weighting
and crude placement estimation is used, it may be
possible to significantly improve runtime by
skipping a placement step altogether and stillretain solution quality.
3) Latch pipeline placement. As designs require
multiple cycles to get from one side of the chip
to the other, placement needs to recognize that
latches must be placed in such a way to guar-
antee that one can get from one latch to another
within the given cycle time. For example, assume
latch A drives latch B, which drives latch C, andA is fixed on the left side and C is fixed on the
right. If B is too close to A, then the path from B
to C becomes critical. If one applies a higher net
weight to the connection from B to C, then B
may be moved too close to C and then the A to B
path becomes critical. One has to teach place-
ment to find an appropriate balance, and it is
unlikely net weighting alone can achieve thiskind of result.
4) BDo no harm[ detailed placement. Detailed place-
ment is a powerful technique for improving
wirelength but typically does not improve timing.
In fact, it is risky to run it late in the fourth stage
of the flow because it may worsen paths that were
already carefully optimized. The idea of Bdo no
harm[ detailed placement [58] is to recognizemoves that degrade the timing and forbid them,
while only accepting moves that improve wire-
length and timing.
5) Force-directed placement. As discussed earlier,
force-directed placement is emerging as a prom-
ising technique both in terms of quality [7] mPL
[8] mFar [9] and speed [10]. This technique also
has the advantage of stability in that small changesto net weights likely will not create entirely
different global placements. Its spreading ability
(like that of diffusion) makes it appealing for
handling incremental netlist changes.
6) Parallelism. As designs truly become large, the
designs can potentially be partitioned into smaller
physical pieces that do not require an inordinate
amount of cross-partition communication. Onecan then apply physical synthesis on each piece
relatively independently. While this approach
seems simple enough, it is fraught with choices,
any of which could lead to significantly degraded
solution. One must be careful with the partition
pin assignment, buffering strategy, and timing
contracts between partitions.
7) Complex transforms. Transforms which performmultiple operations simultaneously could poten-
tially have a big impact on timing. For example,
consider a cell B on the left side connected to cells
A and C on the right side. Clearly B wants to be
near A and C, but if the nets connected to B have
already been buffered, those buffers act as anchors
which keep B from moving to the right. One needs
to rip up the buffer trees, then consider moving B,then put the buffer trees back in to evaluate
whether this was worthwhile. Another example is
simultaneous buffering and cloning.
This list is just a sampling of possible research
directions. As design technology scales to 65 nm and
below, the problem of timing closure will continue to
evolve into the even more complex problem of designclosure. Design closure requires that accurate modeling ofthe clock tree network and routing be incorporated earlier
and earlier up the physical synthesis pipeline to take into
account their effects on timing and signal integrity. The
need to meet a global power constraint, e.g., by
incorporating multithreshold logic gates and voltage
islands, also becomes more critical. One must pay
attention to how physical design choices impact manu-
facturability. Requiring physical synthesis to meet andincorporate these additional constraints only further
exacerbates the runtime issue. Therefore, research which
discovers more efficient techniques for core physical
synthesis optimizations like placement, buffering, legali-
zation, repowering, incremental timing, routing, and clock
tree synthesis will continue to be of high value. h
Acknowledgment
The PDS physical synthesis system has had many
contributors over the years. The authors sincerely thank
everyone who has helped both with driving the work
presented here and for overall contributions to IBM’s PDS
tool. These contributors include Lakshmi Reddy, Ruchir
Puri, David Kung, Leon Stok, Charles Bivona, Louise
Trevillian, Michael Kazda, Pooja Kotecha, Nate Heiter,Erik Kusko, Mike Dotson, Carl Hagen, Zahi Kurzum,
Gopal Gandham, Stephen Quay, Tuhin Mahmud, Jiang
Hu, Milos Hrkic, Kristian Zoerhoff, William Dougherty,
Brian Wilson, Bryon Wirtz, Tony Drumm, Elaine D’Souza,
Shyam Ramji, Alex Suess, Jose Neves, Veena Puresan,
Arjen Mets, Andrew Sullivan, Jim Curtain, David Geiger,
Tsz-mei Ko, and Pete Osler.
Alpert et al.: Techniques for Fast Physical Synthesis
596 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
RE FERENCES
[1] L. Trevillyan, D. Kung, R. Puri, L. N. Reddy,and M. A. Kazda, BAn integrated environmentfor technology closure of deep-submicron ICdesigns,’’ IEEE Des. Test Comput., vol. 21,no. 1, pp. 14–22, Jan.–Feb. 2004.
[2] P. G. Villarrubia, BPhysical design tools forhierarchy,[ in Proc. ACM Int. Symp. PhysicalDesign, 2005.
[3] P. Saxena, N. Menezes, P. Cocchini, andD. A. Kirkpatrick, BRepeater scaling and itsimpact on CAD,’’ IEEE Trans. Comput.-AidedDesign Integr. Circuits Syst., vol. 23, no. 4,pp. 451–463, Apr. 2004.
[4] J. Cong, Z. D. Kong, and T. Pan, BBuffer blockplanning for interconnect planning andprediction,’’ IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst., vol. 9, no. 6, pp. 929–937,Dec. 2001.
[5] C. J. Alpert, J. Hu, S. S. Sapatnekar, andP. G. Villarrubia, BA practical methodologyfor early buffer and wire resource allocation,[in Proc. Design Automation Conf., 2001.
[6] G.-J. Nam, C. J. Alpert, P. G. Villarrubia,B. Winter, and M. Yildiz, BThe ISPD2005placement contest and benchmark suite,[ inProc. ACM Int. Symp. Physical Design, 2005,pp. 216–220.
[7] A. B. Kahng and Q. Wang, BImplementationand extensibility of an analytic placer,’’ IEEETrans. Comput.-Aided Design Integr. CircuitsSyst., vol. 24, no. 5, pp. 734–747, May 2005.
[8] T. Chan, J. Cong, T. Kong, J. Shinnerl, andK. Sze, BAn enhanced multilevel algorithmfor circuit placement,[ in Proc. IEEE/ACMInt. Conf. Computer-Aided Design, 2003,pp. 299–305.
[9] B. Hu and M. M. Sadowska, BFine granularityclustering-based placement,’’ IEEE Trans.Comput.-Aided Design Integr. Circuits Syst.,vol. 23, no. 4, pp. 527–536, Apr. 2004.
[10] N. Viswanathan and C.-N. Chu, BFastplace:Efficient analytical placement using cellshifting, iterative local refinement anda hybrid net model,[ in Proc. ACM Int.Symp. Physical Design, 2004, pp. 26–33.
[11] B. Halpin, C. Y. R. Chen, and N. Sehgal,BTiming driven placement using physicalnet constraints,[ in Proc. IEEE/ACM DesignAutomation Conf., 2001, pp. 780–783.
[12] R.-S. Tsay and J. Koehl, BAn analyticnet weighting approach for performanceoptimization in circuit placement,[ in Proc.IEEE/ACM Design Automation Conf., 1991,pp. 620–625.
[13] X. Yang, B.-K. Choi, and M. Sarrafzadeh,BTiming-driven placement using designhierarchy guided constraint generation,[in IEEE/ACM ICCAD, 2002, pp. 177–180.
[14] K. Rajagopal, T. Shaked, Y. Parasuram, T. Cao,A. Chowdhary, and B. Halpin, BTiming drivenforce directed placement with physical netconstraints,[ in Proc. Int. Symp. on PhysicalDesign, Apr. 2003, pp. 60–66.
[15] H. Ren, D. Z. Pan, and D. Kung, BSensitivityguided net weighting for placement drivensynthesis,[ in Proc. Int. Symp. on PhysicalDesign, Apr. 2004, pp. 10–17.
[16] T. Kong, BA novel net weighting algorithm fortiming-driven placement,[ in Proc. Int. Conf.Computer Aided Design, 2002, pp. 172–176.
[17] D. Brand, R. F. Damiano, L. P. P. P. vanGinneken, and A. D. Drumm, BIn thedriver’s seat of booledozer,[ in ICCD,1994, pp. 518–521.
[18] L. Stok, D. S. Kung, D. Brand, A. D. Drumm,L. N. Reddy, N. Hieter, D. J. Geiger,H. H. Chao, P. J. Osler, and A. J. Sullivan,
BBooledozer: Logic synthesis for asics,’’ IBM J.Res. Dev., vol. 40, no. 4, pp. 407–430, 1996.
[19] W. Donath, P. Kudva, L. Stok, P. Villarrubia,L. Reddy, A. Sullivan, and K. Chakraborty,BTransformational placement and synthesis,[in Proc. Design, Automation and Test in Europe,Mar. 2000.
[20] S. K. Karandikar, C. J. Alpert, M. C. Yildiz,P. G. Villarrubia, S. T. Quay, and T. Mahmud,BFast electrical correction using resizing andbuffering,[ in Proc. Asia and South PacificDesign Automation Conf., 2007.
[21] P. J. Osler, BPlacement driven synthesis casestudies on two sets of two chips: Hierarchicaland flat,[ in Proc. ACM Int. Symp. PhysicalDesign, 2004, pp. 190–197.
[22] G. Karypis, R. Aggarwal, V. Kumar, andS. Shekhar, BMultilevel hypergraphpartitioning: Application in VLSI domain,[in Proc. ACM/IEEE Design AutomationConf., 1997, pp. 526–529.
[23] G.-J. Nam, S. Reda, C. Alpert, P. Villarrubia,and A. Kahng, BA fast hierarchical quadraticplacement algorithm,’’ IEEE Trans. CAD of ICsand Systems, vol. 25, no. 4, Apr. 2006.
[24] L. P. P. P. van Ginneken, BBuffer placementin distributed RC-tree networks for minimalElmore delay,[ in IEEE Int. Symp. on Circuitsand Systems, May 1990, pp. 865–868.
[25] Z. Li, C. N. Sze, C. J. Alpert, J. Hu, andW. Shi, BMaking fast buffer insertion evenfaster via approximation techniques,[ in Proc.Asia and South Pacific Design Automation Conf.,2005, pp. 13–18.
[26] S. Hu, C. J. Alpert, J. Hu, S. K. Karandikar,Z. Li, W. Shi, and C. N. Sze, BFastalgorithms for slew constrained minimumcost buffering,[ in Proc. ACM/IEEE DesignAutomation Conf., 2006, pp. 308–313.
[27] C. J. Alpert, M. Hrkic, J. Hu, and S. T. Quay,BFast and flexible buffer trees that navigatethe physical layout environment,[ in Proc.ACM/IEEE Design Automation Conf., 2004,pp. 24–29.
[28] H. Ren, D. Z. Pan, C. J. Alpert, andP. Villarrubia, BDiffusion-based placementmigration,[ in Proc. Design AutomationConf., 2005, pp. 515–520.
[29] W.-J. Sun and C. Sechen, BEfficient andeffective placement for very large circuits,’’IEEE Trans. Comput.-Aided Design Integr.Circuits Syst., vol. 14, no. 5, pp. 349–359,May 1995.
[30] C. J. Alpert, J.-H. Huang, and A. B. Kahng,BMultilevel circuit partitioning,’’ IEEE Trans.Comput.-Aided Design Integr. Circuits Syst.,vol. 17, no. 8, pp. 655–667, Aug. 1998.
[31] A. E. Caldwell, A. B. Kahng, and I. L. Markov,BCan recursive bisection alone produceroutable placements?’’ in Proc. DesignAutomation Conf., 2000, pp. 477–482.
[32] A. Agnihotri, M. C. Yildiz, A. Khatkhate,A. Mathur, S. Ono, and P. H. Madden,BFractional cut: Improved recursive bisectionplacement,[ in Proc. Int. Conf. ComputerAided Design, 2003, pp. 307–310.
[33] M. Wang, X. Yang, and M. Sarrafzadeh,BDragon2000: Standard-cell placement toolfor large industry circuits,[ in Proc. Int. Conf.Computer-Aided Design, 2000, pp. 260–263.
[34] H. Eisenmann and F. M. Johannes, BGenericglobal placement and floorplanning,[ in Proc.ACM/IEEE Design Automation Conf., 1998,pp. 269–274.
[35] P. Spindler and F. M. Johannes, BFast androbust quadratic placement combined withan exact linear net model,[ presented at theIEEE/ACM Int. Conf. Computer-AidedDesign, San Jose, CA, 2006.
[36] J. Vygen, BAlgorithms for large-scale flatplacement,[ in Proc. ACM/IEEE DesignAutomation Conf., 1997, pp. 746–751.
[37] D.-H. Huang and A. B. Kahng, BPartitioningbased standard cell global placement withan exact objective,[ in Proc. ACM Int. Symp.Physical Design, 1997, pp. 18–25.
[38] C. J. Alpert and A. B. Kahng, BRecentdevelopments in netlist partitioning: Asurvey,’’ Integr. VLSI J., vol. 19, pp. 1–81, 1995.
[39] C. J. Alpert, G. Gandham, M. Hrkic, J. Hu,A. B. Kahng, J. Lillis, B. Liu, S. T. Quay,S. S. Sapatnekar, and A. J. Sullivan, BBufferedSteiner trees for difficult instances,’’ IEEETrans. Comput.-Aided Design Integr. CircuitsSyst., vol. 21, no. 1, pp. 3–14, Jan. 2002.
[40] J. Cong, A. Kahng, and K. Leung, BEfficientalgorithm for the minimum shortestpath steiner arborescence problem withapplication to VLSI physical design,’’ IEEETrans. Comput.-Aided Design Integr. CircuitsSyst., vol. 17, no. 1, pp. 24–38, Jan. 1998.
[41] J. Lillis, C. K. Cheng, and T. Y. Lin, BOptimalwire sizing and buffer insertion for low powerand a generalized delay model,’’ IEEE J.Solid-State Circuits, vol. 31, no. 3, pp. 437–447,Mar. 1996.
[42] C. J. Alpert, A. Devgan, and S. T. Quay,BBuffer insertion for noise and delayoptimization,[ in Proc. ACM/IEEE DesignAutomation Conf., 1998, pp. 362–367.
[43] C. J. Alpert, A. Devgan, and S. T. Quay,BBuffer insertion with accurate gateand interconnect delay computation,[ inProc. ACM/IEEE Design Automation Conf.,1999, pp. 479–484.
[44] W. Shi and Z. Li, BAn O(nlogn) timealgorithm for optimal buffer insertion,[ inProc. IEEE/ACM Design Automation Conf.,2003, pp. 580–585.
[45] W. Shi, Z. Li, and C. J. Alpert, BComplexityanalysis and speedup techniques for optimalbuffer insertion with minimum cost,[ in Proc.Asia and South Pacific Design Automation Conf.,2004, pp. 609–614.
[46] C. J. Alpert, R. G. Gandham, J. L. Neves, andS. T. Quay, BBuffer library selection,[ in Proc.ICCD, 2000, pp. 221–226.
[47] J. Lillis, C. K. Cheng, and T.-T. Y. Lin,BOptimal wire sizing and buffer insertionfor low power and a generalized delay model,’’IEEE Trans. Solid-State Circuits, vol. 31, no. 3,pp. 437–447, Mar. 1996.
[48] C. Kashyap, C. Alpert, F. Liu, and A. Devgan,BClosed form expressions for extendingstep delay and slew metrics to ramp inputs,[in Proc. Int. Symp. Physical Design (ISPD),2003, pp. 24–31.
[49] H. Bakoglu, Circuits, Interconnects, andPackaging for VLSI. Reading, MA:Addison-Wesley, 1990.
[50] N. Weste and K. Eshraghian, Principlesof CMOS VLSI Design. Reading, MA:Addison-Wesley, 1993, pp. 221–223.
[51] M. Hrkic and J. Lillis, BS-tree: A techniquefor buffered routing tree synthesis,[ in Proc.ACM/IEEE Design Automation Conf., 2002,pp. 578–583.
[52] X. Tang, R. Tian, H. Xiang, andD. F. Wong, BA new algorithm for routingtree construction with buffer insertionand wire sizing under obstacle constraints,[in Proc. IEEE/ACM Int. Conf. Computer-AidedDesign, 2001, pp. 49–56.
[53] C. J. Alpert, J. Hu, S. S. Sapatnekar, andC. N. Sze, BAccurate estimation of globalbuffer delay within a floorplan,[ in Proc.IEEE/ACM Int. Conf. Computer-Aided Design,2004, pp. 706–711.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 597
[54] U. Brenner, A. Pauli, and J. Vygen, BAlmostoptimum placement legalization by minimumcost flow and dynamic programming,[ in Proc.Int. Symp. Physical Design, 2004, pp. 2–9.
[55] S. W. Hur and J. Lilis, BMongrel: Hybridtechniques for standard cell placement,[ inProc. Int. Conf. Computer-Aided Design, 2000,pp. 165–170.
[56] A. B. Kahng, P. Tucker, and A. Zelikovsky,BOptimization of linear placementsfor wirelength minimization with freesites,[ in Proc. Asia and South PacificDesign Automation Conf., 1999, pp. 18–21.
[57] W. H. Press, S. A. Teukolsky,W. T. Vetterling, and B. P. Flannery,
Numerical Recipes in C++. Cambridge,U.K.: Cambridge Univ. Press, 2002.
[58] H. Ren, D. Pan, C. Alpert, G.-J. Nam, andP. G. Villarrubia, BHippocrates:First-do-no-harm detailed placement,[presented at the Asia and South PacificDesign Automation Conf., Yokohama,Japan, 2007.
ABOUT THE AUT HORS
Charles J. Alpert (Fellow, IEEE) received the B.S.
degree in math and computational sciences and
the B.A. degree in history from Stanford Univer-
sity, Stanford, CA, in 1991 and the Ph.D. degree
in computer science from the University of
California, Los Angeles (UCLA), in 1996.
He currently works as a Research Staff Member
at the IBM Austin Research Laboratory, Austin, TX,
where he serves as the technical lead for the
design tools group. He has over 80 conference and
journal publications. His research centers upon innovation in physical
synthesis optimization.
Dr. Alpert has thrice received the Best Paper Award from the ACM/
IEEE Design Automation Conference. He has served as the general chair
and the technical program chair for the Tau Workshop on Timing Issues
in the Specification and Synthesis of Digital Systems and the Interna-
tional Symposium on Physical Design. He also serves as an Associate
Editor of IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN. For his work in
mentoring SRC funded research, he received the Mahboob Khan Mentor
Award in 2001.
Shrirang K. Karandikar received the B.E. degree
from the University of Pune, Pune, India, in 1994,
the M. S. degree from Clarkson University,
Potsdam, NY, in 1996, and the Ph.D. degree from
the University of Minnesota, Minneapolis, in 2004.
He worked with Intel’s Logic and Validation
Technology group from 1997 to 1999, and is
currently a Researcher Staff Member at IBM Austin
Research Laboratory. His current interests are in
the areas of logic synthesis and physical design of
VLSI systems.
Zhuo Li (Member, IEEE) received the B.S. and M.S.
degrees in electrical engineering from Xi’an
Jiaotong University, Xi’an, China, and the Ph.D.
degree in computer engineering from Texas A&M
University, College Station, in 1998, 2001, and
2005, respectively.
From 2005 to 2006, he was with Pextra
Corporation, College Station as a Cofounder and
Senior Technical Staff working on VLSI extraction
tools development. He is currently with IBM
Austin Research Laboratory, Austin, TX. His research interests include
physical synthesis optimization, parasitic extraction, circuit modeling
and simulation, timing analysis and delay testing.
Dr. Li was a recipient of Applied Materials Fellowship in 2002. He
received a Best Paper Award at the Asia and South Pacific Design
Automation Conference in 2007.
Gi-Joon Nam (Member, IEEE) received the B.S.
degree in computer engineering from Seoul Na-
tional University, Seoul, Korea, and the M.S. and
Ph.D. degrees in computer science and engineer-
ing from the University of Michigan, Ann Arbor.
Since 2001, he has been with the International
Business Machines Corporation Austin Research
Laboratory, Austin, TX, where he is primarily
working on the physical design space, particularly
placement and timing closure flow. His general
interests includes computer-aided design algorithms, combinatorial
optimizations, very large scale integration system designs, and computer
architecture.
Dr. Nam has been serving on the technical program committee for the
International Symposium on Physical Design (ISPD), the International
Conference on Computer Design (ICCD), the Asia and South Pacific Design
Automation Conference (ASPDAC) and the International System-on-Chip
Conference (SOCC). He was also the organizer of ISPD 2005/2006
placement contest.
Stephen T. Quay received two B.S. degrees in
electrical engineering and computer science from
Washington University, St. Louis, MO, in 1983.
He is currently a Senior Engineer with the IBM
Systems and Technology Group, Austin, TX. Since
1983, he has worked in many areas of chip layout
and analysis for IBM. He currently develops design
automation applications for interconnect perfor-
mance optimization.
Haoxing Ren (Member, IEEE) received the B.S.
and M.Eng. degrees in electrical engineering from
Shanghai Jiao Tong University, China, in 1996 and
1999, respectively, the M.S. degree in computer
engineering from Rensselaer Polytechnic Insti-
tute, Troy, NY, in 2000, and the Ph.D. degree in
computer engineering at the University of Texas,
Austin, in 2006.
He worked at IBM System and Technology
Group from 2000 to 2006. Currently he is a
Research Staff Member at IBM T.J. Watson Research Center. His research
interests include logic synthesis and physical design of VLSI systems.
Alpert et al.: Techniques for Fast Physical Synthesis
598 Proceedings of the IEEE | Vol. 95, No. 3, March 2007
C. N. Sze (Member, IEEE) received the B.Eng. and
M.Phil. degrees from the Department of Computer
Science and Engineering, the Chinese University of
Hong Kong, in 1999 and 2001, respectively, and
the Ph.D. degree in computer engineering at the
Department of Electrical Engineering, Texas A&M
University, College Station, in 2005.
Since then, he has been with the IBM Austin
Research Laboratory, Austin, TX, where he focuses
on integrated placement and timing optimization
for ASIC and microprocessor designs. He was a recipient of the DAC
Graduate Scholarships. His research interests include design and analysis
of algorithms, computer-aided design technique for very large scale
integration, physical design, and performance-driven interconnect
synthesis. He is known by the names of Chin-Ngai Sze and Cliff Sze.
Paul G. Villarrubia received the B.S. degree in
electrical engineering from Louisiana State Uni-
versity, Baton Rouge, in 1981 and the M.S. degree
from the University of Texas, Austin, in 1988.
He is currently a Senior Technical Staff Member
at IBM, Austin, where he leads the development of
placement and timing closure tools. He has
worked at IBM in the areas of physical design of
microprocessors, physical design tools develop-
ment, and tools development for ASIC timing
closure. Interests include placement, synthesis, buffering, signal integrity
and extraction. He has 21 patents, 20 publications.
Mr. Villarrubia has one DAC best paper award. He was a member of the
2005 ICCAD TPC, and an invited speaker at the 2002 and 2004 ISPD
conferences.
Mehmet C. Yildiz (Member, IEEE) received the
B.S. degree in computer engineering from
Marmara University, Turkey, in 1995, the M.S.
degree in computer science from Yeditepe
University, Turkey, in 1998, and the Ph.D. degree
from Binghamton University, Binghamton, NY,
in 2003.
He worked in IBM Austin Research Laboratory
as Postdoc for two years. He currently works as an
Advisory Software Engineer in IBM EDA, Austin,
TX. He has more than ten conference and journal papers. His work is
currently focused on clock tree routing.
Dr. Yildiz serves as an Advisory Board Member of SIGDA, responsible
for the Web server.
Alpert et al. : Techniques for Fast Physical Synthesis
Vol. 95, No. 3, March 2007 | Proceedings of the IEEE 599