circuit partitioning algorithm for low-power design under area constraints using simulated annealing
TRANSCRIPT
Circuit partitioning algorithm for low-power design under area constraints using simulated annealing
I.-S.Choi S.-Y Hwang
Abstract: A synthesis algorithm is proposed for the design of low-power combinational circuits under area constraints. The algorithm partitions a given circuit into several subcircuits such that only a selected subcircuit is activated at a time, hence reducing unnecessary signal transitions. Partitioning is performed through an adaptive simulated annealing algorithm, employing a cost function modelled for low-power consumption under given area constraints. Experiments have been performed for the MCNC benchmark circuits using the power analysis package provided in the Synopsys Design Analyzer. Results show that the proposed algorithm generates circuits which consume less power than those by the area-optimisation package in Synopsys Design Analyzer and precomputation algorithm.
1 Introduction
Demand for portable communication and personal computing applications has increased over recent years. Since the lifetime of portable devices is directly related to the capacity of the batteries, low-power design has become an indispensable factor for extending battery life. Likewise, low-power design is also crucial in high- performance systems. High temperatures in these sys- tems can affect circuit reliability and reduce the lifetime of the system. The cost associated with packaging, cooling and fans to remove heat is increasing signifi- cantly. Thus, power consumption has emerged as an important parameter in the design of low-power sys- tems [l].
Several techniques have been proposed to reduce power consumption at various levels of the design hier- archy: system level, architectural level, logic level and device level [l-51. Most research into power optimisa- tion has focused on reducing switching activity, since the power due to switching activities accounts for over 90% of the total power dissipated by CMOS circuits [3]. Significant progress has been made in the study of high-level low-power synthesis, and a power-concious synthesis algorithm at behavioural level has been pro-
~~
0 IEE, 1999 IEE Proceedings online no. 19990276 DOL IO. 1049hp-cds: 19990276 Paper first received 18th December 1997 and in revised form 22nd July 1998 The authors are with the Department of Electronic Engineering, Sogang University, CPO Box 1142, Seoul, 100-61 1, Korea
8
posed [4, 51. At the logic level, don’t-care optimisation, path balancing, factorisation, precomputation and cir- cuit-partitioning algorithms have been proposed [6-131.
The don’t-care optimisation method reduces the probability of signal transitions by utilising a set of ODCs (observability don’t-care sets) [6]. The transition probability is reduced by restructuring the circuit topology utilising the don’t-care sets, leading to power reductions. The path balancing technique eliminates power consumed by spurious transitions, which account for 10% to 30%) of the dynamic power in com- binational logic circuits [7]. To reduce spurious transi- tions, delays of paths converging at each gate are made roughly equal by adding unit-delay buffers to the inputs of the gates selectively. The addition of buffers does not increase the critical delay of the circuit and eliminates spurious transitions effectively. However, the addition of buffers increases capacitance, which may offset reductions in switching activity power.
Factorisation is a technology-independent multi-level optimisation technique, which reduces the transistor count by factoring a given logic function into multi- level form using a kernel. The modified kernel selection algorithm, utilising a cost function to minimise switch- ing activity, is described elsewhere [8, 91. Precomputa- tion is a combinational logic optimisation technique, which reduces switching activity by selectively disabling the input of a combinational logic circuit [IO]. If the output value is precomputed one clock cycle ahead, the combinational circuit can be turned off in the following clock cycle, reducing the overall switching activity. Alidina et al. proposed an algorithm that generates precomputation logic using universal quantification and ODCs [lo]. However, finding an appropriate ODC for the precomputation logic is not always possible, and area and delay penalties for the precomputation logic may also be expected.
The circuit-partitioning algorithm transforms a cir- cuit into multiple subcircuits, and only one of the sub- circuits is computed; the others are disabled by setting the load-enable signal of its input register. We have proposed two partitioning algorithms [ 1 I , 121: the Shannon expansion-based scheme and the kernel-based scheme. In the Shannon expansion-based scheme, an input variable of a logic function is selected and the function is transformed into two co-factor subcircuits by applying Shannon expansion [ 1 I , 131. Power reduc- tions are achieved by disabling one of the two subcir- cuits which does not contribute to the output evaluation between two co-factor subcircuits. The ker- nel-based scheme selects a kernel of a multiple output function as the precomputation logic, and performs algebraic division using the selected kernel and its com- plement, thereby generating co-factor subcircuits [ 121.
IEE Proc -Circuit5 Drwices Syst., Vol 146. No. I , Februtrry 1999
These two circuit-partitioning algorithms can be applied to all logic functions and do not require the inputs to be in ODC.
In this paper, we propose a synthesis a€goorithm for the design of low-power combinational circuits under given area constraints. Adaptive simulated annealing is employed to search for a globally optimal solution, and a binary tree is used to represent a partitioning solu- tion.
2 Background
2. I Power dissipation model of CMOS circuits The average power consumption of a CMOS gate is given by
- Paverage - Pswztchzng + Pshort-czrcuzt + fieakage
logic circuit. This technique uses universal quantifica- tion and ODCs to determine the precomputation logic [lo]. Fig. l a shows a general combinational circuit, in which combinational logic block C has input latch L1. Fig. Ib is a structure based on the precomputation scheme, with a predictor circuit consisting of functions gl and g2. If gl or g2 evaluates to a 1 during clock cycle t , the load-enable (LE) signal to latch L2 is turned off. This implies that the output does not change during clock cycle t + 1; hence no switching power is dissi- pated. A heuristic algorithm that finds the precomputa- tion function for efficient power reductions has been proposed previously [lo].
i2 i l~~~i~. In
a
where Tc is the clock period, C is the node capacitance, v d d is the supply voltage, and N is the average number of switchings per clock cycle at the gate output. PswrtLhrng is the switching activity power required to charge and discharge circuit nodes [14]. PAhort ,rrtuzt is the power dissipation due to the current flowing from the supply to ground during transitions at the input. This current is often called the short-circuit current. Q,, denotes the quantity of charge carried by the short- circuit current during a transition at the output. Pleakage is the static power drawn by the leakage current I,eak.
Signal probability ps(x) of a node x in a logic gate is defined as the duty cycle of the signal or the probabil- ity of the signal being at logic ‘1’ in a vector event. Transition probability p t (x ) of a node x is the probabil- ity of the signal switching from one state to another [14]. To obtain the internal transition probabilities, one must consider whether the values of the same signal in two consecutive clock cycles are independent. If they are assumed to be independent, the transition probabil- ity can be obtained from the signal probability:
mt.) = 2PS(Z)PS(E.) = 2Ps(.)(l -Ps(Z)) (2 ) The average power dissipation in a CMOS circuit is proportional to the switching activity pt(xI) of node x, in a logic network. The most significant part of power consumption occurs only during transitions at the gate output when the output capacitance is charged and dis- charged. Switching power consumed by the CMOS gate can be modelled as
n 1
Pswitching = -v& C&(Si) (3) 2Tc i=l
where C, is the load capacitance of the ith node, Vdd is the supply voltage, T, is the clock period and pt(xz) is the signal transition probability at node x,. Since the power dissipation due to switching activities accounts for over 9P!0 of the average power consumption in CMOS circuits, only the power drawn by switching transitions is taken into consideration in modelling CMOS power dissipation.
2.2 Precomputation-based design algorithm Precomputation is a Combinational logic optimisation technique, which reduces switching activities by selec- tively disabling the input latches of a combinational
IEE Proc -Circuits Devices Syst . VoI. 146, No 1, February 1999
b Fig. 1 , Circuit siruciures for low-power consurnpiion a Combinational circuit b Structure based on precomputation scheme
2.3 Partitioning algorithm based on Shannon expansion A functionf(il, i2 ..., in) can be Shannon expanded with respect to an input variable i, as in eqn. 4, whereJ;i and j i ~ are the cofactors o f f with respect to $ and 5, respectively:
(4)
A combinational circuit can be transformed into a Shannon expansion form using an output multiplexor, as shown in Fig. 2a [ll]. Depending on the value of i l , only one co-factor subcircuit is computed while the other is disabled; hence keeping the last values of the input latch unchanged. If il is evaluated to a I , circuit Jl is computed while is disabled; when il is at 0, A; is computed and J;, disabled. To maximally reduce the power dissipation of the circuit transformed by Shan- non expansion, the control variable must be properly selected from the input variables for a given logic func- tion. An optimally configured circuit takes into account all variables into the prediction process and selects the best candidate that suits the least power con- sumption requirements. The precomputation scheme is only applicable to the type of logic functions that have ODCs. On the other hand, the Shannon expansion- based scheme is applicable to all types of logic, and incurs small area and delay overheads. However, its
9
limitations are similar to those of the precomputation scheme, where duplicated inputs increase the area by the number of latches at the input.
2.4 Kernel- based partitioning algorithm A kernel-based partitioning algorithm reduces power consumption by partitioning a circuit by selecting an appropriate kernel [12]. A kernel is a common sub- expression of a function, and covers both single and multiple output functions. The algorithm selects a ker- nel as a common logic part and implements it as the selection logic. Fig. 2b shows the kernel-based circuit structure. The kernel-based scheme consists of three subcircuits: a selection logic synthesised for kernel k and two subcircuits generated from the original circuit divided by kernel k and its complement E.
n n i l2:nl-
..ra----.-A ‘1 a
1- I
b Fig. 2 U Based on Shannon expansion b Based on kernel selection
Circuit structures,for low-power consumption
Only one subcircuit is activated, while the other is disabled by properly setting the load-enable signal of its input latch. The output of the selection logic is con- nected to the select line of a multiplexor, which chooses the correct output. The advantage of this circuit struc- ture is that it can be applied to most kinds of logic function, whereas the precomputation-based structure requires an ODC. However, the selection logic and duplicated logic parts in each co-factor subcircuit may incur area overheads.
3 Proposed low-power synthesis algorithm
In combinatorial optimisation problems, both practical and theoretical, the objective is to choose the best solu- tion in terms of the cost function from many possible solutions. A solution for such problems is an arrange- ment of a set of discrete objects according to a given set of constraints, and the solution space is a set of all possible solutions of the problem. NP-complete prob- lems require computing effort that increases exponen- tially with the problem size. Efficient approximation
I O
algorithms that do not produce a global optimum, but rather a local optimum, have been proposed to improve the execution time.
Simulated annealing is a general-purpose optimisa- tion technique for combinatorial optimisation problems [15]. The algorithm randomly generates a new state or configuration, and the new state is accepted or rejected according to an acceptance rule governed by the parameter analogous to temperature in the physical annealing process. Simulated annealing is a very attrac- tive optimisation algorithm because it produces high- quality solutions and is, in general, easy to implement. For the best optimisation result, a careful design of the basic ingredients is required: formulating the problem so as to obtain an optimal partitioning solution; defin- ing the neighbouring solutions of each solution; choos- ing a suitable cost function; and defining an annealing schedule. A set of moves is used to search the solution space, and a neighbouring solution is obtained from a solution via one of the moves. As the execution time of the simulated annealing algorithm largely depends on the annealing schedule, an efficient cooling schedule is needed. Many temperature scheduling algorithms have been proposed [16], such as fixed schedule, logarithmic schedule, Boltzmann annealing, simulated quenching, fast annealing, very fast simulated reannealing and adaptive simulated annealing.
In our proposed algorithm, adaptive simulated annealing is used in partitioning the logic circuit for low-power design [ 161. The partitioning algorithm gen- erates circuits consuming less power, but it tends to generate circuits with a huge area overhead. In the pro- posed system, the user can specify the upper limits of area increase such that the circuits consuming less power can be generated without violating constraints.
b O I f
c-d Fig.3 Benchmark circuit % I ’
3. I Solution space A partitioning solution can be obtained by bipartition- ing a circuit recursively. Fig. 3 shows a MCNC bench- mark circuit ‘bl ’ optimised by the Synopsys Design Analyzer. Fig. 4 shows the solution space as a form of search tree for the circuit ‘bl’. Each internal node in the search tree represents a partitioning solution con- sisting of a set of subcircuits. The edge represents the neighbouring relation between two connected nodes, whose label V; v) represents a move operation that bipartitions a subcircuit f using v as a selection varia- ble. The number of nodes at level k is proportional to ,,Ck, where n is the number of inputs in the given cir- cuit. The number of possible solutions is proportional to CE;l nCk , since the depth of the tree is equal to n.
IEE Proc -Circuit, DevzceJ Syrf Vol 146 No I Fehruury 1999
logic function f resultant circuit representing a partition solution in Fig. 5.
3.2 Neighbouring solutions eighbouring solutions can be obtained by modifying
the existing solution through a set of moves. In the
Fig.4 Search tree for circuit ‘bl’
Partitioning sohtion I
1 A ,
I \ . partitioned
subcircui ts
Fig. 5 Partition tree representing the shaded node in Fig. 4
Fig.6 Resultant circuit corresponding to partitioning solution of Fig. 5
To manage partitioned subcircuits systematically, a binary partition tree is used to represent a partitioning solution. An internal node of the partition tree has two children, representing the co-factors with respect to a selected variable. Each set of leaf nodes represents a partitioned subcircuit. Fig. 5 shows a partition tree for the shaded node of Fig. 4. The leaf nodes in the boxes correspond to the three subcircuits co-factored with respect to the cubes b c, b c, and &. Fig. 6 shows the
IEE Proc-Cifcuits Devices Syst.. Vol. 146, No. 1. February I999
proposed algorithm, two types of moves are used to locally modify a partitioning solution for the randomly selected leaf nodes; partition and merge. There are two possible ways of selecting a leaf node in the partition tree: root-to-leaf path enumeration and pre-labelled leaf selection. In the path enumeration method, a leaf node is selected by enumerating the path from the root node to the leaf node, by using a sequence of binary numbers generated randomly. In the partition tree of Fig. 5, the leaf nodes corresponding to the circuits divided by the variable set b c, b Z are selected by ran- dom number sequences ‘1 1’ and ‘lo’, respectively. Each leaf node has a selection probability of 1/2depth. Since deeper leaf nodes have fewer selection probabilities, the partition tree tends to be balanced. The advantage of this algorithm is that the power consumption of the resultant circuit is less input pattern dependent, because the differences in power consumption of each subcircuit tend to be small. Since a leaf node can be selected by more than one sequence of random numbers, this increases the search time due to unnecessary retrials.
In the pre-labelled leaf selection method, each leaf has a unique integer index and is selected by using a random number. This method has the advantage of having fewer repetitions in the leaf selection process than the previous method, but has the disadvantage of generating partitioned blocks with uneven power con- sumption. To traverse a large partitioning solution space efficiently, the time complexity of the moves must be small. To prevent duplicate visitings of a leaf node in the partition tree, each internal node is imple- mented to keep a set of variables that represents a record of past bipartitioning trials, and this variable set is bypassed at future bipartitioning attempts at the node. Memory overheads can be reduced by bitwise packing of the variable set. Fig. 7 shows the proposed move procedure.
I*
+I move (T)
T : Partition kee whose leaf nodes represent partitioning results.
( leaf = get-a-random-leaflT);
mode = select-a-move-randomly; I* Select a Partition or Merge operahon. */
if (mode = Partition) (
I* Select a subcircuit randomly. *I
I’ Select an input variable which has not been tried yet. *I variable = select-random_input(lea~;
Partition-circuit(leaf, variable);
1’ if mode = Merge *I ] else (
if (leaf = root-node)
Merge-circuits(1eaf->parent->lefi, leaf->parent->right);
return FAIL;
I return SUCCESS;
}
Fig. 7 Move procedure in proposed algorithm
The proposed algorithm searches a partitioning solu- tion with minimum cost while satisfying area con- straints. The cost function is the sum of power consumption of all the subcircuits in a partitioning solution. If the cost of a neighbouring solution
11
decreases, the generated partitioning solution is accepted. If the cost increases, the move is accepted, with a probability varying according to the tempera- ture. The simulated annealing algorithm terminates when the quality of a solution does not improve for a constant number of moves. The adaptive simulated annealing algorithm employed in the proposed algo- rithm is efficient at exploring a wide multi-dimensional solution space [16].
The circuit structure obtained by the multiple parti- tioning algorithm is shown in Fig. 8, where I = {i,, i2, ..., in} is a set of primary input variables and I , is the subset of the primary inputs I used by the selection logic. The selection logic is constructed in the form of a decoder. The function of the selection logic is to acti- vate a subcircuit by enabling its input latch according to an input pattern. Depending on the value of I,), only one of the subcircuits is activated, while the rest are disabled by properly setting the load-enable signal of their input latches. The output of the selection logic is connected to the select line of a multiplexor. The multi- plexor at each output line can be replaced by a single transmission gate in a wired-OR form. As it exhibits a small variation in power dissipation in each subcircuit, the resultant circuit by the proposed algorithm is less input pattern-dependent.
i -
'S
,g, FFs
Fig. 8 Circuit structure obtained by proposed algoritlztn
4 Experimental results
The proposed algorithm has been evaluated for per- formance using the MCNC benchmark circuits. In the experiments, we started with an initial circuit that had been first optimised by the Synopsys Design Analyzer. The original MCNC benchmark circuits and the cir- cuits optimised by the Synopsys Design Analyzer are summarised in Table 1. Technology mapping has been performed for the circuits generated by the proposed algorithm using the standard cell provided by the Syn- opsys Design Analyzer. Power estimations are made at a clock frequency of 5OMHz using a 5V supply volt- age, after performing technology mapping using a 1 si-lOK technology library.
Table 2 shows the results by the precomputation scheme and the proposed scheme for the MCNC benchmark circuits, where power disspations and areas
12
Table 1: Benchmark circuits used for performance evalu- ation of proposed algorithm
Circuits optirnised by Synopsys Design Analyzer
Benchmark circuits Benchmark circuits Delay Power
(ns) (yW) #ins #outs #lits Area
Alu4
B1
c8
cc
Cm 150a
Cm 162a
Crn42a
cmb
Cordic
cu
decod
duke2
e64
f51m
k2
majority
misex2
rnux
pcler8
sa02
SCt
tcon
vda
x2
z4ml
i9
dalu
C17
C1908
C7552
C880
Average
14 8
3 4
28 18
21 20
21 1
14 5
4 10
16 4
23 2
14 11
35 16
22 29
65 65
8 8
45 45
5 1
25 18
21 1
27 17
10 4
19 15
17 16
17 39
10 7
7 4
88 63
75 16
5 2
33 25
207 108
60 26
31 20
664 2532 17 2218
23 51
286 520
136 293
119 291
86 220
43 68
94 208
240 385
118 226
244 258
982 952
2274 965
185 336
1027 3365
21 69
214 425
134 425
149 449
156 388
202 319
74 219
600 1815
91 190
91 125
592 2146
3067 1220
12 61
1497 1221 2317 1548
415 852
521 714
2
8
7
2
5
4
3
6
4
2
9
9
22
17
4
7
6
9
14
34
2
13
4
6
19
20
2
29
38
23
11
59
582
305
320
243
90
204
44 1
238
255
700
853
309
2 281
78
394
384
416
442
340
251
1379
199
159
2 489
1296
69
1425 5 653
3 126
877
are compared to those of the circuits optimised by the Synopsys Design Analyzer. The power reduction by the proposed scheme without area constraints is 42.7% and 37.1% when compared to the Synopsys-optimised cir- cuits and the precomputation-based circuits, respec- tively. The delay is increased by 19.8% when compared to the circuits optimised by the Synopsys Design Ana- lyzer. The area is increased by 59.7% and 49.0%, when compared to the Synopsys-optimised circuits and the precomputation-based circuits. respectively. These area increases are incurred by logic duplications in the parti- tioned co-factor circuits, the input latches and the selection logic.
Table 3 shows that the power reduction for the benchmark circuits without area constraints is 42.7%, and the result without allowing any area increases is an average power reduction of 8.7%. Table 4 shows power reductions of 25.7and 31.0% with constraints of 1200/;1 and 140% of the area of the original circuits, respec- tively.
IEE Pior -Circuits D e i r c e ~ S i J r V d 146 N o I F f h r u u i ~ 1Y9Y
Alu4
B1
C8
cc
cml50a
cm162a
cm42a
cmb
cordic
cu decod
duke2
e64
f51m
k2
majority
misex2
mux
pcler8
sa02
Sct
i c o n
Vda
x2
z4ml
i9
D a h
Cl7
C 1908
C7552
C880
Average
Table 2: Power consumption of *wits produced by precomputation-based scheme and proposed s c h e
A % (ns) A % (pW) A % A % (ns) A % (pW) A %
1824 -28.0 97 481.2 1616 -27.2 2381 -6.0 18 7.7 1193 -46.2
51 0.0
894 71.9
293 0.0
291 0.0
144 -34.5
68 0.0
208 0.0
385 0.0
213 -5.8
258 0.0
928 -2.5
554 -42.6
336 0.0
1880 -44.1
69 0.0
425 0.0
425 0.0
449 0.0
2164 457.7
319 0.0
219 0.0
1063 -41.4
190 0.0
125 0.0
2 146 0.0
1220 0.0
61 0.0
1221 0.0
1548 0.0
852 0.0
672 10.7
2
19
7
2
31
4
3
6
22
2
39
68
22
63
4
7
6
9
54
34
2
46
4
6
19
20
2
29
38
23
22
0.0
141.9
0.0
0.0
302.2
0.0
0.0
0.0
448.4
0.0
325.4
635.0
0.0
278.7
0.0
0.0
0.0
0.0
296.4
0.0
0.0
262.2
0.0
0.0
0.0
0.0 0.0 0.0
0.0
0.0
102.3
59 0.0
579 -0.6
305 0.0
320 0.0
236 -2.8
90 0.0
204 0.0
441 0.0
237 -0.3
255 0.0
550 -21.4
671 -21.4
309 0.0
1268 -44.4
78 0.0
394 0.0
384 0.0
416 0.0
384 -13.2
340 0.0
251 0.0
790 -42.7
199 0.0
159 0.0
2489 0.0
1296 0.0
69 0.0
1425 0.0
5653 0.0
3126 0.0
793 -5.6
62
722
1217
292
235
91
291
445
256
60
1042
1527
332
3 249
74
484
475
41 0
39 1
608
230
1 808
233
418
2 040
1001
72
2 040
1 548
1596
1421
21.6 3 60.1
38.8 10 23.8
315.4 9 13.8
0.3 4 100.0
6.8 7 30.9
33.8 2 -43.3
39.9 5 57.3
15.6 7 18.3
13.3 6 42.6
-76.7 4 65.7
9.5 11 20.1
58.2 11 20.1
-1.2 20 -8.9
-3.4 14 -15.5
7.2 4 -2.3
13.9 6 -14.6
11.8 7 16.1
-8.7 11 21.6
0.8 16 14.3
90.6 33 -2.4
5.0 2 -19.0
-0.4 14 9.3
22.6 5 24.3
234.4 7 20.0
853.4 20 6.8
-18.0 20 1.2
18.0 4 146.8
67.1 23 -21.8
0.0 42 12.2
87.3 25 7.2
59.7 12 19.8
25
407
210
157
95
52
93
254
130
71
342
515
194
1233
45
229
226
158
216
205
118
679
126
139
774
62 1
40
1154
5 653
3 024
593
-58.0
-30.4
-31.1
-51.0
-61.0
-42.2
-54.3
-42.3
-45.2
-72.1
-51.1
-39.6
-37.4
-46.0
-42.6
-42.0
-41.2
-62.0
-51.2
-39.8
-52.7
-50.8
-36.7
-12.9
-68.9
-52.1
-42.5
-19.0
0.0
-3.3
42 .7
Experimental results are compared with circuits optimised by Synopsys Design Analyzer
Note that in Table 2 a 66.0% reduction in power dis- sipation has been obtained for the ‘cm162’, ‘decod’, ‘pclers’, and ‘i9’ circuits. In these circuits, power reduc- tions are prominent owing to the structure of the logic circuits in which a unate variable is dominant. When the circuits with this structure are bipartitioned, the co- factor subcircuits become smaller and less power-con- suming. Circuits with multiple outputs that share inputs tend to have the largest area reductions, but area increases for partitioned circuits with duplicated input latches, such as ‘cc’, ‘e64’ and ‘i9’. Circuit ‘z4ml’, ‘C880’ had a relatively slight decrease in power because the selected inputs are less sensitive to switching activi- ties.
The disadvantage of a precomputation-based circuit lies in the increased overheads due to additional cir- cuitry. However, the co-factor circuits generated by the proposed algorithm have delays not exceeding that of
the initial circuit. In the proposed scheme, each output line of the selection logic may have a large parasitic capacitance. However, only one of the subcircuits is active at a time, and at most two selection lines are switching at each clock cycle. In the experiment, we used the maximum possible load capacitance value at the output line of selection logic. Throughout the experiments, the execution times are averages of 50 runs on an Ultra Sparc I workstation, because runtime has not been consistently predictable. The proposed algorithm proved to be efficient in the experiments for the MCNC benchmark circuits.
5 Conclusion
We have proposed an algorithm based on multiple par- titioning for the synthesis of logic circuits under area constraints. The proposed algorithm recursively bipar-
IEE Proc.-Cirnrits Devices Syst., Vol. 146, No I, February 1993 13
Table 3: Experimental results of circuits obtained by proposed algorithm under area constraints given as those of circuits optimised by Synopsys Design Analyzer
Circuits generated under constraints of 100% of area of Synopsys optirnised circuits Circuits generated without area constraints
Benchmark circuits Area Delay Power Area Delay Power
A % (ns) A % (pW) A % A % (ns) A % (pW) A %
alu4
b l
c8
cc
cm 150a
crn 162a
cm42a
crnb
cordic
cu
decod
duke2
e64
f51 rn
k2
majority
rnisex2
m ux
pcler8
sa02
SCt
tcon
vd a
x2
z4rn I
i9
dalu
C17
C1908
C7552
C880
Average
2 381
62
722
1217
292
235
91
29 1
445
256
60
1 042
1527
332
3 249
74
484
475
410
391
608
230
1808
233
418
20 460
1001
72
2 040
1 548
1596
1421
-6.0
21.6
38.8
315.4
0.3
6.8
33.8
39.9
15.6
13.3
-76.7
9.5
58.2
-1.2
-3.4
7.2
13.9
11.8
-8.7
0.8
90.6
5.0
-0.4
22.6
234.4
853.4
-18.0
18.0
67.1
0.0
87.3
59.7
18
3
10
9
4
7
2
5
7
6
4
11
11
20
14
4
6
7
11
16
33
2
14
5
7
20
20
4
23
42
25
12
7.7
60.1
23.8
13.8
100.0
30.9
-43.3
57.3
18.3
42.6
65.7
20.1
20.1
-8.9
-15.5
-2.3
-14.6
16.1
21.6
14.3
-2.4
-19.0
9.3
24.3
20.0
6.8
1.2
146.8
-21.8
12.2
7.2
19.751
1193
25
407
210
157
95
52
93
254
130
71
342
515
194
1233
45
229
226
158
216
205
118
679
126
139
774
62 1
40 1154
5 653
3 024
593
-46.2
-58.0
-30.1
-31.1
-51.0
-61.0
-42.2
-54.3
-42.3
-45.2
-72.1
-51.1
-39.6
-37.4
-46.0
-42.6
-42.0
-41.2
-62.0
-51.2
-39.8
-52.7
-50.8
-36.7
-12.8
-68.9
-52.1
-42.5
-19.0
0.0
-3.3
-42.7
2 381
51
520
293
29 1
220
68
208
385
226
189
952
965
336
3 249
69
425
425
410
388
319
219
1808
190
125
2 146
1001
61
1221
1 548
852
695
-6.0 18 7.7
0.0 2 0.0
0.0 8 0.0
0.0 7 0.0
0.0 2 0.0
0.0 5 0.0
0.0 4 0.0
0.0 3 0.0
0.0 6 0.0
0.0 4 0.0
-26.7 4 65.7
0.0 9 0.0
0.0 9 0.0
0.0 22 0.0
-3.4 14 -15.5
0.0 4 0.0
0.0 7 0.0
0.0 6 0.0
-8.7 11 21.6
0.0 14 0.0
0.0 34 0.0
0.0 2 0.0
-0.4 14 9.3
0.0 4 0.0
0.0 6 0.0
0.0 19 0.0
-18.0 20 0.0
0.0 2 0.0 0.0 29 0.0
0.0 38 0.0
0.0 23 0.0
-2.0 11 2.9
1193
59
582
305
320
243
90
204
44 1
238
72
700
853
309
1233
78
394
384
185
442
340
251
679
199
159
2 489
1296
69
1425
5 653
3 126
775
-46.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
-71.9
0.0
0.0
0.0
-46.0
0.0
0.0
0.0
-55.5
0.0
0.0
0.0
-50.8
0.0
0.0
0.0
0.0
0.0 0.0
0.0
0.0
-8.7
titions a given circuit such that only a single circuit among the entire pool of partitioned subcircuits can be activated according to the input conditions. Thus, it can successfully reduce unnecessary signal transitions. The algorithm utilises the adaptive simulated annealing technique to search for a probable partitioning solu- tion. To reduce memory usage and execution time, the algorithm is designed to limit unnecessary searches of trial partitionings.
The power consumption by the circuits generated by the algorithm under given area constraints has been greatly reduced. Experimental results show that the proposed algorithm is efficient at designing low-power combinational circuits. Algorithm designs for reducing the search space in the circuits with symmetric inputs, and reducing duplicated areas among the co-factor sub- circuits, are the subject for future research.
14
6 Acknowledgment
This work was supported by MOIE, MOIC and MOST through ASIC infra-technology projects.
7 References
1 DEVADAS, S., and MALIK, S.: ‘A survey of optimization tech- niques targeting low power VLSI circuits’. Proceedings of 32nd DAC, June 1995, pp. 242-247
2 CHANDRAKASAN, A., SHENG, T., and BRODERSEN, R.: ‘Low power CMOS digital design’, IEEE J. Solid-state Circuits, 1992, 27, (4), pp. 473484
3 TSUI, C., PEDRAM, M., and DESPAIN, A.: ‘Exact and approximate methods for switching activity estimation in sequen- tial logic circuits’. Proceedings of 31st DAC, June 1994, pp. 18-23 LEE, H.D., LEE, J.S., and HWANG, S.Y.: ‘A novel high level synthesis algorithm for low power ASIC design’, J. Microelectron. Syst. Integration, 1996, 4, (4), pp. 219-232
4
IEE Pror.-Circuits Devices Sysr., Vol. 146, No. 1. February 1999
Table 4 Experimental resubs d circuits synthesised by proposed algorithm under various values of area constraints
Benchmark circuits Area Delay Power Area Delay Power
A % (ns) A % (pW) A % A % (ns) A % (pW) A %
a h 4
b l
c8
cc
cm 150a
cm162a
cm42a
cmb
cordic
cu
decod
duke2
e64
f51m
k2
majority
misex2
m ux
pcler8
sa02
sct
tcon
vda
x2
z4ml
i9
dalu
C17
C1908
C7552
C880
Average
2 381
51
520
293
292
220
235
208
445
256
189
754
965
336
3 249
74
200
475
410
39 1
148
230
1 808
190
125
2 146
1001
72
1221
1 548
852
687
-6.0 18 7.7
0.0 4 115.6
0.0 10 25.5
0.0 9 26.7
0.3 2 0.0
0.0 7 39.1
245.6 2 -43.3
0.0 5 57.3
15.6 7 18.3
13.3 6 42.6
-26.7 4 65.7
-20.8 11 21.6
0.0 11 21.6
0.0 24 9.2
-3.4 14 -15.5
7.2 4 -2.3
-52.9 9 26.8
11.8 7 16.1
-8.7 11 21.6
0.8 16 14.3
-53.6 36 5.9
5.0 2 -19.0
-0.4 14 9.3
0.0 6 51.2
0.0 8 34.5
0.0 21 10.7
-18.0 20 1.2
18.0 4 115.2
0.0 31 6.8
0.0 40 5.3
0.0 23 0.0
4.1 12 22.3
1193
59
582
305
157
243
45
204
254
130
72
49 1
853
416
1233
45
156
226
185
216
159
118
679
199
159
2 489
1296
40
1425
5 653
3 126
723
-46.2 2381
0.0 62
0.0 722
0.0 405
-51.0 292
0.0 235
-49.7 235
0.0 208
-42.3 445
-45.2 256
-71.9 137
-29.9 754
0.0 965
34.5 332
-46.0 3249
-42.6 74
-60.3 200
-41.2 475
-55.5 410
-51.2 391
-53.2 148
-52.7 230
-50.8 1808
0.0 1041
0.0 125
0.0 2 146
0.0 1001
-42.5 72
0.0 1221
0.0 1548
0.0 852
-25.7 723
-6.0 18 7.7
21.6 3 60.1
38.8 10 23.8
38.2 9 13.8
0.3 2 0.0
6.8 7 30.9
245.6 2 -43.3
0.0 5 57.3
15.6 7 18.3
13.3 6 42.6
-46.9 4 65.7
-20.8 11 21.6
0.0 11 21.6
-1.2 20 -8.9
-3.4 14 -15.5
7.2 4 -2.3
-52.9 9 26.8
11.8 7 16.1
-8.7 11 21.6
0.8 16 14.3
-53.6 36 5.9
5.0 2 -19.0
-0.4 14 9.3
447.9 5 24.3
0.0 8 34.5
0.0 21 10.7
-18.0 20 1.2
18.0 4 115.2
0.0 31 6.8
0.0 40 5.3
0.0 23 0.0
21.3 12 18.3
1193
25
407
216
157
95
45
204
254
130
71
49 1
853
198
1233
45
156
226
185
216
159
118
679
367
159
2 489
1296
40
1425
5 653
3 126
707
-46.2
-58.0
-30.1
-29.3
-51 .O -61 .O -49.7
0.0
-42.3
-45.2
-72.1
-29.9
0.0
-35.9
46.0
-42.6
-60.3
-41.2
-55.5
-51.2
-53.2
-52.7
-50.8
84.4
0.0 0.0
0.0
-42.5
0.0
0.0
0.0
-31.0
Experimental results are compared with circuits optimised by Synopsys Design Analyzer
5 LEE, J.S., LEE, H.D., PARK, C.W., and HWANG, S.Y.: ‘A power-conscious scheduling algorithm for performance-driven datapath synthesis’, Electron. Lett., 1996, 32, (17), pp. 1574-1 576
6 SHEN, A., DEVADAS, S., GHOSH, A., and KEUTZER, K.: ‘On average power dissipation and random pattern testability of combinational logic circuits’. Proceedings of ICCAD, November 1992, pp. 402407 LEMONDS, C., and SHETTI, S.: ‘A low power 16 by 16 multi- plier using transition reduction circuitry’. Proceedings of interna- tional workshop on Low power design, April 1994, pp. 139-142 IMAN, S., and PEDRAM, M.: ‘Multi-level network optimization for low power’. Proceedings of ICCAD, November 1994, pp. 371-377 IMAN, S., and PEDRAM, M.: ‘Logic extraction and factoriza- tion for low power’. Proceedings of 32nd DAC, June 1995, pp. 248-253
10 ALIDINA, M., MONTEIRO, J., DEVADAS, S . , and GHOSH, A.: ‘Precomputation-based logic optimization for low power’. Proceedings of ICCAD, November 1994, pp. 74-81
7
8
9
11 KIM, H., CHOI, I.S., and HWANG, S.Y.: ‘Design of heuristic algorithms based on Shannon expansion for the synthesis of logic circuits with low power’, ZEE Proc. Circuits Devices Syst., 1997,
12 CHOI, I.S., KIM, H., SEO, D.W., and HWANG, S.Y.: ‘Kernel- based precomputation scheme for the design of low power combi- national circuits’, Electron. Lett., 1996, 32, (14), pp. 1281-1283
13 KIM, H., and HWANG, S.Y.: ‘A heuristic algorithm for low power design of combinational circuits’, Electron. Lett., 1996, 32, (12), pp. 1066-1067
14 NAJM, F.: ‘Power estimation techniques for integrated circuits’. Proceedings of ICCAD, November 1995, pp. 492499
15 KIRKPATRICK, S., GELATT, C., and VECCHI, M.: ‘Optimi- zation by simulated annealing’, Science, 1983, 220, pp. 671-680
16 INGBER, L.: ‘Adaptive simulated annealing (ASA) lessons learned’, Control Cybern., 1996, 25, pp. 33-54
17 SENTOVICH, E., SAVOJ, H., BRAYTON, R., and SANGIO- VANNI-VINCENTELLI, A.: ‘SIS, a system for sequential circuit synthesis’. Memorandum UCBiERL M92/41, Electronic Research Laboratory, University of California, Berkeley, May 1992
144, (6), pp. 355-360
IEE Proc.-Circuits Devices Syst., Vol. 146, No. I , February 1999 15