circuit partitioning algorithm for low-power design under area constraints using simulated annealing

Circuit partitioning algorithm for low-power design under area constraints using simulated annealing

I.-S.Choi S.-Y Hwang

Abstract: A synthesis algorithm is proposed for the design of low-power combinational circuits under area constraints. The algorithm partitions a given circuit into several subcircuits such that only a selected subcircuit is activated at a time, hence reducing unnecessary signal transitions. Partitioning is performed through an adaptive simulated annealing algorithm, employing a cost function modelled for low-power consumption under given area constraints. Experiments have been performed for the MCNC benchmark circuits using the power analysis package provided in the Synopsys Design Analyzer. Results show that the proposed algorithm generates circuits which consume less power than those by the area-optimisation package in Synopsys Design Analyzer and precomputation algorithm.

1 Introduction

Demand for portable communication and personal computing applications has increased over recent years. Since the lifetime of portable devices is directly related to the capacity of the batteries, low-power design has become an indispensable factor for extending battery life. Likewise, low-power design is also crucial in high- performance systems. High temperatures in these systems can affect circuit reliability and reduce the lifetime of the system. The cost associated with packaging, cooling and fans to remove heat is increasing signifi- cantly. Thus, power consumption has emerged as an important parameter in the design of low-power systems [l].

Several techniques have been proposed to reduce power consumption at various levels of the design hier- archy: system level, architectural level, logic level and device level [l-51. Most research into power optimisation has focused on reducing switching activity, since the power due to switching activities accounts for over 90% of the total power dissipated by CMOS circuits [3]. Significant progress has been made in the study of high-level low-power synthesis, and a power-concious synthesis algorithm at behavioural level has been pro-

~~

0 IEE, 1999 IEE Proceedings online no. 19990276 DOL IO. 1049hp-cds: 19990276 Paper first received 18th December 1997 and in revised form 22nd July 1998 The authors are with the Department of Electronic Engineering, Sogang University, CPO Box 1142, Seoul, 100-61 1, Korea

8

posed [4, 51. At the logic level, don’t-care optimisation, path balancing, factorisation, precomputation and circuit-partitioning algorithms have been proposed [6-131.

The don’t-care optimisation method reduces the probability of signal transitions by utilising a set of ODCs (observability don’t-care sets) [6]. The transition probability is reduced by restructuring the circuit topology utilising the don’t-care sets, leading to power reductions. The path balancing technique eliminates power consumed by spurious transitions, which account for 10% to 30%) of the dynamic power in combinational logic circuits [7]. To reduce spurious transitions, delays of paths converging at each gate are made roughly equal by adding unit-delay buffers to the inputs of the gates selectively. The addition of buffers does not increase the critical delay of the circuit and eliminates spurious transitions effectively. However, the addition of buffers increases capacitance, which may offset reductions in switching activity power.

Factorisation is a technology-independent multi-level optimisation technique, which reduces the transistor count by factoring a given logic function into multi- level form using a kernel. The modified kernel selection algorithm, utilising a cost function to minimise switching activity, is described elsewhere [8, 91. Precomputa- tion is a combinational logic optimisation technique, which reduces switching activity by selectively disabling the input of a combinational logic circuit [IO]. If the output value is precomputed one clock cycle ahead, the combinational circuit can be turned off in the following clock cycle, reducing the overall switching activity. Alidina et al. proposed an algorithm that generates precomputation logic using universal quantification and ODCs [lo]. However, finding an appropriate ODC for the precomputation logic is not always possible, and area and delay penalties for the precomputation logic may also be expected.

The circuit-partitioning algorithm transforms a circuit into multiple subcircuits, and only one of the subcircuits is computed; the others are disabled by setting the load-enable signal of its input register. We have proposed two partitioning algorithms [ 1 I , 121: the Shannon expansion-based scheme and the kernel-based scheme. In the Shannon expansion-based scheme, an input variable of a logic function is selected and the function is transformed into two co-factor subcircuits by applying Shannon expansion [ 1 I , 131. Power reductions are achieved by disabling one of the two subcircuits which does not contribute to the output evaluation between two co-factor subcircuits. The kernel-based scheme selects a kernel of a multiple output function as the precomputation logic, and performs algebraic division using the selected kernel and its complement, thereby generating co-factor subcircuits [ 121.

IEE Proc -Circuit5 Drwices Syst., Vol 146. No. I , Februtrry 1999

These two circuit-partitioning algorithms can be applied to all logic functions and do not require the inputs to be in ODC.

In this paper, we propose a synthesis a€goorithm for the design of low-power combinational circuits under given area constraints. Adaptive simulated annealing is employed to search for a globally optimal solution, and a binary tree is used to represent a partitioning solution.

2 Background

2. I Power dissipation model of CMOS circuits The average power consumption of a CMOS gate is given by

- Paverage - Pswztchzng + Pshort-czrcuzt + fieakage

logic circuit. This technique uses universal quantification and ODCs to determine the precomputation logic [lo]. Fig. l a shows a general combinational circuit, in which combinational logic block C has input latch L1. Fig. Ib is a structure based on the precomputation scheme, with a predictor circuit consisting of functions gl and g2. If gl or g2 evaluates to a 1 during clock cycle t , the load-enable (LE) signal to latch L2 is turned off. This implies that the output does not change during clock cycle t + 1; hence no switching power is dissipated. A heuristic algorithm that finds the precomputation function for efficient power reductions has been proposed previously [lo].

i2 i l~~~i~. In

a

where Tc is the clock period, C is the node capacitance, v d d is the supply voltage, and N is the average number of switchings per clock cycle at the gate output. PswrtLhrng is the switching activity power required to charge and discharge circuit nodes [14]. PAhort ,rrtuzt is the power dissipation due to the current flowing from the supply to ground during transitions at the input. This current is often called the short-circuit current. Q,, denotes the quantity of charge carried by the short- circuit current during a transition at the output. Pleakage is the static power drawn by the leakage current I,eak.

Signal probability ps(x) of a node x in a logic gate is defined as the duty cycle of the signal or the probability of the signal being at logic ‘1’ in a vector event. Transition probability p t (x ) of a node x is the probability of the signal switching from one state to another [14]. To obtain the internal transition probabilities, one must consider whether the values of the same signal in two consecutive clock cycles are independent. If they are assumed to be independent, the transition probability can be obtained from the signal probability:

mt.) = 2PS(Z)PS(E.) = 2Ps(.)(l -Ps(Z)) (2 ) The average power dissipation in a CMOS circuit is proportional to the switching activity pt(xI) of node x, in a logic network. The most significant part of power consumption occurs only during transitions at the gate output when the output capacitance is charged and dis- charged. Switching power consumed by the CMOS gate can be modelled as

n 1

Pswitching = -v& C&(Si) (3) 2Tc i=l

where C, is the load capacitance of the ith node, Vdd is the supply voltage, T, is the clock period and pt(xz) is the signal transition probability at node x,. Since the power dissipation due to switching activities accounts for over 9P!0 of the average power consumption in CMOS circuits, only the power drawn by switching transitions is taken into consideration in modelling CMOS power dissipation.

2.2 Precomputation-based design algorithm Precomputation is a Combinational logic optimisation technique, which reduces switching activities by selectively disabling the input latches of a combinational

IEE Proc -Circuits Devices Syst . VoI. 146, No 1, February 1999

b Fig. 1 , Circuit siruciures for low-power consurnpiion a Combinational circuit b Structure based on precomputation scheme

2.3 Partitioning algorithm based on Shannon expansion A functionf(il, i2 ..., in) can be Shannon expanded with respect to an input variable i, as in eqn. 4, whereJ;i and j i ~ are the cofactors o f f with respect to $ and 5, respectively:

(4)

A combinational circuit can be transformed into a Shannon expansion form using an output multiplexor, as shown in Fig. 2a [ll]. Depending on the value of i l , only one co-factor subcircuit is computed while the other is disabled; hence keeping the last values of the input latch unchanged. If il is evaluated to a I , circuit Jl is computed while is disabled; when il is at 0, A; is computed and J;, disabled. To maximally reduce the power dissipation of the circuit transformed by Shan- non expansion, the control variable must be properly selected from the input variables for a given logic function. An optimally configured circuit takes into account all variables into the prediction process and selects the best candidate that suits the least power consumption requirements. The precomputation scheme is only applicable to the type of logic functions that have ODCs. On the other hand, the Shannon expansion- based scheme is applicable to all types of logic, and incurs small area and delay overheads. However, its

9

limitations are similar to those of the precomputation scheme, where duplicated inputs increase the area by the number of latches at the input.

2.4 Kernel- based partitioning algorithm A kernel-based partitioning algorithm reduces power consumption by partitioning a circuit by selecting an appropriate kernel [12]. A kernel is a common sub- expression of a function, and covers both single and multiple output functions. The algorithm selects a kernel as a common logic part and implements it as the selection logic. Fig. 2b shows the kernel-based circuit structure. The kernel-based scheme consists of three subcircuits: a selection logic synthesised for kernel k and two subcircuits generated from the original circuit divided by kernel k and its complement E.

n n i l2:nl-

..ra----.-A ‘1 a

1- I

b Fig. 2 U Based on Shannon expansion b Based on kernel selection

Circuit structures,for low-power consumption

Only one subcircuit is activated, while the other is disabled by properly setting the load-enable signal of its input latch. The output of the selection logic is connected to the select line of a multiplexor, which chooses the correct output. The advantage of this circuit structure is that it can be applied to most kinds of logic function, whereas the precomputation-based structure requires an ODC. However, the selection logic and duplicated logic parts in each co-factor subcircuit may incur area overheads.

3 Proposed low-power synthesis algorithm

In combinatorial optimisation problems, both practical and theoretical, the objective is to choose the best solution in terms of the cost function from many possible solutions. A solution for such problems is an arrange- ment of a set of discrete objects according to a given set of constraints, and the solution space is a set of all possible solutions of the problem. NP-complete problems require computing effort that increases exponen- tially with the problem size. Efficient approximation

I O

algorithms that do not produce a global optimum, but rather a local optimum, have been proposed to improve the execution time.

Simulated annealing is a general-purpose optimisation technique for combinatorial optimisation problems [15]. The algorithm randomly generates a new state or configuration, and the new state is accepted or rejected according to an acceptance rule governed by the parameter analogous to temperature in the physical annealing process. Simulated annealing is a very attrac- tive optimisation algorithm because it produces high- quality solutions and is, in general, easy to implement. For the best optimisation result, a careful design of the basic ingredients is required: formulating the problem so as to obtain an optimal partitioning solution; defining the neighbouring solutions of each solution; choos- ing a suitable cost function; and defining an annealing schedule. A set of moves is used to search the solution space, and a neighbouring solution is obtained from a solution via one of the moves. As the execution time of the simulated annealing algorithm largely depends on the annealing schedule, an efficient cooling schedule is needed. Many temperature scheduling algorithms have been proposed [16], such as fixed schedule, logarithmic schedule, Boltzmann annealing, simulated quenching, fast annealing, very fast simulated reannealing and adaptive simulated annealing.

In our proposed algorithm, adaptive simulated annealing is used in partitioning the logic circuit for low-power design [ 161. The partitioning algorithm generates circuits consuming less power, but it tends to generate circuits with a huge area overhead. In the proposed system, the user can specify the upper limits of area increase such that the circuits consuming less power can be generated without violating constraints.

b O I f

c-d Fig.3 Benchmark circuit % I ’

3. I Solution space A partitioning solution can be obtained by bipartitioning a circuit recursively. Fig. 3 shows a MCNC benchmark circuit ‘bl ’ optimised by the Synopsys Design Analyzer. Fig. 4 shows the solution space as a form of search tree for the circuit ‘bl’. Each internal node in the search tree represents a partitioning solution consisting of a set of subcircuits. The edge represents the neighbouring relation between two connected nodes, whose label V; v) represents a move operation that bipartitions a subcircuit f using v as a selection variable. The number of nodes at level k is proportional to ,,Ck, where n is the number of inputs in the given circuit. The number of possible solutions is proportional to CE;l nCk , since the depth of the tree is equal to n.

IEE Proc -Circuit, DevzceJ Syrf Vol 146 No I Fehruury 1999

logic function f resultant circuit representing a partition solution in Fig. 5.

3.2 Neighbouring solutions eighbouring solutions can be obtained by modifying

the existing solution through a set of moves. In the

Fig.4 Search tree for circuit ‘bl’

Partitioning sohtion I

1 A ,

I \ . partitioned

subcircui ts

Fig. 5 Partition tree representing the shaded node in Fig. 4

Fig.6 Resultant circuit corresponding to partitioning solution of Fig. 5

To manage partitioned subcircuits systematically, a binary partition tree is used to represent a partitioning solution. An internal node of the partition tree has two children, representing the co-factors with respect to a selected variable. Each set of leaf nodes represents a partitioned subcircuit. Fig. 5 shows a partition tree for the shaded node of Fig. 4. The leaf nodes in the boxes correspond to the three subcircuits co-factored with respect to the cubes b c, b c, and &. Fig. 6 shows the

IEE Proc-Cifcuits Devices Syst.. Vol. 146, No. 1. February I999

proposed algorithm, two types of moves are used to locally modify a partitioning solution for the randomly selected leaf nodes; partition and merge. There are two possible ways of selecting a leaf node in the partition tree: root-to-leaf path enumeration and pre-labelled leaf selection. In the path enumeration method, a leaf node is selected by enumerating the path from the root node to the leaf node, by using a sequence of binary numbers generated randomly. In the partition tree of Fig. 5, the leaf nodes corresponding to the circuits divided by the variable set b c, b Z are selected by random number sequences ‘1 1’ and ‘lo’, respectively. Each leaf node has a selection probability of 1/2depth. Since deeper leaf nodes have fewer selection probabilities, the partition tree tends to be balanced. The advantage of this algorithm is that the power consumption of the resultant circuit is less input pattern dependent, because the differences in power consumption of each subcircuit tend to be small. Since a leaf node can be selected by more than one sequence of random numbers, this increases the search time due to unnecessary retrials.

In the pre-labelled leaf selection method, each leaf has a unique integer index and is selected by using a random number. This method has the advantage of having fewer repetitions in the leaf selection process than the previous method, but has the disadvantage of generating partitioned blocks with uneven power consumption. To traverse a large partitioning solution space efficiently, the time complexity of the moves must be small. To prevent duplicate visitings of a leaf node in the partition tree, each internal node is imple- mented to keep a set of variables that represents a record of past bipartitioning trials, and this variable set is bypassed at future bipartitioning attempts at the node. Memory overheads can be reduced by bitwise packing of the variable set. Fig. 7 shows the proposed move procedure.

I*

+I move (T)

T : Partition kee whose leaf nodes represent partitioning results.

( leaf = get-a-random-leaflT);

mode = select-a-move-randomly; I* Select a Partition or Merge operahon. */

if (mode = Partition) (

I* Select a subcircuit randomly. *I

I’ Select an input variable which has not been tried yet. *I variable = select-random_input(lea~;

Partition-circuit(leaf, variable);

1’ if mode = Merge *I ] else (

if (leaf = root-node)

Merge-circuits(1eaf->parent->lefi, leaf->parent->right);

return FAIL;

I return SUCCESS;

}

Fig. 7 Move procedure in proposed algorithm

The proposed algorithm searches a partitioning solution with minimum cost while satisfying area constraints. The cost function is the sum of power consumption of all the subcircuits in a partitioning solution. If the cost of a neighbouring solution

11

decreases, the generated partitioning solution is accepted. If the cost increases, the move is accepted, with a probability varying according to the temperature. The simulated annealing algorithm terminates when the quality of a solution does not improve for a constant number of moves. The adaptive simulated annealing algorithm employed in the proposed algorithm is efficient at exploring a wide multi-dimensional solution space [16].

The circuit structure obtained by the multiple partitioning algorithm is shown in Fig. 8, where I = {i,, i2, ..., in} is a set of primary input variables and I , is the subset of the primary inputs I used by the selection logic. The selection logic is constructed in the form of a decoder. The function of the selection logic is to acti- vate a subcircuit by enabling its input latch according to an input pattern. Depending on the value of I,), only one of the subcircuits is activated, while the rest are disabled by properly setting the load-enable signal of their input latches. The output of the selection logic is connected to the select line of a multiplexor. The multiplexor at each output line can be replaced by a single transmission gate in a wired-OR form. As it exhibits a small variation in power dissipation in each subcircuit, the resultant circuit by the proposed algorithm is less input pattern-dependent.

i -

'S

,g, FFs

Fig. 8 Circuit structure obtained by proposed algoritlztn

4 Experimental results

The proposed algorithm has been evaluated for performance using the MCNC benchmark circuits. In the experiments, we started with an initial circuit that had been first optimised by the Synopsys Design Analyzer. The original MCNC benchmark circuits and the circuits optimised by the Synopsys Design Analyzer are summarised in Table 1. Technology mapping has been performed for the circuits generated by the proposed algorithm using the standard cell provided by the Syn- opsys Design Analyzer. Power estimations are made at a clock frequency of 5OMHz using a 5V supply voltage, after performing technology mapping using a 1 si-lOK technology library.

Table 2 shows the results by the precomputation scheme and the proposed scheme for the MCNC benchmark circuits, where power disspations and areas

12

Table 1: Benchmark circuits used for performance evaluation of proposed algorithm

Circuits optirnised by Synopsys Design Analyzer

Benchmark circuits Benchmark circuits Delay Power

(ns) (yW) #ins #outs #lits Area

Alu4

B1

c8

cc

Cm 150a

Cm 162a

Crn42a

cmb

Cordic

cu

decod

duke2

e64

f51m

k2

majority

misex2

rnux

pcler8

sa02

SCt

tcon

vda

x2

z4ml

i9

dalu

C17

C1908

C7552

C880

Average

14 8

3 4

28 18

21 20

21 1

14 5

4 10

16 4

23 2

14 11

35 16

22 29

65 65

8 8

45 45

5 1

25 18

21 1

27 17

10 4

19 15

17 16

17 39

10 7

7 4

88 63

75 16

5 2

33 25

207 108

60 26

31 20

664 2532 17 2218

23 51

286 520

136 293

119 291

86 220

43 68

94 208

240 385

118 226

244 258

982 952

2274 965

185 336

1027 3365

21 69

214 425

134 425

149 449

156 388

202 319

74 219

600 1815

91 190

91 125

592 2146

3067 1220

12 61

1497 1221 2317 1548

415 852

521 714

2

8

7

2

5

4

3

6

4

2

9

9

22

17

4

7

6

9

14

34

2

13

4

6

19

20

2

29

38

23

11

59

582

305

320

243

90

204

44 1

238

255

700

853

309

2 281

78

394

384

416

442

340

251

1379

199

159

2 489

1296

69

1425 5 653

3 126

877

are compared to those of the circuits optimised by the Synopsys Design Analyzer. The power reduction by the proposed scheme without area constraints is 42.7% and 37.1% when compared to the Synopsys-optimised circuits and the precomputation-based circuits, respectively. The delay is increased by 19.8% when compared to the circuits optimised by the Synopsys Design Ana- lyzer. The area is increased by 59.7% and 49.0%, when compared to the Synopsys-optimised circuits and the precomputation-based circuits. respectively. These area increases are incurred by logic duplications in the partitioned co-factor circuits, the input latches and the selection logic.

Table 3 shows that the power reduction for the benchmark circuits without area constraints is 42.7%, and the result without allowing any area increases is an average power reduction of 8.7%. Table 4 shows power reductions of 25.7and 31.0% with constraints of 1200/;1 and 140% of the area of the original circuits, respectively.

IEE Pior -Circuits D e i r c e ~ S i J r V d 146 N o I F f h r u u i ~ 1Y9Y

Alu4

B1

C8

cc

cml50a

cm162a

cm42a

cmb

cordic

cu decod

duke2

e64

f51m

k2

majority

misex2

mux

pcler8

sa02

Sct

i c o n

Vda

x2

z4ml

i9

D a h

Cl7

C 1908

C7552

C880

Average

Table 2: Power consumption of *wits produced by precomputation-based scheme and proposed s c h e

A % (ns) A % (pW) A % A % (ns) A % (pW) A %

1824 -28.0 97 481.2 1616 -27.2 2381 -6.0 18 7.7 1193 -46.2

51 0.0

894 71.9

293 0.0

291 0.0

144 -34.5

68 0.0

208 0.0

385 0.0

213 -5.8

258 0.0

928 -2.5

554 -42.6

336 0.0

1880 -44.1

69 0.0

425 0.0

425 0.0

449 0.0

2164 457.7

319 0.0

219 0.0

1063 -41.4

190 0.0

125 0.0

2 146 0.0

1220 0.0

61 0.0

1221 0.0

1548 0.0

852 0.0

672 10.7

2

19

7

2

31

4

3

6

22

2

39

68

22

63

4

7

6

9

54

34

2

46

4

6

19

20

2

29

38

23

22

0.0

141.9

0.0

0.0

302.2

0.0

0.0

0.0

448.4

0.0

325.4

635.0

0.0

278.7

0.0

0.0

0.0

0.0

296.4

0.0

0.0

262.2

0.0

0.0

0.0

0.0 0.0 0.0

0.0

0.0

102.3

59 0.0

579 -0.6

305 0.0

320 0.0

236 -2.8

90 0.0

204 0.0

441 0.0

237 -0.3

255 0.0

550 -21.4

671 -21.4

309 0.0

1268 -44.4

78 0.0

394 0.0

384 0.0

416 0.0

384 -13.2

340 0.0

251 0.0

790 -42.7

199 0.0

159 0.0

2489 0.0

1296 0.0

69 0.0

1425 0.0

5653 0.0

3126 0.0

793 -5.6

62

722

1217

292

235

91

291

445

256

60

1042

1527

332

3 249

74

484

475

41 0

39 1

608

230

1 808

233

418

2 040

1001

72

2 040

1 548

1596

1421

21.6 3 60.1

38.8 10 23.8

315.4 9 13.8

0.3 4 100.0

6.8 7 30.9

33.8 2 -43.3

39.9 5 57.3

15.6 7 18.3

13.3 6 42.6

-76.7 4 65.7

9.5 11 20.1

58.2 11 20.1

-1.2 20 -8.9

-3.4 14 -15.5

7.2 4 -2.3

13.9 6 -14.6

11.8 7 16.1

-8.7 11 21.6

0.8 16 14.3

90.6 33 -2.4

5.0 2 -19.0

-0.4 14 9.3

22.6 5 24.3

234.4 7 20.0

853.4 20 6.8

-18.0 20 1.2

18.0 4 146.8

67.1 23 -21.8

0.0 42 12.2

87.3 25 7.2

59.7 12 19.8

25

407

210

157

95

52

93

254

130

71

342

515

194

1233

45

229

226

158

216

205

118

679

126

139

774

62 1

40

1154

5 653

3 024

593

-58.0

-30.4

-31.1

-51.0

-61.0

-42.2

-54.3

-42.3

-45.2

-72.1

-51.1

-39.6

-37.4

-46.0

-42.6

-42.0

-41.2

-62.0

-51.2

-39.8

-52.7

-50.8

-36.7

-12.9

-68.9

-52.1

-42.5

-19.0

0.0

-3.3

42 .7

Experimental results are compared with circuits optimised by Synopsys Design Analyzer

Note that in Table 2 a 66.0% reduction in power dissipation has been obtained for the ‘cm162’, ‘decod’, ‘pclers’, and ‘i9’ circuits. In these circuits, power reductions are prominent owing to the structure of the logic circuits in which a unate variable is dominant. When the circuits with this structure are bipartitioned, the co- factor subcircuits become smaller and less power-consuming. Circuits with multiple outputs that share inputs tend to have the largest area reductions, but area increases for partitioned circuits with duplicated input latches, such as ‘cc’, ‘e64’ and ‘i9’. Circuit ‘z4ml’, ‘C880’ had a relatively slight decrease in power because the selected inputs are less sensitive to switching activities.

The disadvantage of a precomputation-based circuit lies in the increased overheads due to additional circuitry. However, the co-factor circuits generated by the proposed algorithm have delays not exceeding that of

the initial circuit. In the proposed scheme, each output line of the selection logic may have a large parasitic capacitance. However, only one of the subcircuits is active at a time, and at most two selection lines are switching at each clock cycle. In the experiment, we used the maximum possible load capacitance value at the output line of selection logic. Throughout the experiments, the execution times are averages of 50 runs on an Ultra Sparc I workstation, because runtime has not been consistently predictable. The proposed algorithm proved to be efficient in the experiments for the MCNC benchmark circuits.

5 Conclusion

We have proposed an algorithm based on multiple partitioning for the synthesis of logic circuits under area constraints. The proposed algorithm recursively bipar-

IEE Proc.-Cirnrits Devices Syst., Vol. 146, No I, February 1993 13

Table 3: Experimental results of circuits obtained by proposed algorithm under area constraints given as those of circuits optimised by Synopsys Design Analyzer

Circuits generated under constraints of 100% of area of Synopsys optirnised circuits Circuits generated without area constraints

Benchmark circuits Area Delay Power Area Delay Power

A % (ns) A % (pW) A % A % (ns) A % (pW) A %

alu4

b l

c8

cc

cm 150a

crn 162a

cm42a

crnb

cordic

cu

decod

duke2

e64

f51 rn

k2

majority

rnisex2

m ux

pcler8

sa02

SCt

tcon

vd a

x2

z4rn I

i9

dalu

C17

C1908

C7552

C880

Average

2 381

62

722

1217

292

235

91

29 1

445

256

60

1 042

1527

332

3 249

74

484

475

410

391

608

230

1808

233

418

20 460

1001

72

2 040

1 548

1596

1421

-6.0

21.6

38.8

315.4

0.3

6.8

33.8

39.9

15.6

13.3

-76.7

9.5

58.2

-1.2

-3.4

7.2

13.9

11.8

-8.7

0.8

90.6

5.0

-0.4

22.6

234.4

853.4

-18.0

18.0

67.1

0.0

87.3

59.7

18

3

10

9

4

7

2

5

7

6

4

11

11

20

14

4

6

7

11

16

33

2

14

5

7

20

20

4

23

42

25

12

7.7

60.1

23.8

13.8

100.0

30.9

-43.3

57.3

18.3

42.6

65.7

20.1

20.1

-8.9

-15.5

-2.3

-14.6

16.1

21.6

14.3

-2.4

-19.0

9.3

24.3

20.0

6.8

1.2

146.8

-21.8

12.2

7.2

19.751

1193

25

407

210

157

95

52

93

254

130

71

342

515

194

1233

45

229

226

158

216

205

118

679

126

139

774

62 1

40 1154

5 653

3 024

593

-46.2

-58.0

-30.1

-31.1

-51.0

-61.0

-42.2

-54.3

-42.3

-45.2

-72.1

-51.1

-39.6

-37.4

-46.0

-42.6

-42.0

-41.2

-62.0

-51.2

-39.8

-52.7

-50.8

-36.7

-12.8

-68.9

-52.1

-42.5

-19.0

0.0

-3.3

-42.7

2 381

51

520

293

29 1

220

68

208

385

226

189

952

965

336

3 249

69

425

425

410

388

319

219

1808

190

125

2 146

1001

61

1221

1 548

852

695

-6.0 18 7.7

0.0 2 0.0

0.0 8 0.0

0.0 7 0.0

0.0 2 0.0

0.0 5 0.0

0.0 4 0.0

0.0 3 0.0

0.0 6 0.0

0.0 4 0.0

-26.7 4 65.7

0.0 9 0.0

0.0 9 0.0

0.0 22 0.0

-3.4 14 -15.5

0.0 4 0.0

0.0 7 0.0

0.0 6 0.0

-8.7 11 21.6

0.0 14 0.0

0.0 34 0.0

0.0 2 0.0

-0.4 14 9.3

0.0 4 0.0

0.0 6 0.0

0.0 19 0.0

-18.0 20 0.0

0.0 2 0.0 0.0 29 0.0

0.0 38 0.0

0.0 23 0.0

-2.0 11 2.9

1193

59

582

305

320

243

90

204

44 1

238

72

700

853

309

1233

78

394

384

185

442

340

251

679

199

159

2 489

1296

69

1425

5 653

3 126

775

-46.2

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

-71.9

0.0

0.0

0.0

-46.0

0.0

0.0

0.0

-55.5

0.0

0.0

0.0

-50.8

0.0

0.0

0.0

0.0

0.0 0.0

0.0

0.0

-8.7

titions a given circuit such that only a single circuit among the entire pool of partitioned subcircuits can be activated according to the input conditions. Thus, it can successfully reduce unnecessary signal transitions. The algorithm utilises the adaptive simulated annealing technique to search for a probable partitioning solution. To reduce memory usage and execution time, the algorithm is designed to limit unnecessary searches of trial partitionings.

The power consumption by the circuits generated by the algorithm under given area constraints has been greatly reduced. Experimental results show that the proposed algorithm is efficient at designing low-power combinational circuits. Algorithm designs for reducing the search space in the circuits with symmetric inputs, and reducing duplicated areas among the co-factor subcircuits, are the subject for future research.

14

6 Acknowledgment

This work was supported by MOIE, MOIC and MOST through ASIC infra-technology projects.

7 References

1 DEVADAS, S., and MALIK, S.: ‘A survey of optimization techniques targeting low power VLSI circuits’. Proceedings of 32nd DAC, June 1995, pp. 242-247

2 CHANDRAKASAN, A., SHENG, T., and BRODERSEN, R.: ‘Low power CMOS digital design’, IEEE J. Solid-state Circuits, 1992, 27, (4), pp. 473484

3 TSUI, C., PEDRAM, M., and DESPAIN, A.: ‘Exact and approximate methods for switching activity estimation in sequential logic circuits’. Proceedings of 31st DAC, June 1994, pp. 18-23 LEE, H.D., LEE, J.S., and HWANG, S.Y.: ‘A novel high level synthesis algorithm for low power ASIC design’, J. Microelectron. Syst. Integration, 1996, 4, (4), pp. 219-232

4

IEE Pror.-Circuits Devices Sysr., Vol. 146, No. 1. February 1999

Table 4 Experimental resubs d circuits synthesised by proposed algorithm under various values of area constraints

Benchmark circuits Area Delay Power Area Delay Power

A % (ns) A % (pW) A % A % (ns) A % (pW) A %

a h 4

b l

c8

cc

cm 150a

cm162a

cm42a

cmb

cordic

cu

decod

duke2

e64

f51m

k2

majority

misex2

m ux

pcler8

sa02

sct

tcon

vda

x2

z4ml

i9

dalu

C17

C1908

C7552

C880

Average

2 381

51

520

293

292

220

235

208

445

256

189

754

965

336

3 249

74

200

475

410

39 1

148

230

1 808

190

125

2 146

1001

72

1221

1 548

852

687

-6.0 18 7.7

0.0 4 115.6

0.0 10 25.5

0.0 9 26.7

0.3 2 0.0

0.0 7 39.1

245.6 2 -43.3

0.0 5 57.3

15.6 7 18.3

13.3 6 42.6

-26.7 4 65.7

-20.8 11 21.6

0.0 11 21.6

0.0 24 9.2

-3.4 14 -15.5

7.2 4 -2.3

-52.9 9 26.8

11.8 7 16.1

-8.7 11 21.6

0.8 16 14.3

-53.6 36 5.9

5.0 2 -19.0

-0.4 14 9.3

0.0 6 51.2

0.0 8 34.5

0.0 21 10.7

-18.0 20 1.2

18.0 4 115.2

0.0 31 6.8

0.0 40 5.3

0.0 23 0.0

4.1 12 22.3

1193

59

582

305

157

243

45

204

254

130

72

49 1

853

416

1233

45

156

226

185

216

159

118

679

199

159

2 489

1296

40

1425

5 653

3 126

723

-46.2 2381

0.0 62

0.0 722

0.0 405

-51.0 292

0.0 235

-49.7 235

0.0 208

-42.3 445

-45.2 256

-71.9 137

-29.9 754

0.0 965

34.5 332

-46.0 3249

-42.6 74

-60.3 200

-41.2 475

-55.5 410

-51.2 391

-53.2 148

-52.7 230

-50.8 1808

0.0 1041

0.0 125

0.0 2 146

0.0 1001

-42.5 72

0.0 1221

0.0 1548

0.0 852

-25.7 723

-6.0 18 7.7

21.6 3 60.1

38.8 10 23.8

38.2 9 13.8

0.3 2 0.0

6.8 7 30.9

245.6 2 -43.3

0.0 5 57.3

15.6 7 18.3

13.3 6 42.6

-46.9 4 65.7

-20.8 11 21.6

0.0 11 21.6

-1.2 20 -8.9

-3.4 14 -15.5

7.2 4 -2.3

-52.9 9 26.8

11.8 7 16.1

-8.7 11 21.6

0.8 16 14.3

-53.6 36 5.9

5.0 2 -19.0

-0.4 14 9.3

447.9 5 24.3

0.0 8 34.5

0.0 21 10.7

-18.0 20 1.2

18.0 4 115.2

0.0 31 6.8

0.0 40 5.3

0.0 23 0.0

21.3 12 18.3

1193

25

407

216

157

95

45

204

254

130

71

49 1

853

198

1233

45

156

226

185

216

159

118

679

367

159

2 489

1296

40

1425

5 653

3 126

707

-46.2

-58.0

-30.1

-29.3

-51 .O -61 .O -49.7

0.0

-42.3

-45.2

-72.1

-29.9

0.0

-35.9

46.0

-42.6

-60.3

-41.2

-55.5

-51.2

-53.2

-52.7

-50.8

84.4

0.0 0.0

0.0

-42.5

0.0

0.0

0.0

-31.0

Experimental results are compared with circuits optimised by Synopsys Design Analyzer

5 LEE, J.S., LEE, H.D., PARK, C.W., and HWANG, S.Y.: ‘A power-conscious scheduling algorithm for performance-driven datapath synthesis’, Electron. Lett., 1996, 32, (17), pp. 1574-1 576

6 SHEN, A., DEVADAS, S., GHOSH, A., and KEUTZER, K.: ‘On average power dissipation and random pattern testability of combinational logic circuits’. Proceedings of ICCAD, November 1992, pp. 402407 LEMONDS, C., and SHETTI, S.: ‘A low power 16 by 16 multi- plier using transition reduction circuitry’. Proceedings of interna- tional workshop on Low power design, April 1994, pp. 139-142 IMAN, S., and PEDRAM, M.: ‘Multi-level network optimization for low power’. Proceedings of ICCAD, November 1994, pp. 371-377 IMAN, S., and PEDRAM, M.: ‘Logic extraction and factoriza- tion for low power’. Proceedings of 32nd DAC, June 1995, pp. 248-253

10 ALIDINA, M., MONTEIRO, J., DEVADAS, S . , and GHOSH, A.: ‘Precomputation-based logic optimization for low power’. Proceedings of ICCAD, November 1994, pp. 74-81

7

8

9

11 KIM, H., CHOI, I.S., and HWANG, S.Y.: ‘Design of heuristic algorithms based on Shannon expansion for the synthesis of logic circuits with low power’, ZEE Proc. Circuits Devices Syst., 1997,

12 CHOI, I.S., KIM, H., SEO, D.W., and HWANG, S.Y.: ‘Kernel- based precomputation scheme for the design of low power combinational circuits’, Electron. Lett., 1996, 32, (14), pp. 1281-1283

13 KIM, H., and HWANG, S.Y.: ‘A heuristic algorithm for low power design of combinational circuits’, Electron. Lett., 1996, 32, (12), pp. 1066-1067

14 NAJM, F.: ‘Power estimation techniques for integrated circuits’. Proceedings of ICCAD, November 1995, pp. 492499

15 KIRKPATRICK, S., GELATT, C., and VECCHI, M.: ‘Optimi- zation by simulated annealing’, Science, 1983, 220, pp. 671-680

16 INGBER, L.: ‘Adaptive simulated annealing (ASA) lessons learned’, Control Cybern., 1996, 25, pp. 33-54

17 SENTOVICH, E., SAVOJ, H., BRAYTON, R., and SANGIO- VANNI-VINCENTELLI, A.: ‘SIS, a system for sequential circuit synthesis’. Memorandum UCBiERL M92/41, Electronic Research Laboratory, University of California, Berkeley, May 1992

144, (6), pp. 355-360

IEE Proc.-Circuits Devices Syst., Vol. 146, No. I , February 1999 15

circuit partitioning algorithm for low-power design under area constraints using simulated annealing

Documents