lower power high level synthesis

137
SungKyunKwan Univ . 1 VADA Lab. Lower Power High Level Synthesis 1999. 8 성성성성성성 성 성 성 성성 http://vada.skku.ac.kr

Upload: dayton

Post on 05-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Lower Power High Level Synthesis. 1999. 8 성균관대학교 조 준 동 교수 http://vada.skku.ac.kr. System Partitioning. To decide which components of the system will be realized in hardware and which will be implemented in software - PowerPoint PPT Presentation

TRANSCRIPT

Lower Power Synthesis1999. 8
http://vada.skku.ac.kr
VADA Lab.
System Partitioning
To decide which components of the system will be realized in hardware and which will be implemented in software
High-quality partitioning is critical in high-level synthesis. To be useful, high-level synthesis algorithms should be able to handle very large systems. Typically, designers partition high-level design specifications manually into procedures, each of which is then synthesized individually. Different partitionings of the high-level specifications may produce substantial differences in the resulting IC chip areas and overall system performance.
To decide whether the system functions are distributed or not. Distributed processors, memories and controllers can lead to significant power savings. The drawback is the increase in area. E.g., a non-distributed and a distributed design of a vector quantizer.
VADA Lab.
Circuit Partitioning
account by letting the weight of a node (i,j)
in Nc be the sum of the weights of the
nodes I and j. We can similarly take edge
weights into account by letting the weight
of an edge in Ec be the sum of the weights
of the edges "collapsed" into it.
Furthermore, we can choose the edge (i,j)
which matches j to i in the construction of
Nc above to have the large weight of all
edges incident on i; this will tend to
minimize the weights of the cut edges. This
is called heavy edge matching in METIS,
and is illustrated on the right.
5439.psd
VADA Lab.
Multilevel Kernighan-Lin
Given a partition (Nc+,Nc-) from step (2) of Recursive_partition, it is easily expanded to a partition (N+,N-) in step (3) by associating
with each node in Nc+ or Nc- the nodes of N that comprise it. This is again shown below:
Finally, in step (4) of Recurive_partition, the approximate partition from step (3) is improved using a variation of Kernighan-Lin.
5440.psd
WRITE VHDL
Minimizing switching activity in Register
Minimizing switching activity in resource and interconnection
LOW POWER AND FAST SCHEDULING
VADA Lab.
Instructions
Operations
Variables
Arrays
signals
@(posedge clk);
VADA Lab.
High-Level Synthesis
The allocation task determines the type and quantity of resources used in the RTL design. It also determines the clocking scheme, memory hierarchy and pipelining style. To perform the required trade-offs, the allocation task must determine the exact area and performance values.
The scheduling task schedules operations and memory references into clock cycles. If the number of clock cycles is a constraint, the scheduler has to produce a design with the fewest functional units
The binding task assigns operations and memory references within each clock cycle to available hardware units. A resource can be shared by different operations if they are mutually exclusive, i.e. they will never execute simultaneously.
VADA Lab.
- -
Sibling [ Fang , 96 ]
correlation resource sharing [ Gebotys, 97 ]
FU shut down (Demand-driven operation) [ Alidina, 94 ]
[ Rabaey, 96 ]
Dual [ Sarrafzadeh, 96 ]
Spurious [ Hwang, 96 ]
+ [Cho,97]
- -
- -
+1
+2
+3
*1
a
b
c
d
e
g
f
h
VADA Lab.
where PREG is the power of the registers
PMUX is the power of multiplexers
PFU is the power of functional units
PINT is the power of physical interconnet capacitance
VADA Lab.
High-Level Power Estimation: PREG
Compute the lifetimes of all the variables in the given VHDL code.
Represent the lifetime of each variable as a vertical line from statement i through statement i + n in the column j reserved for the corresponding varibale v j .
Determine the maximum number N of overlapping lifetimes computing the maximum number of vertical lines intersecting with any horizontal cut-line.
Estimate the minimal number of N of set of registers necessary to implement the code by using register sharing. Register sharing has to be applied whenever a group of variables, with the same bit-width b i .
Select a possible mapping of variables into registers by using register sharing
Compute the number w i of write to the variables mapped to the same set of registers. Estimate n i of each set of register dividing w i by the number of statements S: i =wi/S; hence TR imax = n i f clk .
Power of latches and flip flops is consumed not only during output transitions, but also during all clock edges by the internal clock buffers
The non-switching power PNSK dissipated by internal clock buffers accounts for 30% of the average power for the 0.38-micron and 3.3 V operating system.
In total,
VADA Lab.
PCNTR
After scheduling, the control is defined and optimized by the hardware mapper and
further by the logic synthesis process before mapping to layout.
Like interconnect, therefore, the control needs to be estimated statistically.
Local control model: the local controller account for a larger percentage of the total capacitance than the global controller.
Where Ntrans is the number of tansitions, Nstates is the number of states, Clc is the capacitance switched in any local controller in one sample period and Bf is the ratio of the number of bus accesses to the number of busses.
Global control model
Ntrans
The number of transitions depends on assignment, scheduling, optimizations, logic optimization, the standard cell library used, the amount of glitchings and the statistics of the inputs.
VADA Lab.
VADA Lab.
(a)
(b)
VADA Lab.
The coarse-grained model provides a fast estimation of the power consumption when
no information of the activity of the input data to the functional units is available.
VADA Lab.
Fine-grained model
When information of the activity of the input data to the functional units is available.
VADA Lab.
of an 8 X 8-bit Booth multiplier.
AHD
VADA Lab.
Loop Interchange
If matrix A is laid out in memory in column-major form, execution
order (a.2) implies more cache misses than the execution order in (b.2).
Thus, the compiler chooses algorithm (b.1) to reduce the running time.
VADA Lab.
Motion Estimation
VADA Lab.
Retiming
Flip- flop insertion to minimize hazard activity moving a flip- flop in a circuit
VADA Lab.
power reduction
By sampling a steady state signal at a register input,
no more glitches are propagated through the next
combinational logics.
VADA Lab.
Common patterns enable the design of less complex architecture and therefore simpler interconnect structure (muxes, buffers, and buses). Regular designs often have less control hardware.
VADA Lab.
Module Selection
Select the clock period, choose proper hardware modules for all operations(e.g., Wallace or Booth Multiplier), determine where to pipeline (or where to put registers), such that a minimal hardware cost is obtained under given timing and throughput constraints.
Full pipelining: ineffective clock period mismatches between the execution times of the operators. performing operations in sequence without immediate buffering can result in a reduction of the critical path.
Clustering operations into non-pipelining hardware modules, the reusability of these modules over the complete computational graph be maximized.
During clustering, more expensive but faster hardware may be swapped in for operations on the critical path if the clustering violates timing constraints
VADA Lab.
Estimate min and max bounds on the required resources to
delimit the design space min bounds to serve as an initial solution
serve as entries in a resource utilization table which guides the transformation, assignment and scheduling operations
Max bound on execution time is tmax: topological ordering of DFG using ASAP and ALAP
Minimum bounds on the number of resources for each resource class
Where NRi: the number of resources of class Ri
dRi : the duration of a single operation
ORi : the number of operations
VADA Lab.
Find the minimal area solution constrained to the timing constraints
By checking the critical paths, it determine if the proposed graph violates the timing constraints. If so, retiming, pipelining and tree height reduction can be applied.
After acceptable graph is obtained, the resource allocation process is
initiated.
transform the graph to reduce the hardware requirements.
Use a rejectionless probabilistic iterative search technique (a variant of Simulated Annealing), where moves are always accepted. This approach reduces computational complexity and gives faster convergence.
VADA Lab.
Scheduling and Binding
The scheduling task selects the control step, in which a given operation will happen, i.e., assign each operation to an execution cycle
Sharing: Bind a resource to more than one operation.
Operations must not execute concurrently.
Graph scheduled hierachically in a bottom-up fashion
Power tradeoffs
Schedule directly impacts resource sharing
Energy consumption depends what the previous instruction was
Reordering to minimize the switching on the control path
Clock selection
Eliminate slacks
VADA Lab.
ASAP Scheduling
Mechanical analogy:
Probability of scheduling operations into control steps
Probability of scheduling operations into control steps after operation o3 is scheduled to step s2
Operator cost for multiplications in a
Operator cost for multiplications in c
3445.psd
3446.psd
3447.psd
3448.psd
ready operation list/resource constraint
Priority list
Decompose a computation into strongly connected components
Any adjacent trivial SCCs are merged into a sub part;
Use pipelining to isolate the sub parts;
For each sub part
If (the sub part is linear)
Apply optimal unfolding;
Merge linear sub parts to further optimize;
Schedule merged sub parts to minimize memory usage
VADA Lab.
SCC decomposition step
Using the standard depth-first search-based algorithm [Tarjan,1972] which has a low order polynomial-time complexity. For any pair of operations A and B within an SCC, there exist both a path from A to B and a path from B to A. The graph formed by all the SCCs is acyclic. Thus, the SCCs can be isolated from each other using pipeline delays, which enables us to optimize each SCC separately.
VADA Lab.
Idetifying SCC
The first step of the approach is to identify the computation's strongly connected components,.
VADA Lab.
3456.psd
Pipeline operations inside a loop.
Overlap execution of operations.
Use pipeline scheduling for loop graph model.
VADA Lab.
DFG Restructuring
3462.psd
3463.psd
VADA Lab.
Control Synthesis
VADA Lab.
Optimum binding
VADA Lab.
Resource sharing can destroy signal correlations and increase switching activity, should be done between operations that are strongly connected.
Map operations with correlated input signals to the same units
Regularity: repeated patterns of computation (e.g., (+, * ), ( * ,*), (+,>)) simplifying interconnect (busses, multiplexers, buffers)
VADA Lab.
Datapath interconnections
Multiplexer-oriented datapath
Bus-oriented datapath
3466.psd
Insertion of Latch (out)
Insertion of latches at the output ports of the functional units
3467.psd
Insertion of Latch (in/out)
Insertion of latches at both the input and output ports of the functional units
3468.psd
3469.psd
Overlapping data transfer with functional-unit execution
3470.psd
Scheduled DFG
Graph model
3475.psd
3476.psd
VADA Lab.
Register Allocation
Allocation : bind registers and functional modules to variables and operations in the CDFG and specify the interconnection among modules and registers in terms of MUX or BUS.
Reduce capacitance during allocation by minimizing the number of functional modules, registers, and multiplexers.
Register allocation and variable assignment can have a profound impact on spurious switching activity in a circuit
Judicious variable assignment is a key factor in the elimination of spurious operations
VADA Lab.
Effect of Register Sharing on FU
For a small increase in the number of registers, it is possible to significantly reduce or eliminate spurious operations
For Design 1, synthesized from Assignment 1, the power consumption was 30.71mW
For Design 2, synthesized from Assignment 2, the power consumption was 18.96mW @1.2-m standard-cell library
VADA Lab.
Operand sharing
To schedule and bind operations to functional units in such a way that the activity of the input operands is reduced.
Operations sharing the same operand are bound to the same functional unit and scheduled in such a way that the function unit can reuse that operand.
we will call operand reutilization (OPR) the fact that an operand is reused by two operations consecutively executed in the same functional unit.
VADA Lab.
VADA Lab.
Loop unrolling
The technique of loop unrolling replicates the body of a loop some number of times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality.
VADA Lab.
Loop Unrolling Effects
Loop overhead is cut in half because two iterations are performed in each iteration.
If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body.
Instruction parallelism is increased because the second assignment can be performed while the results of the rst are being stored and the loop variables are being updated.
VADA Lab.
Loop Unrolling (IIR filter example)
loop unrolling : localize the data to reduce the activity of the inputs of the functional units or two output samples are computed in parallel based on two input samples.
Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,
The transformation yields critical path of 3, thus voltage can be dropped.
VADA Lab.
VADA Lab.
VADA Lab.
VADA Lab.
obtaining a reduction of 9.4%.
VADA Lab.
Operand Retaining
An idle functional unit may have input operand changes because of the variation of the selection signals of multiplexers.
The operand-retaining technique attempts to minimize the useless power consumption of the idle functional units.
VADA Lab.
Minimize the useless power consumption of idle units
(a) with a proper register binding that minimizes the activity of the functional units
(b) by wisely defining the control signals of the multiplexors during the idle cycles in such a way that the changes at the inputs of the functional units are minimized (this may result in defining some of the don't care values of the control signals) [RJ94]
(c) latching the operands of those units that will be often idle.
VADA Lab.
(C) latching the operands
It consists of the insertion of latches at the inputs of the functional units to store the operands only when the unit requires them. Thus, in those cycles in which the unit is idle no consumption in produced.
The control unit has to be redesigned accordingly, in such a way that input latches become transparent during those cycles in which the corresponding functional unit must execute an operation.
Similar to putting the functional units to sleep when they are not needed through gated-clocking strategies [CSB94,BSdM94, AMD94].
VADA Lab.
LMS filter Example
The power consumption generated by the idle units (useless consumption) is:
the power consumption because of the useful calculations (useful consumption) is:
The estimated reduction in power consumption is:
reduction of 34%.
interconnect power reduction
A spatially local cluster: group of algorithm operations that are tightly connected to each other in the flowgraph representation.
Two nodes are tightly connected to each other on the flowgraph representaion if the shortest distance between them, in terms of number of edges traversed, is low.
A spatially local assignment is a mapping of the algorithm operations to specific hardware units such that no operations in different clusters share the same hardware.
Partitioning the algorithm into spatially local clusters ensures that the majority of the data transfers take place within clusters (with local bus) and relatively few occur between clusters (with global bus).
The partitioning information is passed to the architecture netlist and floorplanning tools.
Local: A given adder outputs data to its own inputs Global: A given adder outputs data to the aother adder's inputs
VADA Lab.
Hardware Mapping
The last step in the synthesis process maps the allocated, assigned and scheduled flow graph (called the decorated flow graph) onto the available hardware blocks.
The result of this process is a structural description of the processor architecture, (e.g., sdl input to the Lager IV silicon assembly environment).
The mapping process transforms the flow graph into three structural sub-graphs:
the data path structure graph
the controller state machine graph
the interface graph (between data path control inputs and the
controller output signals)
Spectral Partitioning in High-Level Synthesis
The eigenvector placement obtained forms an ordering in which nodes tightly connected to each other are placed close together.
The relative distances is a measure of the tightness of connections.
Use the eigenvector ordering to generate several partitioning solutions
The area estimates are based on distribution graphs.
A distribution graph displays the expected number of operations executed in each time slot.
Local bus power: the number of global data transfers times the area of the cluster
Global bus power: the number of global data transfer times the total area:
VADA Lab.
VADA Lab.
Interconnection Estimation
For connection within a datapath (over-the-cell routing), routing between units increases the actual height of the datapath by approximately 20-30% and that most wire lengths are about 30-40% of the datapath height.
Average global bus length : square root of the estimated chip area.
The three terms represent white space, active area of the components, and wiring area. The coefficients are derived statistically.
VADA Lab.
Datapath Generation
Register file recognition and the multiplexer reduction:
Individual registers are merged as much as possible into register files
reduces the number of bus multiplexers, the overall number of busses (since all registers in a file share the input and output busses) and the number of control signals (since a register file uses a local decoder).
Minimize the multiplexer and I/O bus, simultaneously (clique partitioning is Np-complete, thus Simulated Annealing is used)
Data path partitioning is to optimize the processor floorplan
The core idea is to grow pairs of as large as possible isomorphic regions from corresponding of seed nodes.
VADA Lab.
( input ) : V,
( output ) : V
Cij=1, Bij=L
Cij=1, Bij=W
Wa :
N :
M : Wa * N
L : W
V : )
- -
Minimize Z=

1 : flow = 0;
2 : Bij* .
3 : 2 S T flow .
= min (1, 2)
2 = min{Xij}
4: 2 .
5: V .
- -

- -
PATH 2 : S-b-T REG2 : b
PATH 3 : S-c-g-T REG3 : c, g
PATH 4 : S -d-T REG4 : d
- -


(a,e) +(b, b),
- -
( input ) : V, Network for resource sharing
(output ) : V
Cij=1, Bij=L
Cij=1, Bij=W
Wmux :
K :


V :
- -

PATH 2 :
- -
CDFG
exclusive-OR
- -
,
15%

(polynomial time and optimal solution algorithm)
TOP-DOWN
(DSP, Microcontroller, ASIC, etc)
- -
Programmable DSP bit-swapping .
DSP / .
ALU
VADA Lab.
Bus switchings/transition bus-based system power dissipation . off-chip power consumption chip power 70%
CPU ALU Datapath . ALU bus signal transition bus ALU
VADA Lab.
mux-in1
reg-in1
mux-in2
reg-in2
reg-in3
ALU
databus-in1
databus-in2
C1 = Y1(A2) EXOR Y2(B2)
bit-wise swapping a bit inY2(A2) and a bit in Y2(B2) when a bit in C1 is “1”
VADA Lab.
.
simulation testvector test.
VADA Lab.

[1] D. Gajski and N. Dutt, High-level Synthesis : Introduction to Chip and System Design. Kluwer Academic Publishers, 1992.
[2] G. D. Micheli, Synthesis and Optimization of Digital Circuits. New York : McGraw Hill. Inc, 1994.
[3] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low-Power CMOS digital design", IEEE J. of Solid-State Circuits, pp. 473-484, 1992.
[4] A. P. Chandrakasan, M. Potkonjak, R. Mehra, J. Rabaey, and R. W. Brodersen, "Optimizing power using transformation," IEEE Tr. on CAD/ICAS, pp. 12-31, Jan. 1995.
[5] E. Musool and J. Cortadella, "Scheduling and resource binding for low power", Int'l Symp on Synstem Syntheiss, pp. 104-109, Apr. 1995.
[6] Y. Fang and A. Albicki, "Joint scheduling and allocation for low power," in Proc. of Int'l Symp. on Circuits & Systems, pp. 556-559, May. 1996.
[7] J. Monteiro and Pranav Ashar, "Scheduling techniques to enable power management", 33rd Design Automation Conference, 1996.
[8] R. S. Martin, J. P. Knight, "Optimizing Power in ASIC Behavioral Synthesis", IEEE Design & Test of Computers, pp. 58-70, 1995.
[9] R. Mehra, J. Rabaey, "Exploting Regularity for Low Power Design", IEEE Custom Integrated Circuits Conference, pp.177-182. 1996.
[10] A. Chandrakasan, T. Sheng, and R. W. Brodersen, "Low Power CMOS Digital Design", Journal of Solid State Circuits, pp. 473-484, 1992.
[11] R. Mehra and J. Rabaey, "Behavioral level power estimation and exploration," in Proc. of Int'l Symp. on Low Power Design, pp. 197-202, Apr. 1994.
[12] A. Raghunathan and N. K. Jha, "An iterative improvement algorithm for low power data path synthesis," in Proc. of Int'l Conf. on Computer-Aided Design, pp. 597-602, Nov. 1995.
[13] R. Mehra, J. Rabaey, "Low power architectural synthesis and the impact of exploiting locality," Journal of VLSI Signal Processing, 1996.
VADA Lab.

[14] M. B. Srivastava, A. P. Chandrakasan, and R. W. Brodersen, "Predictive system shutdown and other architectural techniques for energy efficient programmable computation," IEEE Tr. on VLSI Systems, pp. 42-55, Mar. 1996.
[15] A. Abnous and J. M. Rabaey, "Ultra low power domain specific multimedia processors," in Proc. of IEEE VLSI Signal Processing Workshop, Oct. 1996.
[16] M. C. Mcfarland, A. C. Parker, R. Camposano, "The high level synthesis of digital systems," Proceedings of the IEEE. Vol 78. No 2 , February, 1990.
[17] A. Chandrakasan, S. Sheng, R. Brodersen, "Low power CMOS digital design,", IEEE Solid State Circuit, April, 1992.
[18] A. Chandrakasan, R. Brodersen, "Low power digital CMOS design, Kluwer Academic Publishers, 1995.
[19] M. Alidina, J. Moteiro, S. Devadas, A. Ghosh, M. Papaefthymiou, "Precomputation based sequential logic optimization for low power," IEEE International Conference on Computer Aided Design, 1994.
[20] J. Monterio, S. Devadas and A. Ghosh, "Retiming sequential circuits for low power," In Proceeding of the IEEE International Conference on Computer Aided Design, November, 1993.
[21] F. J. Kurdahi, A. C. Parker, REAL: A Program for Register Allocation,: in Proc. of the 24th Design Automation Conference, ACM/IEEE, June. pp. 210-215, 1987.
[22] A. Wolfe. A case study in low-power system level design. In Proc.of the IEEE International Conference on Computer Design, Oct., 1995.
[23] T.D. Burd and R.W. Brothersen. Energy ecient CMOS micropro-cessor design. In Proc. 28th Annual Hawaii International Conf. On System Sciences, January 1995.
[24] A. Dasgupta and R. Karri. Simultaneous scheduling and binding for power minimization during microarchitectural synthesis. In Int. Symposium on Low Power Design, pages 69-74, April 1995.
[25] R.S. Martin. Optimizing power consumption, area and delay in behavioral synthesis. PhD thesis, Department of Electronics, Faculty of Enginering, Carleton University, March 1995.
[26] A. Matsuzawa. Low-power portable design. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, March 1996. Invited lecture.
[27] J.D. Meindl. Low-power microelectronics: retrospect and prospect. Proceedings of the IEEE 83(4):619-635, April 1995.
P
P
P
P
P
P
P
P
P
a
add
abs
b
add
add
abs
abs
add
+
+
+
+
>
-


/
/
CDFG
* * * * *
* * * * *
* * *