ece 465 high level design strategies · a1 subprob. a2 a1,1 a1,2 a2,1 a2,2 root problem a stitch-up...
TRANSCRIPT
![Page 1: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/1.jpg)
ECE 465
High Level Design Strategies
Lecture Notes # 9
Shantanu Dutt
Electrical & Computer Engineering
University of Illinois at Chicago
![Page 2: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/2.jpg)
Outline • Circuit Design Problem
• Solution Approaches:
– Truth Table (TT) vs. Computational/Algorithmic – Yes, hardware, just like software can implement any algorithm (after all software runs on hardware)!
– Flat vs. Divide-&-Conquer
– Divide-&-Conquer: • Associative operations/functions
• General operations/functions
– Other Design Strategies for fast circuits: • Design for all cases
• Speculative computation
• Best of both worlds (best average and best worst-case)
• Pipelining
• Summary
![Page 3: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/3.jpg)
Circuit Design Problem • Design an 8-bit comparator that compares two 8-bit #s available in
two registers A[7..0] and B[7..0], and that o/ps F = 1 if A > B and F = 0 if A <= B.
• Approach 1: The TT approach -- Write down a 16-bit TT, derive logic expression from it, minimize it, obtain gate-based realization, etc.!
A B F
00000000 00000000 0
00000000 00000001 0
- - - - - - - - - - - - - - - - - - - -
00000001 00000000 1
- - - - - - - - - - - - - - - - - - - - - -
11111111 11111111 0
– Too cumbersome and time-consuming
– Fraught with possibility of human error
– Difficult to formally prove correctness (i.e., proof w/o exhasutive testing)
– Will generally have high hardware cost (including wiring, which can be unstructured and messy) and delay
![Page 4: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/4.jpg)
Circuit Design Problem (contd)
• Approach 2: Think computationally/algorithmically about
what the ckt is supposed to compute:
• Approach 2(a): Flat computational/programming
approach:
– Note: A TT can be expressed as a sequence of “if-then-else’s”
– If A = 00000000 and B = 00000000 then F = 0
else if A = 00000000 and B = 00000001 then F=0
……….
else if A = 00000001 and B = 00000000 then F=1
……….
– Essentially a re-hashing of the TT – same problems as the TT
approach
![Page 5: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/5.jpg)
Circuit Design Problem: Strategy 1: Divide-&-Conquer • Approach 2(b): Structured algorithmic approach:
– Be more innovative, think of the structure/properties of the problem that can be used to solve it in a hierarchical or divide-&-conquer (D&C) manner:
– D&C approach: See if the problem can be: • “broken up” into 2 or more smaller subproblems of the same or different type(s): two kinds of
breaks possible by # of operands: partition set of n operands into 2 or more subsets of operands (e.g., adding
n numbers) by operand size: breaking a constant # of n-bit operands into smaller size operands (this
mainly applies when the # of operands are a constant, e.g., add/mult of 2 #s) • whose solns can be “stitched-up” (by a stitch-up function) to give a soln. to the parent problem • also, consider if there is dependency between the sub-probs (results of some required to solve the
other(s)) – Do this recursively for each subprob until subprobs are small enough (the leaf problem) for TT solutions – If the subproblems are of a similar kind (but of smaller size) to the root problem then the breakup and stitching will also be similar, but if not, they have to be broken up differently
Subprob. A1
A1,1 A1,2 A2,1 A2,2
Root problem A
Subprob. A2
Stitch-up of solns to A1 and A2 to form the complete soln to A
Do recursively until subprob-size is s.t. TT-based design is doable
Data dependency? Legend: : D&C breakup arrows : data/signal flow to solve a higher-level problem : possible data-flow betw. sub-problems
![Page 6: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/6.jpg)
Circuit Design Problem: Strategy 1: Divide-&-Conquer • Especially for D&C breakups in which: a) the subproblems are the same problem type as
the root problem, and b) there is no data dependency between subproblems, the final circuit will be a “tree”of stitch-up functions (of either the same size or different sizes at different levels depending on the problem) with leaf functions at the bottom of the tree, as shown in the figure below for a 2-way breakup of each problem/sub-problem.
• A tree is an interconnection structure with nodes and edges/arcs connecting the nodes, so that the nodes can be arranged in a levelized manner such that each node is connected to a unique node called its parent at a higher level (generally a lower #’ed level, where the top level is numbered level 1, and the bottom or leaf level has the highest level #). A binary tree is one in which each node has at most two children (leaf nodes have none).
• Solving a problem using D&C generally yields a fast, low-cost and streamlined design (wiring required is structured and not all jumbled up and messy).
2-i/ps
Stitch-up functions
Leaf functions
Level 1
Level 2
Level (log n), n = # of leaf nodes 2-i/ps 2-i/ps 2-i/ps
Note: breaking an n-bit/n-operand problem into a 2-bit/2-operand problem (log n)-1 levels of breakups and (log n) levels of logic nodes: leaf functions (1 level) and stitch-ups ((log n)-1 levels).
![Page 7: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/7.jpg)
Shift Gears: Design of a Parity Detection Circuit—An n-input XOR
x(0)
x(1)
x(2) X(3)
x(15) f
(a) A linearly-connected circuit of 2-i/p XORs
• No concurrency in design (a)---the actual problem has available concurrency, though, and it is not exploited well in the above “linear” design • Complete sequentialization leading to a delay that is linear in the # of bits n (delay = (n-1)*td), td = delay of 1 gate • All the available concurrency is exploited in design (b)---a parity tree (see next slide). • Question: When can we have a circuit for an operation/function on multiple operands built of “gates” performing the same operation for fewer (generally a small number betw. 2-5) operands? • Answer: When the operation is associative. An oper. “x” is said to be associative if:
a x b x c = (a x b) x c = a x (b x c) OR stated as a function f(a, b, c) = f(f(a,b), c) = f(a, f(b,c)) • Note: An operation/function that is not associative (e.g., NAND of n bits/operands), can still be broken
up into smaller operations, just not the same type as the original operation • Associativity implies that, for example, if we have 4 operations a x b x c x d = f(a,b,c,d), we can either perform this as:
– a x (b x (c x d)) [getting a linear delay of 3 units or in general n-1 units for n operands] i.e., in terms of function notation: f(a, f(b, f(c,d))) – or as (a x b) x (c x d) [getting a logarithmic (base 2) delay of 2 units and exploiting the available
concurrency due to the fact that “x” is associative] i.e., in terms of function notation: f(f(a,b), f(c,d))
• Is XOR associative? Yes. • The parenthesisation corresp. to the above ckt is:
– (…..(((x(0) xor x(1)) xor x(2)) xor x(3)) xor …. xor x(15)) • All these Qs can be answered “automatically” by the D&C approach
![Page 8: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/8.jpg)
Shift Gears: Design of a Parity Detection Circuit—A Series of XORs
(b) 16-bit parity tree
Delay = (# of levels in AND-OR tree) * td = log2 (n) *td
x(15) x(14) x(1) x(0)
w(3,0)
w(3,1)
w(3,2)
w(3,3)
w(3,4)
w(3,5)
w(3,6)
w(3,7)
w(2,0) w(2,1) w(2,2) w(2,3)
w(1,0) w(1,1)
w(0,0) = f
An example of simple designer ingenuity. A bad design would have resulted in a linear delay, an ingenious (simple enough though) & well-informed design results in a log delay, and both have the same gate i/p cost
• if we have 4 operations a x b x c x d, we can either perform this as a x (b x (c x d)) [getting a linear delay of 3 units] or as (a x b) x (c x d) [getting a logarithmic (base 2) delay of 2 units and exploiting the available concurrency due to the fact that “x” is associative]. • We can extend this idea to n operands (and n-1 operations) to perform as many of the pairwise operations as possible in parallel (and do this recursively for every level of remaining operations), similar to design (b) for the parity detector [xor is an associative operation!] and thus get a (log2 n) delay. • In fact, any parenthesisation of operands is correct for an associative operation/function, but the above one is fastest. Surprisingly, any parenthesisation leads to the same h/w cost: n-1 2-i/p gates, i.e., 2(n-1) gate i/ps. Why? Analyze.
Parenthesization of tree-circuit: (((x(15) xor x(14)) xor (x(13) xor x(12))) xor ((x(11) xor x(10)) xor (x(9) xor x(8)))) xor (((x(7) xor x(6)) xor (x(5) xor x(4))) xor ((x(3) xor x(2)) xor (x(1) xor x(0))))
![Page 9: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/9.jpg)
D&C for Associative Operations • Let f(xn-1, ….., x0) be an associative function. • Can the D&C approach be used to yield an efficient, streamlined n-bit xor/parity function w/o having to go through an involved process as we saw for the parity detector? Can it lead automatically to a tree-based ckt? • What is the D&C principle involved here?
• Using the D&C approach for an associative operation results in a breakup by # of operands and the stitch up function being the same as the original function (this is not the case for non-assoc. operations), but w/ a constant # of operands (2, if the original problem is broken into 2 subproblems); see the formulation in the above figure. • Also, there are no dependencies between sub-problems • If the two sub-problems of the D&C approach are balanced (of the same size or as close to it as possible), then unfolding the D&C results in a balanced operation tree of the type for the xor/parity function seen earlier of (log n) delay
f(a,b)
a b
f(xn-1, .., x0)
Stitch-up function---same as the
original function for 2 inputs, i.e.,
f(xn-1, .., x0) = f(f(xn-1, .., xn/2), f(xn/2-1, .., x0))
f(xn-1, .., xn/2) f(xn/2-1, .., x0)
![Page 10: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/10.jpg)
D&C for Associative Operations (cont’d) • Parity detector example
Delay = (# of levels in AND-OR tree) * td = log2 (n) *td
16-bit parity
x(15) x(14) x(1) x(0)
w(3,0)
w(3,1)
w(3,2)
w(3,3)
w(3,4)
w(3,5)
w(3,6)
w(3,7)
w(2,0) w(2,1) w(2,2) w(2,3)
w(1,0) w(1,1)
w(0,0) = f
8-bit parity 8-bit parity
stitch-up function = 2-bit parity/xor
Breakup by operands
![Page 11: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/11.jpg)
D&C Approach for Non-Associative Opers: n-bit > Comparator
• O/P = 1 if A > B, else 0 • Is this is associative? Issue of associativity mainly applies for n operands, not on the n-bits of 2 operands • For a non-associative function, determine its properties that allow determining a break-up & a correct stitch-up function • Useful property: At any level, comp. of MS (most significant) half determines o/p if result is > or < else comp. of LS ½ determines o/p • Can thus break up problem at any level into MS ½ and LS ½ comparisons & based on their results determine which o/p to choose for the higher-level (parent) result • However, need to solve an extended version of the root problem in the sub-probs to be able to realize the stitch-up function: need to think through the problem almost from scratch—no one size fit all recipe! •No sub-problem dependency
Small enough to be
designed using a TT
Breakup by size/bits
Comp A[7..4],B[7..4]
Comp. A[7..0]],B[7..0] Stitch-up of solns to A1
and A2 to form the
complete soln to A
A
A1 A2
Comp A[3..0],B[3..0]
If A1 result is
> or < take
A1 result else
take A2 result
Comp A[7..6],B[7..6] Comp A[5,4],B[5,4]
A1,1 A1,2
If A1,1,1 result is
> or < take
A1,1,1 result else
take A1,1,2 result
Comp A[7],B[7] Comp A[6],B[6]
If A1,1 result is
> or < take
A1,1 result else
take A1,2 result
A1,1,1 A1,1,2
![Page 12: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/12.jpg)
D&C Approach for Non-Associative Opers: n-bit > Comparator (cont’d)
A[i] B[i] f1(i) f2(i)
0 0 0 1
0 1 0 0
1 0 1 0
1 1 0 1
If A[i] = B[i] then { f1(i)=0; f2(i) = 1; /* f2(i) o/p is an i/p to the stitch logic */
/* f2(i) =1 means f1( ), f2( ) o/ps of parent should be that of the LS ½ of this subtree
should be selected by the stitch logic as its o/ps */
else if A[i] < B[i} then { f1(i) = 0; /* indicates < */
f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected as parent’s o/p */
else if A[i] > B[i] then {f1(i) = 1; /* indicates > */
f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected as parent’s o/p */
The TT may be derived directly or by first thinking of and expressing its
computation in a high-level programming language and then converting
it to a TT.
Small enough to be
designed using a TT
(2-bit 2-o/p comparator)
Comp A[7..4],B[7..4]
Comp. A[7..0]],B[7..0] Stitch-up of solns to A1
and A2 to form the
complete soln to A
A
A1 A2
Comp A[3..0],B[3..0]
If A1 result is
> or < take
A1 result else
take A2 result
Comp A[7..6],B[7..6] Comp A[5,4],B[5,4]
A1,1 A1,2
If A1,1,1 result is
> or < take
A1,1,1 result else
take A1,1,2 result
Comp A[7],B[7] Comp A[6],B[6]
If A1,1 result is
> or < take
A1,1 result else
take A1,2 result
A1,1,1 A1,1,2
Breakup by size/bits
![Page 13: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/13.jpg)
Comparator Circuit Design Using D&C (contd.)
Comp A[7..4],B[7..4]
Comp. A[7..0]],B[7..0] Stitch-up of solns to A1
and A2 to form the
complete soln to A
A
A1
A2
Comp A[3..0],B[3..0]
If A1 result is
> or < take
A1 result else
take A2 result
Comp A[7..6],B[7..6] Comp A[5,4],B[5,4]
A1,1 A1,2
If A1,1,1 result is
> or < take
A1,1,1 result else
take A1,1,2 result
Comp A[7],B[7] Comp A[6],B[6]
If A1,1 result is
> or < take
A1,1 result else
take A1,2 result
A1,1,1 A1,1,2
Stitch up logic details for subprobs i & i-1:
If f2(i) = 0 then { my_op1=f1(i);
my_op2=f2(i) } /* select MS ½ comp o/ps */
else /* select LS ½ comp. o/ps */
{my_op1=f1(i-1); my_op2=f2(i-1) }
Stitch-up
logic
f1(i) f2(i)
my_op1 my_op2
f1(i-1) f2(i-1)
f1(i) f2(i) f1(i-1) f2(i-1) my_op1 my_op2
X 0 X X f1(i) f2(i)
X 1 X X f1(i-1) f2(i-1)
OR
• Once the D&C tree is formulated
it is easy to get the low-level &
stitch-up designs
• Stitch-up design shown here
(Compact TT)
2-bit
2:1 Mux
2
2 2
f(i)=f1(i),f2(i) f(i-1)
my_op
f2(i)
I0 I1
(Direct design)
A[i] B[i] f1(i) f2(i)
0 0 0 1
0 1 0 0
1 0 1 0
1 1 0 1
![Page 14: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/14.jpg)
Comparator Circuit Design Using D&C – Final Design
2-bit
2:1 Mux
2
2 2
my(3)
f2(7) = f(7)(2)
I0 I1
1-bit
comparator
f(7)
A[7] B[7]
2
1-bit
comparator
f(6)
A[6] B[6]
2
1-bit
comparator
f(5)
A[5] B[5]
2
1-bit
comparator
f(4)
A[4] B[4]
2
1-bit
comparator
f(3)
A[3] B[3]
2
1-bit
comparator
f(2)
A[2] B[2]
2
1-bit
comparator
f(1)
A[1] B[1]
2
1-bit
comparator
f(0)
A[0] B[0]
2
2-bit
2:1 Mux
2
2 2
my(2)
f(5)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(1)
f(3)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(0)
f(1)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(5)
my(3)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(4)
my(1)(2)
I0 I1
my(5)(2) 1-bit
2:1 Mux
F= my1(6)
I0 I1
my(5)(1) my(4)(1)
Log n levels
of Muxes
• Delay(8-bit comp.) = 3*(delay of 2:1
Mux) + delay of 2-bit comp.
• Note parallelism at work – multiple
logic blocks are processing simult.
• Delay(n-bit comp.) = (log n)*(delay
of 2:1 Mux) + delay of 1-bit comp.
• H/W_cost(8-bit comp.) =
7*(H/W_cost(2:1 Muxes)) +
8*(H/W_cost(1-bit comp.)
• H/W_cost(n-bit comp.) =
(n-1)*(H/W_cost(2:1 Muxes))
+ n*(H/W_cost(1-bit comp.)) Critical path (all
paths in this ckt
are critical)
![Page 15: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/15.jpg)
Mux Design Using D&C 2n :1 mux problem: When control inputs = j, 0 <= j <= 2n -1, input Ij is connected to the output.
(a) Top-Down design (D&C)
2:1
Mux
Sn-1
Sn-2 S0
2n-1 :1
MUX
12 nI
2n-1 :1
MUX
Sn-2 S0
n-1 2 n I
All bits except msb should have different combinations; msb should be at a constant value (here 0)
MSB value should differ
among these 2 groups
All bits except msb should have different combinations; msb should be at a constant value (here 1)
I0
12 nI n-1 Stitch-up
2n :1
MUX
Sn-1 S0
I0
12 nI
Breaku
p b
y op
erand
s (data)
Simu
ltaneo
us b
reakup
by b
its (select)
Two sets of operands: Data operands (2n) and control/select operand (n bits)
![Page 16: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/16.jpg)
8:1
MUX
I0
I1
I2
I3
I4
I5
I6
I7
S2 S1 S0
Opening up the 8:1 MUX’s hierarchical design and a top-down view
I1
2:1 MUX
S0
I0
I3 2:1 MUX
S0
I2
I5 S0
I4
I7
2:1 MUX
S0
I6
2:1
MUX
I0
I2
I4
I6
Z
2:1
MUX
2:1
MUX
2:1
MUX Z
S1
S1
S2
I2
I6
I6
Selected when S0 = 0, S1 = 1.
These i/ps should differ in S2
Selected when
S0 = 0, S1 = 1, S2=1
4:1 Mux
4:1 Mux
All bits except msb should have
different combinations; msb
should be at a constant value
(here 0)
All bits except msb should have
different combinations; msb
should be at a constant value
(here 1)
MSB value should differ
among these 2 groups
• Cost: Number of 2:1 muxes?
• Delay in number of 2:1 mux delay unit?
![Page 17: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/17.jpg)
Top-Down vs Bottom-Up: Mux Design
2:1
2:1
2:1
Sn-1 S1
2n-1 :1
MUX
S0
S0
S0
2n-1
2:1
MUXes
(b) Bottom-Up (“Reduce-and-Accumulate”)
• Generally better to try top-down (D&C) first. For example, it will be much more difficult
to solve the comparator problem bottom-up.
I1
I0
I3
I2
12 nI12 nI2
![Page 18: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/18.jpg)
8:1
MUX
I0
I1
I2
I3
I4
I5
I6
I7
S2 S1 S0
An 8:1 MUX example (bottom-up)
I1
2:1 MUX
S0
I0
I3 2:1 MUX
S0
I2
I5 S0
I4
I7
2:1 MUX
S0
I6
2:1
MUX
4:1 MUX
S2 S1
I0
I2
I4
I6
Z
I1
I3
I5
I7
Selected when S0 = 1
Selected when S0 = 0
Z
These inputs should have different lsb or S0 values, since their sel. is based on S0 (all other remaining, i.e., unselected bit values should be the same). Similarly for other i/p pairs at 2:1 Muxes at this level.
![Page 19: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/19.jpg)
• Multiplication D&C idea: • A x B = (2n/2*Ah + Al)(2
n/2*Bh + Bl), where Ah is the higher n/2 bits of A, and Al the lower n/2 bits = 2n*Ah*Bh + 2n/2*Ah*Bl + 2n/2*Al*Bh + Al*Bl = PH + PM1 + PM2 + PL
• Example: 10111001 = 185 X 00100111 = 39 = 0001110000101111 = 7215 D&C breakup: (10111001) X (00100111) = (24(1011)
+ 1001) X (24(0010) + 0111) = 28(1011 X 0010) + 24(1011 x 0111 + 1001 X 0010) + 1001 X 0111 = 28(00010110) + 24(01001101 + 00010010) + 00111111 = bbbbbbbb00111111 = PL
+ bbbb01001101bbbb = PM1
+ bbbb00010010bbbb = PM2
+ 00010110bbbbbbbb = PH
_____________________ 0001110000101111 = 7215
Multiplier D&C
PL(n2n)
+ PM1(n2n)
PM2(n2n)
+ PH(n2n)
2n-bit
adders
+
2n
Critical path: Delay (using RCAs) =
(a) too high-level analysis: 2*((2n)-bit
adder delay) = 4n*(FA delay)
(b) More exact considering overall critical path: (i+2n-
i+1) = 2n+1 FA delays
Stitch-Up Design 1
(inefficient)
Cost = 3 2n-bit adders = 6n
FAs (full adders) for
RCAs (ripple-carry adders)
AXB:
n-bit mult
Stitch up: Align and Add = 2n*W + 2n/2*X + 2n/2*Y + Z
W X
n n
Z Y
n n
AhXBh:
(n/2)-bit mult
AhXBl:
(n/2)-bit mult
AlXBh:
(n/2)-bit mult
AlXBl:
(n/2)-bit mult
What is the delay of the n-bit multiplier using such a stitch up (# 1)?
Breakup by bits (operand size)
![Page 20: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/20.jpg)
FA7
z0 z1 z2 z3 z4 z5 z6 z7
FA7
FA7
Delay for adding 3 numbers X, Y, Z using two RCAs?
Ans: (n+1) FA delay units or 5(n+1) i/p delay units
is 5n-i/p delay units carry o/p ci+1 = aibi + aici + bici
Critical paths (3 of n) going through (n+1) FAs
![Page 21: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/21.jpg)
Multiplier D&C (cont’d)
PL
PM1
PM2
PH
+
+
+
+
cin
cin
+
cin
n/2 n/2 n/2 n/2
(n/2)-bit adders
Critical path:
Delay =
3*((n/2)-bit
adder delay) =
1.5n*(FA delay)
for RCAs
Stitch-Up Design 2 (efficient)
Cost = 5 (n/2)-bit
Adders = 2.5 n FAs
for RCAs
00 ….0 Cin
Intermediate
Sums
• Ex: 10111001 = 185
X 00100111 = 39
= 0001110000101111 = 7215
D&C breakup: (10111001) X (00100111) = (24(1011) + 1001) X (24(0010) + 0111)
= 28(1011 X 0010) + 24(1011 x 0111 + 1001 X 0010) + 1001 X 0111
= 28(00010110) + 24(01001101 + 00010010) + 00111111
= bbbbbbbb00111111 = PL
+ bbbb01001101bbbb = PM1
+ bbbb00010010bbbb = PM2
+ 00010110bbbbbbbb = PH
_____________________
0001110000101111 = 7215
Cout 000Cin
(Arrows in adds on the left show Couts of lower-order adds propagating as Cin to next higher-order adds)
n
@ del=n/2
@ del=n/2+1
Cin @
del=2[n/2] +1
@ del=2[n/2]
@ del=3[n/2] +1
lsb @ del=n/2 +2 lsb of MS half @ del=n/2+2
![Page 22: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/22.jpg)
Multiplier D&C (cont’d) • The (1.5n + 1) FA delay units is the delay
assuming PL … PH have been computed. • What is the delay of the entire multiplier? • Does the stitch-up need to wait for all bits of
PL … PH to be available before it “starts”? • The stitch up of a level can start when the lsb
of the msb half of the product bits of each of the 4 products PL … PH are available: for the top level this is at n/4 + 2 after the previous level’s such input is avail (= when lsb of msb half of i/p n-bit prod. avail; see analysis for 2n-bit product in figure)
• Using RCAs: (n-1) [this is delay after lsb of msb half avail. at top level o/p ) + { (n/2 +2) + (n/4 +2) + … + (2+2) (stopping at 4-bit mult) + 2 [this is boundary-case 2-bit mult delay at bit 3—lsb of msb half of 4-bit product] + 1/3 [this is the delay of a 2-i/p AND gate translated in terms of FA delay units which, using 2-i/p gates, is 3 2-i/p gate delays] }
• = (n-1 ) + {(1/2)[(S i=0 logn 2i ) + 2logn – 1.17} [corrective term: -[½(21+20) – 1/3] for taking prev. summation up to i=1,0] = n-1 + (1/2)[2n-1] + 2logn - 1.17 ~ 2(n+log n ) ~ Q(2n) FA delays—similar to the well-known array multiplier that uses carry-save adders
• Why do we need 2 FA delay units for a 2-bit mult after 4 1-bit prods of 1-bit mults avail?
PL
PM1
PM2
PH
+
+
+
+
cin
cin
+
cin
n/2 n/2 n/2 n/2
(n/2)-bit adders
Stitch-Up Design 2 (efficient)
00 ….0 Cin
Intermediate
Sums
n
@ del=n/2
@ del=n/2+1
lsb @ del=n/2 +2 lsb of MS half @ del=n/2+2
Cin @
del=2[n/2] +1
@ del=2[n/2]
@ del=3[n/2] +1
• We were able to obtain this similar-to-array-multiplier design using D&C & using basic D&C guidelines. It did not require extensive ingenuity as it might have for the designers of the array multiplier
• But, needed some ingenuity in efficient stitchup and skillful analysis • We can obtain an even faster multiplier (Q(log n) delay) using D&C
and carry-save adders for stitch-up; see appendix
lsb @ del
=n/2+1
![Page 23: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/23.jpg)
SU2(n)
SU2(n/2) SU2(n/2) SU2(n/2) SU2(n/2)
n n n n
2n
SU2
(n/4)
SU2
(n/4)
SU2
(n/4)
SU2
(n/4)
n/2
• What is its cost in terms of # of FAs (RCAs)? • The level below the root (root = 1st level) has 4 (n/2)-bit multiplies to generate the PL …. PH of the root, 16 (n/4)-bit multiplies in the next level, upto 2-bit mults. at level logn. • Thus FAs used = 2.5[n + 4(n/2 )+ 16(n/4) + …+ 4 logn -2*(n/ 2 logn -2) ] + 4 logn -1*(2) + 4 logn *(1/8) [the last two terns are for the boundary cases of 2-bit and 1-bit multipliers that each require 2 and 1/8 FAs, resp.; see Qs below) = 2.5n(S i=0 logn – 2 2i) + 2(n/2)2 + (1/8)n2 = 2.5[n(n/2 -1]/(2 -1)) + 0.625n2 = 1.25n2 -2.5n + 0.625n2 ~ 1.875n2 = Q(n2) • Why do we need 2 FA cost units for a 2-bit multiplication (with 4 1-bit products of 1-bit mults available)? Can we count an even lower cost for 2-bit multiplication (when 4 1-bit prods. avail)? • Assuming we use only 2-input gates, why do we add (n/8) FA cost units for each 1-bit multiplier (which is a 2-i/p AND gate)? Hint: Cost of 2-i/p xor/xnor gates is twice that of 2-i/p and/or/nand/nor gates (why?—look at transistor-level design) • Using carry-save adders or CSvA’s [see appendix], the cost is similar (quadratic in n, i.e., Q(n2)).
SU2 = Stitch up design 2 for
multiplication
![Page 24: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/24.jpg)
D&C Example Where a “Straightforward” Breakup Does Not Work
• Problem: n-bit Majority Function (MF): Output f = 1 when a majority of bits (> n/2) is 1, else f =0
• Need to ask (general Qs for any problem): Is the stitch-up function SU required in the above straightforward breakup of MF(n) into two MF’s for the MS and LS n/2 bits:
Computable? Efficient in both hardware and speed?
• Try all 4 combinations of f1, f2 values and check if its is possible for any function w/ i/ps f1, f2 to determine the correct f value:
f1 = 0, f2 = 0 # of 1’s in minority (<= n/4) in both halves, so totally # of 1’s <= n/2 f = 0 f1 = 1, f2 = 1 # of 1’s in majority (> n/4) in both halves, so totally # of 1’s > n/2 f = 1 f1 = 0, f2 = 1 # of 1’s <= n/4 in LS n/2 and > n/4 in MS n/2, but this does not imply if total
# of 1’s is <= n/2 or > n/2. So no function can determine the correct f value (it will need more info, like exact count of 1’s)
f1 = 1, f2 = 0: same situation as the f1 = 0, f2 = 1 case. Thus the stitch-up function is not even computable in the above breakup of MF(n).
Subprob. A2
MF(MS n/2 bits)
St. Up (SU)
Root problem A:
n-bit MF [MF(n)]
Subprob. A1
MF(LS n/2 bits)
f
f2 f1
![Page 25: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/25.jpg)
D&C Example Where a “Straightforward” Breakup Does Not Work (contd.)
• Try another breakup, this time of MF(n) into functions that are different from MF.
• Have seen (log n) delay for > comparator for two n-bit #s using D&C • Can we do 1-counting using D&C? How much time will this take?
Subprob. A2:
(> compare of A1 o/p
and floor(n/2)
Root problem A:
n-bit MF [MF(n)]
f
f1
Subprob. A1:
Count # of 1’s
in the n-bits
(log n)+1
D&C
tree for
A1
D&C
tree for
A2
![Page 26: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/26.jpg)
Dependency Resolution in D&C:
(1) The Wait Strategy
• Strategy 1: Wait for required o/p of A1 and then perform A2, e.g., as in a ripple-carry adder: A = n-bit addition, A1 = (n/2)-bit addition of the L.S. n/2 bits, A2 = (n/2)-bit addition of the M.S. n/2 bits • No concurrency between A1 and A2: t(A) = t(A1) + t(A2) + t(stitch-up) = 2*t(A1) + t(stitch-up) if A1 and A2 are the same problems of the same size • Note that w/ no dependency, the delay expression is: t(A) = max{t(A1), t(A2)} + t(stitch-up) = t(A1) + t(stitch-up) if A1 and A2 are the same problems of the same size
Subprob. A2
Root problem A
Subprob. A1
Data flow
• So far we have seen D&C breakups in which there is no data dependency
between the two (or more) subproblems of the breakup
• Data dependency leads to increased delays
• We now look at various ways of speeding up designs that have subproblem
dependencies in their D&C breakups
![Page 27: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/27.jpg)
Adder Design using D&C • Example: Ripple-Carry Adder (RCA)
– Stitching up: Carry from LS n/2 bits is input to carry-in of MS n/2 bits at each level of the D&C tree.
– Leaf subproblem: Full Adder (FA)
Add n-bit #s X, Y
Add MS n/2 bits
of X,Y
Add LS n/2 bits
of X,Y
FA FA FA FA
(a) D&C for Ripple-Carry Adder
![Page 28: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/28.jpg)
• Note: Gate delay is propotional to # of inputs (since, generally there is a series connection of
transistors in either the up or down network = # of inputs R’s of the transistors in series
add up and is prop to # of inputs delay ~ RC (C is capacitive load) is prop. to # of inputs)
• The 5-i/p gate delay stated above for a FA is correct if we have 2-3 i/p gates available
(why?), otherwise, if only 2-i/p gates are available, then the delay will be 6-i/p gate delays
(why?).
• Assume each gate i/p contributes 2 ps of delay
• For a 16-bit adder the delay will be 160 ps
• For a 64 bit adder the delay will be 640 ps
Example of the Wait Strategy in Adder Design
FA7
![Page 29: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/29.jpg)
Adder Design using D&C—Lookahead Wait (not in syllabus)
• Example: Carry-Lookahead Adder (CLA)
– Division: 4 subproblems per level
– Stitching up: A more complex stitching up process (generation of global “super” P,G’s to connect up the subproblems)
– Leaf subproblem: 4-bit basic CLA with small p, g bits.
• More intricate techniques (like P,G generation in CLA) for complex stitching up for fast designs may need to be devised that is not directly suggested by D&C. But D&C is a good starting point.
Add n-bit #s X, Y
Add ms n/4 bits Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits
(a) D&C for Carry-Lookahead Adder w/ Linear Global P, G Ckt
P, G P, G P, G P, G Linear connection of local P, G’s from each unit to determine global or super P, G for each unit. But linear delay, so not much better than RCA
But, the global (P,G)for each unit is an associative function. So can be done in max log (n/4) time. Carry-ins to the last 3 (n/4)-bit adds is determined in constant time using the combined (P,G)’s, and it takes another log (n/4) time for all carry-ins to each bit add to be determined.
Add n-bit #s X, Y
Add ms n/4 bits Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits
(b) D&C for Carry-Lookahead Adder w/ a Tree-like
Global P, G Ckt
P, G P, G P, G P, G Tree connection of local P, G’s from each unit to determine global
P, G for each unit (P is associative) to do a prefix computation
![Page 30: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/30.jpg)
Dependency Resolution in D&C:
(2) The “Design-for-all-cases-&-select (DAC)” Strategy
Root problem A
Subprob. A1 Subprob. A2
Subprob. A2
Subprob. A2
Subprob. A2
4-t
o-1
Mu
x
Select i/p
00
01
10
11
I/p00
I/p01
I/p10
I/p11
• Strategy 2: DAC: For a k-bit i/p from A1 to A2,
design 2k copies of A2 each with a different
hardwired k-bit i/p to replace the one from A1.
• Select the correct o/p from all the copies of A2
via a (2k)-to-1 Mux that is selected by the k-bit
o/p from A1 when it becomes available (e.g.,
carry-select adder)
• t(A) = max(t(A1), t(A2)) + t(Mux) + t(stitch-up)
= t(A1) + t(Mux) + t(stitch-up) if A1 and A2 are
the same problems
![Page 31: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/31.jpg)
(2) The “Design-for-all-cases-&-select (DAC)” Strategy: How this looks
across 2 levels
Root problem
Subprob. A1
Subprob. A11 Subprob. A12
Subprob. A12
Subprob. A12
Subprob. A12
4-t
o-1
Mu
x
Select i/p
00
01
10
11
I/p00
I/p01
I/p10
I/p11
Subprob. A2
Subprob. A21 Subprob. A22
Subprob. A22
Subprob. A22
Subprob. A22
4-t
o-1
Mu
x
Select i/p
00
01
10
11
I/p00
I/p01
I/p10
I/p11
Data dependency
Data dependency resolved via DAC at the 2nd level breakup. Two options for the 1st level
breakup: Wait or DAC
![Page 32: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/32.jpg)
(2) The “Design-for-all-cases-&-select (DAC)” Strategy: How this looks
across 2 levels (cont’d)
Data dependency resolved via DAC at the 2nd level breakup. Choosing option Wait at the 1st level
breakup,
Subprob. A21 Subprob. A22
Subprob. A22
Subprob. A22
Subprob. A22
4-t
o-1
Mu
x
Select i/p
00
01
10
11
I/p00
I/p01
I/p10
I/p11
Root problem
Subprob. A1 Subprob. A2 Wait
Subprob. A11 Subprob. A12
Subprob. A12
Subprob. A12
Subprob. A12
4-t
o-1
Mu
x
Select i/p
00
01
10
11
I/p00
I/p01
I/p10
I/p11
![Page 33: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/33.jpg)
(2) The “Design-for-all-cases-&-select (DAC)” Strategy: How this looks across 2 levels (cont’d)
Data dependency resolved via DAC at the 2nd level breakup. Choosing option DAC at the 1st level breakup,
Root problem
Subprob.
A1 Subprob. A2
DAC
Root problem A
Subprob. A1 Subprob. A2
Subprob. A2
Subprob. A2
Subprob. A2
4-t
o-1
Mux
Select i/p
00
01
10
11
I/p00
I/p01
I/p10 I/p11
00
01
10
11
Subprob. A22
Subprob. A22
Subprob. A22
Subprob. A22
4-t
o-1
Mu
x
I/p00
I/p01
I/p10 I/p11
Select i/p
Subprob. A21
Subprob. A11 00
01
10
11
Subprob. A12
Subprob. A12
Subprob. A12
Subprob. A12
4-t
o-1
Mu
x I/p00
I/p01
I/p10
I/p11
Select i/p Select i/p
Subprob. A22
Subprob. A22
Subprob. A22
Subprob. A22
4-t
o-1
Mu
x
00
01
10
11
I/p00
I/p01
I/p10
I/p11
Subprob. A21
Subprob. A21
Subprob. A21
Subprob. A21 4-to
-1 M
ux
00
01
10
11
I/p00
I/p01
I/p10
I/p11
Note: The DAC based replication will apply to the smallest subproblem (subcircuit) of A2 that directly depends on A1’s o/ps, but does not use the DAC strategy, and not to all of A2. Similarly for lower-level DACs.
![Page 34: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/34.jpg)
(2) The “Design-for-all-cases-&-select (DAC)” Strategy (cont’d)
Root problem A
Subprob. A1 Subprob. A2
SUP
Subprob. A2,1 Subprob. A2,2 Subprob. A1,1 Subprob. A1,2
DAC
DAC DAC
SUP SUP
Wait Wait Wait Wait
SUP SUP SUP SUP
Generally, wait strategy will be used at all lower levels after the 1st wait level
• The DAC strategy has a MUX delay involved, and at small subproblems, the delay of a subproblem may be smaller than a MUX delay or may not be sufficiently large to warrant extra replication or mux cost.
• Thus a mix of DAC and Wait strategies, as shown in the above figure, may be faster, w/ DAC used at higher levels and Wait at lower levels.
Figure: A D&C tree w/ a mix of DAC and Wait strategies for dependency resolution between subproblems
Note: The DAC based replication will apply to the smallest subproblem (subcircuit) of A2 that directly depends on A1’s o/ps, but does not use the DAC strategy, and not to all of A2. Similarly for lower-level DACs.
![Page 35: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/35.jpg)
Simplified Mux
1
4
Cout
Example of the DAC Strategy in Adder Design
• For a 16-bit adder, the delay is (9*4 – 4)*2 = 64 ps (2 ps is the delay for a single
i/p); a 60% improvement ((160-64)*100/160) over RCA
• For a 64-bit adder, the delay is (9*8 – 4)*2 = 136 ps; a 79% improvement over RCA
![Page 36: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/36.jpg)
Dependency Resolution in D&C:
(3) Speculative Strategy
• Strategy 3: Have a single copy of A2 but choose a highly likely value of the k-bit i/p and perform A1, A2 concurrently. If after k-bit i/p from A1 is available and selection is incorrect, re-do A2 w/ correct available value. • t(A) = p(correct-choice)*(max(t(A1), t(A2)) + (1-p(correct-choice))*[t(A2) + t(A1)) + t(stitch-up), where p(correct-choice) is probability that our choice of the k-bit i/p for A2 is correct. • For t(A1) = t(A2), this becomes: t(A) = p(correct-choice)*t(A1) + (1-p(correct-choice))*2t(A1)+ t(stitch-up) = t(A1) + (1-p(correct-choice))*t(A1)+ t(stitch-up) • Need a completion signal to indicate when the final o/p is available for A; assuming worst-case time (when the choice is incorrect) is meaningless is such designs • Need an FSM controller for determining if estimate is correct and if not, then redoing A2 (allowing more time for generating the completion signal) .
Root problem A
Subprob. A1 Subprob. A2
01 Estimate based on analysis or stats
FSM Controller: On getting completion signal from A2: If o/p(A1A2) = estimate(A2) (compare when A1 generates a completion signal) then generate a completion signal after some delay (in a subsequent state) corresponding to stitch up delay else set i/p to A2 = o/p(A1 A2) and generate completion signal after delay of A2 + stitch up
2-to
-1 M
ux
select i/p to Mux
I1
I0
op(A1A2)
completion signal
![Page 37: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/37.jpg)
Dependency Resolution in D&C: (4) The “Independent Pre-Computation” Strategy
• Strategy 4: Reconfigure the design of A2 so that it can do as much processing as possible that is independent
of the i/p from A1 (A2_indep). This is the “independent” computation that prepares for the final computation of A2
(A2_dep) that can start once A2_indep and A1 are done.
• t(A) = max(t(A1), t(A2_indep)) + t(A2_dep) + t(stitch-up)
• E.g., Let a1 be the i/p from A1 to A2. If A2 has the logic a2 = v’x’ + uvx + w’xy + wz’a1 + u’xa1. If this were
implemented using 2-i/p AND/OR gates, the delay will be 8 delay units (1 unit = delay for 1 i/p) after a1 is
available. If the logic is re-structured as a2= (v’x’ + uvx + w’xy) + (wz’ + u’x)a1, and if the logic in the 2 brackets
(A2_indep) are performed before a1 is available, then the delay is only 4 delay units after a1 is available.
• Such a strategy requires factoring of the external i/p a1 in the logic for a2, and grouping & implementing all the
non-a1 logic (A2_indep), and then adding logic (A2_dep) to “connect” up the non-a1 logic to a1 as the last stage.
• May not always work very well or at all (e.g, for addition, we need the carry out of A1 to start A2; it has an
A2_indep, but helps only a little; how much?)
Root problem A
Subprob. A1
Data flow
Su
bp
rob. A
2
A2_dep
A2_indep
Concept Example of an unstructured logic for A2
a1
a2
w’ x y w z’ a1 u’ x a1 v’ x’ u v x
A2
Critical path after
a1 avail (8-unit delay)
a2
w’ x y w z’ u’ x a1 v’ x’ u v x
A2_indep A2_dep
Critical path after
a1 avail (4-unit delay)
![Page 38: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/38.jpg)
D&C Summary • For complex digital design, we need to think of the “computation”
underlying the design in a structured manner—are there properties of this computation that can be exploited for the D&C approach? Think of:
– Breakup into >= 2 subprobs via breakup of (# of operands) or (operand sizes [bits])
– Stitch-up (is it computable?)
– Leaf functions
– Dependencies between sub-problems and how to resolve them
• The design is then developed in a structured manner & the corresponding circuit may be synthesized by hand or described compactly using a HDL (e.g., structural VHDL)
• For an operation/func x on n operands (an-1 x an-2 x …… x a0 ) if x is associative, the D&C approach gives an “easy” stitch-up function, which is x on 2 operands (o/ps of applying x on each half). This results in a tree-structured circuit with (log n) delay instead of a linearly-connected circuit with (n) delay can be synthesized.
• If x is non-associative or has only a small # of operands (e.g., 2), more ingenuity and determination of properties of x is needed to determine the breakup and the stitch-up function. The resulting design may or may not be tree-structured
• If there is dependency between the 2 subproblems, then we saw strategies for addressing these dependencies:
– Wait (slowest, least hardware cost)
– Design-for-all-cases (high speed, high hardware cost)
– Speculative (medium speed, medium hardware cost)
– Independent pre-computation (medium-slow speed, low hardware cost)
![Page 39: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/39.jpg)
Strategy 2: A general view of DAC
computations (w/ or w/o D&C) • If there is a data dependency between two
or more portions of a computation (which
may be obtained w/ or w/o using D&C),
don’t wait for the the “previous” computation
to finish before starting the next one
• Assume all possible input values for the
next computation/stage B (e.g., if it has 2
inputs from the prev. stage there will be 4
possible input value combinations) and
perform it using a copy of the design for
possible input value.
• All the different o/p’s of the diff. Copies of B
are Mux’ed using prev. stage A’s o/p
• E.g. design: Carry-Select Adder (at each
stage performs two additions one for carry-
in of 0 and another for carry-in of 1 from the
previous stage)
B A x
y z
B(0,0) 0
0
B(0,1) 0
1
B(1,0) 1
0
B(1,1) 1
1
A x
y
4:1
Mu
x
z
(a) Original design: Time = T(A)+T(B)
(b) DAC computation: Time = max(T(A),T(B)) + T(Mux).
Works well when T(A) approx = T(B) and T(A) >> T(Mux)
![Page 40: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/40.jpg)
Strategy 3: Get the Best of Both Worlds (Average and Worst Case Delays)!
• Use 2 circuits with different worst-case and average-case behaviors
• Use the first available output
• Get the best of both (ave-case, worst-case) worlds
• In the above schematic, we get the good ave case performance of unary division
(assuming uniformly distributed inputs w/o the disadvantage of its bad worst-case
performance): best case = Q(1) subs, ave case = Q(n/2.8) subs, worst case: Q(n) subs
Unary Division Ckt
(good ave case: Q(n/2.8)
subs, bad
worst case: Q(2n) subs)
Non- Restoring Div. Ckt (bad ave
case [Q(n) subs], good
worst case: Q(n) subs)
Ext.
FSM done2 done1
start
Mux select
output output
inputs inputs Registers
Register
Approximate analysis: Avg. dividend D value = 2n-1
For divisor V values in the “lower half range”[1, 2n-1], the average quotient Q value is the Harmonic series (1+ ½ + 1/3 + … + 1/ 2n-1) – this comes from a Q value of x for V in the range (2n-1/x) to 2n-1/(x-1)) - 1, i.e., for approx. (2n-1/x2) #s, which have a probability of 1/x2 , giving a probabilistic value of x(1/x2) = 1/x in the average Q calculation. The above summation is ~ ln (2n-1) ~( n-1)/1.4 (integration of 1/k from k = 1 to 2n-1) Q for divisors in the upper half range [2n-1 +1, 2n] is 0 overall avg. quotient = (n-1)/2.8 avg. subtractions needed = 1 + (n-1)/2.8 = Q(n/2.8)
![Page 41: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/41.jpg)
Strategy 4a: Pipeline It! (Synchronous Pipeline)
Original ckt
or datapath
Stage 1
Stage 2
Stage k
Conversion to a simple level-partitioned pipeline (all modules/gates at the same level from i/ps belong to the same stage). Note that level partition may not always be possible but other pipeline-able partitions may be.
• Throughput is defined as # of outputs / sec
• Non-pipelined throughput = (1 / D), where D = delay of original ckt’s datapath
• Pipeline throughput = 1/ (max stage delay + register delay)
• Special case: If original ckt’s datapath is divided into k stages, each of equal delay, and
dr is the delay of a register, then pipeline throughput = 1/((D/k)+dr).
• If dr is negligible compared to D/k, then pipeline throughput = k/D, k times that of the
original ckt
• In general, the registers can be clocked w/ clock period Tclk = max stage delay + register
delay
Clock
Registers
![Page 42: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/42.jpg)
Strategy 4a: (Synchronous) Pipeline It! (contd.) • Comparator o/p produced every 1 unit of time, instead of
every (log n) +1 unit of time w/o pipelining, where 1 time unit here = delay of mux or 1-bit comparator (both will have the same or similar delay)
• We can reduce reg. cost by inserting at every 2 levels, throughput decreases to 1 per every 2 units of time
2-bit
2:1 Mux
2
2 2
my(3)
f2(7) = f(7)(2)
I0 I1
1-bit
comparator
f(7)
A[7] B[7]
2
1-bit
comparator
f(6)
A[6] B[6]
2
1-bit
comparator
f(5)
A[5] B[5]
2
1-bit
comparator
f(4)
A[4] B[4]
2
1-bit
comparator
f(3)
A[3] B[3]
2
1-bit
comparator
f(2)
A[2] B[2]
2
1-bit
comparator
f(1)
A[1] B[1]
2
1-bit
comparator
f(0)
A[0] B[0]
2
2-bit
2:1 Mux
2
2 2
my(2)
f(5)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(1)
f(3)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(0)
f(1)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(5)
my(3)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(4)
my(1)(2)
I0 I1
my(5)(2) 1-bit
2:1 Mux
F= my1(6)
I0 I1
my(5)(1) my(4)(1)
Log n level
of Muxes
Legend : Register
Ij(t k) = processed version of i/p Ij
@ time k. We assume delay of each
basic odule below is 1 unit.
I1(t=0)
I1(t=1)
I2(t=1)
I1(t=2)
I3(t=2)
I2(t=2)
I1(t=3)
I4(t=3)
I3(t=3)
I2(t=3)
I1(t=4)
I5(t=4)
I4(t=4)
I3(t=4)
I2(t=4)
I2(t=5)
I6(t=5)
I5(t=5)
I4(t=5)
I3(t=5)
time axis
stg.2 i/ps
stage 3 i/ps
stage 4 i/ps
output
![Page 43: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/43.jpg)
Strategy 4a: (Synchronous) Pipeline It! (contd.)
O/p produced every 2 unit’s of FA delay instead of every n units of FA delay in an n-bit RCA
Next 3 S0, S1 o/ps
S1, S0 o/ps for i/ps recvd
4 cc back
S3, S2 o/ps for i/ps recvd
4 cc back
S5, S4 o/ps for i/ps recvd
4 cc back
S7, S6 o/ps for i/ps recvd
4 cc back
Pipelined Ripple Carry Adder
• Problem: I/P and O/P data direction is not the same as the computation direction.They are
perpendicular! In other words, each stage has i/ps & o/ps, rather than i/ps and o/ps only appearing at the beginning and end, resp., of the pipeline. I/ps need to stream in with delay = single stage delay.
• Thus at the i/p of each stage, need to have regs to hold new i/ps until earlier i/ps have been processed by that stage. Thus more regs are needed for later stages as they will process their current i/ps later,by which time more i/ps will have streamed in.
• Similarly, at the o/p of each stage, need to hold o/ps in regs until last stage’s o/p appears for the earliest i/p still being processed. Thus need more regs in earlier stages as they will have produced more o/ps by the time the o/p, corresponding to the earliest i/p being processed, appears
Assume 1 cc = 2 FA + register delay
Intermediate or output
register:
Input register:
![Page 44: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/44.jpg)
Strategy 4b: Pipeline It!—Wave (or Asynchronous) Pipelining
• Wave pipelining is essentially pipelining of combinational circuits w/o the use of registers
and clocking, which are expensive and consume significant amount of power.
• The min. safe input period (MSIP: interval at which subsequent inputs are given to the wave-
pipelined circuit) is determined by the difference in max and min delays across various
module or gate outputs and inputs, respectively, and is designed so that a later input’s
processing, however fast it might be, will never “overwrite” that data at the input of any gate
or module that is still processing the previous input.
m1 m2
tmin(i/p:m2) tmax(o/p:m2)
• Consider two modules m1 and m2 above (the modules could either be subcircuits like a full
adder or even more complex or can be as simple as a single gate). Let tmin(i/p:mj) be the min-
delay (i.e., min arrival time) over all i/ps to mj, and let tmax(o/p:mj) be the max delay at the
o/p(s) of mj.
• Certainly a safe i/p period (SIP) period for wave pipelining the above configuration is
tmax(o/p:m2). But can we do it safely at a higher rate (lower period)?
• The min safe i/p period (MSIP) period at the circuit i/ps, for m2, will correspond to a situation
in which any new i/p appears at m2 only after the processing of its current i/ps is over so that
while it is still processing its current i/ps are held stable. Thus if the 1st i/p appears at time 0 at
ckt i/p, the 2nd i/p should appear at m2 no earlier than 0+tmax(o/p:m2) the 2nd i/p to the
circuit itself, i.e., at m1 should not appear before tmax(o/p:m2) - tmin(i/p:m2).
• Similarly, for safe 1st i/p operation of m1, the 2nd i/p should not appear before 0+tmax(o/p:m1)
- tmin(i/p:m1) = tmax(o/p:m1) as tmin(i/p:m1) = 0.
• Thus for safe operation of the 1st i/ps of both m1 and m2, the 2nd i/p should not appear at the
ckt i/ps before max(tmax(o/p:m1), tmax(o/p:m2) - tmin(i/p:m2)), the MSIP after the 1st i/p
![Page 45: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/45.jpg)
Strategy 4b: Pipeline It!—Wave Pipelining (contd.)
• The Q. now is whether tsafe(1) (the min i/p safe period after the 1st i/p) = max(tmax(o/p:m1),
tmax(o/p:m2) - tmin(i/p:m2)), will also be a safe period after the ith i/p for i > 1?
• The 2nd o/p from m2 appears at time tsafe(1) + tmax(o/p:m2). Consider that the 3rd i/p appears
at the ckt i/p tsafe(1) time after the 2nd i/p. Thus the 3rd i/p appears at m2 at 2 tsafe(1) +
tmin(i/p:m2) >= tsafe(1) + tmax(o/p:m2) (since tsafe(1) = max(tmax(o/p:m1), tmax(o/p:m2) -
tmin(i/p:m2)) >= tmax(o/p:m2) - tmin(i/p:m2) and thus no earlier than when the 2nd o/p of m2
appears, and is thus safe.
• A similar safety analysis will show that tsafe(1) is a safe i/p rate period for any ith i/p for both
m2 and m1, for any i. Since it is also the min. such period (at the module level) for the 1st i/p
(and in fact for any ith i/p) as we have established earlier, tsafe(1) is the min. safe i/p rate
period for any ith i/p for both m2 and m1. We term this min. i/p rate period tsafe (= tsafe(1) ).
• If there are k modules m1, …, mk, a simple extension of the above analysis gives us that the
MSIP tsafe = maxi=1 to k{tmax(o/p:mi) - tmin(i/p:mi)}; see Fig. 2 above.
m1 m2
tmin(i/p:m2) tmax(o/p:m2)
Fig. 1
m1 m2
Safe for mj if fastest i’th i/p: (i-1)tsafe + tmin(i/p:mj) >= slowest (i-1)’th o/p = (i-2)tsafe + tmax(o/p:mj), i.e., if tsafe >= tmax(o/p:mj) - tmin(i/p:mj). Thus safe.
mj
slowest (i-1)’th o/p: (i-2)tsafe + tmax(o/p:mj) Fig. 2
![Page 46: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/46.jpg)
Strategy 4b: Pipeline It!—Wave Pipelining (contd.)
• Interesting Qs:
1. Is tsafe a safe period not just at the module level but for every gate gj in the circuit (e.g.,
will tsafe be a safe period for every gate gj in m2), which it needs to be in order for this
period to work?
2. Can we have a better (smaller) MISP determination by considering the above analysis
at the level of individual gates than at the level of modules?
m1 m2
tmin(i/p:m2) tmax(o/p:m2)
Fig. 1
m1 m2a
tmin(i/p:m2b)
tmax(o/p:m2a)
Fig. 2: Finer granularity analysis by
splitting m2 into m2a U m2b. Does
MISP increase, decrease or
unchanged?
m2b tmax(o/p:m2b)
= tmax(o/p:m2)
tmin(i/p:m2)
= tmin(i/p:m2a)
Finer
Analysis
![Page 47: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/47.jpg)
Strategy 4b: Pipeline It!—Wave Pipelining: Example 1
• Let max delay of a 2-i/p gate be 2 ps
and min delay be 1 ps.
• What is the tsafe for this ckt for the
two modules shown?
• tsafe = max (4 ps, 10-1 = 9 ps) = 9 ps
• Thus MSIP = 9 ps a little better than
10 ps corresponding to the max o/p
delay. So this is not a circuit than
can be effectively wave pipelined.
• Can the circuit be modified in a
simple way to achieve effective
wave pieplining?
• Generally a ckt that has more
balanced max o/p and min i/p delay
for each module and gate is one that
can be effectively wave pipelined,
i.e., one whose MSIP is much lower
than the max o/p delay of the circuit
w’ x y w z’ a1 u’ x a1 v’ x’ u v x
m1
m2
tmax(o/p:m2) = 10 ps
tmin(i/p:m2) = 1 ps tmax(o/p:m1) = 4 ps
![Page 48: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/48.jpg)
Strategy 4b: Pipeline It!—Wave Pipelining: Example 2 • Let max delay of a basic
unit (1-bit comp., 2:1
mux) is 3 ps and min
delay be 2 ps.
• What is the tsafe for this
ckt for the two modules
shown?
• tsafe = max (6-0 ps, 12-4
= 8 ps) = 8 ps
• Thus MSIP = 8 ps, 33%
lower than 12 ps
corresponding to the max
o/p delay.
• So this is a circuit that
can be reasonably
effectively wave
pipelined. This is due to
the balanced nature of
the circuit where all i/p
o/p paths are of the
same length (the diff.
betw. max and min
delays come from the
max and min delays of
the components or gates
themselves).
2-bit
2:1 Mux
2
2 2
my(3)
f2(7) = f(7)(2)
I0 I1
1-bit
comparator
f(7)
A[7] B[7]
2
1-bit
comparator
f(6)
A[6] B[6]
2
1-bit
comparator
f(5)
A[5] B[5]
2
1-bit
comparator
f(4)
A[4] B[4]
2
1-bit
comparator
f(3)
A[3] B[3]
2
1-bit
comparator
f(2)
A[2] B[2]
2
1-bit
comparator
f(1)
A[1] B[1]
2
1-bit
comparator
f(0)
A[0] B[0]
2
2-bit
2:1 Mux
2
2 2
my(2)
f(5)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(1)
f(3)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(0)
f(1)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(5)
my(3)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(4)
my(1)(2)
I0 I1
my(5)(2) 1-bit
2:1 Mux
F= my1(6)
I0 I1
my(5)(1) my(4)(1)
Log n level
of Muxes
m2
m1
12 ps
4 ps
6 ps
• What if we divide the circuit into 4 modules, each
corresponding to a level of the circuit, and did the
analysis for that? See next slide.
![Page 49: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/49.jpg)
Strategy 4b: Pipeline It!—Wave Pipelining: Example 3 • Let max delay of a basic
unit (1-bit comp., 2:1
mux) is 3 ps and min
delay be 2 ps.
• What is the tsafe for this
ckt for the 4 modules
shown?
• tsafe = max (3-0, 6-2, 9-4,
12-6) ps = 6 ps
• Thus MSIP = 6 ps, 50%
lower than 12 ps
corresponding to the max
o/p delay. So we get a
better (lower) MSIP if
we analyze the ckt at a
finer granularity.
2-bit
2:1 Mux
2
2 2
my(3)
f2(7) = f(7)(2)
I0 I1
1-bit
comparator
f(7)
A[7] B[7]
2
1-bit
comparator
f(6)
A[6] B[6]
2
1-bit
comparator
f(5)
A[5] B[5]
2
1-bit
comparator
f(4)
A[4] B[4]
2
1-bit
comparator
f(3)
A[3] B[3]
2
1-bit
comparator
f(2)
A[2] B[2]
2
1-bit
comparator
f(1)
A[1] B[1]
2
1-bit
comparator
f(0)
A[0] B[0]
2
2-bit
2:1 Mux
2
2 2
my(2)
f(5)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(1)
f(3)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(0)
f(1)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(5)
my(3)(2)
I0 I1
2-bit
2:1 Mux
2
2 2
my(4)
my(1)(2)
I0 I1
my(5)(2) 1-bit
2:1 Mux
F= my1(6)
I0 I1
my(5)(1) my(4)(1)
Log n level
of Muxes
m4
m1
12 ps
4 ps
6 ps m2
m3
9 ps
6 ps
3 ps
0 ps
2 ps
4 ps
6 ps
![Page 50: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/50.jpg)
Appendix: Q(log n) delay multiplier Multiplier D&C (cont’d): Carry-Save Addition
![Page 51: ECE 465 High Level Design Strategies · A1 Subprob. A2 A1,1 A1,2 A2,1 A2,2 Root problem A Stitch-up of solns to A1 and A2 to form the complete soln to A Do recursively until subprob-size](https://reader033.vdocuments.net/reader033/viewer/2022060300/5f0825037e708231d4208f46/html5/thumbnails/51.jpg)
Appendix: Q(log n) delay multiplier Multiplier D&C (cont’d): Carry-Save Add. Based Stitch-Up
• Using CSvAs (carry-save adders) [each sub-prod., e.g., PL, is formed of 2 nos. sum bits and carry bits, and so there are 8 n-bit #s to be CSvA’ed in the final stitch-up and takes a delay of approx. 5 units if done in seq. but only 4 units if done in parallel. We then get 2 final nos. (carries # and sums #) that are added by a carry-propagate adder like a CLA, which takes Q(log n) time, and overall multiplier delay is Q(4*log n) [4 time units at each of the (log n -2) levels (need at least 2 bit inputs for the above structure to be valid) + at moat 2 time units for the bottom two levels (why?)] + Q(log n) = Q(log n) —similar to Wallace-tree mult,
• We were able to obtain this fast design using D&C (and did not need the extensive ingenuity that W-T multiplier designers must have needed] !
• Hardware cost (# of FAs), ignoring final carry-prop. adder for the entire mult.? Exercise.
S(PL)
C(PL)
S(PM1)
C(PM1)
S(PM2)
C(PM2)
S(PH)
C(PH)
CSvA CSvA
CSvA
CSvA
Fig. : Stitch-up # 3: Adding 6 numbers in parallel using CSvA’s takes 3 units of time and 4 CSvA’s.
n/2 (C & S) n/2 (C & S) n/2 (C & S) n/2 (C & S)
Add 6 #s using CSvA’s: 3 delay units
No CSvA needed
Add 4 #s using CSvA’s
Add 7 #s using CSvA’s (7 lsb bits need to be added): 4 delay units
Fig. : Separate (and thus parallel) Carry save adds for each of the 4 (n/2)-bit groups shown at the top level of multiplication