recent developments in theory and implementation of parallel prefix adders neil burgess division of...

Post on 31-Mar-2015

219 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Recent Developments inTheory and

Implementationof Parallel Prefix Adders

Neil BurgessDivision of Electronics

Cardiff School of EngineeringCardiff University

Motivation

• Parallel Prefix Adders (e.g. Kogge-Stone) mostly ignored for deep submicron VLSI– large fan-out points– wide wiring channels

• Recent insights: can remove both and do...– absolute difference– late increment– media processing

Structure of Presentation

• Parallel Prefix Adder theory– Kogge-Stone, Ladner-Fisher

• New log-depth prefix trees– Knowles’ “family of adders”

• New applications of prefix adders– late operations, media adder

I.

Parallel Prefix Adder theory

Prefix adder structureA(0:w-1)

Bit propagate and generate cells

g(0:w-1)p(0:w-1)

B(0:w-1)

c(1:w)

Prefix carry tree

s(0:w)

Sum cells (XOR gates)

Prefix Equations - 1

• g(i) = a(i) b(i) “carry generate”• p(i) = a(i) b(i)“carry propagate”• k(i) = {a(i) b(i)} “carry kill”

• g(i), p(i), & k(i) are mutually exclusive– Use any two: g(i) & k(i) = NAND & NOR– p(i) needed as well: s(i) = p(i) c(i)

Prefix Equations - 2

• Generate and Not Kill signals are com-bined to form “Group Signals”Gx

z Kxz interpretation

0 0 c(x+1) = 00 1 c(x+1) = c(z)1 0 Don’t care1 1 c(x+1) = 1

Prefix Equations - Interpretation

• Group signals yield carry signals:

• Tree outputs: c(i+1) = Gi0

• Tree inputs: Gii = g(i) ; Ki

i = k(i)

zy

yx

zx

zy

yx

yx

zx

KKK

GPGG

1

1

zy

yx

zx GKGKGK 1

Prefix Equations - characteristics

• Associative– sub-terms may be pre-computed in

parallel g (0 ), (0 )kg (0 ), (0 )k g (1 ), (1 )kg (1 ), (1 )k g (2 ), (2 )kg (2 ), (2 )k g k(3 ), (3 )g k(3 ), (3 )

G K10G K

10G K

G K

G K

G K

3

3

2

3

2

0

0

0

c (4 )c (4 ) c (3 ) c (2 )c (2 ) c (1 )c (1 )

Prefix equations - characteristics

• Idempotent– sub-terms may be “overlapped”

g(0), k(0)g(0), k(0) g(1), k(1)g(1), k(1) g(2), k(2)g(2), k(2)

GK10 GKGK

2211

GKGK2200

c(3)c(3) c(2)c(2) c(1)c(1)

4-bit Ladner-Fisher prefix tree

• 1 sub-term pre-computed

• Logarithmic depth

• Fan-out = 2 in 2nd row (laterally)

g (0 ), (0 )kg (1 ), (1 )kg (2 ), (2 )kg k(3 ), (3 )

G K10G K

G K G K

3

3 2

2

0 0

c (4 ) c (3 ) c (2 ) c (1 )

8-bit Ladner-Fisher prefix tree

• Log depth; lateral fan-out = 4 in 3rd row

• No exploitation of idempotencyg (0 ), (0 )k

c (1 )

g k(3 ), (3 )

c (4 )

g k(7 ), (7 )

c (8 )

16-bit Ladner-Fisher prefix tree

• Log depth with large fan-out in final row

4-bit Kogge-Stone prefix graph

• Fan-out = 1(laterally)

• 1 extra cell

• parallel wires in 2nd row

g (0 ), (0 )kg (1 ), (1 )kg (2 ), (2 )kg k(3 ), (3 )

G K10G K

21G K

G K G K

3

3 2

2

0 0

c (4 ) c (3 ) c (2 ) c (1 )

8-bit Kogge-Stone prefix graph

• More cells & wiring than Ladner-Fisher g (0 ), (0 )k

c (1 )

g k(3 ), (3 )

c (4 )

g k(7 ), (7 )

c (8 )

16-bit Kogge-Stone prefix graph

• Low fan-out but wider wiring channels

• No exploitation of idempotency

Black cells and grey cells

• Carries, c(i) = Gi-10; Ki-1

0 terms not needed

• G-only cells called and coloured “grey”

The story so far…

• Parallel prefix adders available in VLSI• Log-depth adders possible:

– high fan-outs {1,2,4,8…} & low cell count– low fan-outs {1,1,1,1…} & high cell count

• Problematic in VLSI (buffering, area)• Idempotency of ‘’ operator not

exploited

II.

Knowles’“Family of Adders”

Log-depth prefix trees

• In VLSI:– L-F trees require too much buffering

delay– K-S trees require too much area (wire

flux)

• Fan-outs characterised as:– {1,2,4,8…} Ladner-Fisher– {1,1,1,1…} Kogge-Stone

Knowles’ insight

• Use other fan-out schemes• 5 possible 8-bit log-depth prefix

trees:– {1,1,1} 17 cells Kogge-Stone– {1,1,2} 17 cells uses idempotency– {1,1,4} 14 cells no idempotency– {1,2,2} 14 cells no idempotency– {1,2,4} 12 cells Ladner-Fisher

Knowles’ 8-bit prefix trees

• All trees are log-depth

{ 1 ,1 ,1 }

{ 1 ,1 ,2 }

{ 1 ,2 ,2 }

{ 1 ,1 ,4 }

{ 1 ,2 ,4 }

Tree construction rules

• Levels are labelled 0,1,2...

• Fan-out at jth level, 2k , satisfies 2k 2j

• Fan-out at jth level fan-out at j+1th level

• Lateral wire length at jth level is 2j

Knowles’ 16-bit trees - I

• {1,1,1,1} 49 cells {1,1,1,8} 42 cells

• {1,1,1,2} 49 cells {1,2,2,2} 42 cells• {1,1,1,4} 49 cells {1,1,4,4} 40 cells• {1,1,2,2} 49 cells {1,1,4,8} 36 cells• {1,1,2,4} 49 cells {1,2,2,8} 36 cells• {1,1,2,8} 42 cells {1,2,4,4} 36 cells• {1,2,2,4} 42 cells {1,2,4,8} 32 cells

Knowles’ 16-bit trees - II

• {1,1,1,1} {1,1,1,8}• {1,1,1,2} Idempotent {1,2,2,2}• {1,1,1,4} Idempotent {1,1,4,4}• {1,1,2,2} Idempotent {1,1,4,8} • {1,1,2,4} Idempotent {1,2,2,8} • {1,1,2,8} Idempotent {1,2,4,4} • {1,2,2,4} Idempotent {1,2,4,8}

Knowles’ 16-bit trees - III

• {1,1,1,1} {1,1,1,8} R• {1,1,1,2} I {1,2,2,2} R• {1,1,1,4} I {1,1,4,4} R• {1,1,2,2} I {1,1,4,8} R• {1,1,2,4} I {1,2,2,8} R• {1,1,2,8} R, I {1,2,4,4} R• {1,2,2,4} R, I {1,2,4,8} R

Quick way of spotting R, I

• Define span(l) as distance from start of wire to first cell in lth level

• span(l) = 2l fanout(l) 1• tree characteristics

– R if span(j) span(k) for j < k– I if span(i) + span(j) = span(k) for i < j <

k

Examples of R & I spotting

fanout(l) span(l) characteristic• [1,1,1,1] [1,2,4,8] neither R nor

I• [1,1,2,2] [1,2,3,7] I only• [1,2,2,2] [1,1,3,7] R only• [1,2,2,4] [1,1,3,5] R & I• Are R & I adders “best”?

VLSI design of prefix adders

• Adders laid out as rectangular array of prefix cells (and gaps)

• Assume cells measure 10m 4m– 2 cells per significance 20m / bit

• Key design parameters:– buffering (area & delay)– wiring channels (area)

16-bit adder example

• Assumptions• Maximum fan-out without

buffering:– 3 cells + 80m wire (4 cell widths)

• Maximum fan-out with buffering:– 9 cells + 240m wire (12 cell widths)

• Employ {1,2,2,4} architecture

{1,2,2,4} prefix adder layout

g

xor

xor

b u f b u f b u f b u f b u f b u f b u f b u f b u f b u f b u f b u f

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

xor

k

K G

K G

K G

G G G G G

G

G

G

G

G

G

G

G

G

G

K G

K G

K G

K G

K G

K G

K G

K G

K G

K G

K G

K G

K G

K G

K G

K G

K G

K G

K G K G

K G

K G K G

K G

k k k k k k k k k k k k k k kg g g g g g g g g g g g g g g

Area vs Time for 32-bit adders

Delay12 12.5 13 13.5 14

24

26

28

30

32

34

36

38

40

Area K-S {1,1,1,1,1}

{1,1,2,2,2}

L-F {1,2,4,8,16}{1,2,2,4,4}

[1,1,3,5,13]

32-bit prefix tree adders

• Exploitable trade-off between adder’s delay and area– Kogge-Stone adder 16% faster than

Ladner-Fisher but 66% larger– {1,2,2,4,4} adder 8% faster than Ladner-

Fisher but only 3% larger– buffering also trades off speed for area

III.

New applications of prefix adders

Other addition operations

• Late increment– Mod 2w-1 addition for Reed-Solomon coding– floating-point rounding

• Late complement– absolute difference for video motion

estimation– sign-magnitude addition

• Typically use 2 adders and a MUX

Increments in prefix trees

• Row of prefix cells = ‘late +1’ operation

• Ladner-Fisher comprises many late +1’s– 1 8-bit, 2 4-bit, 4 2-bit, & 8 1-bit

Late increment tree

• Adder returns A+B if inc = 0• Adder returns A+B+1 if inc = 1

inc

Late increment logic

• “Late Carry” lc(i) set high if:– c(i) = 1 or– inc = 1 and a(n),b(n) 0,0 n: 0 n < i

p(i)

s(i)

incKi-1

0

c(i) = G i-10

lc(i)

Late complement theory

• In 2’s-complement, N = -(N+1)• A + B = A B 1

* late increment then yields A B

(A + B) = -(A B 1+1) = B A

• Absolute difference readily available

Absolute difference logic

• If c(w) = 0, result negative– if c(w) = 0, invert all the bits– else always perform late increment with

Ki-10

p(i)

s(i)

c(w)

Ki-10

c(i)

Summary of “late” ops

• Available on all prefix adders• Extra delay: 1 gate’s delay +

buffering• Extra hardware: w black cells • This technique used in floating-point

units– late increment for rounding– late complement for true subtraction

Media (“packed”) arithmetic

• Fundamental strategy:Use full wordlength hardware for

multiple sub-wordlength computations

• Examples:– 32-bit adder 4 8-bit adders– 32-bit multiplier 2 16-bit multipliers

Partitioning an adder

• Criteria:– support carries propagating within sub-adders– prevent carries propagating between sub-

adders

• Solutions:– put AND gates on carry chains slower adder– put dummy 0’s on operand bits larger adder

• Use prefix adder!!

Packed prefix adder - 1

• Force k(n) = 0 at partition points– prevents carries propagating across bit n– exploits don’t care condition (g,k) = (1,0)

• Implementation– change k(n) gate to (2,1) OR-AND gate– delay-neutral modification

Packed prefix adder - 2

• Force c(n) = Gn-10 = 0 at partition

points– prevents c(n) s(n) errors

• Implementation– insert AND gates (off critical path) or

– change Gn-10 gate to ({2,1},1) complex gate

– BUT need Gn-10 signal for sub-adder overflows

Packed prefix adder - 3

• Sub-adder carries complete early• Extraneous cells automatically do

nothing Force k(n) = 0

Force c(n) = 0

Last Slide

• Recent developments in prefix adders:– new “family” of log-depth trees– late operations– packed arithmetic for media processing

• Future possibilities:– systematic exploitation of idempotency – trees with reduced buffering– combine packed arithmetic/late ops

ANY QUESTIONS OR COMMENTS?

top related