Post on 31-Mar-2015
Recent Developments inTheory and
Implementationof Parallel Prefix Adders
Neil BurgessDivision of Electronics
Cardiff School of EngineeringCardiff University
Motivation
• Parallel Prefix Adders (e.g. Kogge-Stone) mostly ignored for deep submicron VLSI– large fan-out points– wide wiring channels
• Recent insights: can remove both and do...– absolute difference– late increment– media processing
Structure of Presentation
• Parallel Prefix Adder theory– Kogge-Stone, Ladner-Fisher
• New log-depth prefix trees– Knowles’ “family of adders”
• New applications of prefix adders– late operations, media adder
I.
Parallel Prefix Adder theory
Prefix adder structureA(0:w-1)
Bit propagate and generate cells
g(0:w-1)p(0:w-1)
B(0:w-1)
c(1:w)
Prefix carry tree
s(0:w)
Sum cells (XOR gates)
Prefix Equations - 1
• g(i) = a(i) b(i) “carry generate”• p(i) = a(i) b(i)“carry propagate”• k(i) = {a(i) b(i)} “carry kill”
• g(i), p(i), & k(i) are mutually exclusive– Use any two: g(i) & k(i) = NAND & NOR– p(i) needed as well: s(i) = p(i) c(i)
Prefix Equations - 2
• Generate and Not Kill signals are com-bined to form “Group Signals”Gx
z Kxz interpretation
0 0 c(x+1) = 00 1 c(x+1) = c(z)1 0 Don’t care1 1 c(x+1) = 1
Prefix Equations - Interpretation
• Group signals yield carry signals:
• Tree outputs: c(i+1) = Gi0
• Tree inputs: Gii = g(i) ; Ki
i = k(i)
zy
yx
zx
zy
yx
yx
zx
KKK
GPGG
1
1
zy
yx
zx GKGKGK 1
Prefix Equations - characteristics
• Associative– sub-terms may be pre-computed in
parallel g (0 ), (0 )kg (0 ), (0 )k g (1 ), (1 )kg (1 ), (1 )k g (2 ), (2 )kg (2 ), (2 )k g k(3 ), (3 )g k(3 ), (3 )
G K10G K
10G K
G K
G K
G K
3
3
2
3
2
0
0
0
c (4 )c (4 ) c (3 ) c (2 )c (2 ) c (1 )c (1 )
Prefix equations - characteristics
• Idempotent– sub-terms may be “overlapped”
g(0), k(0)g(0), k(0) g(1), k(1)g(1), k(1) g(2), k(2)g(2), k(2)
GK10 GKGK
2211
GKGK2200
c(3)c(3) c(2)c(2) c(1)c(1)
4-bit Ladner-Fisher prefix tree
• 1 sub-term pre-computed
• Logarithmic depth
• Fan-out = 2 in 2nd row (laterally)
g (0 ), (0 )kg (1 ), (1 )kg (2 ), (2 )kg k(3 ), (3 )
G K10G K
G K G K
3
3 2
2
0 0
c (4 ) c (3 ) c (2 ) c (1 )
8-bit Ladner-Fisher prefix tree
• Log depth; lateral fan-out = 4 in 3rd row
• No exploitation of idempotencyg (0 ), (0 )k
c (1 )
g k(3 ), (3 )
c (4 )
g k(7 ), (7 )
c (8 )
16-bit Ladner-Fisher prefix tree
• Log depth with large fan-out in final row
4-bit Kogge-Stone prefix graph
• Fan-out = 1(laterally)
• 1 extra cell
• parallel wires in 2nd row
g (0 ), (0 )kg (1 ), (1 )kg (2 ), (2 )kg k(3 ), (3 )
G K10G K
21G K
G K G K
3
3 2
2
0 0
c (4 ) c (3 ) c (2 ) c (1 )
8-bit Kogge-Stone prefix graph
• More cells & wiring than Ladner-Fisher g (0 ), (0 )k
c (1 )
g k(3 ), (3 )
c (4 )
g k(7 ), (7 )
c (8 )
16-bit Kogge-Stone prefix graph
• Low fan-out but wider wiring channels
• No exploitation of idempotency
Black cells and grey cells
• Carries, c(i) = Gi-10; Ki-1
0 terms not needed
• G-only cells called and coloured “grey”
The story so far…
• Parallel prefix adders available in VLSI• Log-depth adders possible:
– high fan-outs {1,2,4,8…} & low cell count– low fan-outs {1,1,1,1…} & high cell count
• Problematic in VLSI (buffering, area)• Idempotency of ‘’ operator not
exploited
II.
Knowles’“Family of Adders”
Log-depth prefix trees
• In VLSI:– L-F trees require too much buffering
delay– K-S trees require too much area (wire
flux)
• Fan-outs characterised as:– {1,2,4,8…} Ladner-Fisher– {1,1,1,1…} Kogge-Stone
Knowles’ insight
• Use other fan-out schemes• 5 possible 8-bit log-depth prefix
trees:– {1,1,1} 17 cells Kogge-Stone– {1,1,2} 17 cells uses idempotency– {1,1,4} 14 cells no idempotency– {1,2,2} 14 cells no idempotency– {1,2,4} 12 cells Ladner-Fisher
Knowles’ 8-bit prefix trees
• All trees are log-depth
{ 1 ,1 ,1 }
{ 1 ,1 ,2 }
{ 1 ,2 ,2 }
{ 1 ,1 ,4 }
{ 1 ,2 ,4 }
Tree construction rules
• Levels are labelled 0,1,2...
• Fan-out at jth level, 2k , satisfies 2k 2j
• Fan-out at jth level fan-out at j+1th level
• Lateral wire length at jth level is 2j
Knowles’ 16-bit trees - I
• {1,1,1,1} 49 cells {1,1,1,8} 42 cells
• {1,1,1,2} 49 cells {1,2,2,2} 42 cells• {1,1,1,4} 49 cells {1,1,4,4} 40 cells• {1,1,2,2} 49 cells {1,1,4,8} 36 cells• {1,1,2,4} 49 cells {1,2,2,8} 36 cells• {1,1,2,8} 42 cells {1,2,4,4} 36 cells• {1,2,2,4} 42 cells {1,2,4,8} 32 cells
Knowles’ 16-bit trees - II
• {1,1,1,1} {1,1,1,8}• {1,1,1,2} Idempotent {1,2,2,2}• {1,1,1,4} Idempotent {1,1,4,4}• {1,1,2,2} Idempotent {1,1,4,8} • {1,1,2,4} Idempotent {1,2,2,8} • {1,1,2,8} Idempotent {1,2,4,4} • {1,2,2,4} Idempotent {1,2,4,8}
Knowles’ 16-bit trees - III
• {1,1,1,1} {1,1,1,8} R• {1,1,1,2} I {1,2,2,2} R• {1,1,1,4} I {1,1,4,4} R• {1,1,2,2} I {1,1,4,8} R• {1,1,2,4} I {1,2,2,8} R• {1,1,2,8} R, I {1,2,4,4} R• {1,2,2,4} R, I {1,2,4,8} R
Quick way of spotting R, I
• Define span(l) as distance from start of wire to first cell in lth level
• span(l) = 2l fanout(l) 1• tree characteristics
– R if span(j) span(k) for j < k– I if span(i) + span(j) = span(k) for i < j <
k
Examples of R & I spotting
fanout(l) span(l) characteristic• [1,1,1,1] [1,2,4,8] neither R nor
I• [1,1,2,2] [1,2,3,7] I only• [1,2,2,2] [1,1,3,7] R only• [1,2,2,4] [1,1,3,5] R & I• Are R & I adders “best”?
VLSI design of prefix adders
• Adders laid out as rectangular array of prefix cells (and gaps)
• Assume cells measure 10m 4m– 2 cells per significance 20m / bit
• Key design parameters:– buffering (area & delay)– wiring channels (area)
16-bit adder example
• Assumptions• Maximum fan-out without
buffering:– 3 cells + 80m wire (4 cell widths)
• Maximum fan-out with buffering:– 9 cells + 240m wire (12 cell widths)
• Employ {1,2,2,4} architecture
{1,2,2,4} prefix adder layout
g
xor
xor
b u f b u f b u f b u f b u f b u f b u f b u f b u f b u f b u f b u f
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
k
K G
K G
K G
G G G G G
G
G
G
G
G
G
G
G
G
G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G K G
K G
K G K G
K G
k k k k k k k k k k k k k k kg g g g g g g g g g g g g g g
Area vs Time for 32-bit adders
Delay12 12.5 13 13.5 14
24
26
28
30
32
34
36
38
40
Area K-S {1,1,1,1,1}
{1,1,2,2,2}
L-F {1,2,4,8,16}{1,2,2,4,4}
[1,1,3,5,13]
32-bit prefix tree adders
• Exploitable trade-off between adder’s delay and area– Kogge-Stone adder 16% faster than
Ladner-Fisher but 66% larger– {1,2,2,4,4} adder 8% faster than Ladner-
Fisher but only 3% larger– buffering also trades off speed for area
III.
New applications of prefix adders
Other addition operations
• Late increment– Mod 2w-1 addition for Reed-Solomon coding– floating-point rounding
• Late complement– absolute difference for video motion
estimation– sign-magnitude addition
• Typically use 2 adders and a MUX
Increments in prefix trees
• Row of prefix cells = ‘late +1’ operation
• Ladner-Fisher comprises many late +1’s– 1 8-bit, 2 4-bit, 4 2-bit, & 8 1-bit
Late increment tree
• Adder returns A+B if inc = 0• Adder returns A+B+1 if inc = 1
inc
Late increment logic
• “Late Carry” lc(i) set high if:– c(i) = 1 or– inc = 1 and a(n),b(n) 0,0 n: 0 n < i
p(i)
s(i)
incKi-1
0
c(i) = G i-10
lc(i)
Late complement theory
• In 2’s-complement, N = -(N+1)• A + B = A B 1
* late increment then yields A B
(A + B) = -(A B 1+1) = B A
• Absolute difference readily available
Absolute difference logic
• If c(w) = 0, result negative– if c(w) = 0, invert all the bits– else always perform late increment with
Ki-10
p(i)
s(i)
c(w)
Ki-10
c(i)
Summary of “late” ops
• Available on all prefix adders• Extra delay: 1 gate’s delay +
buffering• Extra hardware: w black cells • This technique used in floating-point
units– late increment for rounding– late complement for true subtraction
Media (“packed”) arithmetic
• Fundamental strategy:Use full wordlength hardware for
multiple sub-wordlength computations
• Examples:– 32-bit adder 4 8-bit adders– 32-bit multiplier 2 16-bit multipliers
Partitioning an adder
• Criteria:– support carries propagating within sub-adders– prevent carries propagating between sub-
adders
• Solutions:– put AND gates on carry chains slower adder– put dummy 0’s on operand bits larger adder
• Use prefix adder!!
Packed prefix adder - 1
• Force k(n) = 0 at partition points– prevents carries propagating across bit n– exploits don’t care condition (g,k) = (1,0)
• Implementation– change k(n) gate to (2,1) OR-AND gate– delay-neutral modification
Packed prefix adder - 2
• Force c(n) = Gn-10 = 0 at partition
points– prevents c(n) s(n) errors
• Implementation– insert AND gates (off critical path) or
– change Gn-10 gate to ({2,1},1) complex gate
– BUT need Gn-10 signal for sub-adder overflows
Packed prefix adder - 3
• Sub-adder carries complete early• Extraneous cells automatically do
nothing Force k(n) = 0
Force c(n) = 0
Last Slide
• Recent developments in prefix adders:– new “family” of log-depth trees– late operations– packed arithmetic for media processing
• Future possibilities:– systematic exploitation of idempotency – trees with reduced buffering– combine packed arithmetic/late ops
ANY QUESTIONS OR COMMENTS?