Recent Developments in Theory and Implementation of Parallel Prefix Adders Neil Burgess Division of...
-
Upload
bryanna-roderick -
Category
Documents
-
view
218 -
download
1
Transcript of Recent Developments in Theory and Implementation of Parallel Prefix Adders Neil Burgess Division of...
Recent Developments inTheory and
Implementationof Parallel Prefix Adders
Neil BurgessDivision of Electronics
Cardiff School of EngineeringCardiff University
Motivation
• Parallel Prefix Adders (e.g. Kogge-Stone) mostly ignored for deep submicron VLSI– large fan-out points– wide wiring channels
• Recent insights: can remove both and do...– absolute difference– late increment– media processing
Structure of Presentation
• Parallel Prefix Adder theory– Kogge-Stone, Ladner-Fisher
• New log-depth prefix trees– Knowles’ “family of adders”
• New applications of prefix adders– late operations, media adder
I.
Parallel Prefix Adder theory
Prefix adder structureA(0:w-1)
Bit propagate and generate cells
g(0:w-1)p(0:w-1)
B(0:w-1)
c(1:w)
Prefix carry tree
s(0:w)
Sum cells (XOR gates)
Prefix Equations - 1
• g(i) = a(i) b(i) “carry generate”• p(i) = a(i) b(i)“carry propagate”• k(i) = {a(i) b(i)} “carry kill”
• g(i), p(i), & k(i) are mutually exclusive– Use any two: g(i) & k(i) = NAND & NOR– p(i) needed as well: s(i) = p(i) c(i)
Prefix Equations - 2
• Generate and Not Kill signals are com-bined to form “Group Signals”Gx
z Kxz interpretation
0 0 c(x+1) = 00 1 c(x+1) = c(z)1 0 Don’t care1 1 c(x+1) = 1
Prefix Equations - Interpretation
• Group signals yield carry signals:
• Tree outputs: c(i+1) = Gi0
• Tree inputs: Gii = g(i) ; Ki
i = k(i)
zy
yx
zx
zy
yx
yx
zx
KKK
GPGG
1
1
zy
yx
zx GKGKGK 1
Prefix Equations - characteristics
• Associative– sub-terms may be pre-computed in
parallel g (0 ), (0 )kg (0 ), (0 )k g (1 ), (1 )kg (1 ), (1 )k g (2 ), (2 )kg (2 ), (2 )k g k(3 ), (3 )g k(3 ), (3 )
G K10G K
10G K
G K
G K
G K
3
3
2
3
2
0
0
0
c (4 )c (4 ) c (3 ) c (2 )c (2 ) c (1 )c (1 )
Prefix equations - characteristics
• Idempotent– sub-terms may be “overlapped”
g(0), k(0)g(0), k(0) g(1), k(1)g(1), k(1) g(2), k(2)g(2), k(2)
GK10 GKGK
2211
GKGK2200
c(3)c(3) c(2)c(2) c(1)c(1)
4-bit Ladner-Fisher prefix tree
• 1 sub-term pre-computed
• Logarithmic depth
• Fan-out = 2 in 2nd row (laterally)
g (0 ), (0 )kg (1 ), (1 )kg (2 ), (2 )kg k(3 ), (3 )
G K10G K
G K G K
3
3 2
2
0 0
c (4 ) c (3 ) c (2 ) c (1 )
8-bit Ladner-Fisher prefix tree
• Log depth; lateral fan-out = 4 in 3rd row
• No exploitation of idempotencyg (0 ), (0 )k
c (1 )
g k(3 ), (3 )
c (4 )
g k(7 ), (7 )
c (8 )
16-bit Ladner-Fisher prefix tree
• Log depth with large fan-out in final row
4-bit Kogge-Stone prefix graph
• Fan-out = 1(laterally)
• 1 extra cell
• parallel wires in 2nd row
g (0 ), (0 )kg (1 ), (1 )kg (2 ), (2 )kg k(3 ), (3 )
G K10G K
21G K
G K G K
3
3 2
2
0 0
c (4 ) c (3 ) c (2 ) c (1 )
8-bit Kogge-Stone prefix graph
• More cells & wiring than Ladner-Fisher g (0 ), (0 )k
c (1 )
g k(3 ), (3 )
c (4 )
g k(7 ), (7 )
c (8 )
16-bit Kogge-Stone prefix graph
• Low fan-out but wider wiring channels
• No exploitation of idempotency
Black cells and grey cells
• Carries, c(i) = Gi-10; Ki-1
0 terms not needed
• G-only cells called and coloured “grey”
The story so far…
• Parallel prefix adders available in VLSI• Log-depth adders possible:
– high fan-outs {1,2,4,8…} & low cell count– low fan-outs {1,1,1,1…} & high cell count
• Problematic in VLSI (buffering, area)• Idempotency of ‘’ operator not
exploited
II.
Knowles’“Family of Adders”
Log-depth prefix trees
• In VLSI:– L-F trees require too much buffering
delay– K-S trees require too much area (wire
flux)
• Fan-outs characterised as:– {1,2,4,8…} Ladner-Fisher– {1,1,1,1…} Kogge-Stone
Knowles’ insight
• Use other fan-out schemes• 5 possible 8-bit log-depth prefix
trees:– {1,1,1} 17 cells Kogge-Stone– {1,1,2} 17 cells uses idempotency– {1,1,4} 14 cells no idempotency– {1,2,2} 14 cells no idempotency– {1,2,4} 12 cells Ladner-Fisher
Knowles’ 8-bit prefix trees
• All trees are log-depth
{ 1 ,1 ,1 }
{ 1 ,1 ,2 }
{ 1 ,2 ,2 }
{ 1 ,1 ,4 }
{ 1 ,2 ,4 }
Tree construction rules
• Levels are labelled 0,1,2...
• Fan-out at jth level, 2k , satisfies 2k 2j
• Fan-out at jth level fan-out at j+1th level
• Lateral wire length at jth level is 2j
Knowles’ 16-bit trees - I
• {1,1,1,1} 49 cells {1,1,1,8} 42 cells
• {1,1,1,2} 49 cells {1,2,2,2} 42 cells• {1,1,1,4} 49 cells {1,1,4,4} 40 cells• {1,1,2,2} 49 cells {1,1,4,8} 36 cells• {1,1,2,4} 49 cells {1,2,2,8} 36 cells• {1,1,2,8} 42 cells {1,2,4,4} 36 cells• {1,2,2,4} 42 cells {1,2,4,8} 32 cells
Knowles’ 16-bit trees - II
• {1,1,1,1} {1,1,1,8}• {1,1,1,2} Idempotent {1,2,2,2}• {1,1,1,4} Idempotent {1,1,4,4}• {1,1,2,2} Idempotent {1,1,4,8} • {1,1,2,4} Idempotent {1,2,2,8} • {1,1,2,8} Idempotent {1,2,4,4} • {1,2,2,4} Idempotent {1,2,4,8}
Knowles’ 16-bit trees - III
• {1,1,1,1} {1,1,1,8} R• {1,1,1,2} I {1,2,2,2} R• {1,1,1,4} I {1,1,4,4} R• {1,1,2,2} I {1,1,4,8} R• {1,1,2,4} I {1,2,2,8} R• {1,1,2,8} R, I {1,2,4,4} R• {1,2,2,4} R, I {1,2,4,8} R
Quick way of spotting R, I
• Define span(l) as distance from start of wire to first cell in lth level
• span(l) = 2l fanout(l) 1• tree characteristics
– R if span(j) span(k) for j < k– I if span(i) + span(j) = span(k) for i < j <
k
Examples of R & I spotting
fanout(l) span(l) characteristic• [1,1,1,1] [1,2,4,8] neither R nor
I• [1,1,2,2] [1,2,3,7] I only• [1,2,2,2] [1,1,3,7] R only• [1,2,2,4] [1,1,3,5] R & I• Are R & I adders “best”?
VLSI design of prefix adders
• Adders laid out as rectangular array of prefix cells (and gaps)
• Assume cells measure 10m 4m– 2 cells per significance 20m / bit
• Key design parameters:– buffering (area & delay)– wiring channels (area)
16-bit adder example
• Assumptions• Maximum fan-out without
buffering:– 3 cells + 80m wire (4 cell widths)
• Maximum fan-out with buffering:– 9 cells + 240m wire (12 cell widths)
• Employ {1,2,2,4} architecture
{1,2,2,4} prefix adder layout
g
xor
xor
b u f b u f b u f b u f b u f b u f b u f b u f b u f b u f b u f b u f
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
xor
k
K G
K G
K G
G G G G G
G
G
G
G
G
G
G
G
G
G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G
K G K G
K G
K G K G
K G
k k k k k k k k k k k k k k kg g g g g g g g g g g g g g g
Area vs Time for 32-bit adders
Delay12 12.5 13 13.5 14
24
26
28
30
32
34
36
38
40
Area K-S {1,1,1,1,1}
{1,1,2,2,2}
L-F {1,2,4,8,16}{1,2,2,4,4}
[1,1,3,5,13]
32-bit prefix tree adders
• Exploitable trade-off between adder’s delay and area– Kogge-Stone adder 16% faster than
Ladner-Fisher but 66% larger– {1,2,2,4,4} adder 8% faster than Ladner-
Fisher but only 3% larger– buffering also trades off speed for area
III.
New applications of prefix adders
Other addition operations
• Late increment– Mod 2w-1 addition for Reed-Solomon coding– floating-point rounding
• Late complement– absolute difference for video motion
estimation– sign-magnitude addition
• Typically use 2 adders and a MUX
Increments in prefix trees
• Row of prefix cells = ‘late +1’ operation
• Ladner-Fisher comprises many late +1’s– 1 8-bit, 2 4-bit, 4 2-bit, & 8 1-bit
Late increment tree
• Adder returns A+B if inc = 0• Adder returns A+B+1 if inc = 1
inc
Late increment logic
• “Late Carry” lc(i) set high if:– c(i) = 1 or– inc = 1 and a(n),b(n) 0,0 n: 0 n < i
p(i)
s(i)
incKi-1
0
c(i) = G i-10
lc(i)
Late complement theory
• In 2’s-complement, N = -(N+1)• A + B = A B 1
* late increment then yields A B
(A + B) = -(A B 1+1) = B A
• Absolute difference readily available
Absolute difference logic
• If c(w) = 0, result negative– if c(w) = 0, invert all the bits– else always perform late increment with
Ki-10
p(i)
s(i)
c(w)
Ki-10
c(i)
Summary of “late” ops
• Available on all prefix adders• Extra delay: 1 gate’s delay +
buffering• Extra hardware: w black cells • This technique used in floating-point
units– late increment for rounding– late complement for true subtraction
Media (“packed”) arithmetic
• Fundamental strategy:Use full wordlength hardware for
multiple sub-wordlength computations
• Examples:– 32-bit adder 4 8-bit adders– 32-bit multiplier 2 16-bit multipliers
Partitioning an adder
• Criteria:– support carries propagating within sub-adders– prevent carries propagating between sub-
adders
• Solutions:– put AND gates on carry chains slower adder– put dummy 0’s on operand bits larger adder
• Use prefix adder!!
Packed prefix adder - 1
• Force k(n) = 0 at partition points– prevents carries propagating across bit n– exploits don’t care condition (g,k) = (1,0)
• Implementation– change k(n) gate to (2,1) OR-AND gate– delay-neutral modification
Packed prefix adder - 2
• Force c(n) = Gn-10 = 0 at partition
points– prevents c(n) s(n) errors
• Implementation– insert AND gates (off critical path) or
– change Gn-10 gate to ({2,1},1) complex gate
– BUT need Gn-10 signal for sub-adder overflows
Packed prefix adder - 3
• Sub-adder carries complete early• Extraneous cells automatically do
nothing Force k(n) = 0
Force c(n) = 0
Last Slide
• Recent developments in prefix adders:– new “family” of log-depth trees– late operations– packed arithmetic for media processing
• Future possibilities:– systematic exploitation of idempotency – trees with reduced buffering– combine packed arithmetic/late ops
ANY QUESTIONS OR COMMENTS?