Digital Arithmetic CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner.
-
Upload
osborne-wilkinson -
Category
Documents
-
view
228 -
download
0
Transcript of Digital Arithmetic CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner.
Digital Arithmetic
CSE 237D: Spring 2008 Topic #8
Professor Ryan Kastner
Data Representation
Floating point representation Large dynamic range and high
precision Costly
Fixed point representation Requires fewer number of
resources Comparable performance Bitwidth analysis for trading off
estimation accuracy and the number of fixed-point bits
8 bits is sufficient
Moorea Modem Receiver Specification
Note: 112 samples/symbol + 112 samples for channel clearing.
MatchingPursuitCore
MatchingPursuitCore
MatchingPursuitCore
MatchingPursuitCore
arg
min i
Generalized multiple hypothesis test (GMHT)
Walsh/m-Sequence Waveforms
Chip rate – 5 kcps, approx. 5 kHz bandwidth. Uses 25 kHz carrier.
Use 7 chip m-sequence c per Walsh symbol, 8 bits per Walsh symbol bi. Composite symbol duration is thus T = 11.2 msec. (Longer than maximum multipath spread.)
Symbol rate is 266 bps, or 133 bps using 11.2 msec. time guard band for channel clearing.
11 msec.
Transmitted Signal
1 1 -1 1 -1 -1-1 1 1 -1 1 -1 -1-1-1 -1 1 -1 1 1 1
Walsh/m-sequence Signal Parameters
1 1 -1 1 -1 -1-1 1 1 -1 1 -1 -1-1-1 -1 1 -1 1 1 1
8 Walsh Symbols
Moorea Modem Receiver Specification
Note: 112 samples/symbol + 112 samples for channel clearing.
MatchingPursuitCore
MatchingPursuitCore
MatchingPursuitCore
MatchingPursuitCore
arg
min i
Generalized multiple hypothesis test (GMHT)
Channel Estimation
Goal: Map matching pursuits to reconfigurable device Parameterizable – number of samples, data representation Tradeoffs - Provides designs with various area, latency, energy, …
Matching Pursuits Algorithm
Matching Pursuits Core
Reconfigurable System
MP( r, S, A, a ) 1 for i = 1, 2, …,SN // compute matched filter (MF) outputs 2 rSV T
ii 0 3 0if 4 0ig 5 end for 6 00 q // do successive interference cancellation 7 for j = 1, 2, …,fN // update MF outputs 8 11
1
jj qqjj AfVV
9 for k = 0, 1, …, 1SN 10 k
jkk avg
11 kjkk gvQ *)(
12 end for 13 }{maxarg
11,...,,k
qqkkj Qq
j
14 jj qq gf 15 end for 16 return (f)
CLB Block RAM IP Core (Multiplier)
* + -*
+
control control1jqf
0
iV
kg
kg
kQ
kQ
j
kV
iS iSkAka
r
System Design Tools
In Depth: Data Representation
History of Number Systems
Oldest Number System? Fingers, but only 10 Toes, but only 20 Base 10, “digit”al Roman schools taught finger counting
– multiplication/division on hands/toes
"Counting in binary is just like counting in decimal if you are all thumbs." ~ Glaser and Way
Sand Tables Stones in the sand Three grooves with up to ten stones per groove “Calculate” said to be derived from the Latin word "calcis“ because
limestone was used in the first sand tables.
"Base eight is just like base ten really, if you're missing two fingers." ~ Tom Lehrer
Key Idea: Formal Notation
Notches on bones – 8500 BC in Africa, Europe
Count in multiples of some basic number 5 or 10 based on fingers Mayans used 360 Babylonians 60
Greeks, Romans extended this – fundamentally still the same
Positional notation key – same symbol in different spots has different meaning
Numbers
Any number system requires: A set of digits A set of possible values for the digits Rules for interpreting the digits and values onto a number
Example: Roman Numerals Symbols used to represent a value Roman Numerals
1 = I 100 = C
5 = V 500 = D
10 = X 1000 = M
50 = L
For example: 2004 = MMVIII
Unsigned Number Systems
Unsigned integer decimal systemsSet of digits represented by a digit vector X = (Xn-1, Xn-2,…,
X1, X0)
Set of values for the digits: Si = {0, 1, 2, …, 9}
Rules for determine number: Unsigned binary systems
Set of digits represented by a digit vector X = (Xn-1, Xn-2,…, X1, X0)
Set of values for the digits: Si = {0, 1}
Rules for determine number:
€
X = X i ⋅10i
i =0
n−1
∑
€
X = X i ⋅2i
i =0
n−1
∑
Source: Parhami
Other Useful Encodings
Some 4-bit number representation formats
Base-2logarithm
Exponent in{2, 1, 0, 1}
Significand in{0, 1, 2, 3}
Source: Parhami
Encoding Numbers in 4 Bits
0 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16
Unsigned integers
Signed-magnitude
3 + 1 fixed-point, xxx.x
Signed fraction, .xxx
2’s-compl. fraction, x.xxx
2 + 2 floating-point, s 2 e in [ 2, 1], s in [0, 3]
2 + 2 logarithmic (log = xx.xx)
Number format
log x
s e e
Source: Parhami
Sign and Magnitude Representation
0000 0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
0 +1
+3
+4
+5
+6 +7
-7
-3
-5
-4
-0 -1
+2-
+ _
Bit pattern (representation)
Signed values (signed magnitude)
+2 -6
Increment Decrement
-
Source: Parhami
Sign and Magnitude Adder
Adder cc
s
x ySign x Sign y
Sign
Sign s
Selective Complement
Selective Complement
out in
Comp x
Control
Comp s
Add/Sub
Compl x
___ Add/Sub
Compl s
Selective complement
Selective complement
Source: Parhami
Biased Representations
0000 0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
-8 -7
-5
-4
-3
-2 -1
+7
+3
+5
+4
0 +1
+2
+ _
Bit pattern (representation)
Signed values (biased by 8)
-6 +6
Increment Increment
Source: Parhami
Arithmetic with Biased Numbers
Addition/subtraction of biased numbersx + y + bias = (x + bias) + (y + bias) – biasx – y + bias = (x + bias) – (y + bias) + bias
A power-of-2 (or 2a – 1) bias simplifies addition/subtraction
Comparison of biased numbers:Compare like ordinary unsigned numbersfind true difference by ordinary subtraction
We seldom perform arbitrary arithmetic on biased numbersMain application: Exponent field of floating-point numbers
Source: Parhami
One’s Complement Number Representation
One’s complement = digit complement (diminished radix complement) system for r = 2
M = 2k – ulp
(2k – ulp) – x = xcompl
Range of representable numbers in with k whole bits:
from –2k–1 + ulp to 2k–1 – ulp
0000 0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
+0 +1
+3
+4
+5
+6 +7
-0
-4
-2
-3
-7 -6
-5
+ _
Unsigned representations
Signed values (1’s complement)
+2 -1
Source: Parhami
Two’s Complement Number Representation
0000 0001 1111
0010 1110
0011 1101
0100 1100
1000
0101 1011
0110 1010
0111 1001
+0 +1
+3
+4
+5
+6 +7
-1
-5
-3
-4
-8 -7
-6
+ _
Unsigned representations
Signed values (2’s complement)
+2 -2
Two’s complement = radix complement system for r = 2
M = 2k
2k – x = [(2k – ulp) – x] + ulp = xcompl + ulp
Range of representable numbers in with k whole bits:
from –2k–1 to 2k–1 – ulp
Source: Parhami
Two’s Complement Adder/Subtractor
Mux
Adder
0 1
x y
y or y _
s = x y
add/sub ___
c in
Controlled complementation
0 for addition, 1 for subtraction
c out
Can replace this mux with k XOR gates
Source: Parhami
Sign and Magnitude vs Two’s Complement
Mux
Adder
0 1
x y
y or y _
s = x y
add/sub ___
c in
Controlled complementation
0 for addition, 1 for subtraction
c out
Adder cc
s
x ySign x Sign y
Sign
Sign s
Selective Complement
Selective Complement
out in
Comp x
Control
Comp s
Add/Sub
Compl x
___ Add/Sub
Compl s
Selective complement
Selective complement
Signed-magnitude adder/subtractor is significantly more complex than a simple adder
Two’s-complement adder/subtractor needs very little hardware other than a simple adder
Fixed Point Representations
Allows us to use rational numbers: a/b Numbers represented in the form:
Unsigned mappings
Two’s complement mapping:
€
X = Xa−1Xa−2L X1X0.X−1X−2L Xb
€
X = X i ⋅2i
i=−b
a−1
∑
€
X =1
2b
⎛
⎝ ⎜
⎞
⎠ ⎟ 2i
i =0
n−1
∑ ⋅X i
€
X =1
2b
⎛
⎝ ⎜
⎞
⎠ ⎟ −2n−1 ⋅Xn−1 + 2i ⋅X i
i =0
n−2
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
Fixed Point Properties Resolution: Smallest non-zero magnitude
Directly related to the number of fractional bits (b) Unsigned binary fixed point: resolution = 1/2b
Range: Difference between most positive and most negative number Unsigned binary fixed point: range = 2a – 2-b
Largely dependent on number of integer bits
Accuracy: Magnitude of the max difference between a real value and its representation Unsigned binary fixed point: accuracy = 1/2b+1
Accuracy(x) = resolution(x)/2 If one fractional bit, worst possible number is ¼ (since it is ¼ from
both 0 and ½ which are representable with 1 fractional bit
Example
Denote unsigned fixed point systems as U(a,b) Given fixed point number system U(6,2),
What is number does 8A16 represent?
What is the range of U(6,2)?What is the resolution?What is the accuracy?
Rules of Fixed Point Arithmetic
Unsigned Wordlength U(a,b): a + b bits Signed Wordlength S(a,b): a + b + 1 bits Unsigned Range U(a,b): 0 ≤ x ≤ 2a – 2-b
Signed Range S(a,b): -2a ≤ x ≤ 2a – 2-b
Addition Z(a+1,b) = X(a1,b1) + Y(a2,b2) X and Y must be scaled i.e. a1= a2 and b1= b2
Unsigned Multiplication: U(a1,b1) x U(a2,b2) = U(a1 + a2, b1 + b2)
Signed Multiplication: S(a1,b1) x S(a2,b2) = S(a1 + a2 + 1, b1 + b2)
In Depth: Arithmetic Operations
1 Bit AdditionHalf Adder (HA)
Full Adder (HA)
A B
Cou
t
S
HA(2 : 2)
counter
A B
Cou
t
HA
HA
Cin
S
FA(3 : 2)
counter
Half Adder Implementations
c
s
(b) NOR-gate half-adder.
x
y
x
y
(c) NAND-gate half-adder with complemented carry.
x
y
c
s
s
cx
y
x
y
(a) AND/XOR half-adder._
_
_c
Source: Parhami
Full Adder Implementations
HA
HA
xy
c in
cout
(a) Built of half-adders.
s
(b) Built as an AND-OR circuit.
(c) Suitable for CMOS realization.
cout
s
c in
xy
0 1 2 3
0 1 2 3
xy
c in
cout
s
0
1
Mux
Source: Parhami
Full Adder Implementations
(a) FA built of two HAs
(c) Two-level AND-OR FA (b) CMOS mux-based FA
1
0
3
2
HA
HA
1
0
3
2
0
1
x y
x y
x y
s
s s
c out
c out
c out
c in
c in
c in
Source: Parhami
Bit Serial Addition
Perform addition one bit at a time Xi + Yi + C0-(i-1)
Result stored in registered that is right shifted Slow but small area
Ripple Carry Adder
FA
A0B0
S0
FA
A1B1
S1
FA
A2B2
S2
Cin. . .FA
An-1Bn-1
Sn-1
Cout
n-bit Ripple Carry Adder
“Bit parallel adder” Area, delay? n-bit Two
Operand Adder
nn
AB
n
S
CinCout
Another View of Ripple Carry Adder
A0
B0
G0 P0
A1
B1
G1 P1
A2
B2
G2 P2
A3
B3
G3 P3
C0
C4
Carry Network
Faster Addition We need to break the carry chain The carry recurrence: ci+1 = gi + pi ° ci
Observation: Carry only propagates in certain situations
Bit positions 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 ----------- ----------- ----------- ----------- 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0
cout 0 1 0 1 1 0 0 1 1 1 0 0 0 0 1 1 cin \__________/\__________________/ \________/\____/ 4 6 3 2
Carry chains and their lengths
. . . c
k 1
c
k c
k 2
c
1
g
p
1
1
g
p
0
0
g
p
k 2
k 2
g
p
k 1
k 1
c
0 c
2
Manchester Adder
SCC
A0 B0
SCC
A1 B1
Cin
C2 C1
. . .SCC
An-1Bn-1
Cn-1
Ai Bi
Gi PiKi
KGP
1
0
CiCi+
1
SwitchedCarry Chain
(SCC)
Cout
Ai
Bi
Gi PiKi
Kill,Generate,Propagate
(KGP)
Carry Look Ahead
A B C-out0 0 0 “kill”0 1 C-in “propagate”1 0 C-in “propagate”1 1 1 “generate”
G = A and BP = A xor B
A0
B0
A1
B1
A2
B2
A3
B3
S
S
S
S
GP
GP
GP
GP
C0 = Cin
C1 = G0 + C0 P0
C2 = G1 + G0 P1 + C0 P0 P1
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2
G
C4 = . . .
P
Plumbing as Carry Lookahead Analogy
p0
c0g0
c1
p0
c0g0
p1g1
c2
p0
c0g0
p1g1
p2g2
p3g3
c4
A0
B0
G0 P0
A1
B1
G1 P1
C0
C1
S1
S0
C2
2 bit CLA
0 0 0 0
P1 = P0P1
0 0
G1 = G0P1 + G1
0 0 0
2 Bit Carry Lookahead Adder
Source: Parhami
4 Bit Carry Look Ahead
Complexity reduced by deriving the carry-out indirectly, but increases critical path
g0
g1
g2
g3
c0
c4
c1
c2
c3
p3
p2
p1
p0
Full carry lookahead is quite practical for a 4-bit adder
c1 = g0 c0 p0
c2 = g1 g0 p1 c0 p0 p1
c3 = g2 g1 p2 g0 p1 p2 c0 p0 p1 p2
c4 = g3 g2 p3 g1 p2 p3 g0 p1 p2 p3 c0 p0 p1 p2 p3
Carry Look Ahead, multiple levels
c0g0
p0
c1g1
p1
c2g2
p2
c3g3
p3
A0
B0
A1
B1
A2
B2
A3
B3
C0
C4
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
C0
G0
P0
C1
G1
P1
C2
G2
P2
C3
G3
P3
c0
C16
.
Cascaded Carry Look-ahead (16-bit): Abstraction
CLA
4-bitAdder
4-bitAdder
4-bitAdder
C1 = G0 + C0 P0
C2 = G1 + G0 P1 + C0 P0 P1
C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2
GP
G0P0
C4 = . . .
C0
Carry Lookahead Generator Plumbing Analogy
p0g0
p1g1
p2g2
p3g3
G0
p1
p2
p3
P0
P1
0
A0
B0
A1
B1
2 bit CLA
A2
B2
A3
B3
2 bit CLA
G1
0
P1
1
G1
1
C0
P2 = P0P1
1 1
P1
1
G2 = G0P1 + G11 1 1
C4 = C0P0P1 + G0P1 + G11 1 1 1 1
2 bit CLG
C2
4 Bit Hierarchical CLA
A0
B0
A1
B1
2 bit CLA
G1
0P1
0
A2
B2
A3
B3
2 bit CLA
G1
1P1
1
A4
B4
A5
B5
2 bit CLA
G1
2P1
2
A6
B6
A7
B7
2 bit CLA
G1
3P1
3
2 bit CLG2 bit CLG
2 bit CLG
G2
0P2
0G2
1P2
1
C0
C4
C2
C6
C8
8 Bit Hierarchical CLA
Design Trick: Guess (or “Precompute”)
n-bit adder n-bit adderCP(2n) = 2*CP(n)
n-bit adder n-bit addern-bit adder 1 0
Cout
CP(2n) = CP(n) + CP(mux)
Carry-select adder
FA
A0B0
S0
FA
A1B1
S1
FA
A2B2
S2
Cin. . .FA
An-1Bn-1
Sn-1
Cout
......n – 1FF
......... n – 1FF
n – 2FF
n – 3FF
Pipelined Ripple Carry Adder
Multiple Operand Addition
Many applications require summation of many operands
What is best way to compute this?
• • • • a • • • • x ---------- • • • • x a • • • • x a • • • • x a • • • • x a ---------------- • • • • • • • • p
0 1 2 3
0 1 2 3
2 2 2 2
• • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p ----------------- • • • • • • • • • s
(0) (1) (2) (3) (4) (5) (6)
Inner ProductMultiplication
Terminology
Serial Implementation
Two Operand Carry Propagate Adder
Si[n + log i]Oi[n]
Si+1[n + log (i+1)]
Register S
Tserial-multi-add = O(m log(n + log m))
= O(m log n + m log log m)
Therefore, addition time grows superlinearly with n when k is fixed and logarithmically with k for a given n
Parallel Implementation
CPA
O1[n]O2[n]
CPA
O3[n]O4[n]
CPA
CPA
CPA
CPA
O5[n]O6[n]
CPA
O7[n]O8[n]
CPA
CPA
Om-
1[n]Om-
2[n]
CPA
Om-
1[n]Om[n
]
CPA
CPA
O(log m)CPA Tree
. . .
S[n + log m]
Ttree-fast-multi-add = O(log n + log(n + 1) + . . . + log(n + log2m – 1))
= O(log m log n + log m log log m)
Can we do this faster?
Carry Save Adder (CSA)
. . .
n-bit Carry Save
AdderFA
O1[n]O2[n]
S[n]
O3[n]
C[n]
n-bit Carry Save Adder
nn
O1[n]O3[n]
n
S[n]
n
O2[n]
n
C[n]
FA
O1[1]O2[1]
S[1]
O3[1]
C[1]
FA
O1[2]O2[2]
S[2]
O3[2]
C[2]
Source: Parhami
Carry Save Adders
FA FAFA FA FAFA
FA FAFA FA FAFA
Cut
Carry-propagate adder
Carry-save adder (CSA) or (3; 2)-counter or 3-to-2 reduction circuit
c
in
c
out
Carry propagate adder (CPA) and carry save adder (CSA) functions in
dot notation.
Half-adder
Full-adder
Specifying full- and half-adder blocks, with their inputs and outputs, in
dot notation.
Ci[n + log i]
Ci+1[n + log (i+1)]
Carry Save Adder
Si[n + log i]Oi[n]
Si+1[n + log (i+1)]
Register C Register C
Serial CSA Implementation
Tserial-csa-multi-add = O(m)
In the end there are two operands (C, S)
S[n]C[n] S[1]C[1]S[2]C[2]S[3]C[3]…
Carry Propagate Adder
Bit 1Bit 2
S[i]C[i-1]…
Bit i-1
……HA
T[n+2]T[n+1]T[i+1] T[3] T[2] T[1]
Cout
Final Reduction (2:1)
CSA
O4[n]O5[n]
S1[n:1]
O6[n]
C1[n+1:2]
CSA
S1[n:1]C1[n+1:2]
S2[n+1:1]C2[n+2:3]
CSA
S3[n+2:1]C3[n+2:3]
O1[n]O2[n]O3[n]
CSA
O6[n]: xxxx O5[n]: xxxx+ O4[n]: xxxx S1[n:1]: xxxx C1[n+1:2]:xxxx
C1[n+1:2]: xxxx S1[n:1]: xxxx + C1[n+1:2]: xxxx S2[n+1:1]: xxxxx C2[n+2:3]:xxxx
S1[n:1]: xxxx S2[n+1:1]: xxxxx + C2[n+2:3]: xxxx S3[n+2:1]: xxxxxx C3[n+2:2]: xxxxx
Carry Save Arithmetic
CSA CSA
CSA
CSA
+
A B C D E F
Delay = 3 + log2(M + 3)
3 = height of CSA tree
M = bitwidth of operands
S
S
S
SCC
C
C
F
CLA
Tree height = log1.5(N/2)
Carry Save Arithmetic
RCA
RCA
RCA
RCA
RCA
(M +1)
Delay = (M+5) + 4
Delay comparison
0
20
40
60
80
100
120
2 6 10 14 18 22 26 30 34 38 42 46 50
# of operands
Del
ay (
full
ad
der
del
ays)
RCA
CSA
Area comparison
0
500
1000
1500
2000
# Operands
Are
a (f
ull
ad
der
un
its)
RCA
CSA
Using Ripple carry adders (RCAs)
(M +2)(M +3)(M +4)
(M +5)
Delay thru CSA network =
3 + log1.5(M + 3)
Source: Parhami
Example Reduction by a CSA Tree
12 FAs
6 FAs
6 FAs
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA
Addition of seven 6-bit numbers in dot notation.
8 7 6 5 4 3 2 1 0 Bit position
7 7 7 7 7 7 62 = 12 FAs 2 5 5 5 5 5 3 6 FAs
3 4 4 4 4 4 1 6 FAs
1 2 3 3 3 3 2 1 4 FAs + 1 HA
2 2 2 2 2 1 2 1 7-bit adder
--Carry-propagate adder--
1 1 1 1 1 1 1 1 1
Representing a seven-operand addition in tabular form.
A full-adder compacts 3 dots into 2 (compression ratio of 1.5)
A half-adder rearranges 2 dots (no compression, but still useful)
Source: Parhami
Wallace and Dadda Reduction Trees
6 FAs
11 FAs
7 FAs
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA
Adding seven 6-bit numbers using Dadda’s strategy.
12 FAs
6 FAs
6 FAs
4 FAs + 1 HA
7-bit adder
Total cost = 7-bit adder + 28 FAs + 1 HA
Addition of seven 6-bit numbers in dot notation.
Wallace tree: Reduce the number of operands at the earliest possible opportunity
Dadda tree: Postpone the reduction to the extent possible without causing added delay
h n(h) 2 4 3 6 4 9 5 13 6 19
Source: Parhami
Generalized Parallel Counters
(5, 5; 4)-counter Dot notation for a (5, 5; 4)-counter and the use of such counters for reducing five
numbers to two numbers.
. . .
Multicolumn reduction
(2, 3; 3)-counter
Unequal columns
Gen. parallel counter = Parallel compressor
Compressors
Compressors allow for carry in and carry outs
FA
O2[i]O3[i]O4[i]
Cout[i]
FA
Cin[i]
O1[i]
S[i]C[i]
FA
O2[i-1]O3[i-1]O4[i-1]
Cout[i-1]
FA
Cin[i-1]
O1[i-1]
S[i-1]C[i-1]
Bit i Bit i-1
[4:2] Compress
or
[4:2] Compres
sor
[4 : 2] Compressor Adder
. . .
n-bit [4:2] Adder
n-bit [4:2] Adder
nn
O1[n]O3[n]
n
S[n]
n
O2[n]
n
C[n]
n
O4[n]
[4:2] Compressor
O1[1]O3[1]
S[1]
O2[1]
C[1]
O4[1]
[4:2] Compressor
O1[2]O3[2]
S[2]
O2[2]
C[2]
O4[2]
[4:2] Compressor
O1[n]O3[n]
S[n]
O2[n]
C[n]
O4[n]
Higher Order Compressors
FA
O3[i]O4[i]O5[i]
FA
O1[i]O2[i]
S[i]C[i]
FA
FA
O3[i-1]O4[i-1]O5[i-1]
FA
O1[i-1]O2[i-1]
S[i-1]C[i-1]
FA[5:2] Compressor Bit i
[5:2] Compressor Bit i-1
Moorea Modem Receiver Specification
Note: 112 samples/symbol + 112 samples for channel clearing.
MatchingPursuitCore
MatchingPursuitCore
MatchingPursuitCore
MatchingPursuitCore
arg
min i
Generalized multiple hypothesis test (GMHT)
Linear System Optimizations
Linear System Optimization
Linear systems ubiquitous in signal processing applications
We have developed many methods for optimization to hardware, software, FPGA [ASAP04, ASPDAC05, DATE06, ICCD06, Journal of VLSI Signal Processing07]
1D linear systems on previous slide, aka FIR filters
3
2
1
0
3
2
1
0
)85cos()8cos()8
7cos()83cos(
)47cos()4
5cos()43cos()4cos(
)87cos()8
5cos()83cos()8cos(
)0cos()0cos()0cos()0cos(
x
x
x
x
y
y
y
y
+
x
z-1 +
x
z-1 +
x
z-1+
x
z-1
x
z-1
X [n]
y [n]
h0hL-1 hL-2 hL-3 h1
. . .
FIR Filter Implementations:Multiply Accumulate Method
Convolution of the latest L input samples. L is the number of coefficients h(k) of the filter, and x(n) represents the input time series. y[n] = ∑ h[k] x[n-k] k= 0, 1, ..., L-1
Disadvantages Large area on FPGA due to multipliers and the fact that full flexibility of
general purpose multipliers are not required Limited number of embedded resources such as MAC engines,
multipliers, etc. in FPGAs
FIR Filter Implementations:Distributed Arithmetic
Summation of inner product: Ak= constant coefficients
Xk = input data
We can write each input data in two’s complement:
Substituting this into the above yields:
Exchange order of the summations:
€
Y = Ak ⋅Xkk =1
K
∑
€
X = −X0 + Xb ⋅2−b
b=1
n−1
∑
€
Y = Ak −Xk 0 + Xkb ⋅2−b
b=1
B−1
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
k =1
K
∑
€
Y = Ak ⋅Xkbk =1
K
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
b=1
B−1
∑ ⋅2−b + Ak ⋅ −Xk0( )k =1
K
∑
From previous slide: How do we compute the bracketed term?
Multiply a particular bit b of each of the inputs by the binary constants A1, A2, …, Ak
Questions: Assume we are looking at b=1 (LSB of inputs), but this generalizes to any bWhat if each bit of Xk1 are 0 i.e. Xk1 = [00000…0]?
What if X11 = 1 and remaining are 0? Xk1 = [00000…1]?
What if X11 = 1, X21 = 1 and rest are 0? Xk1 = [00000…11]?
FIR Filter Implementations:Distributed Arithmetic
€
Y = Ak ⋅Xkbk =1
K
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
b=1
B−1
∑ ⋅2−b + Ak ⋅ −Xk0( )k =1
K
∑
FIR Filter Implementations:Distributed Arithmetic
Looking at summations in a different way
€
Y = Ak −Xk 0 + Xkb ⋅2−b
b=1
n−1
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
k=1
K
∑
€
Y =
A1 ⋅ −X10 + X11 ⋅2−1 + X12 ⋅2−2 +L + X1 B−1( )
⋅2− B−1( )( )
+A2 ⋅ −X20 + X21 ⋅2−1 + X22 ⋅2−2 +L + X2 B−1( )
⋅2− B−1( )( )
M
AK ⋅ −XK 0 + XK1 ⋅2−1 + XK 2 ⋅2−2 +L + XK B−1( )
⋅2− B−1( )( )
€
Y =
A1 ⋅X11 + A2 ⋅X21 + A3 ⋅X31 +L + AK ⋅XK1[ ]⋅2−1
+ A1 ⋅X12 + A2 ⋅X22 + A3 ⋅X32 +L + AK ⋅XK 2[ ]⋅2−2
M
+ A1 ⋅X1 B−1( )
+ A2 ⋅X2 B−1( )
+ A3 ⋅X3 B−1( )
+L + AK ⋅XK B−1( )[ ]⋅2− B−1( )
+A1 ⋅ −X10( ) + A2 ⋅ −X20( ) + A3 ⋅ −X30( ) +L + AK ⋅ −XK 0( )
€
Y = Ak ⋅ Xkb
k=1
K
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥
b=1
n−1
∑ ⋅2−b + Ak ⋅ −Xk0( )k=1
K
∑
0A1A2
A1+ A2
A1+ A2 + … +AK
2K entry LUT
Precision of constant bits wide:
Usually equal to precision of input data B
Address00…0000…0100…1000…11
11…11
.
.
.
.
.
.
Value
X1b + X2b + … + XKb
+
>>
Y
.
.
.
FIR Filter Implementations:Distributed Arithmetic
Advantages Replaces multiplication with LUT Coefficients stored in LUTs
Disadvantages Performance limited as next input
sample processed only after every bit of the current input sample is processed
Increasing number of bits to be processed has a significant effect on resource utilization
Larger size scaling accumulator needed for higher number of bits
Increases critical path delay
FIR Filter Implementations:Distributed Arithmetic
Address Data
0000 0
0001 C0
0010 C1
… …
1111 C0+C1+C2+C3
LUT
LUT
+ +
Q
QSET
CLR
D
x0[i]x1[i]x2[i]x3[i]
scaling accumulator
<<
x4[i]x5[i]x6[i]x7[i]
FIR Filter Implementations:Distributed Arithmetic
LUT
LUT
+
x0[i]x1[i]x2[i]x3[i]
LUT
LUT
+
+
Q
QSET
CLR
D
scaling accumulator
<<
+
x4[i]x5[i]x6[i]x7[i]
x0[i+1]x1[i+1]x2[i+1]x3[i+1]
x4[i+1]x5[i+1]x6[i+1]x7[i+1]
The performance improved by replication - process multiple bits at a time
Significant effect on resource utilization More LUTs Larger size scaling
accumulator
X[n]
+ +
+
+
+ +
y0 y1 yL-1
+
y2
Z-1
+
Z-1 + Z-1 Z-1 + y[n]
MultiplierBlock
DelayBlock
+
x
z-1 +
x
z-1 +
x
z-1+
x
z-1
x
z-1
X [n]
y [n]
h0hL-1 hL-2 hL-3 h1
. . .
FIR Filter Implementations:Add and Shift Method
Idea: Constant Multiplication to Shift/Add
Multiplication is expensive in hardware Decompose constant multiplications into shifts and additions
13*X = (1101)2*X = X + X<<2 + X<<3
Signed digits can reduce the number of additions/subtractions Canonical Signed Digits (CSD) (Knuth’74) (57)10 = (0110111)2 = (100-1001)CSD
Further reduction possible by common subexpression elimination Up to 50% reduction (R.Hartley TCS’96)
Introduction
Common subexpressions = common digit patterns
F1 = 7*X = (0111)*X = X + X<<1 + X<<2 F2 = 13*X = (1101)*X = X + X<<2 + X<<3
D1 = X + X<<2 F1 = D1 + X<<1 F2 = D1 + X<<3
Good for single variable: FIR filters (transposed form) Multiple variable? (DFT, DCT etc..??)
“0101”
=> X + X<<23+, 3<<
4+, 4<<
Linear Systems and polynomial transformation
Y0 1 1 1 1 X0
Y1 = 2 1 -1 -2 X1
Y2 1 -1 -1 1 X2
Y3 1 -2 2 -1 X3
Decomposing constant multiplications
Y0 = X0 + X1 + X2 + X3
Y1 = X0<<1 + X1 - X2 - X3<<1
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1<<1 + X2<<1 - X3
Y0 = X0 + X1 + X2 + X3
Y1 = X0<<1 + X1 - X2 - X3<<1
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1<<1 + X2<<1 - X3 12+, 4<<12+, 4<<
H.264 Integer Transform
Linear Systems and polynomial transformation
Y0 1 1 1 1 X0
Y1 = 2 1 -1 -2 X1
Y2 1 -1 -1 1 X2
Y3 1 -2 2 -1 X3
Polynomial Transformation
H.264 Integer Transform
Y0 = X0 + X1 + X2 + X3
Y1 = X0L + X1 - X2 - X3L
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1L + X2L - X3
Y0 = X0 + X1 + X2 + X3
Y1 = X0L + X1 - X2 - X3L
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1L + X2L - X3 12+, 4<<12+, 4<<
H.264 Example
Y0 = X0 + X1 + X2 + X3
Y1 = X0L + X1 - X2 - X3L
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1L + X2L - X3
Y0 = X0 + X1 + X2 + X3
Y1 = X0L + X1 - X2 - X3L
Y2 = X0 - X1 - X2 + X3
Y3 = X0 - X1L + X2L - X3
Select D0 = (X0 + X3)
H.264 Example
Select D1 = (X1 – X2)
Y0 = D0 + X1 + X2
Y1 = X0L + X1 - X2 - X3L
Y2 = D0 - X1 - X2
Y3 = X0 - X1L + X2L - X3
Y0 = D0 + X1 + X2
Y1 = X0L + X1 - X2 - X3L
Y2 = D0 - X1 - X2
Y3 = X0 - X1L + X2L - X3
H.264 Example
Select D2 = (X1 + X2)
Y0 = D0 + X1 + X2
Y1 = X0L + D1 - X3L
Y2 = D0 - X1 - X2
Y3 = X0 - D1L - X3
Y0 = D0 + X1 + X2
Y1 = X0L + D1 - X3L
Y2 = D0 - X1 - X2
Y3 = X0 - D1L - X3
H.264 Example
Select D3 = (X0 – X3)
Y0 = D0 + D2
Y1 = X0L + D1 - X3L
Y2 = D0 - D2
Y3 = X0 - D1L - X3
Y0 = D0 + D2
Y1 = X0L + D1 - X3L
Y2 = D0 - D2
Y3 = X0 - D1L - X3
Final Implementation
Extracting 4 divisors
D0 = X0 + X3 Y0 = D0 + D2
D1 = X1 – X2 Y1 = D1 + D3L
D2 = X1 + X2 Y2 = D0 - D2
D3 = X0 - X3 Y3 = D3 – D1L
D0 = X0 + X3 Y0 = D0 + D2
D1 = X1 – X2 Y1 = D1 + D3L
D2 = X1 + X2 Y2 = D0 - D2
D3 = X0 - X3 Y3 = D3 – D1L
8+, 2<<8+, 2<<
Original: 12+, 4<<
Rectangle Covering:10+, 3<<
FPGA FIR Filter Implementations:Add and Shift Method
F1 = A + B + C + DF2 = A + B + C + E
Unoptimized Expression Trees
Extracting Common Expression (A + B + C)
Extracting Common Expression (A + B)
Optimization
Filter Implementation Using Add and Shift Method
Filter Implementation Using Xilinx Coregen (PDA)
Filter(# taps)
Slices LUTs FFsPerformance
(Msps)
6 264 213 509 251
10 474 406 916 222
13 386 334 749 252
20 856 705 1650 250
28 1294 1145 2508 227
41 2154 1719 4161 223
61 3264 2591 6303 192
119 6009 4821 11551 203
151 7579 6098 14611 180
Filter(# taps)
Slices LUTs FFsPerformance
(Msps)
6 524 774 1012 245
10 781 1103 1480 222
13 929 1311 1775 199
20 1191 1631 2288 199
28 1774 2544 3381 199
41 2475 3642 4748 222
61 3528 5335 6812 199
119 6484 9754 12539 205
151 8274 12525 15988 199
Resource Utilization + Performance Results
Experimental ResultsDA vs. Add and Shift Method
Reduction in Resources
0
10
20
30
40
50
60
70
80
6 10 13 20 28 41 61 119 152
# of Taps
% R
eduction
SLICEs
LUTs
FFs
Dynamic Power Consumption
0200
400600
8001000
12001400
1600
6 10 13 20 28 41 61 119
Filter size (# of taps)
Pow
er (m
w)
Add/Shift
Coregen
Experimental ResultsDA vs. Add and Shift Method
Filter(# taps)
Add ShiftMethod
MACfilter
Slices Msps Slices Msps
6 264 296 219 262
10 475 296 418 253
13 387 296 462 253
20 851 271 790 251
28 1303 305 886 251
41 2178 296 1660 243
61 3284 247 1947 242
119 6025 294 3581 241
151 7623 294 7631 215
Experimental ResultsMAC vs. Add and Shift Method
Experimental ResultsMAC vs. Add and Shift Method
resource utilization
0100020003000400050006000700080009000
1 2 3 4 5 6 7 8 9
# of taps
# of
slic
es
MAC
Add and Shift
Experimental ResultsMAC vs. Add and Shift Method
Performance
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8 9
# of taps
Msp
s
Add and Shift
MAC
CSA CSE for Linear Systems
Y1 = X1 + X1<<2 + X2 + X2<<1 + X2<<2
Y2 = X1<<2 + X2<<2 + X2<<3
D1 = X1 + X2 + X2<<1
Y1 = (D1S + D1
C) + X1<<2 + X2<<2
Y2 = (D1S + D1
C)
Algebraic methods
Greedy Iterative algorithmExtracts the “best” 3-term divisorRewrites the expressions containing it
Terminates when there are no more common subexpressions
F1 = a + b + c + d + e
F2 = a + b + c + d + f
>> D1 = a + b + c
F1 = D1S + D1
C + d + e
F2 = D1S + D1
C + d + f
>> D2 = D1S + D1
C + d
F1 = D2S + D2
C + e
F2 = D2S + D2
C + f
Experimental results
Comparing # of CSAs
Comparing # of CSAs
0
50
100
150
200
250
Example
# C
SA
s
Original
Optimized
Average 38.4% reduction
Experimental results FPGA synthesis
Virtex II FPGAs Synthesized designs and performed place & route
Reduction in LUTs and slices
05
10152025303540
H.264 DCT8 IDCT8 6 tapFIR
20 tapFIR
41 tapFIR
Average
Examples
% R
educt
ion
LUTs
Slices
Avg 14.1 % reduction in #Slices and Avg 12.9% reduction in # LUTs
Avg 5.7% increase in the delay
Conclusions
Optimized acoustic modem by focusing on channel estimation and FIR filters
In depth study of parallelization, number representation, arithmetic, and linear system optimization
Note: 112 samples/symbol + 112 samples for channel clearing.
MatchingPursuitCore
MatchingPursuitCore
MatchingPursuitCore
MatchingPursuitCore
arg
min i