Digital Arithmetic CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner.

Digital Arithmetic

CSE 237D: Spring 2008 Topic #8

Professor Ryan Kastner

Data Representation

Floating point representation Large dynamic range and high

precision Costly

Fixed point representation Requires fewer number of

resources Comparable performance Bitwidth analysis for trading off

estimation accuracy and the number of fixed-point bits

8 bits is sufficient

Moorea Modem Receiver Specification

Note: 112 samples/symbol + 112 samples for channel clearing.

MatchingPursuitCore

MatchingPursuitCore

MatchingPursuitCore

MatchingPursuitCore

arg

min i

Generalized multiple hypothesis test (GMHT)

Walsh/m-Sequence Waveforms

Chip rate – 5 kcps, approx. 5 kHz bandwidth. Uses 25 kHz carrier.

Use 7 chip m-sequence c per Walsh symbol, 8 bits per Walsh symbol bi. Composite symbol duration is thus T = 11.2 msec. (Longer than maximum multipath spread.)

Symbol rate is 266 bps, or 133 bps using 11.2 msec. time guard band for channel clearing.

11 msec.

Transmitted Signal

1 1 -1 1 -1 -1-1 1 1 -1 1 -1 -1-1-1 -1 1 -1 1 1 1

Walsh/m-sequence Signal Parameters

1 1 -1 1 -1 -1-1 1 1 -1 1 -1 -1-1-1 -1 1 -1 1 1 1

8 Walsh Symbols



MatchingPursuitCore

MatchingPursuitCore

MatchingPursuitCore

MatchingPursuitCore

arg

min i


Channel Estimation

Goal: Map matching pursuits to reconfigurable device Parameterizable – number of samples, data representation Tradeoffs - Provides designs with various area, latency, energy, …

Matching Pursuits Algorithm

Matching Pursuits Core

Reconfigurable System

MP( r, S, A, a ) 1 for i = 1, 2, …,SN // compute matched filter (MF) outputs 2 rSV T

ii 0 3 0if 4 0ig 5 end for 6 00 q // do successive interference cancellation 7 for j = 1, 2, …,fN // update MF outputs 8 11

1

jj qqjj AfVV

9 for k = 0, 1, …, 1SN 10 k

jkk avg

11 kjkk gvQ *)(

12 end for 13 }{maxarg

11,...,,k

qqkkj Qq

j

14 jj qq gf 15 end for 16 return (f)

CLB Block RAM IP Core (Multiplier)

* + -*

+

control control1jqf

0

iV

kg

kg

kQ

kQ

j

kV

iS iSkAka

r

System Design Tools

In Depth: Data Representation

History of Number Systems

Oldest Number System? Fingers, but only 10 Toes, but only 20 Base 10, “digit”al Roman schools taught finger counting

– multiplication/division on hands/toes

"Counting in binary is just like counting in decimal if you are all thumbs." ~ Glaser and Way

Sand Tables Stones in the sand Three grooves with up to ten stones per groove “Calculate” said to be derived from the Latin word "calcis“ because

limestone was used in the first sand tables.

"Base eight is just like base ten really, if you're missing two fingers." ~ Tom Lehrer

Key Idea: Formal Notation

Notches on bones – 8500 BC in Africa, Europe

Count in multiples of some basic number 5 or 10 based on fingers Mayans used 360 Babylonians 60

Greeks, Romans extended this – fundamentally still the same

Positional notation key – same symbol in different spots has different meaning

Numbers

Any number system requires: A set of digits A set of possible values for the digits Rules for interpreting the digits and values onto a number

Example: Roman Numerals Symbols used to represent a value Roman Numerals

1 = I 100 = C

5 = V 500 = D

10 = X 1000 = M

50 = L

For example: 2004 = MMVIII

Unsigned Number Systems

Unsigned integer decimal systemsSet of digits represented by a digit vector X = (Xn-1, Xn-2,…,

X1, X0)

Set of values for the digits: Si = {0, 1, 2, …, 9}

Rules for determine number: Unsigned binary systems

Set of digits represented by a digit vector X = (Xn-1, Xn-2,…, X1, X0)

Set of values for the digits: Si = {0, 1}

Rules for determine number:

€

X = X i ⋅10i

i =0

n−1

∑

€

X = X i ⋅2i

i =0

n−1

∑

Source: Parhami

Other Useful Encodings

Some 4-bit number representation formats

Base-2logarithm

Exponent in{2, 1, 0, 1}

Significand in{0, 1, 2, 3}

Source: Parhami

Encoding Numbers in 4 Bits

0 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16

Unsigned integers

Signed-magnitude

3 + 1 fixed-point, xxx.x

Signed fraction, .xxx

2’s-compl. fraction, x.xxx

2 + 2 floating-point, s 2 e in [ 2, 1], s in [0, 3]

2 + 2 logarithmic (log = xx.xx)

Number format

log x

s e e

Source: Parhami

Sign and Magnitude Representation

0000 0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

0 +1

+3

+4

+5

+6 +7

-7

-3

-5

-4

-0 -1

+2-

+ _

Bit pattern (representation)

Signed values (signed magnitude)

+2 -6

Increment Decrement

-

Source: Parhami

Sign and Magnitude Adder

Adder cc

s

x ySign x Sign y

Sign

Sign s

Selective Complement


out in

Comp x

Control

Comp s

Add/Sub

Compl x

___ Add/Sub

Compl s

Selective complement


Source: Parhami

Biased Representations

0000 0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

-8 -7

-5

-4

-3

-2 -1

+7

+3

+5

+4

0 +1

+2

+ _

Bit pattern (representation)

Signed values (biased by 8)

-6 +6

Increment Increment

Source: Parhami

Arithmetic with Biased Numbers

Addition/subtraction of biased numbersx + y + bias = (x + bias) + (y + bias) – biasx – y + bias = (x + bias) – (y + bias) + bias

A power-of-2 (or 2a – 1) bias simplifies addition/subtraction

Comparison of biased numbers:Compare like ordinary unsigned numbersfind true difference by ordinary subtraction

We seldom perform arbitrary arithmetic on biased numbersMain application: Exponent field of floating-point numbers

Source: Parhami

One’s Complement Number Representation

One’s complement = digit complement (diminished radix complement) system for r = 2

M = 2k – ulp

(2k – ulp) – x = xcompl

Range of representable numbers in with k whole bits:

from –2k–1 + ulp to 2k–1 – ulp

0000 0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

+0 +1

+3

+4

+5

+6 +7

-0

-4

-2

-3

-7 -6

-5

+ _

Unsigned representations

Signed values (1’s complement)

+2 -1

Source: Parhami

Two’s Complement Number Representation

0000 0001 1111

0010 1110

0011 1101

0100 1100

1000

0101 1011

0110 1010

0111 1001

+0 +1

+3

+4

+5

+6 +7

-1

-5

-3

-4

-8 -7

-6

+ _

Unsigned representations

Signed values (2’s complement)

+2 -2

Two’s complement = radix complement system for r = 2

M = 2k

2k – x = [(2k – ulp) – x] + ulp = xcompl + ulp

Range of representable numbers in with k whole bits:

from –2k–1 to 2k–1 – ulp

Source: Parhami

Two’s Complement Adder/Subtractor

Mux

Adder

0 1

x y

y or y _

s = x y

add/sub ___

c in

Controlled complementation

0 for addition, 1 for subtraction

c out

Can replace this mux with k XOR gates

Source: Parhami

Sign and Magnitude vs Two’s Complement

Mux

Adder

0 1

x y

y or y _

s = x y

add/sub ___

c in

Controlled complementation

0 for addition, 1 for subtraction

c out

Adder cc

s

x ySign x Sign y

Sign

Sign s



out in

Comp x

Control

Comp s

Add/Sub

Compl x

___ Add/Sub

Compl s



Signed-magnitude adder/subtractor is significantly more complex than a simple adder

Two’s-complement adder/subtractor needs very little hardware other than a simple adder

Fixed Point Representations

Allows us to use rational numbers: a/b Numbers represented in the form:

Unsigned mappings

Two’s complement mapping:

€

X = Xa−1Xa−2L X1X0.X−1X−2L Xb

€

X = X i ⋅2i

i=−b

a−1

∑

€

X =1

2b

⎛

⎝ ⎜

⎞

⎠ ⎟ 2i

i =0

n−1

∑ ⋅X i

€

X =1

2b

⎛

⎝ ⎜

⎞

⎠ ⎟ −2n−1 ⋅Xn−1 + 2i ⋅X i

i =0

n−2

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥

Fixed Point Properties Resolution: Smallest non-zero magnitude

Directly related to the number of fractional bits (b) Unsigned binary fixed point: resolution = 1/2b

Range: Difference between most positive and most negative number Unsigned binary fixed point: range = 2a – 2-b

Largely dependent on number of integer bits

Accuracy: Magnitude of the max difference between a real value and its representation Unsigned binary fixed point: accuracy = 1/2b+1

Accuracy(x) = resolution(x)/2 If one fractional bit, worst possible number is ¼ (since it is ¼ from

both 0 and ½ which are representable with 1 fractional bit

Example

Denote unsigned fixed point systems as U(a,b) Given fixed point number system U(6,2),

What is number does 8A16 represent?

What is the range of U(6,2)?What is the resolution?What is the accuracy?

Rules of Fixed Point Arithmetic

Unsigned Wordlength U(a,b): a + b bits Signed Wordlength S(a,b): a + b + 1 bits Unsigned Range U(a,b): 0 ≤ x ≤ 2a – 2-b

Signed Range S(a,b): -2a ≤ x ≤ 2a – 2-b

Addition Z(a+1,b) = X(a1,b1) + Y(a2,b2) X and Y must be scaled i.e. a1= a2 and b1= b2

Unsigned Multiplication: U(a1,b1) x U(a2,b2) = U(a1 + a2, b1 + b2)

Signed Multiplication: S(a1,b1) x S(a2,b2) = S(a1 + a2 + 1, b1 + b2)

In Depth: Arithmetic Operations

1 Bit AdditionHalf Adder (HA)

Full Adder (HA)

A B

Cou

t

S

HA(2 : 2)

counter

A B

Cou

t

HA

HA

Cin

S

FA(3 : 2)

counter

Half Adder Implementations

c

s

(b) NOR-gate half-adder.

x

y

x

y

(c) NAND-gate half-adder with complemented carry.

x

y

c

s

s

cx

y

x

y

(a) AND/XOR half-adder._

_

_c

Source: Parhami

Full Adder Implementations

HA

HA

xy

c in

cout

(a) Built of half-adders.

s

(b) Built as an AND-OR circuit.

(c) Suitable for CMOS realization.

cout

s

c in

xy

0 1 2 3

0 1 2 3

xy

c in

cout

s

0

1

Mux

Source: Parhami

Full Adder Implementations

(a) FA built of two HAs

(c) Two-level AND-OR FA (b) CMOS mux-based FA

1

0

3

2

HA

HA

1

0

3

2

0

1

x y

x y

x y

s

s s

c out

c out

c out

c in

c in

c in

Source: Parhami

Bit Serial Addition

Perform addition one bit at a time Xi + Yi + C0-(i-1)

Result stored in registered that is right shifted Slow but small area

Ripple Carry Adder

FA

A0B0

S0

FA

A1B1

S1

FA

A2B2

S2

Cin. . .FA

An-1Bn-1

Sn-1

Cout

n-bit Ripple Carry Adder

“Bit parallel adder” Area, delay? n-bit Two

Operand Adder

nn

AB

n

S

CinCout

Another View of Ripple Carry Adder

A0

B0

G0 P0

A1

B1

G1 P1

A2

B2

G2 P2

A3

B3

G3 P3

C0

C4

Carry Network

Faster Addition We need to break the carry chain The carry recurrence: ci+1 = gi + pi ° ci

Observation: Carry only propagates in certain situations

Bit positions 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 ----------- ----------- ----------- ----------- 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0

cout 0 1 0 1 1 0 0 1 1 1 0 0 0 0 1 1 cin \__________/\__________________/ \________/\____/ 4 6 3 2

Carry chains and their lengths

. . . c

k 1

c

k c

k 2

c

1

g

p

1

1

g

p

0

0

g

p

k 2

k 2

g

p

k 1

k 1

c

0 c

2

Manchester Adder

SCC

A0 B0

SCC

A1 B1

Cin

C2 C1

. . .SCC

An-1Bn-1

Cn-1

Ai Bi

Gi PiKi

KGP

1

0

CiCi+

1

SwitchedCarry Chain

(SCC)

Cout

Ai

Bi

Gi PiKi

Kill,Generate,Propagate

(KGP)

Carry Look Ahead

A B C-out0 0 0 “kill”0 1 C-in “propagate”1 0 C-in “propagate”1 1 1 “generate”

G = A and BP = A xor B

A0

B0

A1

B1

A2

B2

A3

B3

S

S

S

S

GP

GP

GP

GP

C0 = Cin

C1 = G0 + C0 P0

C2 = G1 + G0 P1 + C0 P0 P1

C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2

G

C4 = . . .

P

Plumbing as Carry Lookahead Analogy

p0

c0g0

c1

p0

c0g0

p1g1

c2

p0

c0g0

p1g1

p2g2

p3g3

c4

A0

B0

G0 P0

A1

B1

G1 P1

C0

C1

S1

S0

C2

2 bit CLA

0 0 0 0

P1 = P0P1

0 0

G1 = G0P1 + G1

0 0 0

2 Bit Carry Lookahead Adder

Source: Parhami

4 Bit Carry Look Ahead

Complexity reduced by deriving the carry-out indirectly, but increases critical path

g0

g1

g2

g3

c0

c4

c1

c2

c3

p3

p2

p1

p0

Full carry lookahead is quite practical for a 4-bit adder

c1 = g0 c0 p0

c2 = g1 g0 p1 c0 p0 p1

c3 = g2 g1 p2 g0 p1 p2 c0 p0 p1 p2

c4 = g3 g2 p3 g1 p2 p3 g0 p1 p2 p3 c0 p0 p1 p2 p3

Carry Look Ahead, multiple levels

c0g0

p0

c1g1

p1

c2g2

p2

c3g3

p3

A0

B0

A1

B1

A2

B2

A3

B3

C0

C4

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

C0

G0

P0

C1

G1

P1

C2

G2

P2

C3

G3

P3

c0

C16

.

Cascaded Carry Look-ahead (16-bit): Abstraction

CLA

4-bitAdder

4-bitAdder

4-bitAdder

C1 = G0 + C0 P0

C2 = G1 + G0 P1 + C0 P0 P1

C3 = G2 + G1 P2 + G0 P1 P2 + C0 P0 P1 P2

GP

G0P0

C4 = . . .

C0

Carry Lookahead Generator Plumbing Analogy

p0g0

p1g1

p2g2

p3g3

G0

p1

p2

p3

P0

P1

0

A0

B0

A1

B1

2 bit CLA

A2

B2

A3

B3

2 bit CLA

G1

0

P1

1

G1

1

C0

P2 = P0P1

1 1

P1

1

G2 = G0P1 + G11 1 1

C4 = C0P0P1 + G0P1 + G11 1 1 1 1

2 bit CLG

C2

4 Bit Hierarchical CLA

A0

B0

A1

B1

2 bit CLA

G1

0P1

0

A2

B2

A3

B3

2 bit CLA

G1

1P1

1

A4

B4

A5

B5

2 bit CLA

G1

2P1

2

A6

B6

A7

B7

2 bit CLA

G1

3P1

3

2 bit CLG2 bit CLG

2 bit CLG

G2

0P2

0G2

1P2

1

C0

C4

C2

C6

C8

8 Bit Hierarchical CLA

Design Trick: Guess (or “Precompute”)

n-bit adder n-bit adderCP(2n) = 2*CP(n)

n-bit adder n-bit addern-bit adder 1 0

Cout

CP(2n) = CP(n) + CP(mux)

Carry-select adder

FA

A0B0

S0

FA

A1B1

S1

FA

A2B2

S2

Cin. . .FA

An-1Bn-1

Sn-1

Cout

......n – 1FF

......... n – 1FF

n – 2FF

n – 3FF

Pipelined Ripple Carry Adder

Multiple Operand Addition

Many applications require summation of many operands

What is best way to compute this?

• • • • a • • • • x ---------- • • • • x a • • • • x a • • • • x a • • • • x a ---------------- • • • • • • • • p

0 1 2 3

0 1 2 3

2 2 2 2

• • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p • • • • • • p ----------------- • • • • • • • • • s

(0) (1) (2) (3) (4) (5) (6)

Inner ProductMultiplication

Terminology

Serial Implementation

Two Operand Carry Propagate Adder

Si[n + log i]Oi[n]

Si+1[n + log (i+1)]

Register S

Tserial-multi-add = O(m log(n + log m))

= O(m log n + m log log m)

Therefore, addition time grows superlinearly with n when k is fixed and logarithmically with k for a given n

Parallel Implementation

CPA

O1[n]O2[n]

CPA

O3[n]O4[n]

CPA

CPA

CPA

CPA

O5[n]O6[n]

CPA

O7[n]O8[n]

CPA

CPA

Om-

1[n]Om-

2[n]

CPA

Om-

1[n]Om[n

]

CPA

CPA

O(log m)CPA Tree

. . .

S[n + log m]

Ttree-fast-multi-add = O(log n + log(n + 1) + . . . + log(n + log2m – 1))

= O(log m log n + log m log log m)

Can we do this faster?

Carry Save Adder (CSA)

. . .

n-bit Carry Save

AdderFA

O1[n]O2[n]

S[n]

O3[n]

C[n]

n-bit Carry Save Adder

nn

O1[n]O3[n]

n

S[n]

n

O2[n]

n

C[n]

FA

O1[1]O2[1]

S[1]

O3[1]

C[1]

FA

O1[2]O2[2]

S[2]

O3[2]

C[2]

Source: Parhami

Carry Save Adders

FA FAFA FA FAFA

FA FAFA FA FAFA

Cut

Carry-propagate adder

Carry-save adder (CSA) or (3; 2)-counter or 3-to-2 reduction circuit

c

in

c

out

Carry propagate adder (CPA) and carry save adder (CSA) functions in

dot notation.

Half-adder

Full-adder

Specifying full- and half-adder blocks, with their inputs and outputs, in

dot notation.

Ci[n + log i]

Ci+1[n + log (i+1)]

Carry Save Adder

Si[n + log i]Oi[n]

Si+1[n + log (i+1)]

Register C Register C

Serial CSA Implementation

Tserial-csa-multi-add = O(m)

In the end there are two operands (C, S)

S[n]C[n] S[1]C[1]S[2]C[2]S[3]C[3]…

Carry Propagate Adder

Bit 1Bit 2

S[i]C[i-1]…

Bit i-1

……HA

T[n+2]T[n+1]T[i+1] T[3] T[2] T[1]

Cout

Final Reduction (2:1)

CSA

O4[n]O5[n]

S1[n:1]

O6[n]

C1[n+1:2]

CSA

S1[n:1]C1[n+1:2]

S2[n+1:1]C2[n+2:3]

CSA

S3[n+2:1]C3[n+2:3]

O1[n]O2[n]O3[n]

CSA

O6[n]: xxxx O5[n]: xxxx+ O4[n]: xxxx S1[n:1]: xxxx C1[n+1:2]:xxxx

C1[n+1:2]: xxxx S1[n:1]: xxxx + C1[n+1:2]: xxxx S2[n+1:1]: xxxxx C2[n+2:3]:xxxx

S1[n:1]: xxxx S2[n+1:1]: xxxxx + C2[n+2:3]: xxxx S3[n+2:1]: xxxxxx C3[n+2:2]: xxxxx

Carry Save Arithmetic

CSA CSA

CSA

CSA

+

A B C D E F

Delay = 3 + log2(M + 3)

3 = height of CSA tree

M = bitwidth of operands

S

S

S

SCC

C

C

F

CLA

Tree height = log1.5(N/2)

Carry Save Arithmetic

RCA

RCA

RCA

RCA

RCA

(M +1)

Delay = (M+5) + 4

Delay comparison

0

20

40

60

80

100

120

2 6 10 14 18 22 26 30 34 38 42 46 50

# of operands

Del

ay (

full

ad

der

del

ays)

RCA

CSA

Area comparison

0

500

1000

1500

2000

# Operands

Are

a (f

ull

ad

der

un

its)

RCA

CSA

Using Ripple carry adders (RCAs)

(M +2)(M +3)(M +4)

(M +5)

Delay thru CSA network =

3 + log1.5(M + 3)

Source: Parhami

Example Reduction by a CSA Tree

12 FAs

6 FAs

6 FAs

4 FAs + 1 HA

7-bit adder

Total cost = 7-bit adder + 28 FAs + 1 HA

Addition of seven 6-bit numbers in dot notation.

8 7 6 5 4 3 2 1 0 Bit position

7 7 7 7 7 7 62 = 12 FAs 2 5 5 5 5 5 3 6 FAs

3 4 4 4 4 4 1 6 FAs

1 2 3 3 3 3 2 1 4 FAs + 1 HA

2 2 2 2 2 1 2 1 7-bit adder

--Carry-propagate adder--

1 1 1 1 1 1 1 1 1

Representing a seven-operand addition in tabular form.

A full-adder compacts 3 dots into 2 (compression ratio of 1.5)

A half-adder rearranges 2 dots (no compression, but still useful)

Source: Parhami

Wallace and Dadda Reduction Trees

6 FAs

11 FAs

7 FAs

4 FAs + 1 HA

7-bit adder


Adding seven 6-bit numbers using Dadda’s strategy.

12 FAs

6 FAs

6 FAs

4 FAs + 1 HA

7-bit adder


Addition of seven 6-bit numbers in dot notation.

Wallace tree: Reduce the number of operands at the earliest possible opportunity

Dadda tree: Postpone the reduction to the extent possible without causing added delay

h n(h) 2 4 3 6 4 9 5 13 6 19

Source: Parhami

Generalized Parallel Counters

(5, 5; 4)-counter Dot notation for a (5, 5; 4)-counter and the use of such counters for reducing five

numbers to two numbers.

. . .

Multicolumn reduction

(2, 3; 3)-counter

Unequal columns

Gen. parallel counter = Parallel compressor

Compressors

Compressors allow for carry in and carry outs

FA

O2[i]O3[i]O4[i]

Cout[i]

FA

Cin[i]

O1[i]

S[i]C[i]

FA

O2[i-1]O3[i-1]O4[i-1]

Cout[i-1]

FA

Cin[i-1]

O1[i-1]

S[i-1]C[i-1]

Bit i Bit i-1

[4:2] Compress

or

[4:2] Compres

sor

[4 : 2] Compressor Adder

. . .

n-bit [4:2] Adder

n-bit [4:2] Adder

nn

O1[n]O3[n]

n

S[n]

n

O2[n]

n

C[n]

n

O4[n]

[4:2] Compressor

O1[1]O3[1]

S[1]

O2[1]

C[1]

O4[1]

[4:2] Compressor

O1[2]O3[2]

S[2]

O2[2]

C[2]

O4[2]

[4:2] Compressor

O1[n]O3[n]

S[n]

O2[n]

C[n]

O4[n]

Higher Order Compressors

FA

O3[i]O4[i]O5[i]

FA

O1[i]O2[i]

S[i]C[i]

FA

FA

O3[i-1]O4[i-1]O5[i-1]

FA

O1[i-1]O2[i-1]

S[i-1]C[i-1]

FA[5:2] Compressor Bit i

[5:2] Compressor Bit i-1



MatchingPursuitCore

MatchingPursuitCore

MatchingPursuitCore

MatchingPursuitCore

arg

min i


Linear System Optimizations

Linear System Optimization

Linear systems ubiquitous in signal processing applications

We have developed many methods for optimization to hardware, software, FPGA [ASAP04, ASPDAC05, DATE06, ICCD06, Journal of VLSI Signal Processing07]

1D linear systems on previous slide, aka FIR filters

3

2

1

0

3

2

1

0

)85cos()8cos()8

7cos()83cos(

)47cos()4

5cos()43cos()4cos(

)87cos()8

5cos()83cos()8cos(

)0cos()0cos()0cos()0cos(

x

x

x

x

y

y

y

y

+

x

z-1 +

x

z-1 +

x

z-1+

x

z-1

x

z-1

X [n]

y [n]

h0hL-1 hL-2 hL-3 h1

. . .

FIR Filter Implementations:Multiply Accumulate Method

Convolution of the latest L input samples. L is the number of coefficients h(k) of the filter, and x(n) represents the input time series. y[n] = ∑ h[k] x[n-k] k= 0, 1, ..., L-1

Disadvantages Large area on FPGA due to multipliers and the fact that full flexibility of

general purpose multipliers are not required Limited number of embedded resources such as MAC engines,

multipliers, etc. in FPGAs

FIR Filter Implementations:Distributed Arithmetic

Summation of inner product: Ak= constant coefficients

Xk = input data

We can write each input data in two’s complement:

Substituting this into the above yields:

Exchange order of the summations:

€

Y = Ak ⋅Xkk =1

K

∑

€

X = −X0 + Xb ⋅2−b

b=1

n−1

∑

€

Y = Ak −Xk 0 + Xkb ⋅2−b

b=1

B−1

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥

k =1

K

∑

€

Y = Ak ⋅Xkbk =1

K

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥

b=1

B−1

∑ ⋅2−b + Ak ⋅ −Xk0( )k =1

K

∑

From previous slide: How do we compute the bracketed term?

Multiply a particular bit b of each of the inputs by the binary constants A1, A2, …, Ak

Questions: Assume we are looking at b=1 (LSB of inputs), but this generalizes to any bWhat if each bit of Xk1 are 0 i.e. Xk1 = [00000…0]?

What if X11 = 1 and remaining are 0? Xk1 = [00000…1]?

What if X11 = 1, X21 = 1 and rest are 0? Xk1 = [00000…11]?


€

Y = Ak ⋅Xkbk =1

K

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥

b=1

B−1

∑ ⋅2−b + Ak ⋅ −Xk0( )k =1

K

∑


Looking at summations in a different way

€

Y = Ak −Xk 0 + Xkb ⋅2−b

b=1

n−1

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥

k=1

K

∑

€

Y =

A1 ⋅ −X10 + X11 ⋅2−1 + X12 ⋅2−2 +L + X1 B−1( )

⋅2− B−1( )( )

+A2 ⋅ −X20 + X21 ⋅2−1 + X22 ⋅2−2 +L + X2 B−1( )

⋅2− B−1( )( )

M

AK ⋅ −XK 0 + XK1 ⋅2−1 + XK 2 ⋅2−2 +L + XK B−1( )

⋅2− B−1( )( )

€

Y =

A1 ⋅X11 + A2 ⋅X21 + A3 ⋅X31 +L + AK ⋅XK1[ ]⋅2−1

+ A1 ⋅X12 + A2 ⋅X22 + A3 ⋅X32 +L + AK ⋅XK 2[ ]⋅2−2

M

+ A1 ⋅X1 B−1( )

+ A2 ⋅X2 B−1( )

+ A3 ⋅X3 B−1( )

+L + AK ⋅XK B−1( )[ ]⋅2− B−1( )

+A1 ⋅ −X10( ) + A2 ⋅ −X20( ) + A3 ⋅ −X30( ) +L + AK ⋅ −XK 0( )

€

Y = Ak ⋅ Xkb

k=1

K

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥

b=1

n−1

∑ ⋅2−b + Ak ⋅ −Xk0( )k=1

K

∑

0A1A2

A1+ A2

A1+ A2 + … +AK

2K entry LUT

Precision of constant bits wide:

Usually equal to precision of input data B

Address00…0000…0100…1000…11

11…11

.

.

.

.

.

.

Value

X1b + X2b + … + XKb

+

>>

Y

.

.

.


Advantages Replaces multiplication with LUT Coefficients stored in LUTs

Disadvantages Performance limited as next input

sample processed only after every bit of the current input sample is processed

Increasing number of bits to be processed has a significant effect on resource utilization

Larger size scaling accumulator needed for higher number of bits

Increases critical path delay


Address Data

0000 0

0001 C0

0010 C1

… …

1111 C0+C1+C2+C3

LUT

LUT

+ +

Q

QSET

CLR

D

x0[i]x1[i]x2[i]x3[i]

scaling accumulator

<<



LUT

LUT

+


LUT

LUT

+

+

Q

QSET

CLR

D

scaling accumulator

<<

+


x0[i+1]x1[i+1]x2[i+1]x3[i+1]

x4[i+1]x5[i+1]x6[i+1]x7[i+1]

The performance improved by replication - process multiple bits at a time

Significant effect on resource utilization More LUTs Larger size scaling

accumulator

X[n]

+ +

+

+

+ +

y0 y1 yL-1

+

y2

Z-1

+

Z-1 + Z-1 Z-1 + y[n]

MultiplierBlock

DelayBlock

+

x

z-1 +

x

z-1 +

x

z-1+

x

z-1

x

z-1

X [n]

y [n]

h0hL-1 hL-2 hL-3 h1

. . .

FIR Filter Implementations:Add and Shift Method

Idea: Constant Multiplication to Shift/Add

Multiplication is expensive in hardware Decompose constant multiplications into shifts and additions

13*X = (1101)2*X = X + X<<2 + X<<3

Signed digits can reduce the number of additions/subtractions Canonical Signed Digits (CSD) (Knuth’74) (57)10 = (0110111)2 = (100-1001)CSD

Further reduction possible by common subexpression elimination Up to 50% reduction (R.Hartley TCS’96)

Introduction

Common subexpressions = common digit patterns

F1 = 7*X = (0111)*X = X + X<<1 + X<<2 F2 = 13*X = (1101)*X = X + X<<2 + X<<3

D1 = X + X<<2 F1 = D1 + X<<1 F2 = D1 + X<<3

Good for single variable: FIR filters (transposed form) Multiple variable? (DFT, DCT etc..??)

“0101”

=> X + X<<23+, 3<<

4+, 4<<

Linear Systems and polynomial transformation

Y0 1 1 1 1 X0

Y1 = 2 1 -1 -2 X1

Y2 1 -1 -1 1 X2

Y3 1 -2 2 -1 X3

Decomposing constant multiplications

Y0 = X0 + X1 + X2 + X3

Y1 = X0<<1 + X1 - X2 - X3<<1

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1<<1 + X2<<1 - X3

Y0 = X0 + X1 + X2 + X3

Y1 = X0<<1 + X1 - X2 - X3<<1

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1<<1 + X2<<1 - X3 12+, 4<<12+, 4<<

H.264 Integer Transform

Linear Systems and polynomial transformation

Y0 1 1 1 1 X0

Y1 = 2 1 -1 -2 X1

Y2 1 -1 -1 1 X2

Y3 1 -2 2 -1 X3

Polynomial Transformation

H.264 Integer Transform

Y0 = X0 + X1 + X2 + X3

Y1 = X0L + X1 - X2 - X3L

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1L + X2L - X3

Y0 = X0 + X1 + X2 + X3

Y1 = X0L + X1 - X2 - X3L

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1L + X2L - X3 12+, 4<<12+, 4<<

H.264 Example

Y0 = X0 + X1 + X2 + X3

Y1 = X0L + X1 - X2 - X3L

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1L + X2L - X3

Y0 = X0 + X1 + X2 + X3

Y1 = X0L + X1 - X2 - X3L

Y2 = X0 - X1 - X2 + X3

Y3 = X0 - X1L + X2L - X3

Select D0 = (X0 + X3)

H.264 Example

Select D1 = (X1 – X2)

Y0 = D0 + X1 + X2

Y1 = X0L + X1 - X2 - X3L

Y2 = D0 - X1 - X2

Y3 = X0 - X1L + X2L - X3

Y0 = D0 + X1 + X2

Y1 = X0L + X1 - X2 - X3L

Y2 = D0 - X1 - X2

Y3 = X0 - X1L + X2L - X3

H.264 Example

Select D2 = (X1 + X2)

Y0 = D0 + X1 + X2

Y1 = X0L + D1 - X3L

Y2 = D0 - X1 - X2

Y3 = X0 - D1L - X3

Y0 = D0 + X1 + X2

Y1 = X0L + D1 - X3L

Y2 = D0 - X1 - X2

Y3 = X0 - D1L - X3

H.264 Example

Select D3 = (X0 – X3)

Y0 = D0 + D2

Y1 = X0L + D1 - X3L

Y2 = D0 - D2

Y3 = X0 - D1L - X3

Y0 = D0 + D2

Y1 = X0L + D1 - X3L

Y2 = D0 - D2

Y3 = X0 - D1L - X3

Final Implementation

Extracting 4 divisors

D0 = X0 + X3 Y0 = D0 + D2

D1 = X1 – X2 Y1 = D1 + D3L

D2 = X1 + X2 Y2 = D0 - D2

D3 = X0 - X3 Y3 = D3 – D1L

D0 = X0 + X3 Y0 = D0 + D2

D1 = X1 – X2 Y1 = D1 + D3L

D2 = X1 + X2 Y2 = D0 - D2

D3 = X0 - X3 Y3 = D3 – D1L

8+, 2<<8+, 2<<

Original: 12+, 4<<

Rectangle Covering:10+, 3<<

FPGA FIR Filter Implementations:Add and Shift Method

F1 = A + B + C + DF2 = A + B + C + E

Unoptimized Expression Trees

Extracting Common Expression (A + B + C)

Extracting Common Expression (A + B)

Optimization

Filter Implementation Using Add and Shift Method

Filter Implementation Using Xilinx Coregen (PDA)

Filter(# taps)

Slices LUTs FFsPerformance

(Msps)

6 264 213 509 251

10 474 406 916 222

13 386 334 749 252

20 856 705 1650 250

28 1294 1145 2508 227

41 2154 1719 4161 223

61 3264 2591 6303 192

119 6009 4821 11551 203

151 7579 6098 14611 180

Filter(# taps)

Slices LUTs FFsPerformance

(Msps)

6 524 774 1012 245

10 781 1103 1480 222

13 929 1311 1775 199

20 1191 1631 2288 199

28 1774 2544 3381 199

41 2475 3642 4748 222

61 3528 5335 6812 199

119 6484 9754 12539 205

151 8274 12525 15988 199

Resource Utilization + Performance Results

Experimental ResultsDA vs. Add and Shift Method

Reduction in Resources

0

10

20

30

40

50

60

70

80

6 10 13 20 28 41 61 119 152

# of Taps

% R

eduction

SLICEs

LUTs

FFs

Dynamic Power Consumption

0200

400600

8001000

12001400

1600

6 10 13 20 28 41 61 119

Filter size (# of taps)

Pow

er (m

w)

Add/Shift

Coregen

Experimental ResultsDA vs. Add and Shift Method

Filter(# taps)

Add ShiftMethod

MACfilter

Slices Msps Slices Msps

6 264 296 219 262

10 475 296 418 253

13 387 296 462 253

20 851 271 790 251

28 1303 305 886 251

41 2178 296 1660 243

61 3284 247 1947 242

119 6025 294 3581 241

151 7623 294 7631 215

Experimental ResultsMAC vs. Add and Shift Method


resource utilization

0100020003000400050006000700080009000

1 2 3 4 5 6 7 8 9

# of taps

# of

slic

es

MAC

Add and Shift


Performance

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9

# of taps

Msp

s

Add and Shift

MAC

CSA CSE for Linear Systems

Y1 = X1 + X1<<2 + X2 + X2<<1 + X2<<2

Y2 = X1<<2 + X2<<2 + X2<<3

D1 = X1 + X2 + X2<<1

Y1 = (D1S + D1

C) + X1<<2 + X2<<2

Y2 = (D1S + D1

C)

Algebraic methods

Greedy Iterative algorithmExtracts the “best” 3-term divisorRewrites the expressions containing it

Terminates when there are no more common subexpressions

F1 = a + b + c + d + e

F2 = a + b + c + d + f

>> D1 = a + b + c

F1 = D1S + D1

C + d + e

F2 = D1S + D1

C + d + f

>> D2 = D1S + D1

C + d

F1 = D2S + D2

C + e

F2 = D2S + D2

C + f

Experimental results

Comparing # of CSAs

Comparing # of CSAs

0

50

100

150

200

250

Example

# C

SA

s

Original

Optimized

Average 38.4% reduction

Experimental results FPGA synthesis

Virtex II FPGAs Synthesized designs and performed place & route

Reduction in LUTs and slices

05

10152025303540

H.264 DCT8 IDCT8 6 tapFIR

20 tapFIR

41 tapFIR

Average

Examples

% R

educt

ion

LUTs

Slices

Avg 14.1 % reduction in #Slices and Avg 12.9% reduction in # LUTs

Avg 5.7% increase in the delay

Conclusions

Optimized acoustic modem by focusing on channel estimation and FIR filters

In depth study of parallelization, number representation, arithmetic, and linear system optimization


MatchingPursuitCore

MatchingPursuitCore

MatchingPursuitCore

MatchingPursuitCore

arg

min i

Digital Arithmetic CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner.

Documents

Transcript of Digital Arithmetic CSE 237D: Spring 2008 Topic #8 Professor Ryan Kastner.