ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer...

41
ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago

Transcript of ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer...

Page 1: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

ECE 465 High Level Design Strategies

Lecture Notes # 9

Shantanu Dutt

Electrical & Computer Engineering

University of Illinois at Chicago

Page 2: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Outline• Circuit Design Problem• Solution Approaches:

– Truth Table (TT) vs. Computational/Algorithmic – Yes, hardware, just like software can implement any algorithm!

– Flat vs. Divide-&-Conquer– Divide-&-Conquer:

• Associative operations/functions• General operations/functions

– Other Design Strategies for fast circuits:• Speculative computation• Best of both worlds (best average and best worst-case)• Pipelining

• Summary

Page 3: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Circuit Design Problem• Design an 8-bit comparator that compares two 8-bit #s available in

two registers A[7..0] and B[7..0], and that o/ps F = 1 if A > B and F = 0 if A <= B.

• Approach 1: The TT approach -- Write down a 16-bit TT, derive logic expression from it, minimize it, obtain gate-based realization, etc.!

A B F

00000000 00000000 0

00000000 00000001 0

- - - - - - - - - - - - - - - - - - - - 00000001 00000000 1 - - - - - - - - - - - - - - - - - - - - - -

11111111 11111111 0

– Too cumbersome and time-consuming– Fraught with possibility of human error– Difficult to formally prove correctness (i.e., proof w/o exhasutive testing)– Will generally have high hardware cost (including wiring) and delay

Page 4: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Circuit Design Problem (contd)

• Approach 2: Think computationally/algorithmically about what the ckt is supposed to compute:

• Approach 2(a): Flat computational/programming approach:– Note: A TT can be expressed as a sequence of “if-then-else’s”– If A = 00000000 and B = 00000000 then F = 0

else if A = 00000000 and B = 00000001 then F=0

……….

else if A = 00000001 and B = 00000000 then F=1

……….– Essentially a re-hashing of the TT – same problems as the TT

approach

Page 5: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Circuit Design Problem: Strategy 1: Divide-&-Conquer• Approach 2(b): Structured algorithmic approach:

– Be more innovative, think of the structure/properties of the problem that can be used to solve it in a hierarchical or divide-&-conquer (D&C) manner:

– D&C approach: See if the problem can be:• “broken up” into 2 or more smaller subproblems: two types of breaks possible

by # of operands: partition set of n operands into 2 or more subsets of operands by operand size: breaking a constant # of n-bit operands into smaller size operands

(this mainly applies when the # of operands are a constant, e.g., add. of 2 #s)• whose solns can be “stitched-up” (stitch-up function) to give a soln. to the parent prob• also, consider if there is dependency between the sub-probs (results of some required

to solve the other(s))– Do this recursively for each large subprob until subprobs are small enough (the leaf problem) for TT solutions– If the subproblems are of a similar kind (but of smaller size) to the root prob. then the breakup and stitching will also be similar, but if not, they have to be broken up differently

Subprob. A1

A1,1 A1,2 A2,1 A2,2

Root problem A

Subprob. A2

Stitch-up of solns to A1 and A2 to form the complete soln to A

Do recursively until subprob-sizeis s.t. TT-based design is doable

Data dependency?Legend: : D&C breakup arrows : data/signal flow to solve a higher-level problem : possible data-flow betw. sub-problems

Page 6: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Circuit Design Problem: Strategy 1: Divide-&-Conquer• Especially for D&C breakups in which: a) the subproblems are the same problem type as the

root problem, and b) there is no data dependency between subproblems, the final circuit will be a “tree”of stitch-up functions (of either the same size or different sizes at different levels—this depends on the problem being solved) with leaf functions at the bottom of the tree, as shown in the figure below for a 2-way breakup of each problem/sub-problem.

Stitch-up functions

Leaf functions

Page 7: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Shift Gears: Design of a Parity Detection Circuit—A Series of XORs

x(0)

x(1)

x(2)X(3)

x(15) f

(a) A linearly-connected circuit

• No concurrency in design (a)---the actual problem has available concurrency, though, and it is not exploited well in the above “linear” design• Complete sequentialization leading to a delay that is linear in the # of bits n (delay = n*td), td = delay of 1 gate• All the available concurrency is exploited in design (b)---a parity tree (see next slide).• Question: When can we have a circuit for an operation/function on multiple operands built of “gates” performing the same operation for fewer (generally a small number betw. 2-5) operands?• Answer:

(1) It should be possible to break down the n-operand function into multiple operations w/ fewer operands.

(2) When the operation is associative. An oper. “x” is said to be associative if: a x b x c = (a x b) x c = a x (b x c).

• This implies that, for example, if we have 4 operations a x b x c x d, we can either perform this as:– a x (b x (c x d)) [getting a linear delay of 3 units or in general n-1 units for n operands]– or as (a x b) x (c x d) [getting a logarithmic (base 2) delay of 2 units and exploiting the available

concurrency due to the fact that “x” is associative].• Is XOR associative?• The parenthesisation corresp. to the above ckt is:

– (…..((x(0) xor x(1)) xor x(2))) xor x(3)))) xor …. xor x(15))….)

Page 8: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Shift Gears: Design of a Parity Detection Circuit—A Series of XORs

(b) 16-bit parity tree

Delay = (# of levels in AND-OR tree) * td = log2 (n) *td

x(15) x(14) x(1) x(0)

w(3,0)

w(3,1)

w(3,2)

w(3,3)

w(3,4)

w(3,5)

w(3,6)

w(3,7)

w(2,0)w(2,1)w(2,2)w(2,3)

w(1,0)w(1,1)

w(0,0) = f

An example of simple designer ingenuity. A bad design would have resulted in a linear delay, an ingenious (simple enough though) & well-informed design results in a log delay, and both have the same gate i/p cost

• if we have 4 operations a x b x c x d, we can either perform this as a x (b x (c x d)) [getting a linear delay of 3 units] or as (a x b) x (c x d) [getting a logarithmic (base 2) delay of 2 units and exploiting the available concurrency due to the fact that “x” is associative].• We can extend this idea to n operands (and n-1 operations) to perform as many of the pairwise operations as possible in parallel (and do this recursively for every level of remaining operations), similar to design (b) for the parity detector [xor is an associative operation!] and thus get a (log2 n) delay.• In fact, any parenthesisation of operands is correct for an associative operation/function, but the above one is fastest. Surprisingly, any parenthesisation leads to the same h/w cost: n-1 2-i/p gates, i.e., 2(n-1) gate i/ps. Why? Analyze.

Parenthesization of tree-circuit: (((x(15) xor x(14)) xor (x(13) xor x(12))) xor ((x(11) xor x(10)) xor (x(9) xor x(8)))) xor (((x(7) xor x(6)) xor (x(5) xor x(4))) xor ((x(3) xor x(2)) xor (x(1) xor x(0))))

Page 9: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

D&C for Associative Operations• Let f(xn-1, ….., x0) be an associative function.• What is the D&C principle involved in the design of an n-bit xor/parity function? Can it also lead automatically to a tree-based ckt?

• Using the D&C approach for an associative operation results in a breakup by # of operands and the stitch up function being the same as the original function (this is not the case for non-assoc. operations), but w/ a constant # of operands (2, if the original problem is broken into 2 subproblems)• Also, there are no dependencies between sub-problems• If the two sub-problems of the D&C approach are balanced (of the same size or as close to it as possible), then unfolding the D&C results in a balanced operation tree of the type for the xor/parity function seen earlier of (log n) delay

f(a,b)

a b

f(xn-1, .., x0)

Stitch-up function---same as theoriginal function for 2 inputs

f(xn-1, .., xn/2) f(xn/2-1, .., x0)

Page 10: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

D&C for Associative Operations (cont’d)• Parity detector example

Delay = (# of levels in AND-OR tree) * td = log2 (n) *td

16-bit parity

x(15) x(14) x(1) x(0)

w(3,0)

w(3,1)

w(3,2)

w(3,3)

w(3,4)

w(3,5)

w(3,6)

w(3,7)

w(2,0)w(2,1)w(2,2)w(2,3)

w(1,0)w(1,1)

w(0,0) = f

8-bit parity 8-bit parity

stitch-upfunction = 2-bit parity/xor

Breakup by operands

Page 11: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

D&C Approach for Non-Associative Opers: n-bit > Comparator

A

• O/P = 1 if A > B, else 0• Is this is associative? Not sure for breakup by bits in the 2 operands. Issue of associativity mainly applies for n operands, not on the n-bits of 2 operands• For a non-associative func, determine its properties that allow determining a break-up & acorrect stitch-up function• Useful property: At any level, comp. of MS (most significant) half determines o/p if result is > or < else comp. of LS ½ determ. o/p• Can thus break up problem at any level into MS ½ and LS ½ comparisons & based on their results determine which o/p to choose for the higher-level (parent) result• No sub-prob. dependency

Comp A[7..4],B[7..4]

Comp. A[7..0]],B[7..0] Stitch-up of solns to A1 and A2 to form the complete soln to A

A1 A2Comp A[3..0],B[3..0]

If A1 reslt is> or < takeA1 reslt elsetake A2 reslt

Comp A[7..6],B[7..6] Comp A[5,4],B[5,4]

A1,1 A1,2

If A1,1,1 reslt is> or < takeA1,1,1 reslt elsetake A1,1,2 reslt

Comp A[7],B[7] Comp A[6],B[6]

If A1,1 reslt is> or < takeA1,1 reslt elsetake A1,2 reslt

A1,1,1A1,1,2

Small enough to bedesigned using a TT

Breakup by size/bits

Page 12: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

D&C Approach for Non-Associative Opers: n-bit > Comparator (cont’d)

A

A[i] B[i] f1(i) f2(i)0 0 0 10 1 0 01 0 1 01 1 0 1

If A[i] = B[i] then { f1(i)=0; f2(i) = 1; /* f2(i) o/p is an i/p to the stitch logic */

/* f2(i) =1 means f1( ), f2( ) o/ps of parent should be that of the LS ½ of this subtree should be selected by the stitch logic as its o/ps */else if A[i] < B[i} then { f1(i) = 0; /* indicates < */f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */else if A[i] > B[i] then {f1(i) = 1; /* indicates > */f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */

The TT may be derived directly or by first thinking of and expressing itscomputation in a high-level programming language and then convertingit to a TT.

Comp A[7..4],B[7..4]

Comp. A[7..0]],B[7..0] Stitch-up of solns to A1 and A2 to form the complete soln to A

A1 A2Comp A[3..0],B[3..0]

If A1 reslt is> or < takeA1 reslt elsetake A2 reslt

Comp A[7..6],B[7..6] Comp A[5,4],B[5,4]

A1,1 A1,2

If A1,1,1 reslt is> or < takeA1,1,1 reslt elsetake A1,1,2 reslt

Comp A[7],B[7] Comp A[6],B[6]

If A1,1 reslt is> or < takeA1,1 reslt elsetake A1,2 reslt

A1,1,1A1,1,2

Small enough to bedesigned using a TT

(2-bit 2-o/p comparator)

Breakup by size/bits

Page 13: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Comparator Circuit Design Using D&C (contd.)

Comp A[7..4],B[7..4]

Comp. A[7..0]],B[7..0] Stitch-up of solns to A1 and A2to form the complete soln to A

A

A1A2

Comp A[3..0],B[3..0]

If A1 reslt is> or < takeA1 reslt elsetake A2 reslt

Comp A[7..6],B[7..6] Comp A[5,4],B[5,4]

A1,1 A1,2

If A1,1,1 reslt is> or < takeA1,1,1 reslt elsetake A1,1,2 reslt

Comp A[7],B[7] Comp A[6],B[6]

If A1,1 reslt is> or < takeA1,1 reslt elsetake A1,2 reslt

A1,1,1 A1,1,2

A[i] B[i] f1(i) f2(i)0 0 0 10 1 0 01 0 1 01 1 0 1

Stitch up logic details:If f2(i) = 0 then { my_op1=f1(i); my_op2=f2(i) } /* select MS ½ comp o/ps */

else /* select LS ½ comp. o/ps */

{my_op1=f1(i-1); my_op2=f2(i-1) }

Stitch-uplogic

f1(i) f2(i)

my_op1 my_op2

f1(i-1) f2(i-1)

f1(i) f2(i) f1(i-1) f2(i-1) my_op1 my_op2 X 0 X X f1(i) f2(i) X 1 X X f1(i-1) f2(i-1)

OR

• Once the D&C tree is formulated it is easy to get the low-level & stitch-up designs• Stitch-up design shown here

(Compact TT)

2-bit2:1 Mux

2

2 2

f(i)=f1(i),f2(i) f(i-1)

my_op

f2(i)

I0 I1

(Direct design)

Page 14: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Comparator Circuit Design Using D&C – Final Design

2-bit2:1 Mux

2

2 2

my(3)

f2(7) = f(7)(2)

I0 I1

1-bitcomparator

f(7)

A[7] B[7]

2

1-bitcomparator

f(6)

A[6] B[6]

2

1-bitcomparator

f(5)

A[5] B[5]

2

1-bitcomparator

f(4)

A[4] B[4]

2

1-bitcomparator

f(3)

A[3] B[3]

2

1-bitcomparator

f(2)

A[2] B[2]

2

1-bitcomparator

f(1)

A[1] B[1]

2

1-bitcomparator

f(0)

A[0] B[0]

2

2-bit2:1 Mux

2

2 2

my(2)

f(5)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(1)

f(3)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(0)

f(1)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(5)

my(3)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(4)

my(1)(2)

I0 I1

my(5)(2) 1-bit2:1 Mux

F= my1(6)

I0 I1

my(5)(1) my(4)(1)

Log n levelof Muxes

• Delay(8-bit comp.) = 3 (delay of 2:1 Mux) + delay of 2-bit comp. • Note parallelism at work – multiple logic blocks are processing simult.• Delay(n-bit comp.) = log n (delay of 2:1 Mux) + delay of 2-bit comp.

• H/W_cost(8-bit comp.) = 7(H/W_cost(2:1 Muxes)) + 8(H/W_cost(2-bit comp.)

• H/W_cost(n-bit comp.) =(n-1)(H/W_cost(2:1 Muxes)) + n(H/W_cost(2-bit comp.))

Page 15: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

D&C: Mux Design

(a) Top-Down design (D&C)

2:1Mux

Sn-1

Sn-2 S0

2n-1 :1MUX

12 nI

2n-1 :1MUX

Sn-2 S0

n-12nI

All bits except msb should have different combinations; msb should be at a constant value (here 0)

MSB value should differ among these 2 groups

All bits except msb should have different combinations; msb should be at a constant value (here 1)

I0

12 nI n-1

Stitch-up

2n :1MUX

Sn-1 S0

I0

12 nI

Breakup by operands (data)Sim

ultaneous breakup by bits (select)

Two sets of operands: Data operands (2n) and control/select operand (n bits)

Page 16: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

8:1MUX

I0

I1

I2

I3

I4

I5

I6

I7

S2 S1 S0

Opening up the 8:1 MUX’s hierarchical design and a top-down view

I1

2:1MUX

S0

I0

I3

2:1MUX

S0

I2

I5S0

I4

I7

2:1MUX

S0

I6

2:1MUX

I0

I2

I4

I6

Z

2:1MUX

2:1MUX

2:1MUX

Z

S1

S1

S2

I2

I6

I6

Selected when S0 = 0, S1 = 1. These i/ps should differ in S2

Selected whenS0 = 0, S1 = 1, S2=1

4:1 Mux

4:1 Mux

All bits except msb should have different combinations; msb should be at a constant value (here 0)

All bits except msb should have different combinations; msb should be at a constant value (here 1)

MSB value should differ among these 2 groups

Page 17: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Top-Down vs Bottom-Up: Mux Design

2:1

2:1

2:1

Sn-1 S1

2n-1 :1MUX

S0

S0

S0

2n-1

2:1MUXes

(b) Bottom-Up (“Divide-and-Accumulate”)

• Generally better to try top-down (D&C) first

I1

I0

I3

I2

12 nI

12 nI

2

Page 18: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

8:1MUX

I0

I1

I2

I3

I4

I5

I6

I7

S2 S1 S0

An 8:1 MUX example (bottom-up)

I1

2:1MUX

S0

I0

I3

2:1MUX

S0

I2

I5S0

I4

I7

2:1MUX

S0

I6

2:1MUX

4:1MUX

S2 S1

I0

I2

I4

I6

Z

I1

I3

I5

I7

Selected when S0 = 1

Selected when S0 = 0

Z

These inputs shouldhave different lsb or S0 values, since their sel. is based on S0 (all other remaining, i.e., unselected bit values should be the same). Similarly for other i/p pairs at 2:1 Muxes at this level.

Page 19: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

• Multiplication D&C idea:• A x B = (2n/2*Ah + Al)(2n/2*Bh + Bl), where Ah is the higher

n/2 bits of A, and Al the lower n/2 bits = 2n*Ah*Bh + 2n/2*Ah*Bl + 2n/2*Al*Bh + Al*Bl = PH + PM1 + PM2 + PL

• Example: 10111001 = 185 X 00100111 = 39 = 0001110000101111 = 7215 D&C breakup: (10111001) X (00100111) = (24(1011)

+ 1001) X (24(0010) + 0111) = 28(1011 X 0010) + 24(1011 x 0111 + 1001 X 0010) + 1001 X 0111 = 28(00010110) + 24(01001101 + 00010010) + 00111111 = bbbbbbbb00111111 = PL

+ bbbb01001101bbbb = PM1

+ bbbb00010010bbbb = PM2

+ 00010110bbbbbbbb = PH

_____________________ 0001110000101111 = 7215

Multiplier D&C

PL(n2n)

+PM1(n2n)

PM2(n2n)

+PH(n2n) 2n-bit

adders

+

2n

Critical path:Delay (using RCAs) =

(a) too high-level analysis: 2*((2n)-bit

adder delay) = 4n*(FA delay)

(b) More exact considering overall critical path: (i+2n-

i+1) = 2n+1 FA delays

Stitch-Up Design 1(inefficient)

Cost = 3 2n-bitadders = 6n

FAs (full adders) for

RCAs (ripple-carry adders)

AXB:n-bit mult

AhXBh:(n/2)-bit

mult

AhXBl:(n/2)-bit

mult

AlXBh:(n/2)-bit

mult

AlXBl:(n/2)-bit

mult

Stitch up: Align and Add =2n*W + 2n/2*X + 2n/2*Y + Z

W Xn n

ZYnn

What is the delay of the n-bit multiplier using such a stitch up (# 1)?

Breakup by bits(operand size)

Page 20: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

FA7

z0z1z2z3z4z5z6z7

FA7

FA7

Delay for adding 3 numbers X, Y, Z using two RCAs?

Ans: (n+1) FA delay units or 2(n+1) 2-i/p gate delay units

Page 21: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Multiplier D&C (cont’d)

• This (1.5n FA delay units)is the delay assuming PL … PH have been computed.

• What is the delay of the entire multiplier? Note the stitch up of a level can start when the lsb of the msb half of the product bits of each of the 4 products PL … PH are available: for the top level this is at n/4 + 2 after the previous level’s such input is available

• Using RCAs: (n-1) [this is delay after lsb of msb half avail. at top level) + (n/2 +2) + (n/4 +2) + … + (2+2) (stopping at 4-bit mult) + 2 [boundary-case 2-bit mult delay at bit 3] + 1/3 [this is the delay of 1 2-i/p gate translated in terms of FA delay units which is 3 2-i/p gate delays] = (n-1) + (1/2)[( i=0 logn 2i ) + (logn +1) – 1.17 [corrective term for taking prev. summation up to i=1,0] = n-1 + (1/2)[2n-1] + 2(logn +1) - 1.17 ~ 2(n+log n ) ~ (2n) FA delays—similar to the well-known array multiplier that uses carry-save adders

PL

PM1

PM2

PH

+

+

+

+

cin

cin

+

cin

n/2 n/2 n/2 n/2

(n/2)-bit adders

Critical path:Delay =

3*((n/2)-bitadder delay) = 1.5n*(FA delay)

for RCAs

Stitch-Up Design 2 (efficient)

Cost = 5 (n/2)-bitAdders = 2.5 n FAs

for RCAs

00 ….0 Cin

IntermediateSums

• Ex: 10111001 = 185 X 00100111 = 39 = 0001110000101111 = 7215D&C breakup: (10111001) X (00100111) = (24(1011) + 1001) X (24(0010) + 0111)= 28(1011 X 0010) + 24(1011 x 0111 + 1001 X 0010) + 1001 X 0111= 28(00010110) + 24(01001101 + 00010010) + 00111111= bbbbbbbb00111111 = PL

+ bbbb01001101bbbb = PM1

+ bbbb00010010bbbb = PM2

+ 00010110bbbbbbbb = PH

_____________________ 0001110000101111 = 7215

Cout000Cin

(Arrows in adds on the left show Couts of lower-order addspropagating as Cin ti next higher-order adds)

n

@ del=n/2

@ del=n/2+1

@ del=2[n/2] +2lsb of MS half@ del=n/2+2

Cin @del=2[n/2] +1

@ del=2[n/2] +1

@ del=3[n/2] +1

We were able to obtain this similar-to-array-multiplier design using D&C using basic D&C guidelines and it did not require an extensive ingenuity as it might have for the designers of the array multiplier

Page 22: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

SU2(n)

SU2(n/2) SU2(n/2) SU2(n/2) SU2(n/2)

n n n n

2n

SU2(n/4)

SU2(n/4)

SU2(n/4)

SU2(n/4)

n/2

• What is its cost in terms of # of FAs (RCAs)?• The level below the root (root = 1st level) has 4 (n/2)-bit multiplies to generate the PL …. PH of the root, 16 (n/4)-bit multiplies in the next level, upto 2-bit mults. at level logn. • Thus FAs used = 2.5[n + 4(n/2 )+ 16(n/4)] + 4 logn -1*(2) + 4 logn *(1/7) [the last two terns are for the boundary cases of 2-bit and 1-bit multipliers that each require 2 and 1/7 FAs, resp.) = 2.5n( i=0 logn – 2 2i) + 2(n/2)2 + (1/7)n2 = 2.5[n(n/2 -1]/(2 -1)) + 0.64n2 = 1.25n2 -2.5n + 0.64n2 ~ 1.89n2 = (n2). • Why do we add (n/7) FA cost units for each 1-bit multiplier (which is a 2-i/p AND gate)?• Using CSvA’s [see later], the cost is similar (quadratic in n, i.e., (n2)).

SU2 = Stitch up design 2 formultiplication

Page 23: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Multiplier D&C (cont’d): Carry-Save Addition

Page 24: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Multiplier D&C (cont’d): Carry-Save Add. Based Stitch-Up

• Using CSvAs (carry-save adders) [each sub-prod., e.g., PL, is formed of 2 nos. sum bits and carry bits, and so there are 8 n-bit #s to be CSvA’ed in the final stitch-up and takes a delay of approx. 5 units if done in seq. but only 4 units if done in parallel. We then get 2 final nos. (carries # and sums #) that are added by a carry-propagate adder like a CLA, which takes (log n) time, and overall multiplier delay is (4*log n) [4 time units at each of the (log n -2) levels (need at least 2 bit inputs for the above structure to be valid) + at moat 2 time units for the bottom two levels (why?)] + (log n) = (log n) —similar to Wallace-tree mult,

• We were able to obtain this fast design using D&C (and did not need the extensive ingenuity that W-T multiplier designers must have needed] !

• Hardware cost (# of FAs), ignoring final carry-prop. adder for the entire mult.? Exercise.

S(PL)

C(PL)

S(PM1)

C(PM1)

S(PM2)

C(PM2)

S(PH)

C(PH)

CSvA CSvA

CSvA

CSvA

Fig. : Stitch-up # 3: Adding 6 numbers in parallel using CSvA’s takes 3 units of time and 4 CSvA’s.

n/2 (C & S) n/2 (C & S) n/2 (C & S) n/2 (C & S)

Add 6 #susing CSvA’s:3 delay units

No CSvAneeded

Add 4 #susing CSvA’s

Add 7 #susing CSvA’s(7 lsb bits needto be added): 4 delay units

Fig. : Separate (and thus parallel) Carry save adds for each of the 4 (n/2)-bit groups shown at the top level of multiplication

Page 25: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

D&C Example Where a “Straightforward” Breakup Does Not Work• Problem: n-bit Majority Function (MF): Output f = 1 when a majority of bits is 1, else f =0

• Need to ask (general Qs for any problem): Is the stitch-up function SU required in the above straightforward breakup of MF(n) into two MF’s for the MS and LS n/2 bits:

Computable? Efficient in both hardware and speed?

• Try all 4 combinations of f1, f2 values and check if its is possible for any function w/ i/ps f1, f2 to determine the correct f value:

f1 = 0, f2 = 0 # of 1’s in minority (<= n/4) in both halves, so totally # of 1’s <= n/2 f = 0 f1 = 1, f2 = 1 # of 1’s in majority (> n/4) in both halves, so totally # of 1’s > n/2 f = 1 f1 = 0, f2 = 1 # of 1’s <= n/4 in LS n/2 and > n/4 in MS n/2, but this does not imply if total

# of 1’s is <= n/2 or > n/2. So no function can determine the correct f value (it will need more info, like exact count of 1’s)

f1 = 1, f2 = 0: same situation as the f1 = 0, f2 = 1 case. Thus the stitch-up function is not even computable in the above breakup of MF(n).

Subprob. A2MF(MS n/2 bits)

St. Up(SU)

Root problem A:n-bit MF [MF(n)]

Subprob. A1MF(LS n/2 bits)

ff2 f1

Page 26: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

D&C Example Where a “Straightforward” Breakup Does Not Work (contd.)

• Try another breakup, this time of MF(n) into functions that are different from MF.

• Have seen (log n) delay (>) comparator for two n-bit #s using D&C• Can we do 1-counting using D&C? How much time will this take?

Subprob. A2:(> compare of A1 o/p

and floor(n/2)

Root problem A:n-bit MF [MF(n)]

f

f1

Subprob. A1:Count # of 1’sin the n-bits(log n)+1

Page 27: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Dependency Resolution in D&C:(1) The Wait Strategy

• Strategy 1: Wait for required o/p of A1 and then perform A2, e.g., as in a ripple-carry adder: A = n-bit addition, A1 = (n/2)-bit addition of the L.S. n/2 bits, A2 = (n/2)-bit addition of the M.S. n/2 bits• No concurrency between A1 and A2:

t(A) = t(A1) + t(A2) + t(stitch-up)= 2*t(A1) + t(stitch-up) if A1 and A2 are the same problems of the same size (w/ different i/ps)

Subprob. A2

Root problem A

Subprob. A1

Data flow

• So far we have seen D&C breakups in which there is no data dependency between the two (or more) subproblems of the breakup• Data dependency leads to increased delays• We now look at various ways of speeding up designs that have subproblem dependencies in their D&C breakups

Page 28: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Adder Design using D&C• Example: Ripple-Carry Adder (RCA)

– Stitching up: Carry from LS n/2 bits is input to carry-in of MS n/2 bits at each level of the D&C tree.

– Leaf subproblem: Full Adder (FA)

Add n-bit #s X, Y

Add MS n/2 bitsof X,Y

Add LS n/2 bitsof X,Y

FA FA FA FA

(a) D&C for Ripple-Carry Adder

Page 29: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

• Note: Gate delay is propotional to # of inputs (since, generally there is a series connection of transistors in either the up or down network = # of inputs R’s of the transistors in series add up and is prop to # of inputs delay ~ RC (C is capacitive load) is prop. to # of inputs)• The 5-i/p gate delay stated above for a FA is correct if we have 2-3 i/p gates available (why?), otherwise, if only 2-i/p gates are available, then the delay will be 6-i/p gate delays (why?).• Assume each gate i/p contributes 2 ps of delay• For a 16-bit adder the delay will be 160 ps• For a 64 bit adder the delay will be 640 ps

Example of the Wait Strategy in Adder Design

FA7

Page 30: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Adder Design using D&C—Lookahead Wait (not in syllabus)

• Example: Carry-Lookahead Adder (CLA)

– Division: 4 subproblems per level

– Stitching up: A more complex stitching up process (generation of global ir “super” P,G’s to connect up the subproblems)

– Leaf subproblem: 4-bit basic CLA with small p, g bits.

• More intricate techniques (like P,G generation in CLA) for complex stitching up for fast designs may need to be devised that is not directly suggested by D&C. But D&C is a good starting point.

Add n-bit #s X, Y

Add ms n/4 bits Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits

(a) D&C for Carry-Lookahead Adder w/ Linear Global P, G Ckt

P, GP, GP, GP, GLinear connection of local P, G’s from each unit to determine global orsuper P, G for each unit. But linear delay, so not much better than RCA

But, the global P for each unit is an associative function. So can be done in max log n time (for the last unit; less time for earlier units).

Add n-bit #s X, Y

Add ms n/4 bits Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits

(b) D&C for Carry-Lookahead Adder w/ a Tree-likeGlobal P, G Ckt

P, GP, GP, GP, GTree connection of local P, G’s from each unit to determine global

P, G for each unit (P is associative)to do a prefix computation

Page 31: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Dependency Resolution in D&C:(2) The “Design-for-all-cases-&-select (DAC)” Strategy

Root problem A

Subprob. A1Subprob. A2

Subprob. A2

Subprob. A2

Subprob. A2

4-to

-1 M

ux

Select i/p

00

01

10

11

I/p00

I/p01

I/p10

I/p11

• Strategy 2: DAC: For a k-bit i/p from A1 to A2, design 2k copies of A2 each with a different hardwired k-bit i/p to replace the one from A1.• Select the correct o/p from all the copies of A2 via a (2k)-to-1 Mux that is selected by the k-bit o/p from A1 when it becomes available (e.g., carry-select adder)• t(A) = max(t(A1), t(A2)) + t(Mux) + t(stitch-up)= t(A1) + t(Mux) + t(stitch-up) if A1 and A2 are the same problems

Page 32: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

(2) The “Design-for-all-cases-&-select (DAC)” Strategy (cont’d)

Root problem A

Subprob. A1Subprob. A2SUP

Subprob. A2 Subprob. A2 Subprob. A2 Subprob. A2

DAC

DAC DAC

SUP SUP

Wait Wait Wait Wait

SUP SUP SUP SUP

Generally, wait strategy will be used at all lower levels after the 1st wait level

• The DAC strategy has a MUX delay involved, and at small subproblems, the delay of a subproblem may be smaller than a MUX delay.

• Thus a mix of DAC and Wait strategies, as shown in the above figure, may be faster, w/ DAC used at higher levels and Wait at lower levels.

Figure: A D&C tree with a mix of DAC and Wait strategies for dependency resolution between subproblems

Page 33: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Simplified Mux

1

4

Cout

Example of the DAC Strategy in Adder Design

• For a 16-bit adder, the delay is (9*4 – 4)*2 = 64 ps (2 ps is the delay for a single i/p); a 60% improvement ((160-64)*100/160) over RCA• For a 64-bit adder, the delay is (9*8 – 4)*2 = 136 ps; a 79% improvement over RCA

Page 34: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Dependency Resolution in D&C:(3) Speculative Strategy

• Speculative Strategy: Have a single copy of A2 but choose a highly likely value of the k-bit i/p and perform A1, A2 concurrently. If after k-bit i/p from A1 is available and selection is incorrect, re-do A2 w/ correct available value.• t(A) = p(correct-choice)*(max(t(A1), t(A2)) + (1-p(correct-choice))*[t(A2) + t(A1)) + t(stitch-up), where p(correct-choice) is probability that our choice of the k-bit i/p for A2 is correct.• For t(A1) = t(A2), this becomes: t(A) = p(correct-choice)*t(A1) + (1-p(correct-choice))*2t(A1)+ t(stitch-up) = t(A1) + (1-p(correct-choice))*t(A1)+ t(stitch-up)• Need a completion signal to indicate when the final o/p is available for A; assuming worst-case time (when the choice is incorrect) is meaningless is such designs• Need an FSM controller for determining if guess is correct and if not, then redoing A2 (allowing more time for generating the completion signal) .

Root problem A

Subprob. A1Subprob. A2

01Estimate (guess), based on analysis or stats

FSM Controller:If o/p(A1A2) = guess(A2) then generate a completion signal after some delay corresponding to stitch upelse set i/p to A1 = o/p(A1 S2) and generate completion signal after delay of A2 + stitch up

2-to-1 Mux

select i/p toMux

I1

I0

op(A1A2)

Page 35: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Dependency Resolution in D&C:(4) The “Independent Pre-Computation” Strategy

• Strategy 4: Reconfigure the design of A2 so that it can do as much processing as possible that is independent of the i/p from A1 (A2_indep). This is the “independent” computation that prepares for the final computation of A2 (A2_dep) that can start once A2_indep and A1 are done.• t(A) = max(t(A1), t(A2_indep)) + t(A2_dep) + t(stitch-up)• E.g., Let a1 be the i/p from A1 to A2. If A2 has the logic a2 = v’x’ + uvx + w’xy + wz’a1 + u’xa1. If this were implemented using 2-i/p AND/OR gates, the delay will be 8 delay units (1 unit = delay for 1 i/p) after a1 is available. If the logic is re-structured as a2= (v’x’ + uvx + w’xy) + (wz’ + u’x)a1, and if the logic in the 2 brackets are performed before a1 is available (these constitute A2_indep), then the delay is only 4 delay units after a1 is available.• Such a strategy requires factoring of the external i/p a1 in the logic for a2, and grouping & implementing all the non-a1 logic, and then adding logic to “connect” up the non-a1 logic to a1 as the last stage.

Root problem A

Subprob. A1

Data flow

Su

bp

rob

. A

2

A2_dep

A2_indep

Concept

a2 a2

w’ x y w z’ a1u’ x a1v’ x’ u v x

A2

Critical path aftera1 avail (8-unit delay)

w’ x y w z’ u’ x a1v’ x’ u v x

A2_indepA2_dep

Critical path aftera1 avail (4-unit delay)

Example of an unstructured logic for A2

Page 36: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

D&C Summary• For complex digital design, we need to think of the “computation”

underlying the design in a structured manner---are there properties of this computation that can be exploited for faster, less expensive, modular design; is it amenable to the D&C approach? Think of:

– Breakup into >= 2 subprobs via breakup of (# of operands) or (operand sizes [bits])– Stitch-up (is it computable?)– Leaf functions– Dependencies between sub-problems and how to resolve them

• The design is then developed in a structured manner & the corresponding circuit may be synthesized by hand or described compactly using a HDL (e.g., structural VHDL)

• For an operation/func x on n operands (an-1 x an-2 x …… x a0 ) if x is associative, the D&C approach gives an “easy” stitch-up function, which is x on 2 operands (o/ps of applying x on each half). This results in a tree-structured circuit with (log n) delay instead of a linearly-connected circuit with (n) delay can be synthesized.

• If x is non-associative, more ingenuity and determination of properties of x is needed to determine the breakup and the stitch-up function. The resulting design may or may not be tree-structured

• If there is dependency between the 2 subproblems, then we saw strategies for addressing these dependencies:

– Wait (slowest, least hardware cost)– Design-for-all-cases (high speed, high hardware cost)– Speculative (medium speed, medium hardware cost)– Independent pre-computation (medium-slow speed, low hardware cost)

Page 37: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Strategy 2: A general view of DAC computations (w/ or w/o D&C)• If there is a data dependency between two

or more portions of a computation (which may be obtained w/ or w/o using D&C), don’t wait for the the “previous” computation to finish before starting the next one

• Assume all possible input values for the next computation/stage B (e.g., if it has 2 inputs from the prev. stage there will be 4 possible input value combinations) and perform it using a copy of the design for possible input value.

• All the different o/p’s of the diff. Copies of B are Mux’ed using prev. stage A’s o/p

• E.g. design: Carry-Select Adder (at each stage performs two additions one for carry-in of 0 and another for carry-in of 1 from the previous stage)

B Ax

yz

B(0,0)0

0

B(0,1)0

1

B(1,0)1

0

B(1,1)1

1

Ax

y

4:1

Mux

z

(a) Original design: Time = T(A)+T(B)

(b) Speculative computation: Time = max(T(A),T(B)) + T(Mux). Works well when T(A) approx = T(B) and T(A) >> T(Mux)

Page 38: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Strategy 3: Get the Best of Both Worlds (Average and Worst Case Delays)!

• Use 2 circuits with different worst-case and average-case behaviors

• Use the first available output

• Get the best of both (ave-case, worst-case) worlds

• In the above schematic, we get the good ave case performance of unary division (assuming uniformly distributed inputs w/o the disadvantage of its bad worst-case performance): ave. case = (1) subs, worst case = (n) subs

UnaryDivision Ckt

(good avecase: (n/2.8)

subs,bad

worst case:(2n) subs)

Non-RestoringDiv. Ckt(bad ave

case [(n)subs],good

worst case:(n) subs)

Ext.FSM done2done1

start

Muxselect

outputoutput

inputs inputsRegisters

Register

Approximate analysis: Avg. dividend value = 2n-1

For divisor values in the “lower half range”[1, 2n-1], the average quotient value is the Harmonic series (1+ ½ + 1/3 + … + 1/ 2n-1) ~ ln (2n-1) ~( n-1)/1.4 (integration of 1/k from 1 to 2n-1)Quotient for divisors in the upper half range [2n-1 +1, 2n] is 0 overall avg. quotient = (n-1)/2.8 avg. subtractions needed = 1 + (n-1)/2.8 = (n/2.8)

Page 39: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Strategy 4: Pipeline It!

Original cktor datapath

Stage 1

Stage 2

Stage k

Conversionto a simplelevel-partitionedpipeline (levelpartition may notalways be possiblebut other pipe-lineable partitionsmay be)

• Throughput is defined as # of outputs / sec• Non-pipelined throughput = (1 / D), where D = delay of original ckt’s datapath• Pipeline throughput = 1/ (max stage delay + register delay)• Special case: If original ckt’s datapath is divided into n stages, each of equal delay, and dr is the delay of a register, then pipeline throughput = 1/((D/n)+dr).• If dr is negligible compared to D/n, then pipeline throughput = n/D, n times that of the original ckt• FSM controller may be needed for non-uniform stage delays; not needed otherwise

Clock

Registers

Page 40: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Strategy 4: Pipeline It! (contd.)

2-bit2:1 Mux

2

2 2

my(3)

f2(7) = f(7)(2)

I0 I1

1-bitcomparator

f(7)

A[7] B[7]

2

1-bitcomparator

f(6)

A[6] B[6]

2

1-bitcomparator

f(5)

A[5] B[5]

2

1-bitcomparator

f(4)

A[4] B[4]

2

1-bitcomparator

f(3)

A[3] B[3]

2

1-bitcomparator

f(2)

A[2] B[2]

2

1-bitcomparator

f(1)

A[1] B[1]

2

1-bitcomparator

f(0)

A[0] B[0]

2

2-bit2:1 Mux

2

2 2

my(2)

f(5)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(1)

f(3)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(0)

f(1)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(5)

my(3)(2)

I0 I1

2-bit2:1 Mux

2

2 2

my(4)

my(1)(2)

I0 I1

my(5)(2) 1-bit2:1 Mux

F= my1(6)

I0 I1

my(5)(1) my(4)(1)

Log n levelof Muxes

• Comparator o/p produced every 1 unit of time, instead of every (logn +1) unit of time, where 1 time unit here = delay of mux or 1-bit comparator (both will have the same or similar delay)

• We can reduce reg. cost by inserting at every 2 levels, throughput decreases to 1 per every 2 units of time

Legend : Register

Page 41: ECE 465 High Level Design Strategies Lecture Notes # 9 Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago.

Strategy 4: Pipeline It! (contd.)

Adder o/p produced every 2 unit’s of FA delay instead of every n units of FA delay in an n-bit RCA

Legend : Intermediate & output register : Input register

Next 3S0, S1 o/ps

S1, S0 o/psfor i/ps recvd

4 cc back

S3, S2 o/psfor i/ps recvd

4 cc back

S5, S4 o/psfor i/ps recvd

4 cc back

S7, S6 o/psfor i/ps recvd

4 cc back

Pipelined Ripple Carry AdderProblem: I/P and O/P data direction is not the same as the computation direction.

They are perpendicular!