CS141-L2-1Tarun Soni, Summer ‘03 Performance, ALUs and such like The good news: no quiz today ! ...

CS141-L2-1 Tarun Soni, Summer ‘03

Performance, ALUs and such like

The good news: no quiz today !

Homework #1 is on the net now, so are the slides from previous class. Home page is www.cs.ucsd.edu/~tsoni/cse141Finals will be the last day of class, no special time slotAdd-drops shall be handled at break.

Today: Chap 2 and 4 of the text.

http://www.cs.ucsd.edu/~tsoni/cse141


Computer organization: concept of abstraction Instruction Set Architectures: Definition, types, examples Instruction formats: operands, addressing modes Operations: load, store, arithmetic, logical Control instructions: branch, jump, procedures Stacks

The Story so far:

Basically learnt about Instruction Set Architectures


MIPS Software Register Conventions

0 zero constant 0

1 at reserved for assembler

2 v0 expression evaluation &

3 v1 function results

4 a0 arguments

5 a1

6 a2

7 a3

8 t0 temporary: caller saves

. . . (callee can clobber)

15 t7

16 s0 callee saves

. . . (caller can clobber)

23 s7

24 t8 temporary (cont’d)

25 t9

26 k0 reserved for OS kernel

27 k1

28 gp Pointer to global area

29 sp Stack pointer

30 fp frame pointer

31 ra Return Address (HW)


Example: Swap()

• Can we figure out the code?

swap(int v[], int k);{ int temp;

temp = v[k]v[k] = v[k+1];v[k+1] = temp;

}

swap: // $4=v, $5=kmuli $2, $5, 4 // $2 = k*4add $2, $4, $2 // $2 = v+(4*k)lw $15, 0($2) // $15=temp= *($2+0)=*(v+k)lw $16, 4($2) // $16 = *($2+4) = *(v+k+1)sw $16, 0($2) // *(v+k) = $16 = *(v+k+1)sw $15, 4($2) // *(v+k+1) = $15 = tempjr $31 // return;


Example: Leaf_procedure()

• Procedures?int PairDiff(int a, int b, int c,int d);{ int temp;

temp = (a+b)-(c+d);return temp;

}

Assume caller puts $a0-$a3 = a,b,c,d and wants result in $v0PairDiff: //

sub $sp,$sp,12 // Make space for 3 temp locationssw $t1, 8($sp) // save $t1 (optional if MIPS convention)sw $t0, 4($sp) // save $t0 (optional if MIPS convention)sw $s0, 0($sp) // save $s0

add $t0,$a0,$a1 // (t0=a+b) add $t1,$a2,$a3 // (t1=c+d)sub $s0,$t0,$t1 // (s0=t0-t1)

add $v0,$s0,$zero // store return value in $v0lw $s0,0($sp) // restore registerslw $t0,4($sp) // (optional if MIPS convention)lw $t1,8($sp) // (optional if MIPS convention)add $sp,$sp,12 // ‘pop’ the stack

jr $ra // The actual return to calling routine


Example: Nested_procedure()

• What about nested procedures? $ra ??

• Recursive procedures?

int fact(int n);{

if(n<1) return(1);else return (n*fact(n-1));

}

Assume $a0 = n fact: //

sub $sp,$sp,8 // Make space for 2 temp locationssw $ra, 4($sp) // save return addresssw $a0, 4($sp) // save argument n

slt $t0,$a0,1 // test for n<1beq $t0,$zero, L1 // if (n>=1) goto L1

add $v0,$zero,1 // $v0=1add $sp,$sp,8 // ‘pop’ the stack

jr $ra // return

L1: sub $a0,$a0,1 // n--; jal fact; // call fact again.

lw $a0,0($sp) // fact() returns here. Restore nlw $ra,4($sp) // restore return addressadd $sp,$sp,8 // ‘pop’ stackmult $v0,$a0,$v0 // $v0 = n*fact(n-1)jr $ra // return to caller

(n<1) case

(n>=1) case


Comparing Instruction Set Architectures

Design-time metrics:

° Can it be implemented, in how long, at what cost?

° Can it be programmed? Ease of compilation?

Static Metrics:

° How many bytes does the program occupy in memory?

Dynamic Metrics:

° How many instructions are executed?

° How many bytes does the processor fetch to execute the program?

° How many clocks are required per instruction?

° How "lean" a clock is practical?

Best Metric: Time to execute the program!

This depends on instruction set, processor organization, and compilation techniques.

CPI

Inst. Count Cycle Time


Computer Performance

Measuring and Discussing Computer System

Performanceor

“My computer is faster than your computer”


SPEC Performance

0

50

100

150

200

250

300

350

1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995

Year

Perform

ance

RISC

Intel x86

35%/yr

RISCintroduction

performance now improves 50% per year (2x every 1.5 years)

But what is performance ??


Performance depends on the eyes of the beholder?

• Purchasing perspective – given a collection of machines, which has the

• best performance ?• least cost ?• best performance / cost ?

• Design perspective– faced with design options, which has the

• best performance improvement ?• least cost ?• best performance / cost ?

• Both require– basis for comparison– metric for evaluation

• Our goal is to understand cost & performance implications of architectural choices


Two ideas

° Time to do the task (Execution Time)

– execution time, response time, latency

° Tasks per day, hour, week, sec, ns. .. (Performance)

– throughput, bandwidth

Response time and throughput often are in opposition

Plane

Boeing 747

Concorde

Speed

610 mph

1350 mph

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput (pmph)

286,700

178,200

Which has higher performance?

•How much faster is the Concorde compared to the 747?

•How much bigger is the 747 than the Douglas DC-8?


° Time to do the task from start to finish

– execution time, response time, latency

° Tasks per unit time

– throughput, bandwidth

Vehicle

Ferrari

Greyhound

Speed

160 mph

65 mph

Time to Bay Area

3.1 hours

7.7 hours

Passengers

2

60

Throughput (pm/h)

320

3900

mostly used for data movement

Two mechanisms of getting to the bay-area

Response time and throughput often are in opposition


Relative performance ?

• can be confusing

A runs in 12 seconds

B runs in 20 seconds– A/B = .6 , so A is 40% faster, or 1.4X faster, or B is 40% slower– B/A = 1.67, so A is 67% faster, or 1.67X faster, or B is 67% slower

• needs a precise definition


Relative performance ?

• Performance is in units of things-per-second– bigger is better

• If we are primarily concerned with response time– performance(x) = 1

execution_time(x)

" X is n times faster than Y" means

Performance(X)

n = ----------------------

Performance(Y)

PerformanceX

Execution TimeXPerformanceY

RelativePerformance

Execution TimeY= = = n


How many times ?

• Time of Concorde vs. Boeing 747?

• Concord is 1350 mph / 610 mph = 2.2 times faster

= 6.5 hours / 3 hours

• Throughput of Concorde vs. Boeing 747 ?

• Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster”

• Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster”

• Boeing is 1.6 times (“60%”)faster in terms of throughput

• Concord is 2.2 times (“120%”) faster in terms of flying time

We will focus primarily on execution time for a single job


• “times faster than” (or “times as fast as”)– there’s a multiplicative factor relating quantities

– “X was 3 time faster than Y” speed(X) = 3 speed(Y)

• “percent faster than” – implies an additive relationship– “X was 25% faster than Y” speed(X) = (1+25/100) speed(Y)

• “percent slower than” – implies subtraction– “X was 5% slower than Y” speed(X) = (1-5/100) speed(Y)– “100% slower” means it doesn’t move at all !

• “times slower than” or “times as slow as” – is awkward.

– “X was 3 times slower than Y” means speed(X) = 1/3 speed(Y)

Some grammar?


X is r times faster than Y speed(X) = r speed(Y)

speed(Y) = 1/r speed(X)

Y is r times slower than X

X is r times faster than Y, & Y is s times faster than Z

speed(X) = r speed(Y) = rs speed(Z)

X is rs faster than Z

(Cannot do this with % numbers !)

Easiest way to avoid confusion:

- Convert “% faster” to “times faster”

- then do calculation and convert back if needed.- Example: change “25% faster” to “5/4 times faster”.

Avoid Linguistic Confusion


• user CPU time? (time CPU spends running your code)• total CPU time (user + kernel)? (includes op. sys. code)

• Wallclock time? (total elapsed time) – Includes time spent waiting for I/O, other users, ...

• Answer depends ...For measuring processor speed, we can use total CPU.– If no I/O or interrupts, wallclock may be better

• more precise (microseconds rather than 1/100 sec)• can measure individual sections of code

> time foo... foo’s results ...90.7u 12.9s 2:39 65%>

user + kernelwallclock

Which time anyways ?


Metrics of Performance

Compiler

Programming Language

Application

DatapathControl

TransistorsWires Pins

ISA

Function Units

(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per month

Useful Operations per second

Each metric has a place and a purpose, and each can be misused


Levels of benchmarking

Actual Target Workload

Full Application Benchmarks

Small “Kernel” Benchmarks

Microbenchmarks

Pros Cons

• representative

• very specific• non-portable• difficult to run, or measure• hard to identify cause

• portable• widely used• improvements useful in reality

• easy to run, early in design cycle

• identify peak capability and potential bottlenecks

•less representative

• easy to “fool”

• “peak” may be a long way from application performance


Cycle Time

• Instead of reporting execution time in seconds, we often use cycles

• Clock “ticks” indicate when to start activities (one abstraction):

• cycle time = time between ticks = seconds per cycle

• clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)

A 200 Mhz. clock has a cycle time of:

time

seconds

program

cycles

program

seconds

cycle

1

200 106 109 5 nanoseconds


Cycle Time

CPU Execution Time

Instruction Count

CPI Clock Cycle Time= X X

instructionscycles/instruction seconds/cycle

seconds

• Improve performance => reduce execution time– Reduce instruction count (Programmer, Compiler) – Reduce cycles per instruction (ISA, Machine designer)– Reduce clock cycle time (Hardware designer, Physicist)


Performance Variation

Number ofinstructions

CPI Clock CycleTime

Same machine differentprograms

different similar same

same programs, differentmachines, same ISA

same different different

Same programs,different machines

somewhatdifferent

different different

CPU Execution Time

Instruction Count

CPI Clock Cycle Time= X X


Amdahl’s Law

Execution Time After Improvement =

Execution Time Unaffected +

( Execution Time Affected / Amount of Improvement )

• Example:

"Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"

How about making it 5 times faster?

• Principle: Make the common case fast


MIPS, MFLOPS etc.

• MIPS - million instructions per second

= number of instructions executed in program = Clock rate

execution time in seconds * 106 CPI * 106

• MFLOPS - million floating point operations per second

= number of floating point operations executed in program

execution time in seconds * 106

• program-independent• deceptive


Example RISC Processor

Typical Mix

Base Machine (Reg / Reg)

Op Freq Cycles CPI(i) % Time

ALU 50% 1 .5 23%

Load 20% 5 1.0 45%

Store 10% 3 .3 14%

Branch 20% 2 .4 18%

2.2

How much faster would the machine be if a better data cachereduced the average load time to 2 cycles?

How does this compare with using branch prediction to shave a cycle off the branch time?

What if two ALU instructions could be executed at once?


SPEC

Which Programs?

• peak throughput measures (simple programs)• synthetic benchmarks (whetstone, dhrystone,...)• Real applications• SPEC (best of both worlds, but with problems of their own)

– System Performance Evaluation Cooperative– Provides a common set of real applications along with strict guidelines for

how to run them.– provides a relatively unbiased means to compare machines.


SPEC89

• Compiler “enhancements” and performance

0

100

200

300

400

500

600

700

800

tomcatvfppppmatrix300eqntottlinasa7doducspiceespressogcc

BenchmarkCompiler

Enhanced compiler

SP

EC

pe

rfo

rman

ce r

atio


• SPECint2000 – • gzip and bzip2 - compression• gcc – compiler; 205K lines of messy code!• crafty – chess program• parser – word processing• vortex – object-oriented database• perlbmk – PERL interpreter• eon – computer visualization• vpr, twolf – CAD tools for VLSI• mcf, gap – “combinatorial” programs

• SPECfp2000 – 10 Fortran, 3 C programs

– scientific application programs (physics, chemistry, image

processing, number theory, ...)

SPECCPU2000 Suite


Performance is always misleading

• Performance is specific to a particular program/s– Total execution time is a consistent summary of performance

• For a given architecture performance increases come from:– increases in clock rate (without adverse CPI affects)– improvements in processor organization that lower CPI– compiler enhancements that lower CPI and/or instruction count

• Pitfall: expecting improvement in one aspect of a machine’s

performance to affect the total performance

• You should not always believe everything you read! Read carefully!


Computer Arithmetic

bits (011011011100010 ....01)

instruction

R-format I-format ...

data

number text chars ..............

integer floating point

signed unsigned single precision double precision

... .........

What do all those bits mean now?


Computer Arithmetic

• How do you represent– negative numbers?– fractions?– really large numbers?– really small numbers?

• How do you– do arithmetic?– identify errors (e.g. overflow)?

• What is an ALU and what does it look like?– ALU=arithmetic logic unit


Big Endian vs. Little Endian

0

1

2

3

4

5

6

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

8 bits of data

least-significant bit

031

least-significant bit

031

Big EndianIBM, Mot, HP, Sun

Little EndianDec, Intel

Some processors (e.g. PowerPC) provide both

– If you can figure out how to switch modes or get the compiler to issue “Byte-reversed load’s and store’s”


Binary Numbers: An Introduction

Consider a 4-bit binary number

Examples of binary arithmetic:

3 + 2 = 5 3 + 3 = 6

BinaryBinaryDecimal

0 0000

1 0001

2 0010

3 0011

Decimal

4 0100

5 0101

6 0110

7 0111

0 0 1 1

0 0 1 0+

0 1 0 1

1

0 0 1 1

0 0 1 1+

0 1 1 0

1 1


Negative Numbers: Some options

• We would like a number system that provides– obvious representation of 0,1,2...– uses adder for addition– single value of 0– equal coverage of positive and negative numbers– easy detection of sign– easy negation

• Sign Magnitude -- MSB is sign bit, rest the same

-1 == 1001

-5 == 1101• One’s complement -- flip all bits to negate

-1 == 1111

-5 == 1010


Negative Numbers: two’s complement

Decimal-8-7-6-5-4-3-2-101234567

Two’s Complement Binary1000100110101011110011011110111100000001001000110100010101100111

– Positive numbers: normal binary representation

– Negative numbers: flip bits (0 1) , then add 1

Smallest 4-bit number: -8

Biggest 4-bit number: 7


Two’s complement arithmetic

• Examples: 7 - 6 = 7 + (- 6) = 1 3 - 5 = 3 + (- 5) = -2

2’s Complement Binary2’s Complement BinaryDecimal

0 0000

1 0001

2 0010

3 0011

1111

1110

1101

Decimal

-1

-2

-3

4 0100

5 0101

6 0110

7 0111

1100

1011

1010

1001

-4

-5

-6

-7

1000-8

0 1 1 1

1 0 1 0+

0 0 0 1

1

0 0 1 1

1 0 1 1+

1 1 1 0

1 111

Uses simple adder for + and - numbers


• Negation – flip bits and add 1. (Works for + and -)– Might cause overflow.

• Extend sign when loading into large register– +3 => 0011, 00000011, 0000000000000011– -3 => 1101, 11111101, 1111111111111101

• Overflow detection – (need to raise “exception” when answer can’t be represented)

0101 5+ 0110 6 1011 -5 ??!!!

Things to keep in mind


Overflow detection again

0 1 1 1

0 0 1 1+

1 0 1 0

1

1 1 0 0

1 0 1 1+

0 1 1 1

110

7

3

1

-6

- 4

- 5

7

0

0 0 1 0

0 0 1 1+

0 1 0 1

1

1 1 0 0

1 1 1 0+

1 0 1 0

100

2

3

0

5

- 4

- 2

- 6

1 0 0

1 0

So how do we detect overflow?

Carry into MSB ! = Carry out of MSB


Execution: the heart of it all

32

32

32

operation

result

a

b

ALU

Instruction

Fetch

Instruction

Decode

Operand

Fetch

Execute

Result

Store

Next

Instruction


A Basic ALU

• ALU Control Lines (ALUop) Function– 000 And– 001 Or– 010 Add– 110 Subtract– 111 Set-on-less-than

AL

U

N

N

N

A

B

Result

Overflow

Zero

3ALUop

CarryOut

General idea: Build for 1-bit numbers and then extend for n-bits!


Some basics of digital logic

c = a . bba

000

010

001

111

b

ac

b

ac

a c

c = a + bba

000

110

101

111

10

01

c = aa

a0

b1

cd

0

1

a

c

b

d

1. AND gate (c = a . b)

2. OR gate (c = a + b)

3. Inverter (c = a)

4. Multiplexor (if d = = 0, c = a; else c = b)


1-bit ALU

b

0

1

Result

Operation

a• ALU Control Lines (ALUop) Function

– 000 And– 001 Or

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

• ALU Control Lines (ALUop) Function– 000 And– 001 Or– 010 Add

But how do we make the adder?


1-bit Full Adder

• This is also called a (3, 2) adder• Half Adder: No CarryIn nor CarryOut• Truth Table:

Inputs Outputs

CommentsA B CarryIn SumCarryOut

0 0 0 0 0 0 + 0 + 0 = 00

0 0 1 0 1 0 + 0 + 1 = 01

0 1 0 0 1 0 + 1 + 0 = 01

0 1 1 1 0 0 + 1 + 1 = 10

1 0 0 0 1 1 + 0 + 0 = 01

1 0 1 1 0 1 + 0 + 1 = 10

1 1 0 1 0 1 + 1 + 0 = 10

1 1 1 1 1 1 + 1 + 1 = 11

1-bitFull

Adder

CarryOut

CarryIn

A

BC


1-bit Full Adder: CarryOut

CarryOut = (!A & B & CarryIn) | (A & !B & CarryIn) | (A & B & !CarryIn) | (A & B & CarryIn);

CarryOut = B & CarryIn | A & CarryIn | A & B

Inputs Outputs


0 0 0 0 0 0 + 0 + 0 = 00

0 0 1 0 1 0 + 0 + 1 = 01

0 1 0 0 1 0 + 1 + 0 = 01

0 1 1 1 0 0 + 1 + 1 = 10

1 0 0 0 1 1 + 0 + 0 = 01

1 0 1 1 0 1 + 0 + 1 = 10

1 1 0 1 0 1 + 1 + 0 = 10

1 1 1 1 1 1 + 1 + 1 = 11

b

CarryOut

a

CarryIn


1-bit Full Adder: Sum

Inputs Outputs


0 0 0 0 0 0 + 0 + 0 = 00

0 0 1 0 1 0 + 0 + 1 = 01

0 1 0 0 1 0 + 1 + 0 = 01

0 1 1 1 0 0 + 1 + 1 = 10

1 0 0 0 1 1 + 0 + 0 = 01

1 0 1 1 0 1 + 0 + 1 = 10

1 1 0 1 0 1 + 1 + 0 = 10

1 1 1 1 1 1 + 1 + 1 = 11

Sum = (!A & !B & CarryIn) | (!A & B & !CarryIn) | (A & !B & !CarryIn) | (A & B & CarryIn)


32-bit ALU

b

0

2

Result

Operation

a

1

CarryIn

CarryOut

The 1-bit ALU

Result31a31

b31

Result0

CarryIn

a0

b0

Result1a1

b1

Result2a2

b2

Operation

ALU0

CarryIn

CarryOut

ALU1

CarryIn

CarryOut

ALU2

CarryIn

CarryOut

ALU31

CarryIn

The 32-bit ALU

• (ALUop) Function– And– Or– Add

What about other operationssub $s1, $s, $s3 ; $s1 = $s2 - $s3 ;

Subtraction

slt $s1, $s2, $s3 ; if ($s2 < $s3) {$s1 = 1} else {$s1 = 0};set on less than (SLT)


32-bit ALU

• Keep in mind the following:– (A - B) is the same as: A + (-B)– 2’s Complement negate: Take the inverse of every bit and add 1

• Bit-wise inverse of B is !B:– A - B = A + (-B) = A + (!B + 1) = A + !B + 1

0

2

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b

Binvert provides the negation

What about the ‘+1’ ?


32-bit ALU

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Bnegate

Zero

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

Setting CarryIn[0] = 1 provides the ‘+1’ for the 32-bit adder.


32-bit ALU: slt

• slt instruction • If ( a<b) result = 1, else result = 0;• If ( a-b < 0 ) result = 1, else result = 0;

• Do a subtract• use sign bit

– route to bit 0 of result– all other bits zero

0

3

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b 2

Less

0

3

Result

Operation

a

1

CarryIn

0

1

Binvert

b 2

Less

Set

Overflowdetection

Overflow

a.

b.


32-bit ALU: Special conditions

Overflow Detection Logic

• Carry into MSB ! = Carry out of MSB– For a N-bit ALU: Overflow = CarryIn[N - 1] XOR CarryOut[N - 1]

A0

B0

1-bitALU

Result0

CarryIn0

CarryOut0

A1

B1

1-bitALU

Result1

CarryIn1

CarryOut1

A2

B2

1-bitALU

Result2

CarryIn2

A3

B3

1-bitALU

Result3

CarryIn3

CarryOut3

Overflow

X Y X XOR Y

0 0 0

0 1 1

1 0 1

1 1 0



0

3

Result

Operation

a

1

CarryIn

CarryOut

0

1

Binvert

b 2

Less

0

3

Result

Operation

a

1

CarryIn

0

1

Binvert

b 2

Less

Set

Overflowdetection

Overflow

a.

b.

• Thus MSB block has special logic to generate

• Set line (sign bit)• Overflow line



Zero Detection Logic

• Zero Detection Logic is just one BIG NOR gate– Any non-zero input to the NOR gate will cause its output to be zero

CarryIn0

A0

B0

1-bitALU

Result0

CarryOut0

A1

B1

1-bitALU

Result1CarryIn1

CarryOut1

A2

B2

1-bitALU

Result2CarryIn2

CarryOut2

A3

B3

1-bitALU

Result3CarryIn3

CarryOut3

Zero


32 bit ALU

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Bnegate

Zero

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

But what about performance ?

• Notice control lines:

000 = and001 = or010 = add110 = subtract111 = slt

•zero is a 1 when the result is zero!


32 bit ALU

• We can build an ALU to support the MIPS instruction set

– key idea: use multiplexor to select the output we want

– we can efficiently perform subtraction using two’s complement

– we can replicate a 1-bit ALU to produce a 32-bit ALU

• Important points about hardware

– all of the gates are always working

– the speed of a gate is affected by the number of inputs to the gate

– the speed of a circuit is affected by the number of gates in series(on the “critical path” or the “deepest level of logic”)

• Our primary focus: comprehension, however,– Clever changes to organization can improve performance

(similar to using better algorithms in software)– we’ll look at two examples for addition and multiplication


PerformanceIdeal (CS) versus Reality (EE)

• When input 0 -> 1, output 1 -> 0 but NOT instantly– Output goes 1 -> 0: output voltage goes from Vdd (5v) to 0v

• When input 1 -> 0, output 0 -> 1 but NOT instantly– Output goes 0 -> 1: output voltage goes from 0v to Vdd (5v)

• Voltage does not like to change instantaneously

OutIn

Time

Voltage

1 => Vdd

Vin

Vout

0 => GND


Performance

Series Connection

• Total Propagation Delay = Sum of individual delays = d1 + d2

• Capacitance C1 has two components:

– Capacitance of the wire connecting the two gates

– Input capacitance of the second inverter

Vdd

Cout

Vout

Vdd

C1

V1Vin

V1Vin Vout

Time

G1 G2 G1 G2

VoltageVdd

Vin

GND

V1 Vout

Vdd/2d1 d2


Performance: Calculating Delays

• Sum delays along serial paths• Delay (Vin -> V2) ! = Delay (Vin -> V3)

– Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)– Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)

• Critical Path = The longest among the N parallel paths• C1 = Wire C + Cin of Gate 2 + Cin of Gate 3

Vdd

V2

VddV1Vin V2

C1

V1VinG1 G2

Vdd

V3G3

V3


Performance: Storage elements

• Setup Time: Input must be stable BEFORE the trigger clock edge

• Hold Time: Input must REMAIN stable after the trigger clock edge

• Clock-to-Q time:

– Output cannot change instantaneously at the trigger clock edge

– Similar to delay in logic gates, two components:

• Internal Clock-to-Q

• Load dependent Clock-to-Q

D QD Don’t Care Don’t Care

Clk

UnknownQ

Setup Hold

Clock-to-Q

• Storage element: D flip flop with negative edge triggered


Performance: Synchronous logic

• All storage elements are clocked by the same clock edge

• The combination logic block’s:

– Inputs are updated at each clock tick

– All outputs MUST be stable before the next clock tick

Clk

.

.

.

.

.

.

.

.

.

.

.

.Combination Logic


Performance: Critical Path

• Critical path: the slowest path between any two storage devices• Cycle time is a function of the critical path• must be greater than:

– Clock-to-Q + Longest Path through the Combination Logic + Setup

Clk

.

.

.

.

.

.

.

.

.

.

.

.


Clock Skew

• The worst case scenario for cycle time consideration:– The input register sees CLK1– The output register sees CLK2

• Cycle Time CLK-to-Q + Longest Delay + Setup + Clock Skew

Clk1

Clk2 Clock Skew

.

.

.

.

.

.

.

.

.

.

.

.


Cycle Time: Thumb rules

• Reduce the number of gate levels

° Pay attention to loading

° One gate driving many gates is a bad idea

° Avoid using a small gate to drive a long wire

° Use multiple stages to drive large load

A

B

CD

A

B

C

D

INV4x

INV4x

Clarge


Back to ALUs

Seta31

0

Result0a0

Result1a1

0

Result2a2

0

Operation

b31

b0

b1

b2

Result31

Overflow

Bnegate

Zero

ALU0Less

CarryIn

CarryOut

ALU1Less

CarryIn

CarryOut

ALU2Less

CarryIn

CarryOut

ALU31Less

CarryIn

• The adder we just built is called a “Ripple Carry Adder”– The carry bit may have to propagate from LSB to MSB– Worst case delay for an N-bit RC adder: 2N-gate delay

CarryIn

CarryOut

A

B

• E.g: (Back of the envelope approximations)• Single gate delay = 0.02 ns (inverter “speed” of 50 GHz)• 32 bit adder => 64 gate delay => 1.28 ns delay => maximum clock of 789 MHz.


Ripple Carry Adders

• Is there more than one way to do addition?– two extremes: ripple carry and sum-of-products

Can you see the ripple? How could you get rid of it?

c1 = b0c0 + a0c0 + a0b0

c2 = b1c1 + a1c1 + a1b1 c2 = b1 (b0c0 + a0c0 + a0b0 ) + a1 (b0c0 + a0c0 + a0b0 ) +a1b1

c3 = b2c2 + a2c2 + a2b2 c3 =

c4 = b3c3 + a3c3 + a3b3 c4 =

Not feasible! Why?


Carry Look Ahead Adders

• An approach in-between our two extremes• Motivation:

– If we didn't know the value of carry-in, what could we do?

– When would we always generate a carry? gi = ai bi

– When would we propagate the carry? pi = ai + bi

Inputs Outputs


0 0 0 0 0 0 + 0 + 0 = 00

0 0 1 0 1 0 + 0 + 1 = 01

0 1 0 0 1 0 + 1 + 0 = 01

0 1 1 1 0 0 + 1 + 1 = 101 0 0 0 1 1 + 0 + 0 = 01

1 0 1 1 0 1 + 0 + 1 = 10

1 1 0 1 0 1 + 1 + 0 = 10

1 1 1 1 1 1 + 1 + 1 = 11

c1 = g0 + p0c0

c2 = g1 + p1c1

c3 = g2 + p2c2

c4 = g3 + p3c3

Generate CarryCarryOut = 1 (independent of CarryIn)

Propagate CarryCarryOut = CarryIn



The Propagate and Generate machinery.

Worst case delay of 1-gate.



The Generation of the CarryOut.

The delay (and size) still does grow with number of bits.



The Generation of the Result.

Sum i = Pi xor C i-1

• pi = ai + bi



• It is very expensive to build a “full” carry lookahead adder– Just imagine the length of the equation for Cin31

• Common practices:– Connect several N-bit Lookahead Adders to form a big adder– Example: connect four 8-bit carry lookahead adders to form

a 32-bit partial carry lookahead adder

8-bit CarryLookahead

Adder

C0

8

88

Result[7:0]

B[7:0]A[7:0]


Adder

C8

8

88

Result[15:8]

B[15:8]A[15:8]


Adder

C16

8

88

Result[23:16]

B[23:16]A[23:16]


Adder

C24

8

88

Result[31:24]

B[31:24]A[31:24]



CarryIn

Result0--3

ALU0

CarryIn

Result4--7

ALU1

CarryIn

Result8--11

ALU2

CarryIn

CarryOut

Result12--15

ALU3

CarryIn

C1

C2

C3

C4

P0G0

P1G1

P2G2

P3G3

pigi

pi + 1gi + 1

ci + 1

ci + 2

ci + 3

ci + 4

pi + 2gi + 2

pi + 3gi + 3

a0b0a1b1a2b2a3b3

a4b4a5b5a6b6a7b7

a8b8a9b9

a10b10a11b11

a12b12a13b13a14b14a15b15

Carry-lookahead unit

• Can’t build a 16 bit adder this way... (too big)• Could use ripple carry of 4-bit CLA adders• Better: use the CLA principle again!


What did we cover today?• Last pieces of ISA class• Performance: how to quantify• Binary representation: integers, positive and negative• Basic ALU design

• 1-bit addition• Handling the carry• Carry look ahead• Subtraction• Set on less than • Condition codes such as overflow, zero

• Performance: Cycle time, number of gates etc.

Next class

Multiplication, Division, Floating point numbers

Rest of Chapter 4 from the text

Remember quizzes are surprises, and based on hws

CS141-L2-1Tarun Soni, Summer ‘03 Performance, ALUs and such like The good news: no quiz today ! ...

Documents

Transcript of CS141-L2-1Tarun Soni, Summer ‘03 Performance, ALUs and such like The good news: no quiz today ! ...