CS141-L2-1Tarun Soni, Summer ‘03 Performance, ALUs and such like The good news: no quiz today ! ...
-
Upload
kelley-atkins -
Category
Documents
-
view
216 -
download
0
Transcript of CS141-L2-1Tarun Soni, Summer ‘03 Performance, ALUs and such like The good news: no quiz today ! ...
CS141-L2-1 Tarun Soni, Summer ‘03
Performance, ALUs and such like
The good news: no quiz today !
Homework #1 is on the net now, so are the slides from previous class. Home page is www.cs.ucsd.edu/~tsoni/cse141Finals will be the last day of class, no special time slotAdd-drops shall be handled at break.
Today: Chap 2 and 4 of the text.
CS141-L2-2 Tarun Soni, Summer ‘03
Computer organization: concept of abstraction Instruction Set Architectures: Definition, types, examples Instruction formats: operands, addressing modes Operations: load, store, arithmetic, logical Control instructions: branch, jump, procedures Stacks
The Story so far:
Basically learnt about Instruction Set Architectures
CS141-L2-3 Tarun Soni, Summer ‘03
MIPS Software Register Conventions
0 zero constant 0
1 at reserved for assembler
2 v0 expression evaluation &
3 v1 function results
4 a0 arguments
5 a1
6 a2
7 a3
8 t0 temporary: caller saves
. . . (callee can clobber)
15 t7
16 s0 callee saves
. . . (caller can clobber)
23 s7
24 t8 temporary (cont’d)
25 t9
26 k0 reserved for OS kernel
27 k1
28 gp Pointer to global area
29 sp Stack pointer
30 fp frame pointer
31 ra Return Address (HW)
CS141-L2-4 Tarun Soni, Summer ‘03
Example: Swap()
• Can we figure out the code?
swap(int v[], int k);{ int temp;
temp = v[k]v[k] = v[k+1];v[k+1] = temp;
}
swap: // $4=v, $5=kmuli $2, $5, 4 // $2 = k*4add $2, $4, $2 // $2 = v+(4*k)lw $15, 0($2) // $15=temp= *($2+0)=*(v+k)lw $16, 4($2) // $16 = *($2+4) = *(v+k+1)sw $16, 0($2) // *(v+k) = $16 = *(v+k+1)sw $15, 4($2) // *(v+k+1) = $15 = tempjr $31 // return;
CS141-L2-5 Tarun Soni, Summer ‘03
Example: Leaf_procedure()
• Procedures?int PairDiff(int a, int b, int c,int d);{ int temp;
temp = (a+b)-(c+d);return temp;
}
Assume caller puts $a0-$a3 = a,b,c,d and wants result in $v0PairDiff: //
sub $sp,$sp,12 // Make space for 3 temp locationssw $t1, 8($sp) // save $t1 (optional if MIPS convention)sw $t0, 4($sp) // save $t0 (optional if MIPS convention)sw $s0, 0($sp) // save $s0
add $t0,$a0,$a1 // (t0=a+b) add $t1,$a2,$a3 // (t1=c+d)sub $s0,$t0,$t1 // (s0=t0-t1)
add $v0,$s0,$zero // store return value in $v0lw $s0,0($sp) // restore registerslw $t0,4($sp) // (optional if MIPS convention)lw $t1,8($sp) // (optional if MIPS convention)add $sp,$sp,12 // ‘pop’ the stack
jr $ra // The actual return to calling routine
CS141-L2-6 Tarun Soni, Summer ‘03
Example: Nested_procedure()
• What about nested procedures? $ra ??
• Recursive procedures?
int fact(int n);{
if(n<1) return(1);else return (n*fact(n-1));
}
Assume $a0 = n fact: //
sub $sp,$sp,8 // Make space for 2 temp locationssw $ra, 4($sp) // save return addresssw $a0, 4($sp) // save argument n
slt $t0,$a0,1 // test for n<1beq $t0,$zero, L1 // if (n>=1) goto L1
add $v0,$zero,1 // $v0=1add $sp,$sp,8 // ‘pop’ the stack
jr $ra // return
L1: sub $a0,$a0,1 // n--; jal fact; // call fact again.
lw $a0,0($sp) // fact() returns here. Restore nlw $ra,4($sp) // restore return addressadd $sp,$sp,8 // ‘pop’ stackmult $v0,$a0,$v0 // $v0 = n*fact(n-1)jr $ra // return to caller
(n<1) case
(n>=1) case
CS141-L2-7 Tarun Soni, Summer ‘03
Comparing Instruction Set Architectures
Design-time metrics:
° Can it be implemented, in how long, at what cost?
° Can it be programmed? Ease of compilation?
Static Metrics:
° How many bytes does the program occupy in memory?
Dynamic Metrics:
° How many instructions are executed?
° How many bytes does the processor fetch to execute the program?
° How many clocks are required per instruction?
° How "lean" a clock is practical?
Best Metric: Time to execute the program!
This depends on instruction set, processor organization, and compilation techniques.
CPI
Inst. Count Cycle Time
CS141-L2-8 Tarun Soni, Summer ‘03
Computer Performance
Measuring and Discussing Computer System
Performanceor
“My computer is faster than your computer”
CS141-L2-9 Tarun Soni, Summer ‘03
SPEC Performance
0
50
100
150
200
250
300
350
1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
Year
Perform
ance
RISC
Intel x86
35%/yr
RISCintroduction
performance now improves 50% per year (2x every 1.5 years)
But what is performance ??
CS141-L2-10 Tarun Soni, Summer ‘03
Performance depends on the eyes of the beholder?
• Purchasing perspective – given a collection of machines, which has the
• best performance ?• least cost ?• best performance / cost ?
• Design perspective– faced with design options, which has the
• best performance improvement ?• least cost ?• best performance / cost ?
• Both require– basis for comparison– metric for evaluation
• Our goal is to understand cost & performance implications of architectural choices
CS141-L2-11 Tarun Soni, Summer ‘03
Two ideas
° Time to do the task (Execution Time)
– execution time, response time, latency
° Tasks per day, hour, week, sec, ns. .. (Performance)
– throughput, bandwidth
Response time and throughput often are in opposition
Plane
Boeing 747
Concorde
Speed
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput (pmph)
286,700
178,200
Which has higher performance?
•How much faster is the Concorde compared to the 747?
•How much bigger is the 747 than the Douglas DC-8?
CS141-L2-12 Tarun Soni, Summer ‘03
° Time to do the task from start to finish
– execution time, response time, latency
° Tasks per unit time
– throughput, bandwidth
Vehicle
Ferrari
Greyhound
Speed
160 mph
65 mph
Time to Bay Area
3.1 hours
7.7 hours
Passengers
2
60
Throughput (pm/h)
320
3900
mostly used for data movement
Two mechanisms of getting to the bay-area
Response time and throughput often are in opposition
CS141-L2-13 Tarun Soni, Summer ‘03
Relative performance ?
• can be confusing
A runs in 12 seconds
B runs in 20 seconds– A/B = .6 , so A is 40% faster, or 1.4X faster, or B is 40% slower– B/A = 1.67, so A is 67% faster, or 1.67X faster, or B is 67% slower
• needs a precise definition
CS141-L2-14 Tarun Soni, Summer ‘03
Relative performance ?
• Performance is in units of things-per-second– bigger is better
• If we are primarily concerned with response time– performance(x) = 1
execution_time(x)
" X is n times faster than Y" means
Performance(X)
n = ----------------------
Performance(Y)
PerformanceX
Execution TimeXPerformanceY
RelativePerformance
Execution TimeY= = = n
CS141-L2-15 Tarun Soni, Summer ‘03
How many times ?
• Time of Concorde vs. Boeing 747?
• Concord is 1350 mph / 610 mph = 2.2 times faster
= 6.5 hours / 3 hours
• Throughput of Concorde vs. Boeing 747 ?
• Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster”
• Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster”
• Boeing is 1.6 times (“60%”)faster in terms of throughput
• Concord is 2.2 times (“120%”) faster in terms of flying time
We will focus primarily on execution time for a single job
CS141-L2-16 Tarun Soni, Summer ‘03
• “times faster than” (or “times as fast as”)– there’s a multiplicative factor relating quantities
– “X was 3 time faster than Y” speed(X) = 3 speed(Y)
• “percent faster than” – implies an additive relationship– “X was 25% faster than Y” speed(X) = (1+25/100) speed(Y)
• “percent slower than” – implies subtraction– “X was 5% slower than Y” speed(X) = (1-5/100) speed(Y)– “100% slower” means it doesn’t move at all !
• “times slower than” or “times as slow as” – is awkward.
– “X was 3 times slower than Y” means speed(X) = 1/3 speed(Y)
Some grammar?
CS141-L2-17 Tarun Soni, Summer ‘03
X is r times faster than Y speed(X) = r speed(Y)
speed(Y) = 1/r speed(X)
Y is r times slower than X
X is r times faster than Y, & Y is s times faster than Z
speed(X) = r speed(Y) = rs speed(Z)
X is rs faster than Z
(Cannot do this with % numbers !)
Easiest way to avoid confusion:
- Convert “% faster” to “times faster”
- then do calculation and convert back if needed.- Example: change “25% faster” to “5/4 times faster”.
Avoid Linguistic Confusion
CS141-L2-18 Tarun Soni, Summer ‘03
• user CPU time? (time CPU spends running your code)• total CPU time (user + kernel)? (includes op. sys. code)
• Wallclock time? (total elapsed time) – Includes time spent waiting for I/O, other users, ...
• Answer depends ...For measuring processor speed, we can use total CPU.– If no I/O or interrupts, wallclock may be better
• more precise (microseconds rather than 1/100 sec)• can measure individual sections of code
> time foo... foo’s results ...90.7u 12.9s 2:39 65%>
user + kernelwallclock
Which time anyways ?
CS141-L2-19 Tarun Soni, Summer ‘03
Metrics of Performance
Compiler
Programming Language
Application
DatapathControl
TransistorsWires Pins
ISA
Function Units
(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per month
Useful Operations per second
Each metric has a place and a purpose, and each can be misused
CS141-L2-20 Tarun Soni, Summer ‘03
Levels of benchmarking
Actual Target Workload
Full Application Benchmarks
Small “Kernel” Benchmarks
Microbenchmarks
Pros Cons
• representative
• very specific• non-portable• difficult to run, or measure• hard to identify cause
• portable• widely used• improvements useful in reality
• easy to run, early in design cycle
• identify peak capability and potential bottlenecks
•less representative
• easy to “fool”
• “peak” may be a long way from application performance
CS141-L2-21 Tarun Soni, Summer ‘03
Cycle Time
• Instead of reporting execution time in seconds, we often use cycles
• Clock “ticks” indicate when to start activities (one abstraction):
• cycle time = time between ticks = seconds per cycle
• clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)
A 200 Mhz. clock has a cycle time of:
time
seconds
program
cycles
program
seconds
cycle
1
200 106 109 5 nanoseconds
CS141-L2-22 Tarun Soni, Summer ‘03
Cycle Time
CPU Execution Time
Instruction Count
CPI Clock Cycle Time= X X
instructionscycles/instruction seconds/cycle
seconds
• Improve performance => reduce execution time– Reduce instruction count (Programmer, Compiler) – Reduce cycles per instruction (ISA, Machine designer)– Reduce clock cycle time (Hardware designer, Physicist)
CS141-L2-23 Tarun Soni, Summer ‘03
Performance Variation
Number ofinstructions
CPI Clock CycleTime
Same machine differentprograms
different similar same
same programs, differentmachines, same ISA
same different different
Same programs,different machines
somewhatdifferent
different different
CPU Execution Time
Instruction Count
CPI Clock Cycle Time= X X
CS141-L2-24 Tarun Soni, Summer ‘03
Amdahl’s Law
Execution Time After Improvement =
Execution Time Unaffected +
( Execution Time Affected / Amount of Improvement )
• Example:
"Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"
How about making it 5 times faster?
• Principle: Make the common case fast
CS141-L2-25 Tarun Soni, Summer ‘03
MIPS, MFLOPS etc.
• MIPS - million instructions per second
= number of instructions executed in program = Clock rate
execution time in seconds * 106 CPI * 106
• MFLOPS - million floating point operations per second
= number of floating point operations executed in program
execution time in seconds * 106
• program-independent• deceptive
CS141-L2-26 Tarun Soni, Summer ‘03
Example RISC Processor
Typical Mix
Base Machine (Reg / Reg)
Op Freq Cycles CPI(i) % Time
ALU 50% 1 .5 23%
Load 20% 5 1.0 45%
Store 10% 3 .3 14%
Branch 20% 2 .4 18%
2.2
How much faster would the machine be if a better data cachereduced the average load time to 2 cycles?
How does this compare with using branch prediction to shave a cycle off the branch time?
What if two ALU instructions could be executed at once?
CS141-L2-27 Tarun Soni, Summer ‘03
SPEC
Which Programs?
• peak throughput measures (simple programs)• synthetic benchmarks (whetstone, dhrystone,...)• Real applications• SPEC (best of both worlds, but with problems of their own)
– System Performance Evaluation Cooperative– Provides a common set of real applications along with strict guidelines for
how to run them.– provides a relatively unbiased means to compare machines.
CS141-L2-28 Tarun Soni, Summer ‘03
SPEC89
• Compiler “enhancements” and performance
0
100
200
300
400
500
600
700
800
tomcatvfppppmatrix300eqntottlinasa7doducspiceespressogcc
BenchmarkCompiler
Enhanced compiler
SP
EC
pe
rfo
rman
ce r
atio
CS141-L2-29 Tarun Soni, Summer ‘03
• SPECint2000 – • gzip and bzip2 - compression• gcc – compiler; 205K lines of messy code!• crafty – chess program• parser – word processing• vortex – object-oriented database• perlbmk – PERL interpreter• eon – computer visualization• vpr, twolf – CAD tools for VLSI• mcf, gap – “combinatorial” programs
• SPECfp2000 – 10 Fortran, 3 C programs
– scientific application programs (physics, chemistry, image
processing, number theory, ...)
SPECCPU2000 Suite
CS141-L2-30 Tarun Soni, Summer ‘03
Performance is always misleading
• Performance is specific to a particular program/s– Total execution time is a consistent summary of performance
• For a given architecture performance increases come from:– increases in clock rate (without adverse CPI affects)– improvements in processor organization that lower CPI– compiler enhancements that lower CPI and/or instruction count
• Pitfall: expecting improvement in one aspect of a machine’s
performance to affect the total performance
• You should not always believe everything you read! Read carefully!
CS141-L2-31 Tarun Soni, Summer ‘03
Computer Arithmetic
bits (011011011100010 ....01)
instruction
R-format I-format ...
data
number text chars ..............
integer floating point
signed unsigned single precision double precision
... .........
What do all those bits mean now?
CS141-L2-32 Tarun Soni, Summer ‘03
Computer Arithmetic
• How do you represent– negative numbers?– fractions?– really large numbers?– really small numbers?
• How do you– do arithmetic?– identify errors (e.g. overflow)?
• What is an ALU and what does it look like?– ALU=arithmetic logic unit
CS141-L2-33 Tarun Soni, Summer ‘03
Big Endian vs. Little Endian
0
1
2
3
4
5
6
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
8 bits of data
least-significant bit
031
least-significant bit
031
Big EndianIBM, Mot, HP, Sun
Little EndianDec, Intel
Some processors (e.g. PowerPC) provide both
– If you can figure out how to switch modes or get the compiler to issue “Byte-reversed load’s and store’s”
CS141-L2-34 Tarun Soni, Summer ‘03
Binary Numbers: An Introduction
Consider a 4-bit binary number
Examples of binary arithmetic:
3 + 2 = 5 3 + 3 = 6
BinaryBinaryDecimal
0 0000
1 0001
2 0010
3 0011
Decimal
4 0100
5 0101
6 0110
7 0111
0 0 1 1
0 0 1 0+
0 1 0 1
1
0 0 1 1
0 0 1 1+
0 1 1 0
1 1
CS141-L2-35 Tarun Soni, Summer ‘03
Negative Numbers: Some options
• We would like a number system that provides– obvious representation of 0,1,2...– uses adder for addition– single value of 0– equal coverage of positive and negative numbers– easy detection of sign– easy negation
• Sign Magnitude -- MSB is sign bit, rest the same
-1 == 1001
-5 == 1101• One’s complement -- flip all bits to negate
-1 == 1111
-5 == 1010
CS141-L2-36 Tarun Soni, Summer ‘03
Negative Numbers: two’s complement
Decimal-8-7-6-5-4-3-2-101234567
Two’s Complement Binary1000100110101011110011011110111100000001001000110100010101100111
– Positive numbers: normal binary representation
– Negative numbers: flip bits (0 1) , then add 1
Smallest 4-bit number: -8
Biggest 4-bit number: 7
CS141-L2-37 Tarun Soni, Summer ‘03
Two’s complement arithmetic
• Examples: 7 - 6 = 7 + (- 6) = 1 3 - 5 = 3 + (- 5) = -2
2’s Complement Binary2’s Complement BinaryDecimal
0 0000
1 0001
2 0010
3 0011
1111
1110
1101
Decimal
-1
-2
-3
4 0100
5 0101
6 0110
7 0111
1100
1011
1010
1001
-4
-5
-6
-7
1000-8
0 1 1 1
1 0 1 0+
0 0 0 1
1
0 0 1 1
1 0 1 1+
1 1 1 0
1 111
Uses simple adder for + and - numbers
CS141-L2-38 Tarun Soni, Summer ‘03
• Negation – flip bits and add 1. (Works for + and -)– Might cause overflow.
• Extend sign when loading into large register– +3 => 0011, 00000011, 0000000000000011– -3 => 1101, 11111101, 1111111111111101
• Overflow detection – (need to raise “exception” when answer can’t be represented)
0101 5+ 0110 6 1011 -5 ??!!!
Things to keep in mind
CS141-L2-39 Tarun Soni, Summer ‘03
Overflow detection again
0 1 1 1
0 0 1 1+
1 0 1 0
1
1 1 0 0
1 0 1 1+
0 1 1 1
110
7
3
1
-6
- 4
- 5
7
0
0 0 1 0
0 0 1 1+
0 1 0 1
1
1 1 0 0
1 1 1 0+
1 0 1 0
100
2
3
0
5
- 4
- 2
- 6
1 0 0
1 0
So how do we detect overflow?
Carry into MSB ! = Carry out of MSB
CS141-L2-40 Tarun Soni, Summer ‘03
Execution: the heart of it all
32
32
32
operation
result
a
b
ALU
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
CS141-L2-41 Tarun Soni, Summer ‘03
A Basic ALU
• ALU Control Lines (ALUop) Function– 000 And– 001 Or– 010 Add– 110 Subtract– 111 Set-on-less-than
AL
U
N
N
N
A
B
Result
Overflow
Zero
3ALUop
CarryOut
General idea: Build for 1-bit numbers and then extend for n-bits!
CS141-L2-42 Tarun Soni, Summer ‘03
Some basics of digital logic
c = a . bba
000
010
001
111
b
ac
b
ac
a c
c = a + bba
000
110
101
111
10
01
c = aa
a0
b1
cd
0
1
a
c
b
d
1. AND gate (c = a . b)
2. OR gate (c = a + b)
3. Inverter (c = a)
4. Multiplexor (if d = = 0, c = a; else c = b)
CS141-L2-43 Tarun Soni, Summer ‘03
1-bit ALU
b
0
1
Result
Operation
a• ALU Control Lines (ALUop) Function
– 000 And– 001 Or
b
0
2
Result
Operation
a
1
CarryIn
CarryOut
• ALU Control Lines (ALUop) Function– 000 And– 001 Or– 010 Add
But how do we make the adder?
CS141-L2-44 Tarun Soni, Summer ‘03
1-bit Full Adder
• This is also called a (3, 2) adder• Half Adder: No CarryIn nor CarryOut• Truth Table:
Inputs Outputs
CommentsA B CarryIn SumCarryOut
0 0 0 0 0 0 + 0 + 0 = 00
0 0 1 0 1 0 + 0 + 1 = 01
0 1 0 0 1 0 + 1 + 0 = 01
0 1 1 1 0 0 + 1 + 1 = 10
1 0 0 0 1 1 + 0 + 0 = 01
1 0 1 1 0 1 + 0 + 1 = 10
1 1 0 1 0 1 + 1 + 0 = 10
1 1 1 1 1 1 + 1 + 1 = 11
1-bitFull
Adder
CarryOut
CarryIn
A
BC
CS141-L2-45 Tarun Soni, Summer ‘03
1-bit Full Adder: CarryOut
CarryOut = (!A & B & CarryIn) | (A & !B & CarryIn) | (A & B & !CarryIn) | (A & B & CarryIn);
CarryOut = B & CarryIn | A & CarryIn | A & B
Inputs Outputs
CommentsA B CarryIn SumCarryOut
0 0 0 0 0 0 + 0 + 0 = 00
0 0 1 0 1 0 + 0 + 1 = 01
0 1 0 0 1 0 + 1 + 0 = 01
0 1 1 1 0 0 + 1 + 1 = 10
1 0 0 0 1 1 + 0 + 0 = 01
1 0 1 1 0 1 + 0 + 1 = 10
1 1 0 1 0 1 + 1 + 0 = 10
1 1 1 1 1 1 + 1 + 1 = 11
b
CarryOut
a
CarryIn
CS141-L2-46 Tarun Soni, Summer ‘03
1-bit Full Adder: Sum
Inputs Outputs
CommentsA B CarryIn SumCarryOut
0 0 0 0 0 0 + 0 + 0 = 00
0 0 1 0 1 0 + 0 + 1 = 01
0 1 0 0 1 0 + 1 + 0 = 01
0 1 1 1 0 0 + 1 + 1 = 10
1 0 0 0 1 1 + 0 + 0 = 01
1 0 1 1 0 1 + 0 + 1 = 10
1 1 0 1 0 1 + 1 + 0 = 10
1 1 1 1 1 1 + 1 + 1 = 11
Sum = (!A & !B & CarryIn) | (!A & B & !CarryIn) | (A & !B & !CarryIn) | (A & B & CarryIn)
CS141-L2-47 Tarun Soni, Summer ‘03
32-bit ALU
b
0
2
Result
Operation
a
1
CarryIn
CarryOut
The 1-bit ALU
Result31a31
b31
Result0
CarryIn
a0
b0
Result1a1
b1
Result2a2
b2
Operation
ALU0
CarryIn
CarryOut
ALU1
CarryIn
CarryOut
ALU2
CarryIn
CarryOut
ALU31
CarryIn
The 32-bit ALU
• (ALUop) Function– And– Or– Add
What about other operationssub $s1, $s, $s3 ; $s1 = $s2 - $s3 ;
Subtraction
slt $s1, $s2, $s3 ; if ($s2 < $s3) {$s1 = 1} else {$s1 = 0};set on less than (SLT)
CS141-L2-48 Tarun Soni, Summer ‘03
32-bit ALU
• Keep in mind the following:– (A - B) is the same as: A + (-B)– 2’s Complement negate: Take the inverse of every bit and add 1
• Bit-wise inverse of B is !B:– A - B = A + (-B) = A + (!B + 1) = A + !B + 1
0
2
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b
Binvert provides the negation
What about the ‘+1’ ?
CS141-L2-49 Tarun Soni, Summer ‘03
32-bit ALU
Seta31
0
Result0a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Bnegate
Zero
ALU0Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
Setting CarryIn[0] = 1 provides the ‘+1’ for the 32-bit adder.
CS141-L2-50 Tarun Soni, Summer ‘03
32-bit ALU: slt
• slt instruction • If ( a<b) result = 1, else result = 0;• If ( a-b < 0 ) result = 1, else result = 0;
• Do a subtract• use sign bit
– route to bit 0 of result– all other bits zero
0
3
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b 2
Less
0
3
Result
Operation
a
1
CarryIn
0
1
Binvert
b 2
Less
Set
Overflowdetection
Overflow
a.
b.
CS141-L2-51 Tarun Soni, Summer ‘03
32-bit ALU: Special conditions
Overflow Detection Logic
• Carry into MSB ! = Carry out of MSB– For a N-bit ALU: Overflow = CarryIn[N - 1] XOR CarryOut[N - 1]
A0
B0
1-bitALU
Result0
CarryIn0
CarryOut0
A1
B1
1-bitALU
Result1
CarryIn1
CarryOut1
A2
B2
1-bitALU
Result2
CarryIn2
A3
B3
1-bitALU
Result3
CarryIn3
CarryOut3
Overflow
X Y X XOR Y
0 0 0
0 1 1
1 0 1
1 1 0
CS141-L2-52 Tarun Soni, Summer ‘03
32-bit ALU: Special conditions
0
3
Result
Operation
a
1
CarryIn
CarryOut
0
1
Binvert
b 2
Less
0
3
Result
Operation
a
1
CarryIn
0
1
Binvert
b 2
Less
Set
Overflowdetection
Overflow
a.
b.
• Thus MSB block has special logic to generate
• Set line (sign bit)• Overflow line
CS141-L2-53 Tarun Soni, Summer ‘03
32-bit ALU: Special conditions
Zero Detection Logic
• Zero Detection Logic is just one BIG NOR gate– Any non-zero input to the NOR gate will cause its output to be zero
CarryIn0
A0
B0
1-bitALU
Result0
CarryOut0
A1
B1
1-bitALU
Result1CarryIn1
CarryOut1
A2
B2
1-bitALU
Result2CarryIn2
CarryOut2
A3
B3
1-bitALU
Result3CarryIn3
CarryOut3
Zero
CS141-L2-54 Tarun Soni, Summer ‘03
32 bit ALU
Seta31
0
Result0a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Bnegate
Zero
ALU0Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
But what about performance ?
• Notice control lines:
000 = and001 = or010 = add110 = subtract111 = slt
•zero is a 1 when the result is zero!
CS141-L2-55 Tarun Soni, Summer ‘03
32 bit ALU
• We can build an ALU to support the MIPS instruction set
– key idea: use multiplexor to select the output we want
– we can efficiently perform subtraction using two’s complement
– we can replicate a 1-bit ALU to produce a 32-bit ALU
• Important points about hardware
– all of the gates are always working
– the speed of a gate is affected by the number of inputs to the gate
– the speed of a circuit is affected by the number of gates in series(on the “critical path” or the “deepest level of logic”)
• Our primary focus: comprehension, however,– Clever changes to organization can improve performance
(similar to using better algorithms in software)– we’ll look at two examples for addition and multiplication
CS141-L2-56 Tarun Soni, Summer ‘03
PerformanceIdeal (CS) versus Reality (EE)
• When input 0 -> 1, output 1 -> 0 but NOT instantly– Output goes 1 -> 0: output voltage goes from Vdd (5v) to 0v
• When input 1 -> 0, output 0 -> 1 but NOT instantly– Output goes 0 -> 1: output voltage goes from 0v to Vdd (5v)
• Voltage does not like to change instantaneously
OutIn
Time
Voltage
1 => Vdd
Vin
Vout
0 => GND
CS141-L2-57 Tarun Soni, Summer ‘03
Performance
Series Connection
• Total Propagation Delay = Sum of individual delays = d1 + d2
• Capacitance C1 has two components:
– Capacitance of the wire connecting the two gates
– Input capacitance of the second inverter
Vdd
Cout
Vout
Vdd
C1
V1Vin
V1Vin Vout
Time
G1 G2 G1 G2
VoltageVdd
Vin
GND
V1 Vout
Vdd/2d1 d2
CS141-L2-58 Tarun Soni, Summer ‘03
Performance: Calculating Delays
• Sum delays along serial paths• Delay (Vin -> V2) ! = Delay (Vin -> V3)
– Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)– Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)
• Critical Path = The longest among the N parallel paths• C1 = Wire C + Cin of Gate 2 + Cin of Gate 3
Vdd
V2
VddV1Vin V2
C1
V1VinG1 G2
Vdd
V3G3
V3
CS141-L2-59 Tarun Soni, Summer ‘03
Performance: Storage elements
• Setup Time: Input must be stable BEFORE the trigger clock edge
• Hold Time: Input must REMAIN stable after the trigger clock edge
• Clock-to-Q time:
– Output cannot change instantaneously at the trigger clock edge
– Similar to delay in logic gates, two components:
• Internal Clock-to-Q
• Load dependent Clock-to-Q
D QD Don’t Care Don’t Care
Clk
UnknownQ
Setup Hold
Clock-to-Q
• Storage element: D flip flop with negative edge triggered
CS141-L2-60 Tarun Soni, Summer ‘03
Performance: Synchronous logic
• All storage elements are clocked by the same clock edge
• The combination logic block’s:
– Inputs are updated at each clock tick
– All outputs MUST be stable before the next clock tick
Clk
.
.
.
.
.
.
.
.
.
.
.
.Combination Logic
CS141-L2-61 Tarun Soni, Summer ‘03
Performance: Critical Path
• Critical path: the slowest path between any two storage devices• Cycle time is a function of the critical path• must be greater than:
– Clock-to-Q + Longest Path through the Combination Logic + Setup
Clk
.
.
.
.
.
.
.
.
.
.
.
.
CS141-L2-62 Tarun Soni, Summer ‘03
Clock Skew
• The worst case scenario for cycle time consideration:– The input register sees CLK1– The output register sees CLK2
• Cycle Time CLK-to-Q + Longest Delay + Setup + Clock Skew
Clk1
Clk2 Clock Skew
.
.
.
.
.
.
.
.
.
.
.
.
CS141-L2-63 Tarun Soni, Summer ‘03
Cycle Time: Thumb rules
• Reduce the number of gate levels
° Pay attention to loading
° One gate driving many gates is a bad idea
° Avoid using a small gate to drive a long wire
° Use multiple stages to drive large load
A
B
CD
A
B
C
D
INV4x
INV4x
Clarge
CS141-L2-64 Tarun Soni, Summer ‘03
Back to ALUs
Seta31
0
Result0a0
Result1a1
0
Result2a2
0
Operation
b31
b0
b1
b2
Result31
Overflow
Bnegate
Zero
ALU0Less
CarryIn
CarryOut
ALU1Less
CarryIn
CarryOut
ALU2Less
CarryIn
CarryOut
ALU31Less
CarryIn
• The adder we just built is called a “Ripple Carry Adder”– The carry bit may have to propagate from LSB to MSB– Worst case delay for an N-bit RC adder: 2N-gate delay
CarryIn
CarryOut
A
B
• E.g: (Back of the envelope approximations)• Single gate delay = 0.02 ns (inverter “speed” of 50 GHz)• 32 bit adder => 64 gate delay => 1.28 ns delay => maximum clock of 789 MHz.
CS141-L2-65 Tarun Soni, Summer ‘03
Ripple Carry Adders
• Is there more than one way to do addition?– two extremes: ripple carry and sum-of-products
Can you see the ripple? How could you get rid of it?
c1 = b0c0 + a0c0 + a0b0
c2 = b1c1 + a1c1 + a1b1 c2 = b1 (b0c0 + a0c0 + a0b0 ) + a1 (b0c0 + a0c0 + a0b0 ) +a1b1
c3 = b2c2 + a2c2 + a2b2 c3 =
c4 = b3c3 + a3c3 + a3b3 c4 =
Not feasible! Why?
CS141-L2-66 Tarun Soni, Summer ‘03
Carry Look Ahead Adders
• An approach in-between our two extremes• Motivation:
– If we didn't know the value of carry-in, what could we do?
– When would we always generate a carry? gi = ai bi
– When would we propagate the carry? pi = ai + bi
Inputs Outputs
CommentsA B CarryIn SumCarryOut
0 0 0 0 0 0 + 0 + 0 = 00
0 0 1 0 1 0 + 0 + 1 = 01
0 1 0 0 1 0 + 1 + 0 = 01
0 1 1 1 0 0 + 1 + 1 = 101 0 0 0 1 1 + 0 + 0 = 01
1 0 1 1 0 1 + 0 + 1 = 10
1 1 0 1 0 1 + 1 + 0 = 10
1 1 1 1 1 1 + 1 + 1 = 11
c1 = g0 + p0c0
c2 = g1 + p1c1
c3 = g2 + p2c2
c4 = g3 + p3c3
Generate CarryCarryOut = 1 (independent of CarryIn)
Propagate CarryCarryOut = CarryIn
CS141-L2-67 Tarun Soni, Summer ‘03
Carry Look Ahead Adders
The Propagate and Generate machinery.
Worst case delay of 1-gate.
CS141-L2-68 Tarun Soni, Summer ‘03
Carry Look Ahead Adders
The Generation of the CarryOut.
The delay (and size) still does grow with number of bits.
CS141-L2-69 Tarun Soni, Summer ‘03
Carry Look Ahead Adders
The Generation of the Result.
Sum i = Pi xor C i-1
• pi = ai + bi
CS141-L2-70 Tarun Soni, Summer ‘03
Carry Look Ahead Adders
• It is very expensive to build a “full” carry lookahead adder– Just imagine the length of the equation for Cin31
• Common practices:– Connect several N-bit Lookahead Adders to form a big adder– Example: connect four 8-bit carry lookahead adders to form
a 32-bit partial carry lookahead adder
8-bit CarryLookahead
Adder
C0
8
88
Result[7:0]
B[7:0]A[7:0]
8-bit CarryLookahead
Adder
C8
8
88
Result[15:8]
B[15:8]A[15:8]
8-bit CarryLookahead
Adder
C16
8
88
Result[23:16]
B[23:16]A[23:16]
8-bit CarryLookahead
Adder
C24
8
88
Result[31:24]
B[31:24]A[31:24]
CS141-L2-71 Tarun Soni, Summer ‘03
Carry Look Ahead Adders
CarryIn
Result0--3
ALU0
CarryIn
Result4--7
ALU1
CarryIn
Result8--11
ALU2
CarryIn
CarryOut
Result12--15
ALU3
CarryIn
C1
C2
C3
C4
P0G0
P1G1
P2G2
P3G3
pigi
pi + 1gi + 1
ci + 1
ci + 2
ci + 3
ci + 4
pi + 2gi + 2
pi + 3gi + 3
a0b0a1b1a2b2a3b3
a4b4a5b5a6b6a7b7
a8b8a9b9
a10b10a11b11
a12b12a13b13a14b14a15b15
Carry-lookahead unit
• Can’t build a 16 bit adder this way... (too big)• Could use ripple carry of 4-bit CLA adders• Better: use the CLA principle again!
CS141-L2-72 Tarun Soni, Summer ‘03
What did we cover today?• Last pieces of ISA class• Performance: how to quantify• Binary representation: integers, positive and negative• Basic ALU design
• 1-bit addition• Handling the carry• Carry look ahead• Subtraction• Set on less than • Condition codes such as overflow, zero
• Performance: Cycle time, number of gates etc.
Next class
Multiplication, Division, Floating point numbers
Rest of Chapter 4 from the text
Remember quizzes are surprises, and based on hws