ECE 563 Advanced Computer Architecture - Rutgers...
Transcript of ECE 563 Advanced Computer Architecture - Rutgers...
563 L02.1 Fall 2010
ECE 563Advanced Computer Architecture
Fall 2010
Lecture 2: Review of Metrics and Pipelining
563 L02.2 Fall 2010
Review from Last Time Computer Architecture >> instruction sets Computer Architecture skill sets are different
5 Quantitative principles of design Quantitative approach to design Solid interfaces that really work Technology tracking and anticipation
563 L02.3 Fall 2010
Review (continued) Other fields often borrow ideas from architecture Quantitative Principles of Design
1. Take Advantage of Parallelism2. Principle of Locality3. Focus on the Common Case4. Amdahl’s Law5. The Processor Performance Equation
Careful, quantitative comparisons Define, quantity, and summarize relative performance Define and quantify relative cost Define and quantify dependability Define and quantify power
Culture of anticipating and exploiting advances in technology Culture of well-defined interfaces that are carefully
implemented and thoroughly checked
563 L02.4 Fall 2010
Metrics used to Compare Designs Cost
Die cost and system cost
Execution Time average and worst-case Latency vs. Throughput
Energy and Power Also peak power and peak switching current
Reliability Resiliency to electrical noise, part failure Robustness to bad software, operator error
Maintainability System administration costs
Compatibility Software costs dominate
563 L02.5 Fall 2010
Cost of Processor Design cost (Non-recurring Engineering Costs, NRE)
dominated by engineer-years (~$200K per engineer year) also mask costs (exceeding $1M per spin)
Cost of die die area die yield (maturity of manufacturing process, redundancy features) cost/size of wafers die cost ~= f(die area^4) with no redundancy
Cost of packaging number of pins (signal + power/ground pins) power dissipation
Cost of testing built-in test features? logical complexity of design choice of circuits (minimum clock rates, leakage currents, I/O
drivers)
Architect affects all of these
563 L02.6 Fall 2010
System-Level Cost Impacts Power supply and cooling
Support chipset
Off-chip SRAM/DRAM/ROM
Off-chip peripherals
563 L02.7 Fall 2010
What is Performance?
Latency (or response time or execution time) time to complete one task
Bandwidth (or throughput) tasks completed per unit time
What does pipelining improve?
563 L02.8 Fall 2010
Performance(X) Execution_time(Y)
n = =
Performance(Y) Execution_time(X)
Definition: Performance Performance is in units of things per sec
bigger is betterCan you think of a lower-is-better metric?
If we are primarily concerned with response timeperformance(x) = 1
execution_time(x)
" X is n times faster than Y" means
563 L02.9 Fall 2010
Performance Guarantees
A
Average Rate: A > B > C
Worst-case Rate: A < B < C
BC
Execution Rate
Inputs
563 L02.10 Fall 2010
Types of Benchmark
Synthetic Benchmarks Fake programs invented to try to match the profile and
behavior of real applications, e.g., Dhrystone, Whetstone
Toy Programs 100-line programs from beginning programming
assignments, e.g., Nqueens, quicksort, Towers of Hanoi
Kernels small, key pieces of real applications, e.g., matrix multiply,
FFT, sorting, Livermore Loops, Linpack
Simplified Applications Extract main computational skeleton of real application to
simplify porting, e.g., NAS parallel benchmarks, TPC
Real Applications Things people actually use their computers for, e.g., car
crash simulations, relational databases, Photoshop, Quake
563 L02.11 Fall 2010
Performance: What to measure Usually rely on benchmarks vs. real workloads
To increase predictability, collections of benchmark applications-- benchmark suites -- are popular
SPECCPU: popular desktop benchmark suite CPU only, split between integer and floating point programs SPECint2000 has 12 integer, SPECfp2000 has 14 integer pgms SPECCPU2006 to be announced Spring 2006 SPECSFS (NFS file server) and SPECWeb (WebServer) added as
server benchmarks
Transaction Processing Council measures server performance and cost-performance for databases TPC-C Complex query for Online Transaction Processing TPC-H models ad hoc decision support TPC-W a transactional web benchmark TPC-App application server and web services benchmark
563 L02.12 Fall 2010
Summarizing Performance
Which system is faster?
System Rate (Task 1) Rate (Task 2)
A 10 20
B 20 10
563 L02.13 Fall 2010
… depends who’s sellingSystem Rate (Task 1) Rate (Task 2)
A 10 20B 20 10
Average1515
Average throughput
System Rate (Task 1) Rate (Task 2)A 0.50 2.00B 1.00 1.00
Average1.251.00
Throughput relative to B
System Rate (Task 1) Rate (Task 2)A 1.00 1.00B 2.00 0.50
Average1.001.25
Throughput relative to A
563 L02.14 Fall 2010
Summarizing Performance over Set of Benchmark Programs
Arithmetic mean of execution times ti (in seconds)
1/n i ti
Harmonic mean of execution rates ri (MIPS/MFLOPS)
n/ [i (1/ri)] Both equivalent to workload where each program
is run the same number of times
Can add weighting factors to model other workload distributions
563 L02.15 Fall 2010
Normalized Execution Time
Measure speedup relative to reference machine
ratio = tRef / tA
563 L02.16 Fall 2010
How to Mislead with Performance ReportsSelect pieces of workload that work well on your design, ignore othersUse unrealistic data set sizes for application (too big or too small)Report throughput numbers for a latency benchmarkReport latency numbers for a throughput benchmarkReport performance on a kernel and claim it represents an entire
applicationUse 16-bit fixed-point arithmetic (because it’s fastest on your system)
even though application requires 64-bit floating-point arithmeticUse a less efficient algorithm on the competing machineReport speedup for an inefficient algorithm (bubblesort)Compare hand-optimized assembly code with unoptimized C codeCompare your design using next year’s technology against
competitor’s year old design (1% performance improvement per week)Ignore the relative cost of the systems being comparedReport averages and not individual resultsReport speedup over unspecified base system, not absolute timesReport efficiency not absolute timesReport MFLOPS not absolute times (use inefficient algorithm)[ David Bailey “Twelve ways to fool the masses when giving performance
results for parallel supercomputers” ]
563 L02.17 Fall 2010
Benchmarking for Future Machines Variance in performance for parallel architectures
is going to be much worse than for serial processors SPECcpu means only really work across very similar
machine configurations
What is a good benchmarking methodology?
563 L02.18 Fall 2010
Why Power Matters
Packaging costs
Power supply rail design
Chip and system cooling costs
Noise immunity and system reliability
Battery life (in portable systems)
Environmental concerns Office equipment accounted for 5% of total US commercial
energy usage in 1993 Energy Star compliant systems
563 L02.19 Fall 2010
Power and Energy Figures of Merit
Power consumption in Watts determines battery life in hours Rate at which energy is delivered
Peak power determines power ground wiring designs sets packaging limits impacts signal noise margin and reliability analysis
Energy efficiency in Joules rate at which power is consumed over time
Energy = power * delay Joules = Watts * seconds lower energy number means less power to perform a
computation at the same frequency
563 L02.20 Fall 2010
Power versus Energy
Watts
time
Power is height of curve
Watts
time
Approach 1
Approach 2
Approach 2
Approach 1
Energy is area under curve
Lower power design could simply be slower
Two approaches require the same energy
563 L02.21 Fall 2010
Peak Power versus Lower Energy
System A has higher peak power, but lower total energy
System B has lower peak power, but higher total energy
Power
Time
Peak A
Peak BIntegrate power
curve to get energy
563 L02.22 Fall 2010
A "Typical" RISC ISA
32-bit fixed format instruction (3 formats)
32 32-bit GPR (R0 contains zero)
3-address, reg-reg arithmetic instruction
Single address mode for load/store: base + displacement no indirection
Simple branch conditions
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
563 L02.23 Fall 2010
Example: MIPS
Op31 26 01516202125
Rs1 Rd immediate
Op31 26 025
Op31 26 01516202125
Rs1 Rs2
target
Rd Opx
Register-Register561011
Register-Immediate
Op31 26 01516202125
Rs1 Rs2 immediate
Branch
Jump / Call
563 L02.24 Fall 2010
5 Steps of MIPS DatapathFigure A.3, Page A-9
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/M
EM4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Dat
a
Next PC
Address
RS1
RS2
Imm
MU
X
IR <= mem[PC];
PC <= PC + 4
A <= Reg[IRrs];
B <= Reg[IRrt]
rslt <= A opIRop B Reg[IRrd] <= WB
WB <= rslt
563 L02.25 Fall 2010
Inst. Set Processor Controller
IR <= mem[PC];
PC <= PC + 4
A <= Reg[IRrs];
B <= Reg[IRrt]
r <= A opIRop B
Reg[IRrd] <= WB
WB <= r
Ifetch
opFetch-DCD
PC <= IRjaddrif bop(A,b)
PC <= PC+IRim
br jmpRR
r <= A opIRop IRim
Reg[IRrd] <= WB
WB <= r
RIr <= A + IRim
WB <= Mem[r]
Reg[IRrd] <= WB
LD
563 L02.26 Fall 2010
5 Steps of MIPS DatapathFigure A.3, Page A-9
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/M
EM4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Dat
a
• Data stationary control– local decode for each instruction phase / pipeline stage
Next PC
Address
RS1
RS2
Imm
MU
X
563 L02.27 Fall 2010
Visualizing PipeliningFigure A.2, Page A-8
Instr.
Order
Time (clock cycles)
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
563 L02.28 Fall 2010
Pipelining is not quite that easy!
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of
instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior
instruction still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow (branches and jumps).
563 L02.29 Fall 2010
One Memory Port/Structural HazardsFigure A.4, Page A-14
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg ALU DMemIfetch Reg
563 L02.30 Fall 2010
One Memory Port/Structural Hazards(Similar to Figure A.5, Page A-15)
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg ALU DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
How do you “bubble” the pipe?
563 L02.31 Fall 2010
Speed Up Equation for Pipelining
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline CPI Ideal
depth Pipeline CPI Ideal Speedup
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline 1
depth Pipeline Speedup
Instper cycles Stall Average CPI Ideal CPIpipelined
For simple RISC pipeline, CPI = 1:
563 L02.32 Fall 2010
Example: Dual-port vs. Single-port
Machine A: Dual ported memory (“Harvard Architecture”)
Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate
Ideal CPI = 1 for both
Loads are 40% of instructions executed
Machine A is 1.33 times faster
563 L02.33 Fall 2010
Data Hazard on R1 Figure A.6, Page A-17
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Time (clock cycles)
IF ID/RF EX MEM WB
Among these 4 arrows, which are hazards and which are not?
563 L02.34 Fall 2010
Read After Write (RAW)InstrJ tries to read operand before InstrI writes it
Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
Three Generic Data Hazards
I: add r1,r2,r3J: sub r4,r1,r3
563 L02.35 Fall 2010
Write After Read (WAR)InstrJ writes operand before InstrI reads it
Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Three Generic Data Hazards
563 L02.36 Fall 2010
Three Generic Data Hazards
Write After Write (WAW)InstrJ writes operand before InstrI writes it.
Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5
Will see WAR and WAW in more complicated pipes
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
563 L02.37 Fall 2010
Time (clock cycles)
Forwarding to Avoid Data HazardFigure A.7, Page A-19
Inst
r.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
563 L02.38 Fall 2010
HW Change for ForwardingFigure A.23, Page A-37
MEM
/WR
ID/EX
EX/M
EM
DataMemory
ALU
mux
mux
Registers
NextPC
Immediate
mux
What circuit detects and resolves this hazard?
ALU output at the end of EX
ALU output at the end of MEM
memory output at the end of MEM
563 L02.39 Fall 2010
Time (clock cycles)
Forwarding to Avoid LW-SW Data HazardFigure A.8, Page A-20
Inst
r.
Order
add r1,r2,r3
lw r4, 0(r1)
sw r4,12(r1)
or r8,r6,r9
xor r10,r9,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
563 L02.40 Fall 2010
Time (clock cycles)
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Data Hazard Even with ForwardingFigure A.9, Page A-21
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
563 L02.41 Fall 2010
Data Hazard Even with Forwarding(Similar to Figure A.10, Page A-21)
Time (clock cycles)
or r8,r1,r9
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg ALU DMemIfetch Reg
RegIfetch ALU DMem RegBubble
Ifetch ALU DMem RegBubble Reg
Ifetch ALU DMemBubble Reg
Is this the same bubble?
563 L02.42 Fall 2010
Try producing fast code fora = b + c;d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,fSUB Rd,Re,RfSW d,Rd
Software Scheduling to Avoid Load Hazards
Compiler optimizes for performance. Hardware checks for safety.
563 L02.43 Fall 2010
Control Hazard on BranchesThree Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
How to deal with the 3 instructions in between?
563 L02.44 Fall 2010
Branch Stall Impact
If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier
MIPS branch tests if register = 0 or 0
MIPS Solution: Move Zero test to ID/RF stage (use xor) Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3
563 L02.45 Fall 2010
Adder
IF/ID
Pipelined MIPS DatapathFigure A.24, page A-38
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
X
Data
Mem
ory
MU
X
SignExtend
Zero?
MEM
/WB
EX/M
EM4
Adder
Next SEQ PC
RD RD RD WB
Dat
a
• Interplay of instruction set design and cycle time.
Next PC
Address
RS1
RS2
ImmM
UX
ID/EX
563 L02.46 Fall 2010
Four Branch Hazard Alternatives#1: Stall until branch direction is clear#2: Predict Branch Not Taken
Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS
- MIPS still incurs 1 cycle branch penalty- Other machines: branch target known before outcome
563 L02.47 Fall 2010
Four Branch Hazard Alternatives
#4: Delayed Branch Define branch to take place AFTER a following instruction
branch instructionsequential successor1sequential successor2........sequential successorn
branch target if taken
1 slot delay allows proper decision and branch target address in 5 stage pipeline
MIPS uses this
Branch delay of length n
563 L02.48 Fall 2010
Scheduling Branch Delay Slots (Fig A.14)
A is the best choice, fills delay slot & reduces instruction count (IC)
In B, the sub instruction may need to be copied, increasing IC
add $1,$2,$3if $2=0 then
delay slot
A. From before branch B. From branch target C. From fall through
add $1,$2,$3if $1=0 thendelay slot
add $1,$2,$3if $1=0 then
delay slot
sub $4,$5,$6
sub $4,$5,$6
becomes becomes becomes
if $2=0 then
add $1,$2,$3add $1,$2,$3if $1=0 thensub $4,$5,$6
add $1,$2,$3if $1=0 then
or $7 $8 $9
or $7 $8 $9
563 L02.49 Fall 2010
Delayed Branch
Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots
useful in computation About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot Delayed branching has lost popularity compared to more
expensive but more flexible dynamic approaches Growth in available transistors has made dynamic
approaches relatively cheaper
563 L02.50 Fall 2010
Evaluating Branch Alternatives
Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken
Scheduling Branch CPI speedup v. speedup v.scheme penalty unpipelined stall
Stall pipeline 3 1.60 3.1 1.0
Predict taken 1 1.20 4.2 1.33
Predict not taken 1 1.14 4.4 1.40
Delayed branch 0.5 1.10 4.5 1.45
Pipeline speedup = Pipeline depth1 +Branch frequency Branch penalty
563 L02.51 Fall 2010
Problems with Pipelining Exception: An unusual event happens to an instruction
during its execution Examples: divide by zero, undefined opcode
Interrupt: Hardware signal to switch the processor to a new instruction stream Example: a sound card interrupts when it needs more audio
output samples (an audio “click” happens if it is left waiting)
Problem: the exception or interrupt must appear between 2 instructions (Ii and Ii+1) The effect of all instructions up to and including Ii is totalling
complete No effect of any instruction after Ii can take place
The interrupt (exception) handler either aborts program or restarts at instruction Ii+1
563 L02.52 Fall 2010
Precise Exceptions In-Order Pipelines
Key observation: architected state only change in memory and register write stages.
563 L02.53 Fall 2010
Summary: Metrics and Pipelining
Machines compared over many metrics Cost, performance, power, reliability, compatibility, …
Difficult to compare widely differing machines on benchmark suite
Control VIA State Machines and Microprogramming Just overlap tasks; easy if tasks are independent Speed Up Pipeline Depth; if ideal CPI is 1, then:
Hazards limit performance on computers: Structural: need more HW resources Data (RAW,WAR,WAW): need forwarding, compiler scheduling Control: delayed branch, prediction
Exceptions, Interrupts add complexity Next time: Read Appendix C!
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline 1
depth Pipeline Speedup