Computer Organization Lecture Set – 06 Chapter 6 Huei-Yung Lin.
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
1
Transcript of Computer Organization Lecture Set – 06 Chapter 6 Huei-Yung Lin.
Computer Organization
Lecture Set – 06
Chapter 6
Huei-Yung Lin
H.Y. Lin, CCUEE Computer Organization 2
Overview / Abstractions and Technology Performance Instruction sets Logic & arithmetic Processor Implementation
Single-cycle implemenatation Multicycle implementation Pipelined Implementation
Memory systems Input/Output
Roadmap for the Term: Major Topics
H.Y. Lin, CCUEE Computer Organization 3
Pipelining Outline
Introduction Defining Pipelining Pipelining Instructions Hazards
Pipelined Processor Design Datapath Control
Advanced Pipelining Superscalar Dynamic Pipelining Examples
H.Y. Lin, CCUEE Computer Organization 4
What is Pipelining?
A way of speeding up execution of instructions Key idea: overlap execution of multiple instructions Analogy: doing your laundry
1. Run load through washer
2. Run load through dryer
3. Fold clothes
4. Put away clothes
5. Go to 1
Observation: we can start another load as soon as we finish step 1!
H.Y. Lin, CCUEE Computer Organization 5
The Laundry Analogy
Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 30 minutes
“Folder” takes 30 minutes
“Stasher” takes 30 minutesto put clothes into drawers
A B C D
H.Y. Lin, CCUEE Computer Organization 6
If we do laundry sequentially...
30Task
Order
TimeA
3030 3030
B
30 3030
C
3030 3030
D
3030 3030
6 PM 7 8 9 10 11 12 1 2 AM
Time Required: 8 hours for 4 loads
H.Y. Lin, CCUEE Computer Organization 7
12 2 AM6 PM 7 8 9 10 11 1
Time30
A
C
D
B
3030 3030 3030Task
Order
To Pipeline, We Overlap Tasks
Time Required: 3.5 Hours for 4 Loads Latency remains 2 hours Throughput improves by factor of 2.3 (decreases for more loads)
H.Y. Lin, CCUEE Computer Organization 8
Pipelining a Digital System
Key idea: break big computation up into pieces
Separate each piece with a pipeline register1ns
200ps 200ps 200ps 200ps 200ps
PipelineRegister
H.Y. Lin, CCUEE Computer Organization 9
Pipelining a Digital System
Why do this? Because it's faster for repeated computations
1ns
Non-pipelined:1 operation finishesevery 1ns
200ps 200ps 200ps 200ps 200ps
Pipelined:1 operation finishesevery 200ps
H.Y. Lin, CCUEE Computer Organization 10
Comments about pipelining
Pipelining increases throughput, but not latency Answer available every 200ps, BUT A single computation still takes 1ns
Limitations: Computations must be divisible into stage size Pipeline registers add overhead
H.Y. Lin, CCUEE Computer Organization 11
Pipelining a Processor
Recall the 5 steps in instruction execution:1. Instruction Fetch2. Instruction Decode and Register Read3. Execution operation or calculate address4. Memory access5. Write result into register
Review: Single-Cycle Processor All 5 steps done in a single clock cycle Dedicated hardware required for each step
What happens if we break execution into multiple cycles, but keep the extra hardware?
H.Y. Lin, CCUEE Computer Organization 12
Review - Single-Cycle Processor
5 516
RD1
RD2
RN1 RN2 WN
WD
Register File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
IFInstruction Fetch
IDInstruction Decode
EXExecute/ Address Calc.
MEMMemory Access
WBWrite Back
H.Y. Lin, CCUEE Computer Organization 13
Pipelining - Key Idea
Question: What happens if we break execution into multiple cycles, but
keep the extra hardware? Answer:
In the best case, we can start executing a new instruction on each clock cycle – this is pipelining
Pipelining stages: IF - Instruction Fetch ID - Instruction Decode EX - Execute / Address Calculation MEM - Memory Access (read / write) WB - Write Back (results into register file)
H.Y. Lin, CCUEE Computer Organization 14
Basic Pipelined Processor
IF/ID
Pipeline Registers
5 516
RD1
RD2
RN1 RN2 WN
WD
Register File ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
5
Instruction I32
MUX
<<2RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
32
ID/EX EX/MEM MEM/WB
H.Y. Lin, CCUEE Computer Organization 15
Single-Cycle vs. Pipelined Execution
Non-Pipelined0 200 400 600 800 1000 1200 1400 1600 1800
lw $1, 100($0) InstructionFetch
REGRD
ALU REGWR
MEM
lw $2, 200($0) InstructionFetch
REGRD
ALU REGWR
MEM
lw $3, 300($0) InstructionFetch
TimeInstructionOrder
800ps
800ps
800ps
Pipelined0 200 400 600 800 1000 1200 1400 1600
lw $1, 100($0) InstructionFetch
REGRD
ALU REGWR
MEM
lw $2, 200($0)
lw $3, 300($0)
TimeInstructionOrder
200psInstructionFetch
REGRD
ALU REGWR
MEM
InstructionFetch
REGRD
ALU REGWR
MEM
200ps
200ps 200ps 200ps 200ps 200ps
H.Y. Lin, CCUEE Computer Organization 16
Comments about Pipelining
The good news Multiple instructions are being processed at same time This works because stages are isolated by registers Best case speedup of N
The bad news Instructions interfere with each other – hazards
Example: different instructions may need the same piece of hardware (e.g., memory) in same clock cycle
Example: instruction may require a result produced by an earlier instruction that is not yet complete
Worst case: must suspend execution – stall
H.Y. Lin, CCUEE Computer Organization 17
Consider the following instruction sequence:lw $r0, 10($r1)
sw $sr3, 20($r4)
add $r5, $r6, $r7
sub $r8, $r9, $r10
Pipelined Example – Executing Multiple Instructions
H.Y. Lin, CCUEE Computer Organization 18
LW
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
Executing Multiple InstructionsClock Cycle 1
lw $r0, 10($r1)sw $sr3, 20($r4)add $r5, $r6, $r7sub $r8, $r9, $r10
H.Y. Lin, CCUEE Computer Organization 19
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
LWSW
Executing Multiple InstructionsClock Cycle 2
lw $r0, 10($r1)sw $sr3, 20($r4)add $r5, $r6, $r7sub $r8, $r9, $r10
H.Y. Lin, CCUEE Computer Organization 20
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
LWSWADD
Executing Multiple InstructionsClock Cycle 3
lw $r0, 10($r1)sw $sr3, 20($r4)add $r5, $r6, $r7sub $r8, $r9, $r10
H.Y. Lin, CCUEE Computer Organization 21
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
LWSWADDSUB
Executing Multiple InstructionsClock Cycle 4
lw $r0, 10($r1)sw $sr3, 20($r4)add $r5, $r6, $r7sub $r8, $r9, $r10
H.Y. Lin, CCUEE Computer Organization 22
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
LWSWADDSUB
Executing Multiple InstructionsClock Cycle 5
lw $r0, 10($r1)sw $sr3, 20($r4)add $r5, $r6, $r7sub $r8, $r9, $r10
H.Y. Lin, CCUEE Computer Organization 23
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
SWADDSUB
Executing Multiple InstructionsClock Cycle 6
lw $r0, 10($r1)sw $sr3, 20($r4)add $r5, $r6, $r7sub $r8, $r9, $r10
H.Y. Lin, CCUEE Computer Organization 24
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
ADDSUB
Executing Multiple InstructionsClock Cycle 7
lw $r0, 10($r1)sw $sr3, 20($r4)add $r5, $r6, $r7sub $r8, $r9, $r10
H.Y. Lin, CCUEE Computer Organization 25
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
MUX
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
MUX
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
SUB
Executing Multiple InstructionsClock Cycle 8
lw $r0, 10($r1)sw $sr3, 20($r4)add $r5, $r6, $r7sub $r8, $r9, $r10
H.Y. Lin, CCUEE Computer Organization 26
Alternative View – Multicycle Diagram
IM REG ALU DM REGlw $r0, 10($r1)
sw $r3, 20($r4)
add $r5, $r6, $r7
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
IM REG ALU DM REG
IM REG ALU DM REG
sub $r8, $r9, $r10 IM REG ALU DM REG
CC 8
H.Y. Lin, CCUEE Computer Organization 27
Pipeline Hazards
Where one instruction cannot immediately follow another
Types of hazards Structural hazards – attempt to use same resource twice Control hazards – attempt to make decision before condition is
evaluated Data hazards – attempt to use data before it is ready
Can always resolve hazards by waiting
H.Y. Lin, CCUEE Computer Organization 28
Structural Hazards
Attempt to use same resource twice at same time Example: Single Memory for instructions, data
Accessed by IF stage Accessed at same time by MEM stage
Solutions Delay second access by one clock cycle, OR Provide separate memories for instructions, data
This is what the book does This is called a “Harvard Architecture” Real pipelined processors have separate caches
H.Y. Lin, CCUEE Computer Organization 29
0 2 4 6 8 10Time
12
IF ID EX MEM WB
14
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
14
Memory Conflict
Example Structural Hazard – Single Memory
H.Y. Lin, CCUEE Computer Organization 30
Control Hazards
Attempt to make a decision before condition is evaluated Example: beq $s0, $s1, offset Assume we add hardware to second stage to:
Compare fetched registers for equality Compute branch target
This allows branch to be taken at end of second clock cycle
But, this still means result is not ready when we want to load the next instruction!
H.Y. Lin, CCUEE Computer Organization 31
Control Hazard Solutions
Stall – stop loading instructions until result is available
Predict – assume an outcome and continue fetching (undo if prediction is wrong)
Delayed branch – specify in architecture that following instruction is always executed
H.Y. Lin, CCUEE Computer Organization 32
Control Hazard – Stall
beqwrites PC
here
new PCused here
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WBsw $s4,200($t5)
18
BUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
STALL
H.Y. Lin, CCUEE Computer Organization 33
Control Hazard – Correct Prediction
Fetch assumingbranch taken
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WBtgt:sw $s4,200($t5)
18
H.Y. Lin, CCUEE Computer Organization 34
Control Hazard – Incorrect Prediction
“ Squashed”instruction
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WB
18
BUBBLE BUBBLE BUBBLE BUBBLE
tgt:sw $s4,200($t5)(incorrect - STALL)
IF
or $r8,$r8,$r9
H.Y. Lin, CCUEE Computer Organization 35
Control Hazard – Delayed Branch
always executes
correct PC avail. here
0 2 4 6 8 10 12
IF ID EX MEM WB
16
add $r4,$r5,$r6
beq $r0,$r1,tgt IF ID EX MEM WB
IF ID EX MEM WB
18
Branch SLOT:and $r6,$r6,$r7
tgt:sw $s4,200($t5) IF ID EX MEM WB
H.Y. Lin, CCUEE Computer Organization 36
Summary – Control Hazard Solutions Stall – stop fetching instr. until result is available
Significant performance penalty Hardware required to stall
Predict – assume an outcome and continue fetching (undo if prediction is wrong)
Performance penalty only when guess wrong Hardware required to "squash" instructions
Delayed branch – specify in architecture that following instruction is always executed
Compiler re-orders instructions into delay slot Insert "NOP" (no-op) operations when can't use (~50%) This is how original MIPS worked
H.Y. Lin, CCUEE Computer Organization 37
Data Hazards
Attempt to use data before it is ready Solutions
Stalling – wait until result is available Forwarding – make data available inside datapath Reordering instructions – use compiler to avoid hazards
Examples:add $s0, $t0, $t1 ; $s0 = $t0+$t1sub $t2, $s0, $t3 ; $t2 = $s0-$t2
lw $s0, 0($t0) ; $s0 = MEM[$t0]sub $t2, $s0, $t3 ; $t2 = $s0-$t2
H.Y. Lin, CCUEE Computer Organization 38
Data Hazard – Stalling
0 2 4 6 8 10 12
IF ID EX MEM
16
add $s0,$t0,$t1
STALL
18
sub $t2,$s0,$t3 IF EX MEM
STALL
BUBBLE BUBBLE BUBBLE BUBBLE
BUBBLEBUBBLE BUBBLE BUBBLE BUBBLE
$s0writtenhere
Ws0
WB
$s0 readhere
Rs0
BUBBLE
H.Y. Lin, CCUEE Computer Organization 39
Data Hazards – Forwarding
Key idea: connect new value directly to next stage Still read s0, but ignore in favor of new result
Problem: what about load instructions?
ID
0 2 4 6 8 10 12
IF ID EX MEM
16
add $s0 ,$t0,$t1
18
sub $t2, $s0 ,$t3 IF EX MEM
Ws0
WBRs0
new value of s0
H.Y. Lin, CCUEE Computer Organization 40
Data Hazards – Forwarding
STALL still required for load – data avail. after MEM MIPS architecture calls this delayed load, initial implementations
required compiler to deal with this
ID
0 2 4 6 8 10 12
IF ID EX MEM
16
lw $s0,20($t1)
18
sub $t2,$s0,$t3 IF EX MEM
Ws0
WBRs0
new value of s0
STALLBUBBLE BUBBLE BUBBLE BUBBLE BUBBLE
H.Y. Lin, CCUEE Computer Organization 41
Assuming we have data forwarding, what are the hazards in this code?
lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)
Reorder instructions to remove hazard:lw $t0, 0($t1)lw $t2, 4($t1)sw $t0, 4($t1)sw $t2, 0($t1)
Data Hazards – Reordering Instructions
H.Y. Lin, CCUEE Computer Organization 42
Summary - Pipelining Overview Pipelining increase throughput (but not latency) Hazards limit performance
Structural hazards Control hazards Data hazards
H.Y. Lin, CCUEE Computer Organization 43
Pipelining Outline
Introduction Pipelined Processor Design
Datapath Control Dealing with Hazards & Forwarding Branch Prediction Exceptions Performance
Advanced Pipelining Superscalar Dynamic Pipelining Examples
H.Y. Lin, CCUEE Computer Organization 44
Pipelining in MIPS
MIPS architecture was designed to be pipelined Simple instruction format (makes IF, ID easy)
Single-word instructions Small number of instruction formats Common fields in same place (e.g., rs, rt) in different formats
Memory operations only in lw, sw instructions (simplifies EX)
Memory operands aligned in memory (simplifies MEM) Single value for writeback (limits forwarding)
Pipelining is harder in CISC architectures
H.Y. Lin, CCUEE Computer Organization 45
MemtoReg5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
5
5
5
IF/IDID/EX
EX/MEM MEM/WB
Zero
0
1
MemRead
ALUSrc
MemWrite
ALUControl6
ALUOp0
1
RegDst5
rs
rt
rt
rd
RegWrite
immed
Branch
0
1
PCSrc PCSrc
0
1
Pipelined Datapath with Control Signals
H.Y. Lin, CCUEE Computer Organization 46
Next Step: Adding Control
Basic approach: build on single-cycle control Place control unit in ID stage Pass control signals to following stages
Later: extra features to deal with: Data forwarding Stalls Exceptions
H.Y. Lin, CCUEE Computer Organization 47
Control for Pipelined Datapath
Source: Book Fig. 6.29, p 469
EX
M
WB
Control
IF / ID ID / EX EX / MEM MEM / WB
M
WB
WB
RegDstALUOp[1:0]ALUSrc
MemReadMemWriteBranch
RegWriteMemtoReg
H.Y. Lin, CCUEE Computer Organization 48
Control for Pipelined Datapath
Execution/Address Calculation stage control lines
Memory access stage control lines
Write-back stage control lines
Instruction Reg DstALU Op1
ALU Op0 ALU Src Branch
Mem Read
Mem Write
Reg write
Mem to Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
EX
M
WB
Control
IF / ID ID / EX EX / MEM MEM / WB
M
WB
WB
Source: Book Fig. 6.25, p 401
H.Y. Lin, CCUEE Computer Organization 49
Datapath and Control Unit
W
M WE
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
0
1
MemRead
ALUSrc
ALUControl6
ALUOp0
1
RegDst
5
rs
rt
rt
rd
RegWrite
immed
Branch
0
1
PCSrcRegWrite
0
1
W
MControl
H.Y. Lin, CCUEE Computer Organization 50
Tracking Control Signals - Cycle 1
LW
W
M WE
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
0
1
MemRead
ALUSrc
ALUControl6
ALUOp0
1
RegDst
5
rs
rt
rt
rd
RegWrite
immed
Branch
0
1
PCSrcRegWrite
0
1
W
MControl
H.Y. Lin, CCUEE Computer Organization 51
Tracking Control Signals - Cycle 2
SW LW
W
M WE
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
0
1
MemRead
ALUSrc
ALUControl6
ALUOp0
1
RegDst
5
rs
rt
rt
rd
RegWrite
immed
Branch
0
1
PCSrcRegWrite
0
1
W
MControl
H.Y. Lin, CCUEE Computer Organization 52
Tracking Control Signals - Cycle 3
ADD SW LW
001
1
W
M WE
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
0
1
MemRead
ALUSrc
ALUControl6
ALUOp0
1
RegDst
5
rs
rt
rt
rd
RegWrite
immed
Branch
0
1
PCSrcRegWrite
0
1
W
MControl
H.Y. Lin, CCUEE Computer Organization 53
Tracking Control Signals - Cycle 4
SUB ADD SW LW
1
0
0
W
M WE
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
0
1
MemRead
ALUSrc
ALUControl6
ALUOp0
1
RegDst
5
rs
rt
rt
rd
RegWrite
immed
Branch
0
1
PCSrcRegWrite
0
1
W
MControl
H.Y. Lin, CCUEE Computer Organization 54
1
1
ADD
Tracking Control Signals - Cycle 5
SUB SW LW
W
M WE
5
RD1
RD2
RN1
RN2
WN
WD
RegisterFile
ALU
EXTND
16 32
RD
WD
DataMemory
ADDR
32
<<2
RD
InstructionMemory
ADDR
PC
4
ADD
ADD
5
5
5
IF/ID ID/EX EX/MEM MEM/WB
Zero
0
1
MemRead
ALUSrc
ALUControl6
ALUOp0
1
RegDst
5
rs
rt
rt
rd
RegWrite
immed
Branch
0
1
PCSrcRegWrite
0
1
W
MControl
H.Y. Lin, CCUEE Computer Organization 55
References
Portions of these slides are derived from: Textbook figures © 1998 Morgan Kaufmann Publishers all rights
reserved Tod Amon's COD2e Slides © 1998 Morgan Kaufmann
Publishers all rights reserved Dave Patterson’s CS 152 Slides – Fall 1997 © UCB Rob Rutenbar’s 18-347 Slides – Fall 1999 CMU John Nestor’s ECE 313 Slides – Fall 2004 LC T.S. Chang’s DEE 1050 Slides – Fall 2004 NCTU Other sources as noted