ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7...
Transcript of ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7...
![Page 1: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/1.jpg)
1[ material from Patterson & Hennessy and Harris]
ENGN1640: Design of Computing SystemsTopic 05: Pipeline Processor Design
Professor Sherief Redahttp://scale.engin.brown.edu
Electrical Sciences and Computer EngineeringSchool of Engineering
Brown UniversitySpring 2016
![Page 2: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/2.jpg)
Pipelining analogy
2
Pipelined laundry: overlapping execution Parallelism improves performance
Four loads: Speedup
= 16/7 = 2.3 Non-stop:
Ideal Speedup= 4n/(n + 3) ≈ 4= number of stages
![Page 3: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/3.jpg)
Pipelined ARM processor
3
• Temporal parallelism• Divide single-cycle processor into 5 stages:
– Fetch– Decode– Execute– Memory– Writeback
• Add pipeline registers between stages
![Page 4: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/4.jpg)
Single-Cycle vs Pipelined
4
Time (ps)Instr
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg1
2
0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 15001000
Instr
1
2
(b)
3
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
FetchInstruction
DecRead Reg
ExecuteALU
MemoryRead / Write
WrReg
Single-Cycle
Pipelined
![Page 5: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/5.jpg)
Pipeline datapath abstraction
5
Time (cycles)
LDR R2, [R0, #40] RF 40
R0RF
R2+ DM
RF R10
R9RF
R3+ DM
RF R5
R1RF
R4- DM
RF R13
R12RF
R5& DM
RF 20
R1RF
R6+ DM
RF 42
R11RF
R7| DM
ADD R3, R9, R10
SUB R4, R1, R5
AND R5, R12, R13
STR R6, [R1, #20]
ORR R7, R11, #42
1 2 3 4 5 6 7 8 9 10
ADD
IM
IM
IM
IM
IM
IM LDR
SUB
AND
STR
ORR
![Page 6: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/6.jpg)
Single-cycle & pipelined datapath
6
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PC10
PC'
Instr
19:16
15:12
23:0
SrcB
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
CLK
ALU
PCPlus8 R15
3:0
+
4
15RA1
RA2
Extend
01
01
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
PCPlus8 R15
3:0
+
4
15RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
Fetch Decode Execute Memory Writeback
InstrF
ALUOutM ALUOutW
WA3D
Single-Cycle
Pipelined
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
PCPlus8
R15
3:0
+
4
15RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
• WA3 must arrive at same time as Result• Register file written on falling edge of CLK
![Page 7: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/7.jpg)
Optimized pipeline datapath
7
Remove adder by using PCPlus4F after PC has been updated to PC+4Assumes writing happens (e.g., in first half of clock cycle) before reading
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
PCPlus8
R15
3:0
+
4
15RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
![Page 8: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/8.jpg)
Tracing LDR in its journey: 1st cycle
8
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
LDR fetch
![Page 9: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/9.jpg)
Tracing LDR in its journey: 2nd cycle
9
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
LDR decode
![Page 10: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/10.jpg)
Tracing LDR in its journey: 3rd cycle
10
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
LDR EXE
![Page 11: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/11.jpg)
Tracing LDR in its journey: 4th cycle
11
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
LDR mem
![Page 12: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/12.jpg)
Tracing LDR in its journey: 5th cycle
12
ExtImmE
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF10
PC'
InstrD
19:16
15:12
23:0
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
CLK
ALU
R15
3:015
RA1D
RA2D
Extend
01
01
CLK CLK CLK CLK
InstrF
ALUOutM ALUOutWWA3E WA3M WA3WWA3D
PCPlus8D
LDR WBLDR WB
![Page 13: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/13.jpg)
Pipeline performance
13
Assume time for stages is 100ps for register read or write 200ps for other stages
Compare pipelined datapath with single-cycle datapath
Instr Instr fetch Register read
ALU op Memory access
Register write
Total time
LDR 200ps 100 ps 200ps 200ps 100 ps 800ps
STR 200ps 100 ps 200ps 200ps 700ps
data op 200ps 100 ps 200ps 100 ps 600ps
Branch 200ps 100 ps 200ps 500ps
![Page 14: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/14.jpg)
Single-cycle versus pipeline performance
14
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
LDR R1,[R5]
LDR R2, [R6]
LDR R3, [R7]
LDR R1,[R5]
LDR R2, [R6]
LDR R3, [R7]
![Page 15: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/15.jpg)
Pipeline speedup
15
If all stages are balanced i.e., all take the same time Time between instructionspipelined
= Time between instructionsnonpipelinedNumber of stages
Ideal speedup (n instructions and s stages)
If not balanced, speedup is less Speedup due to increased throughput
Latency (time for each instruction) does not decrease Branches will also reduce the speedup Added pipeline registers reduce the speedup
![Page 16: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/16.jpg)
Reminder of single-cycle control
16
ExtImm
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PC10
PC'
Instr
19:16
15:12
23:0
25:20
SrcB
ALUResult ReadData
WriteData
SrcA
PCPlus4
Result
27:26
ImmSrc
PCSrc
MemWriteMemtoReg
ALUSrc
RegWrite
OpFunct
ControlUnit
ALUFlags
CLK
ALUControl
ALU
PCPlus8 R15
3:0
Cond31:28
Flags
15:12 Rd
+
4
15RA1
RA2
0 1
Extend
01
01
RegSrc
![Page 17: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/17.jpg)
Modifications to pipeline control
17
Control signals derived from instruction Same as in single-cycle implementation Control delayed to proper pipeline stage
![Page 18: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/18.jpg)
Pipelined datapath + control
18
ExtImm
E
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
CLK
RegisterFile
01
A RDData
MemoryWD
WE
10
PCFPC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlagsCLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutWWA3E WA3M WA3W
CLK CLK
MemWriteEMemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondE
CondExE
10
PCSrcD PCSrcE PCSrcM PCSrcW
Flags'CondUnit
• Same control unit as single-cycle processor• Control delayed to proper pipeline stage
![Page 19: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/19.jpg)
Pipelining hazards
19
Situations that prevent starting the next instruction in the next cycle
1. Structural hazards• A required resource is busy
2. Data hazard• Need to wait for previous instruction to
complete its data read/write3. Control hazard
• Deciding on control action depends on previous instruction
![Page 20: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/20.jpg)
1. Structural hazards
20
Conflict for use of a resource In ARMs pipeline with a single memory
Load/store requires data access Instruction fetch would have to stall for that
cycle Would cause a pipeline “bubble”
Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches
![Page 21: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/21.jpg)
2. Data Hazards: compute-use
21
Time (cycles)
ADD R1, R4, R5 RF R5
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM ADD
ORR
SUB
![Page 22: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/22.jpg)
Data Hazard: load-use
22
LDR R1, [R4, #40]
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
![Page 23: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/23.jpg)
Handling data hazards
23
A. Compile-time techniques B. Forward data at run timeC. Stall the processor at run time
![Page 24: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/24.jpg)
A. Data hazard elimination using compile-time techniques (nop)
24
• Insert enough nops until result is ready (wastes cycles)
Time (cycles)
ADD R1, R4, R5 RF R5
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM ADD
ORR
SUB
NOP
NOP
RF RFDMNOPIM
RF RFDMNOPIM
9 10
![Page 25: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/25.jpg)
A. Data hazard elimination using compile-time techniques (code rescheduling)
25
Reorder code to avoid use of load result in the next instruction
Compiler must be aware of pipeline structure
ADD R1, R4, R5
ADD R8, R1, R3
AND R9, R6, R1
SUB R10, R2, R7
ADD R1, R4, R5
SUB R10, R2, R7
NOP
ADD R8, R1, R3
AND R9, R6, R1
Rescheduling saved one cycles!
![Page 26: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/26.jpg)
B. Data hazard elimination using data forwarding/bypassing during runtime
26
Don’t wait for result to be stored in a register forward the results whenever the results
Requires extra connections in the datapath
Time (cycles)
ADD R1, R4, R5 RF R5
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM ADD
ORR
SUB
• Check if register read in Execute stage matches register written in Memory or Writeback stage
• If so, forward result
![Page 27: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/27.jpg)
Circuitry for forwarding
27
ExtImm
E
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCFPC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlagsCLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutWWA3E WA3M WA3W
CLK CLK
MemWriteEMemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondE
CondExE
10
PCSrcD PCSrcE PCSrcM PCSrcW
Flags'CondUnit
000110
000110
Hazard Unit
ForwardA
EForw
ardBE
RegW
riteM
Match
RegW
riteWCLK
![Page 28: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/28.jpg)
To forward or not to forward!
28
• Execute stage register matches Memory stage register?Match_1E_M = (RA1E == WA3M)Match_2E_M = (RA2E == WA3M)
• Execute stage register matches Writeback stage register?Match_1E_W = (RA1E == WA3W)Match_2E_W = (RA2E == WA3W)
• If it matches, forward result:
if (Match_1E_M • RegWriteM) ForwardAE = 10; else if (Match_1E_W • RegWriteW) ForwardAE = 01; else ForwardAE = 00;
![Page 29: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/29.jpg)
Double data hazard
29
Consider the sequence:ADD R1,R1,R2ADD R1,R1,R3ADD R1,R1,R4
Both hazards occur Want to use the most recent
Revise MEM hazard condition Give priority to EX results. That is, only fwd
from MEM if EX hazard condition isn’t true
![Page 30: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/30.jpg)
Pipelining hazards
30
Structural hazards2. Data hazard3. Control hazard
Compile-time techniques Forward data at run timeC. Stall the processor at run time
![Page 31: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/31.jpg)
Stalling
3131
Time (cycles)
LDR R1, [R4, #40] RF 40
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM LDR
ORR
SUB
Trouble!
![Page 32: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/32.jpg)
Forwarding is not going to eliminate all hazards
32
Time (cycles)
LDR R1, [R4, #40] RF 40
R4RF
R1+ DM
RF R3
R1RF
R8& DM
RF R1
R6RF
R9| DM
RF R7
R1RF
R10- DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM
IM LDR
ORR
SUB
9
RF R3
R1
IM ORR
Stall
![Page 33: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/33.jpg)
FIX C. Data hazard elimination by stalling
33
stall inserted here
clock cycle wasted − necessary for correctness
![Page 34: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/34.jpg)
Stalling HW
34
ExtImm
E
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCFPC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlagsCLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutWWA3E WA3M WA3W
CLK CLK
MemWriteEMemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondE
CondExE
10
PCSrcD PCSrcE PCSrcM PCSrcW
Flags'CondUnit
000110
000110
Hazard Unit
ForwardA
EForw
ardBE
RegW
riteM
Match
RegW
riteW
Mem
toRegE
StallF
StallD
FlushE
EN
CLR
CLRE
N
FlushD
CLK
![Page 35: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/35.jpg)
To stall or not to stall!
35
• Is either source register in the Decode stage the same as the one being written in the Execute stage?
Match_12D_E = (RA1D == WA3E) | (RA2D == WA3E)
• Is a LDR in the Execute stage AND Match_12D_E?
ldrstall = Match_12D_E • MemtoRegEStallF = StallD = FlushE = ldrstall
![Page 36: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/36.jpg)
Data hazard summary
36
Compiler can arrange code to avoid hazards and stalls. Requires knowledge of the pipeline structure
Forwarding can sometimes avoids stalls at the expense of extra hardware complexity
Stalls reduce performance by increasing the average cycles per instruction (CPI). But sometimes are absolutely necessary to get correct results
![Page 37: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/37.jpg)
3. Control hazards
37
• B: – branch not determined until the Writeback stage
of pipeline– Instructions after branch fetched before branch
occurs– These 4 instructions must be flushed if branch
happens
• Writes to PC (R15) similar
![Page 38: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/38.jpg)
Control hazards
38
Time (cycles)
B 3C RF RFDM
RF R3
R1RF& DM
RF R1
R6RF| DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM B
ORR
20
24
28
2C
34... ...
9
Flushthese
instructions
64 ADD R12, R3, R4 RF R4
R3RF
R12+ DMIM ADD
RF R7
R1RF- DMIM SUB
RF R8
R1RF- DMIM SUBSUB R11, R1, R830
10
Branch misprediction penalty• number of instruction flushed when branch is taken (4)• May be reduced by determining BTA earlier
![Page 39: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/39.jpg)
Early branch resolution
39
• Determine BTA in Execute stage– Branch misprediction penalty = 2 cycles
• Hardware changes– Add a branch multiplexer before PC register to select BTA
from ALUResultE– Add BranchTakenE select signal for this multiplexer (only
asserted if branch condition satisfied)– PCSrcW now only asserted for writes to PC
![Page 40: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/40.jpg)
Pipelined processor with early BTA
40
ExtImm
E
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF01
PC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlagsCLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutW
000110
000110
WA3E WA3M WA3W
CLK CLK
MemWriteEMemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondEC
ondExE
Hazard Unit
StallF
StallD
FlushE
ForwardA
EForw
ardBE
EN
CLR
CLRE
N
10
PCSrcD PCSrcE PCSrcM PCSrcW
FlushD
Flags'CondUnit
BranchTakenE
RegW
riteM
Match
RegW
riteW
Mem
toRegE
CLK
![Page 41: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/41.jpg)
Control hazards with early BTA
41
Time (cycles)
B 3C RF RFDM
RF R3
R1RF& DM
RF R1
R6RF| DM
AND R8, R1, R3
ORR R9, R6, R1
SUB R10, R1, R7
1 2 3 4 5 6 7 8
AND
IM
IM
IM B
ORR
20
24
28
2C
34... ...
9
Flushthese
instructions
64 ADD R12, R3, R4 RF R4
R3RF
R12+ DMIM ADD
SUB R11, R1, R830
10
![Page 42: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/42.jpg)
Control stalling logic
42
• PCWrPendingF = 1 if write to PC in Decode, Execute or Memory
PCWrPendingF = PCSrcD + PCSrcE + PCSrcM
• Stall Fetch if PCWrPendingFStallF = ldrStallD + PCWrPendingF
• Flush Decode if PCWrPendingF OR PC is written in Writeback OR branch is taken
FlushD = PCWrPendingF + PCSrcW + BranchTakenE
• Flush Execute if branch is takenFlushE = ldrStallD + BranchTakenE
• Stall Decode if ldrStallD (as before)StallD = ldrStallD
![Page 43: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/43.jpg)
ARM Pipelined Processor with Hazard Unit
43
ExtImm
E
CLK
A RD
InstructionMemory
+
4
A1
A3WD3
RD2
RD1WE3
A2
RegisterFile
01
A RDData
MemoryWD
WE
10
PCF01
PC'
InstrD
19:16
15:12
23:0
25:20
SrcBE
ALUResultE ReadDataW
WriteDataE
SrcAE
PCPlus4F
ResultW
27:26
ImmSrcD
MemWriteDMemtoRegD
ALUSrcD
RegWriteD
OpFunct
ControlUnit
ALUFlagsCLK
ALUControlD
ALU
PCPlus8D
R15
3:0
31:28
FlagWriteD
15:12 Rd
15RA1D
RA2D
0 1
Extend
01
01
RegSrcD
CLK
InstrF
CLK
ALUOutM ALUOutW
000110
000110
WA3E WA3M WA3W
CLK CLK
MemWriteEMemtoRegE
ALUSrcE
RegWriteE
ALUControlEMemWriteMMemtoRegMRegWriteM
MemtoRegWRegWriteW
BranchD
FlagsE
FlagWriteE
BranchE
CondEC
ondExE
Hazard Unit
StallF
StallD
FlushE
ForwardA
EForw
ardBE
EN
CLR
CLRE
N
10
PCSrcD PCSrcE PCSrcM PCSrcW
FlushD
Flags'CondUnit
BranchTakenE
RegW
riteM
Match
RegW
riteW
Mem
toRegE
CLK
![Page 44: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/44.jpg)
Branch prediction
44
• Ideal pipelined processor: CPI = 1• Branch misprediction increases CPI• Static branch prediction:
– Always not taken– Always taken– Check direction of branch (forward or backward): If
backward, predict taken; else, predict not taken• Dynamic branch prediction:
– Make a dynamic based on history of branches
In all cases, branch must be executed to see if prediction was correct if not, flush instructions and resume from correct direction!
![Page 45: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/45.jpg)
Eliminating 2-cycle stall for taken-prediction policy with branch target buffer
45
Even with predictor, still need to calculate the target address 2-cycle penalty for a taken branch
Branch target buffer Cache of target addresses Indexed by PC when instruction fetched
If hit and instruction is branch predicted taken, can fetch target immediately – no 2-cycle penalty
![Page 46: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/46.jpg)
Branch target buffer
46
Branch PC Branch target address
=
No: it is not a branchNext PC = PC+4
Yes: it is a branch andPC = branch target address
PC
IM
Pipeline reg
MU
X rest of pipeline
![Page 47: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/47.jpg)
2. Dynamic branch prediction
47
In deeper and superscalar pipelines, branch penalty is more significant Use dynamic prediction:
Branch prediction buffer (aka branch history table) indexed by recent branch instruction addresses and stores outcome (taken/not taken)
To execute a branch: Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction
[floorplan of Pentium
processor]
![Page 48: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/48.jpg)
1-bit branch predictor
48
• Remembers whether branch was taken the last time and does the same thing
• Mispredicts first and last branch of loop• Prediction bits are added to BTB entries
MOV R1, #0 ; R1 = sum
MOV R0, #0 ; R0 = i
FOR ; for (i=0; i<10; i=i+1)
CMP R0, #10
BGE DONE
ADD R1, R1, R0 ; sum = sum + i
ADD R0, R0, #1
B FOR
DONE
![Page 49: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/49.jpg)
Problem with 1-bit predictor
49
Inner loop branches mispredicted twice!outer: …
…inner: …
…beq …, …, inner…beq …, …, outer
Mispredict as taken on last iteration of inner loop Then mispredict as not taken on first iteration of
inner loop next time around
![Page 50: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/50.jpg)
2-bit predictor
50
Only change prediction on two successive mispredictions
![Page 51: ENGN1640: Design of Computing Systemsscale.engin.brown.edu/classes/EN164S16/topic05 - pipe.pdf= 16/7 = 2.3 Non-stop: Ideal ... LDR fetch. Tracing LDR in its journey: 2nd cycle. 9 ExtImmE](https://reader034.fdocuments.us/reader034/viewer/2022050602/5fa9331e41e05055392ce1e0/html5/thumbnails/51.jpg)
Summary
• Pipelining for speedup
• Ideal speedup = number of stages, but actual speedup depends on delay balance between stages (clock frequency), delays introduced by pipeline registers, and number of stalls (CPI).
• Hazards (structural, data, and control) can increase CPI
• Hazards can be eliminated or mitigated using code reorganization, stalling, flushing, forwarding / bypassing
• Branch prediction, branch prediction buffer, branch target buffer can reduce stalls arising from control hazards
51