CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators...
-
Upload
harriet-mcgee -
Category
Documents
-
view
216 -
download
0
Transcript of CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators...
CDA 5155
Computer Architecture
Week 1.5
Start with the materials: Conductors and Insulators
• Conductor: a material that permits electrical current to flow easily. (low resistance to current flow)
Lattice of atoms with free electrons
• Insulator: a material that is a poor conductor of electrical current (High resistance to current flow)
Lattice of atoms with strongly held electrons
• Semi-conductor: a material that can act like a conductor or an insulator depending on conditions. (variable resistance to current flow)
Making a semiconductor using silicon
e e
e
ee
ee
e
ee
e
e
e e
e
e
ee
e
e
What is a pure silicon lattice? A. Conductor B. Insulator C. Semi conductor
N-type Doping
We can increase the conductivity by adding atoms of phosphorus or arsenic to the silicon lattice.They have more electrons (1 more) which is free to wander…
This is called n-type doping since we add some free (negatively charged) electrons
Making a semiconductor using silicon
ee
e
e
ee
e
e
e e
e
e
ee
e
e
Pe
ee
e
e
This electron is easilymoved from here
What is a n-doped silicon lattice? A. Conductor B. Insulator C. Semi-conductor
P-type Doping
Interestingly, we can also improve the conductivity by adding atoms of gallium or boron to the silicon lattice.They have fewer electrons (1 fewer) which creates a hole. Holes also conduct current by stealing electrons from their neighbor (thus moving the hole).
This is called p-type doping since we have fewer (negatively charged) electrons in the bond holding the atoms together.
Making a semiconductor using silicon
ee
e
e
ee
e
e
e e
e
e
ee
e
e
Gae
ee
?
This atom will accept an electron eventhough it is one too many since it fills theeighth electron position in this shell. Againthis lets current flow since the electron mustcome from somewhere to fill this position.
Using doped silicon to make a junction diode
A junction diode allows current to flow in one direction and blocks it in the other.
Electrons liketo move to Vcc
GND VccElectrons move from GNDto fill holes.
Using doped silicon to make a junction diode
A junction diode allows current to flow in one direction and blocks it in the other.
Current flows
eee
Vcc GND
eeee
Making a transistor
Our first level of abstraction is the transistor. (basically 2 diodes sitting back-to-back)
P-type
Gate
Making a transistor
Transistors are electronic switches connecting the source to the drain if the gate is “on”.
http://www.intel.com/education/transworks/INDEX.HTM
Vcc Vcc Vcc
12/96
Review of basic pipelining
• 5 stage “RISC” load-store architecture– About as simple as things get
• Instruction fetch: • get instruction from memory/cache
• Instruction decode: • translate opcode into control signals and read regs
• Execute: • perform ALU operation
• Memory: • Access memory if load/store
• Writeback/retire: • update register file
13/96
Pipelined implementation
• Break the execution of the instruction into cycles (5 in this case).
• Design a separate datapath stage for the execution performed during each cycle.
• Build pipeline registers to communicate between the stages.
Stage 1: Fetch
Design a datapath that can fetch an instruction from memory every cycle.Use PC to index memory to read instruction
Increment the PC (assume no branches for now)
Write everything needed to complete execution to the pipeline register (IF/ID)The next stage will read this pipeline register.
Note that pipeline register must be edge triggered
Inst
ruct
ion
bit
s
IF / IDPipeline register
PC
Instructionmemory
en
en
1
+
MUX
Res
t of
pip
elin
ed d
atap
ath
PC
+ 1
Stage 2: Decode
Design a datapath that reads the IF/ID pipeline register, decodes instruction and reads register file (specified by regA and regB of instruction bits).Decode is easy, just pass on the opcode and let later stages figure out their own control signals for the instruction.
Write everything needed to complete execution to the pipeline register (ID/EX)Pass on the offset field and both destination register specifiers (or simply pass on the whole instruction!).Including PC+1 even though decode didn’t use it.
Destreg
Data
ID / EXPipeline register
Con
ten
tsO
f re
gAC
onte
nts
Of
regBRegister File
regAregB
en
Res
t of
pip
elin
ed d
atap
ath
Inst
ruct
ion
bit
s
IF / IDPipeline register
PC
+ 1
PC
+ 1
Inst
ruct
ion
bit
sSta
ge 1
: F
etch
dat
apat
h
Stage 3: Execute
Design a datapath that performs the proper ALU operation for the instruction specified and the values present in the ID/EX pipeline register.The inputs are the contents of regA and either the contents of regB or the offset field on the instruction.Also, calculate PC+1+offset in case this is a branch.
Write everything needed to complete execution to the pipeline register (EX/Mem)ALU result, contents of regB and PC+1+offsetInstruction bits for opcode and destReg specifiersResult from comparison of regA and regB contents
ID / EXPipeline register
Con
ten
tsO
f re
gAC
onte
nts
Of
regB
Res
t of
pip
elin
ed d
atap
ath
Alu
Res
ult
EX/MemPipeline register
PC
+ 1
Inst
ruct
ion
bit
s
Sta
ge 2
: D
ecod
e d
atap
ath
Inst
ruct
ion
bit
sP
C+
1+
offs
et
+
con
ten
ts
of r
egB
ALU
MUX
Stage 4: Memory Operation
Design a datapath that performs the proper memory operation for the instruction specified and the values present in the EX/Mem pipeline register.ALU result contains address for ld and st instructions.
Opcode bits control memory R/W and enable signals.
Write everything needed to complete execution to the pipeline register (Mem/WB)ALU result and MemData
Instruction bits for opcode and destReg specifiers
Alu
Res
ult
Mem/WBPipeline register
Res
t of
pip
elin
ed d
atap
ath
Alu
Res
ult
EX/MemPipeline register
Sta
ge 3
: E
xecu
te d
atap
ath
Inst
ruct
ion
bit
sP
C+
1+
offs
etco
nte
nts
of
reg
B
This goes back to the MUXbefore the PC in stage 1.
Mem
ory
Rea
d D
ata
Data Memory
en R/W
Inst
ruct
ion
bit
s
MUX controlfor PC input
Stage 5: Write back
Design a datapath that completes the execution of this instruction, writing to the register file if required.Write MemData to destReg for ld instruction
Write ALU result to destReg for add or nand instructions.
Opcode bits also control register write enable signal.
Alu
Res
ult
Mem/WBPipeline register
Sta
ge 4
: M
emor
y d
atap
ath
Inst
ruct
ion
bit
sM
emor
y R
ead
Dat
a
MUX
This goes back to data input of register file
This goes back to thedestination register specifier
MUX
bits 0-2
bits 16-18
register write enable
PC Instmem
Registerfile
MUX
Sign extend
ALU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
0-2
16-18
Sample Test Question (Easy)
• Which item does not need to be included in the Mem/WB pipeline register for the LC3101 pipelined implementation discussed in class?
A. ALU result
B. Memory read data
C. PC+1+offset
D. Destination register specifier
E. Instruction opcode
C. PC+1+offset
Sample Test Question (Hard?)
• What items need to be added to one of the pipeline registers (discussed in class) to support the
<insert nasty instruction description here> ?
A. IF/ID: PC
B. ID/EX: PC+offset
C. EX/Mem: Contents of regA
D. EX/Mem: ALU2 result
E. Mem/WB: Contents of regA
Things to think about…
1. How would you modify the pipeline datapath if you wanted to double the clock frequency?
2. Would it actually double?
3. How do you determine the frequency?
Sample Code (Simple)
Run the following code on pipelined LC3101:
add 1 2 3 ; reg 3 = reg 1 + reg 2
nand 4 5 6 ; reg 6 = reg 4 & reg 5
lw 2 4 20 ; reg 4 = Mem[reg2+20]
add 2 5 5 ; reg 5 = reg 2 + reg 5
sw 3 7 10 ; Mem[reg3+10] =reg 7
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
op
dest
offset
valB
valA
PC+1PC+1target
ALUresult
op
dest
valB
op
dest
ALUresult
mdata
eq?instru
ction
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
Bits 22-24
data
dest
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
noop
0
0
0
0
000
0
noop
0
0
noop
0
0
0
0
noop
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
InitialState
Time: 0
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
noop
0
0
0
0
010
0
noop
0
0
noop
0
0
0
0add
1 2 3
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
Fetch: add 1 2 3
add 1 2 3
Time: 1
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
add
3
3
9
36
120
0
noop
0
0
noop
0
0
0
0nan
d 4 5 6
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
1
2
Bits 22-24
data
dest
Fetch: nand 4 5 6
nand 4 5 6 add 1 2 3
Time: 2
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
nand
6
6
7
18
234
45
add
3
9
noop
0
0
0
0lw 2 4 20
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
4
5
Bits 22-24
data
dest
Fetch: lw 2 4 20
lw 2 4 20 nand 4 5 6 add 1 2 3
Time: 3
36
9
1
3
3
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
lw
4
20
18
9
348
-3
nand
6
7
add
3
45
0
0add
2 5 8
912187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
2
4
Bits 22-24
data
dest
Fetch: add 2 5 5
add 2 5 5 lw 2 4 20 nand 4 5 6 add 1 2 3
Time: 4
18
7
2
6
6
45
3
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
add
5
5
7
9
4523
29
lw
4
18
nand
6
-3
0
0sw 3 7 10
945187
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
2
5
Bits 22-24
data
dest
Fetch: sw 3 7 10
sw 3 7 10 add 2 5 5 lw 2 4 20 nand 4 5 6 add
Time: 5
9
20
3
20
4
-3
6
45
3
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
sw
7
10
22
45
5 9
16
add
5
7
lw
4
29
99
0
945187
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
3
7
Bits 22-24
data
dest
No moreinstructions
sw 3 7 10 add 2 5 5 lw 2 4 20 nand
Time: 6
9
7
4
5
5
29
4
-3
6
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
15
55
sw
7
22
add
5
16
0
0
945997
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
No moreinstructions
sw 3 7 10 add 2 5 5 lw
Time: 7
45
5
10
7
10
16
5
99
4
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
sw
7
55
0
9459916
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
No moreinstructions
sw 3 7 10 add
Time: 8
2255
22
16
5
PC Instmem
Reg
iste
r fi
le
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
9459916
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
No moreinstructions
sw
Time: 9
Time graphs
Time: 1 2 3 4 5 6 7 8 9
add
nand
lw
add
sw
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
What can go wrong?
• Data hazards: since register reads occur in stage 2 and register writes occur in stage 5 it is possible to read the wrong value if is about to be written.
• Control hazards: A branch instruction may change the PC, but not until stage 4. What do we fetch before that?
• Exceptions: How do you handle exceptions in a pipelined processor with 5 instructions in flight?
Data Hazards
Data hazardsWhat are they?
How do you detect them?
How do you deal with them?
Pipeline function for ADD
Fetch: read instruction from memory
Decode: read source operands from reg
Execute: calculate sum
Memory: Pass results to next stage
Writeback: write sum into register file
Data Hazards
add 1 2 3nand 3 4 5
time
fetch decode execute memory writeback
fetch decode execute memory writeback
add
nand
If not careful, nand will read the wrong value of R3
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
Bits 0-2
Bits 16-18
op
dest
offset
valB
valA
PC+1PC+1target
ALUresult
op
dest
valB
op
dest
ALUresult
mdata
eq?instru
ction
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
Bits 22-24
data
dest
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
op
dest
offset
valB
valA
PC+1PC+1target
ALUresult
op
dest
valB
op
dest
ALUresult
mdata
eq?instru
ction
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
dest
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
op
offset
valB
valA
PC+1PC+1target
ALUresult
op
valB
op
ALUresult
mdata
eq?instru
ction
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
fwd fwd fwd
data
Three approaches to handling data hazards
AvoidMake sure there are no hazards in the code
Detect and StallIf hazards exist, stall the processor until they go away.
Detect and ForwardIf hazards exist, fix up the pipeline to get the correct value (if possible)
Handling data hazards I: Avoid all hazards
Assume the programmer (or the compiler) knows about the processor implementation.Make sure no hazards exist.
• Put noops between any dependent instructions.
add 1 2 3noopnoopnand 3 4 5
write R3 in cycle 5
read R3 in cycle 5
Problems with this solution
Old programs (legacy code) may not run correctly on new implementationsLonger pipelines need more noops
Programs get larger as noops are includedEspecially a problem for machines that try to execute more than one instruction every cycle
Intel EPIC: Often 25% - 40% of instructions are noops
Program execution is slower–CPI is 1, but some instructions are noops
Handling data hazards II: Detect and stall until ready
Detect:Compare regA with previous DestRegs
• 3 bit operand fields
Compare regB with previous DestRegs • 3 bit operand fields
Stall:Keep current instructions in fetch and decode
Pass a noop to execute
Hazard detection
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
add
3
7
14
PC+1PC+1target
ALUresult
op
valB
op
ALUresult
mdata
eq?nan
d 3 4 5
7 10
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
3
First half of cycle 3
REGfile
IF/ID
ID/EX
3
compare
Hazarddetected
regA
regB
compare
compare compare
3
3
Hazard detected
regA
regB
compare
0 1 1
0 1 1
0 0 0
1
Handling data hazards II: Detect and stall until ready
• Detect:– Compare regA with previous DestReg
• 3 bit operand fields
– Compare regB with previous DestReg • 3 bit operand fields
Stall:Keep current instructions in fetch and decode
Pass a noop to execute
Hazard
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
add
7
14
12target
ALUresult
valB
ALUresult
mdata
eq?nan
d 3 4 5
7 10
11
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
3
en
en
First half of cycle 3
Handling data hazards II: Detect and stall until ready
• Detect:– Compare regA with previous DestReg
• 3 bit operand fields
– Compare regB with previous DestReg • 3 bit operand fields
Stall:– Keep current instructions in fetch and decode
Pass a noop to execute
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
noop
2
21
add
ALUresult
mdata
nan
d 3 4 5
7 10 11
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
End of cycle 3
Hazard
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
2
21
add
ALUresult
mdata
nan
d 3 4 5
7 10 11
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
3
en
en
First half of cycle 4
noop
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
noop
2
noop
add
21
nan
d 3 4 5
7 10 11
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
End of cycle 4
No Hazard
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
noop
2
noop
add
21
nan
d 3 4 5
7 10 11
14
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
3
First half of cycle 5
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
nand
11
21
23
noop
noop
add
3 7 7
7 21 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
5data
End of cycle 5
No more stalling
add 1 2 3nand 3 4 5
time
fetch decode execute memory writeback
fetch decode decode decode execute
add
nand
Assume Register File gives the right value of R3 when read/written during same cycle.
hazard hazard
Problems with detect and stall
CPI increases every time a hazard is detected!
Is that necessary? Not always!Re-route the result of the add to the nand
• nand no longer needs to read R3 from reg file
• It can get the data later (when it is ready)
• This lets us complete the decode this cycle– But we need more control to remember that the data that we
aren’t getting from the reg file at this time will be found elsewhere in the pipeline at a later cycle.
Handling data hazards III: Detect and forward
Detect: same as detect and stallExcept that all 4 hazards are treated differently
• i.e., you can’t logical-OR the 4 hazard signals
Forward:New bypass datapaths route computed data to where it is needed
New MUX and control to pick the right data
•Beware: Stalling may still be required even in the presence of forwarding
Sample Code
Which hazards do you see?
add 1 2 3
nand 3 4 5
add 4 3 7
add 6 3 7
lw 3 6 10
sw 6 2 12
Hazard
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
add
7
14
12
nan
d 3 4 5
7 10 1177
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
3
fwd fwd fwd
3
First half of cycle 3
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
nand
11
10
23
21
add
add
4 3 7
7 10 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
5data
H1
3
End of cycle 3
New Hazard
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
nand
11
10
23
21
add
add
6 3 7
7 10 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
5data
3MUX
H1
3
First half of cycle 4
21
11
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
add
10
1
34
-2
nand
add
21
lw 3 6 10
7 10 1177
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
75 3data
MUX
H2 H1
End of cycle 4
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
add
10
1
34
-2
nand
add
21
lw 3 6 10
7 10 1177
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
75 3data
MUX
H2 H1
First half of cycle 5
3
No Hazard
21
1
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
lw
10
21
4 5
22
add
nand
-2
sw 6 2 12
7 21 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
7 5data
MUX
H2 H1
6
End of cycle 5
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
lw
10
21
4 5
22
add
nand
-2
sw 6 2 12
7 21 11 77
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
67 5
data
MUX
H2 H1
First half of cycle 6
Hazard
6
en
en
L
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
5
31
lw
add
22
sw 6 2 12
7 21 11 -2
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6 7data
MUX
H2
End of cycle 6
noop
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
noop
5
31
lw
add
22
sw 6 2 12
7 21 11 -2
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6 7data
MUX
H2
First half of cycle 7
Hazard
6
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
sw
12
7
1
5
noop
lw
99
7 21 11 -2
14
1
0
22
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6data
MUX
H3
End of cycle 7
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
sw
12
7
1
5
noop
lw
99
7 21 11 -2
14
1
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
6data
MUX
H3
First half of cycle 8
99
12
PCInstmem
Reg
iste
r fi
leMUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
MUX
12
7
1
5
111
sw
noop
7 21 11 -2
14
99
0
8
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
data
MUX
H3
End of cycle 8
Control hazards
How can the pipeline handle branch and jump instructions?
Pipeline function for BEQ
Fetch: read instruction from memory
Decode: read source operands from reg
Execute: calculate target address and test for equality
Memory: Send target to PC if test is equal
Writeback: Nothing left to do
Control Hazards
beq 1 1 10sub 3 4 5
time
fetch decode execute memory writeback
fetch decode execute
beq
sub
Approaches to handling control hazards
AvoidMake sure there are no hazards in the code
Detect and StallDelay fetch until branch resolved.
Speculate and Squash-if-WrongGo ahead and fetch more instruction in case it is correct, but stop them if they shouldn’t have been executed
Handling control hazards I: Avoid all hazards
Don’t have branch instructions! Maybe a little impractical
Delay taking branch:dbeq r1 r2 offset
Instructions at PC+1, PC+2, etc will execute before deciding whether to fetch from PC+1+offset. (If no useful instructions can be placed after dbeq, noops must be inserted.)
Problems with this solution
Old programs (legacy code) may not run correctly on new implementationsLonger pipelines need more instuctions/noops after delayed beq
Programs get larger as noops are includedEspecially a problem for machines that try to execute more than one instruction every cycleIntel EPIC: Often 25% - 40% of instructions are noops
Program execution is slower–CPI equals 1, but some instructions are noops
Handling control hazards II: Detect and stall
Detection:Must wait until decode
Compare opcode to beq or jalr
Alternately, this is just another control signal
Stall:Keep current instructions in fetch
Pass noop to decode stage (not execute!)
PC Instmem
REGfile
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
signext
Control
noop
MUX
Control Hazards
beq 1 1 10sub 3 4 5
time
fetch decode execute memory writeback
fetch fetch fetch
beq
sub fetch
or
fetchTarget:
Problems with detect and stall
CPI increases every time a branch is detected!
Is that necessary? Not always!Only about ½ of the time is the branch taken
• Let’s assume that it is NOT taken…– In this case, we can ignore the beq (treat it like a noop)
– Keep fetching PC + 1
• What if we are wrong?– OK, as long as we do not COMPLETE any instructions we
mistakenly executed (i.e. don’t perform writeback)
Handling data hazards III: Speculate and squash
Speculate: assume not equalKeep fetching from PC+1 until we know that the branch is really taken
Squash: stop bad instructions if takenSend a noop to:
• Decode, Execute and Memory
Send target address to PC
PC REGfile
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
signext
Control
equal
MUX
beqsubaddnand
add
sub
beq
beq
Instmem
noop
noop
noop
Problems with fetching PC+1
CPI increases every time a branch is taken!About ½ of the time
Is that necessary?
No!, but how can you fetch from the targetbefore you even know the previous instructionis a branch – much less whether it is taken???
PC Instmem
REGfile
MUXA
LU
MUX
1
Datamemory
++
MUX
IF/ID
ID/EX
EX/Mem
Mem/WB
signext
Control
beq
bpc
MUX
target
targ
et
eq?
Branch prediction
Predict not taken: ~50% accurate
Predict backward taken: ~65% accurate
Predict same as last time: ~80% accurate
Pentium: ~85% accurate
Pentium Pro: ~92% accurate
Best paper designs: ~97% accurate
94/96
Handling control hazards II: Detect and stall
• Detection:– Must wait until decode– Compare opcode to beq or jalr– Alternately, this is just another control signal
• Stall:– Keep current instructions in fetch– Pass noop to decode stage (not execute!)
95/96
PC Instmem
REGfile
ALU
M U X
1
Datamemory
++
IF/ID
ID/EX
EX/Mem
Mem/WB
signext
Control
noop
M U X
MUX
MUX
96/96
Role of the Compiler
• The primary user of the instruction set– Exceptions: getting less common
• Some device drivers; specialized library routines• Some small embedded systems (synthesized arch)
• Compilers must:– generate a correct translation into machine code
• Compilers should: – fast compile time; generate fast code
• While we are at it: – generate reasonable code size; good debug support
97/96
Structure of Compilers
• Front-end: translate high level semantics to some generic intermediate form– Intermediate form does not have any resource
constraints, but uses simple instructions.
• Back-end: translates intermediate form into assembly/machine code for target architecture– Resource allocation; code optimization under
resource constraints
Architects mostly concerned with optimization
98/96
Typical optimizations: CSE
• Common sub-expression eliminationc = array1[d+e] / array2[d+e];
c = array1[i] / arrray2[i];
• Purpose: –reduce instructions / faster code
• Architectural issues: –more register pressure
99/96
Typical optimization: LICM
• Loop invariant code motionfor (i=0; i<100; i++) { t = 5; array1[i] = t;}
• Purpose: –remove statements or expressions from loops that
need only be executed once (idempotent)
• Architectural issues:–more register pressure
100/96
Other transformations
• Procedure inlining: better inst schedule– greater code size, more register pressure
• Loop unrolling: better loop schedule– greater code size, more register pressure
• Software pipelining: better loop schedule– greater code size; more register pressure
• In general – “global”optimization: faster code– greater code size; more register pressure
101/96
Compiled code characteristics
• Optimized code has different characteristics than unoptimized code.– Fewer memory references, but it is generally the “easy
ones” that are eliminated• Example: Better register allocation retains active data in
register file – these would be cache hits in unoptimized code.
– Removing redundant memory and ALU operations leaves a higher ratio of branches in the code
• Branch prediction becomes more important
Many optimizations provide better instruction schedulingat the cost of an increase in hardware resource pressure
102/96
What do compiler writers want in an instruction set architecture?• More resources: better optimization tradeoffs• Regularity: same behaviour in all contexts
– no special cases (flags set differently for immediates)
• Orthogonality: – data type independent of addressing mode– addressing mode independent of operation performed
• Primitives, not solutions: – keep instructions simple– it is easier to compose than to fit. (ex. MMX operations)
103/96
What do architects want in an instruction set architecture?
• Simple instruction decode: – tends to increase orthogonality
• Small structures: – more resource constraints
• Small data bus fanout:– tends to reduce orthogonality; regularity
• Small instructions: – Make things implicit– non-regular; non-orthogonal; non-primative
104/96
To make faster processors
• Make the compiler team unhappy– More aggressive optimization over the entire program
– More resource constraints; caches; HW schedulers
– Higher expectations: increase IPC
• Make hardware design team unhappy– Tighter design constraints (clock)
– Execute optimized code with more complex execution characteristics
– Make all stages bottlenecks (Amdahl’s law)