CSC 2405 Computer Systems II Advanced Topics. Instruction Set Architecture.
-
Upload
jemimah-hutchinson -
Category
Documents
-
view
224 -
download
0
Transcript of CSC 2405 Computer Systems II Advanced Topics. Instruction Set Architecture.
CSC 2405Computer Systems II
Advanced Topics
Instruction Set Architecture
3Chapter 4
Instruction Set Architecture Assembly Language View
– Processor state Registers, memory, …
– Instructions addl, movl, leal, … How instructions are encoded as bytes
Layer of Abstraction– Above: how to program machine
Processor executes instructions in a sequence
– Below: what needs to be built Use variety of tricks to make it run fast E.g., execute multiple instructions
simultaneously
ISA
Compiler OS
CPUDesign
CircuitDesign
ChipLayout
ApplicationProgram
4Chapter 4
Instruction Set Architectures Basic ISA Classes
Stack Accumulator Register
(Register-memory)
Register
(load-store)
Push A Load A Load R1, A Load R1, A
Push B Add B Add R1, B Load R2, B
Add Store C Store C, R1 Add R3, R1, R2
Pop C Store C, R3
The results of different address classes is easiest to see with the examples here, all of which implement the sequences for C = A + B.
Registers are the class that won out. The more registers on the CPU, the better.
5Chapter 4
80x86 Instruction Frequency
6Chapter 4
Relative Frequency of Control Instructions
Operation Integer Floating Pt Call/Return 19% 8%
Jumps 6% 10% Branches 75% 82%
Design hardware to handle branches quickly, since these occur most frequently
7Chapter 4
CISC Instruction Sets– Complex Instruction Set Computer– Dominant style through mid-80’s
Stack-oriented instruction set– Use stack to pass arguments, save program counter– Explicit push and pop instructions
Arithmetic instructions can access memory– addl %eax, 12(%ebx,%ecx,4)
requires memory read and write Complex address calculation
Condition codes– Set as side effect of arithmetic and logical instructions
Philosophy– Add instructions to perform “typical” programming tasks
8Chapter 4
RISC Instruction Sets– Reduced Instruction Set Computer– Internal project at IBM, later popularized by Hennessy (Stanford) and
Patterson (Berkeley)
Fewer, simpler instructions– Might take more to get given task done– Can execute them with small and fast hardware
Register-oriented instruction set– Many more (typically 32) registers– Use for arguments, return pointer, temporaries
Only load and store instructions can access memory– Similar to Y86 mrmovl and rmmovl
No Condition codes– Test instructions return 0/1 in register
9Chapter 4
Example RISC Instruction Formats
Op
31 26 01516202125
rs1 rd immediate
Op
31 26 025
Op
31 26 01516202125
rs1 rs2
offset added to PC
rd
Register-Register (R-type) ADD R1, R2, R3
561011
Register-Immediate (I-type) SUB R1, R2, #3
Jump / Call (J-type) JUMP end
func
(ALU imm. operations, loads and stores, conditional branch, jump (and link)
(jump, jump and link, trap and return from exception)
(ALI reg. operations, read/write special registers and moves)
10Chapter 4
CISC vs. RISC Original Debate
– Strong opinions!– CISC proponents---easy for compiler, fewer code bytes– RISC proponents---better for optimizing compilers, can make run fast
with simple chip design
Current Status– For desktop processors, choice of ISA not a technical issue
With enough hardware, can make anything run fast Code compatibility more important
– For embedded processors, RISC makes sense Smaller, cheaper, less power
Logic Design
12Chapter 4
Overview of Logic Design Fundamental Hardware Requirements
– Communication How to get values from one place to another
– Computation– Storage
Bits are Our Friends– Everything expressed in terms of values 0 and 1– Communication
Low or high voltage on wire
– Computation Compute Boolean functions
– Storage Store bits of information
13Chapter 4
Digital Signals
– Use voltage thresholds to extract discrete values from continuous signal
– Simplest version: 1-bit signal Either high range (1) or low range (0) With guard range between them
– Not strongly affected by noise or low quality circuit elements Can make circuits simple, small, and fast
Voltage
Time
0 1 0
14Chapter 4
Computing with Logic Gates
– Outputs are Boolean functions of inputs– Respond continuously to changes in inputs
With some, small delay
ab out
ab out a out
out = a && b out = a || b out = !a
And Or Not
Voltage
Time
a
ba && b
Rising Delay Falling Delay
15Chapter 4
Combinational Circuits
Acyclic Network of Logic Gates– Continuously responds to changes on primary inputs– Primary outputs become (after some delay) Boolean functions of
primary inputs
Acyclic Network
PrimaryInputs
PrimaryOutputs
16Chapter 4
Bit Equality
– Generate 1 if a and b are equal
Hardware Control Language (HCL)– Very simple hardware description language
Boolean operations have syntax similar to C logical operations
– We’ll use it to describe control logic for processors
Bit equala
b
eqbool eq = (a&&b)||(!a&&!b)
HCL Expression
17Chapter 4
Word Equality
– 32-bit word size– HCL representation
Equality operation Generates Boolean value
b31Bit equal
a31
eq31
b30Bit equal
a30
eq30
b1Bit equal
a1
eq1
b0Bit equal
a0
eq0
Eq
==B
A
Eq
Word-Level Representation
bool Eq = (A == B)
HCL Representation
18Chapter 4
1-Bit LatchD Latch
Q+
Q–
R
S
D
C
Data
Clock
Latching
1
Q+
Q–
R
S
D
C
Q+
Q–
R
S
D
C
d !d !d !d d
d d !d0
Storing
Q+
Q–
R
S
D
C
Q+
Q–
R
S
D
C
d !d q
!q
!q
q0
0
19Chapter 4
Registers
– Stores word of data Different from program registers seen in assembly code
– Collection of edge-triggered latches– Loads input on rising edge of clock
I O
Clock
DC
Q+
DC
Q+
DC
Q+
DC
Q+
DC
Q+
DC
Q+
DC
Q+
DC
Q+
i7
i6
i5
i4
i3
i2
i1
i0
o7
o6
o5
o4
o3
o2
o1
o0
Clock
Structure
20Chapter 4
Random-Access Memory
– Stores multiple words of memory Address input specifies which word to read or write
– Register file Holds values of program registers %eax, %esp, etc. Register identifier serves as address
– ID 8 implies no read or write performed
– Multiple Ports Can read and/or write multiple words in one cycle
– Each has separate address and data input/output
Registerfile
Registerfile
A
B
W dstW
srcA
valA
srcB
valB
valW
Read ports Write port
Clock
21Chapter 4
Basic Logic Gates
NOTE: okay to use just a circle for NOT:
22Chapter 4
More than 2 Inputs? AND/OR can take any number of inputs.
– AND = 1 if all inputs are 1.– OR = 1 if any input is 1.– Similar for NAND/NOR.
Can implement with multiple two-input gates
23Chapter 4
Logical Completeness Can implement ANY truth table with AND, OR, NOT.
A B C D
0 0 0 0
0 0 1 0
0 1 0 1
0 1 1 0
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 0
1. AND combinations that yield a "1" in the truth table.
2. OR the resultsof the AND gates.
24Chapter 4
DeMorgan's Law Converting AND to OR (with some help from NOT) Consider the following gate:
A B
0 0 1 1 1 0
0 1 1 0 0 1
1 0 0 1 0 1
1 1 0 0 0 1
BA BA BA To convert AND to OR
(or vice versa),invert inputs and output.
25Chapter 4
Decoder n inputs, 2n outputs
– exactly one output is 1 for each possible input pattern
2-bitdecoder
Sequential Processors
27Chapter 4
Sequential HW Structure
State– Program counter register (PC)– Condition code register (CC)– Register File– Memories
Access same memory space Data: for reading/writing program data Instruction: for reading instructions
Instruction Flow– Read instruction at address specified by
PC– Process through stages– Update program counter
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
icode, ifunrA , rB
valC
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
PC
valP
srcA, srcBdstA, dstB
valA, valB
aluA, aluB
Bch
valE
Addr, Data
valM
PCvalE, valM
newPC
28Chapter 4
Seqential Stages Fetch
– Read instruction from instruction memory
Decode– Read program registers
Execute– Compute value or address
Memory– Read or write data
Write Back– Write program registers
PC– Update program counter
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
icode, ifunrA , rB
valC
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
PC
valP
srcA, srcBdstA, dstB
valA, valB
aluA, aluB
Bch
valE
Addr, Data
valM
PCvalE, valM
newPC
29Chapter 4
Instruction Decoding
Instruction Format– Instruction byte icode:ifun– Optional register byte rA:rB– Optional constant word valC
5 0 rA rB D
icodeifun
rArB
valC
Optional Optional
30Chapter 4
Sequential Summary Implementation
– Express every instruction as series of simple steps– Follow same general flow for each instruction type– Assemble registers, memories, predesigned combinational blocks– Connect with control logic
Limitations– Too slow to be practical– In one cycle, must propagate through instruction memory, register file,
ALU, and data memory– Would need to run clock very slowly– Hardware units only active for fraction of clock cycle
Pipelined Processors
32Chapter 4
What is Pipelining Computers execute billions of instructions, so instruction
throughput is what matters IDEA: Divide instruction execution up into several pipeline
stages. For example
IF ID EX MEM WB Simultaneously have different instructions in different
pipeline stages The length of the longest pipeline stage determines the
cycle time Desirable pipeline features (e.g., RISC):
– all instructions same length– registers located in same place in instruction format– memory operands only in loads or stores
33Chapter 4
What Is Pipelining
Laundry Example
Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes
A B C D
34Chapter 4
What Is Pipelining
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
35Chapter 4
Start work ASAP
Pipelined laundry takes 3.5 hours for 4 loads A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
What Is Pipelining
36Chapter 4
Pipelining Lessons
Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup = Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” it reduces speedup
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
What Is Pipelining
37Chapter 4
Real-World Pipelines: Car Washes
Idea– Divide process into independent stages– Move objects through stages in sequence– At any given times, multiple objects being processed
Sequential Parallel
Pipelined
38Chapter 4
Pipeline Diagrams Unpipelined
– Cannot start new operation until previous one completes
3-Way Pipelined
– Up to 3 operations in process simultaneously
Time
OP1
OP2
OP3
Time
A B C
A B C
A B C
OP1
OP2
OP3
39Chapter 4
Data Dependencies
System– Each operation depends on result from preceding one
Clock
Combinationallogic
Reg
Time
OP1
OP2
OP3
40Chapter 4
Data Hazards
– Result does not feed back around in time for next operation– Pipelining has changed behavior of system
Reg
Clock
Comb.logic
A
Reg
Comb.logic
B
Reg
Comb.logic
C
Time
OP1
OP2
OP3
A B C
A B C
A B C
OP4 A B C
41Chapter 4
One Memory Port/Structural Hazards
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg
ALU
DMemIfetch Reg
42Chapter 4
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg
ALU
DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
How do you “bubble” the pipe?
One Memory Port/Structural Hazards
43Chapter 4
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Data Hazard on R1
Time (clock cycles)
IF ID/RF EX MEM WB
44Chapter 4
Read After Write (RAW) InstrJ tries to read operand before InstrI writes it
Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
Three Generic Data Hazards
I: add r1,r2,r3J: sub r4,r1,r3
45Chapter 4
Write After Read (WAR) InstrJ writes operand before InstrI reads it
Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Three Generic Data Hazards
46Chapter 4
Three Generic Data Hazards
Write After Write (WAW) InstrJ writes operand before InstrI writes it.
Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
47Chapter 4
Data Forwarding Naïve Pipeline
– Register isn’t written until completion of write-back stage– Source operands read from register file in decode stage
Needs to be in register file at start of stage
Observation– Value generated in execute or memory stage
Trick– Pass value directly from generating instruction to decode stage– Needs to be available at end of decode stage
48Chapter 4
Time (clock cycles)
Forwarding to Avoid Data Hazard
Inst
r.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg