CSC 2405 Computer Systems II Advanced Topics. Instruction Set Architecture.

CSC 2405Computer Systems II

Advanced Topics

Instruction Set Architecture

3Chapter 4

Instruction Set Architecture Assembly Language View

– Processor state Registers, memory, …

– Instructions addl, movl, leal, … How instructions are encoded as bytes

Layer of Abstraction– Above: how to program machine

Processor executes instructions in a sequence

– Below: what needs to be built Use variety of tricks to make it run fast E.g., execute multiple instructions

simultaneously

ISA

Compiler OS

CPUDesign

CircuitDesign

ChipLayout

ApplicationProgram

4Chapter 4

Instruction Set Architectures Basic ISA Classes

Stack Accumulator Register

(Register-memory)

Register

(load-store)

Push A Load A Load R1, A Load R1, A

Push B Add B Add R1, B Load R2, B

Add Store C Store C, R1 Add R3, R1, R2

Pop C Store C, R3

The results of different address classes is easiest to see with the examples here, all of which implement the sequences for C = A + B.

Registers are the class that won out. The more registers on the CPU, the better.

5Chapter 4

80x86 Instruction Frequency

6Chapter 4

Relative Frequency of Control Instructions

Operation Integer Floating Pt Call/Return 19% 8%

Jumps 6% 10% Branches 75% 82%

Design hardware to handle branches quickly, since these occur most frequently

7Chapter 4

CISC Instruction Sets– Complex Instruction Set Computer– Dominant style through mid-80’s

Stack-oriented instruction set– Use stack to pass arguments, save program counter– Explicit push and pop instructions

Arithmetic instructions can access memory– addl %eax, 12(%ebx,%ecx,4)

requires memory read and write Complex address calculation

Condition codes– Set as side effect of arithmetic and logical instructions

Philosophy– Add instructions to perform “typical” programming tasks

8Chapter 4

RISC Instruction Sets– Reduced Instruction Set Computer– Internal project at IBM, later popularized by Hennessy (Stanford) and

Patterson (Berkeley)

Fewer, simpler instructions– Might take more to get given task done– Can execute them with small and fast hardware

Register-oriented instruction set– Many more (typically 32) registers– Use for arguments, return pointer, temporaries

Only load and store instructions can access memory– Similar to Y86 mrmovl and rmmovl

No Condition codes– Test instructions return 0/1 in register

9Chapter 4

Example RISC Instruction Formats

Op

31 26 01516202125

rs1 rd immediate

Op

31 26 025

Op

31 26 01516202125

rs1 rs2

offset added to PC

rd

Register-Register (R-type) ADD R1, R2, R3

561011

Register-Immediate (I-type) SUB R1, R2, #3

Jump / Call (J-type) JUMP end

func

(ALU imm. operations, loads and stores, conditional branch, jump (and link)

(jump, jump and link, trap and return from exception)

(ALI reg. operations, read/write special registers and moves)

10Chapter 4

CISC vs. RISC Original Debate

– Strong opinions!– CISC proponents---easy for compiler, fewer code bytes– RISC proponents---better for optimizing compilers, can make run fast

with simple chip design

Current Status– For desktop processors, choice of ISA not a technical issue

With enough hardware, can make anything run fast Code compatibility more important

– For embedded processors, RISC makes sense Smaller, cheaper, less power

Logic Design

12Chapter 4

Overview of Logic Design Fundamental Hardware Requirements

– Communication How to get values from one place to another

– Computation– Storage

Bits are Our Friends– Everything expressed in terms of values 0 and 1– Communication

Low or high voltage on wire

– Computation Compute Boolean functions

– Storage Store bits of information

13Chapter 4

Digital Signals

– Use voltage thresholds to extract discrete values from continuous signal

– Simplest version: 1-bit signal Either high range (1) or low range (0) With guard range between them

– Not strongly affected by noise or low quality circuit elements Can make circuits simple, small, and fast

Voltage

Time

0 1 0

14Chapter 4

Computing with Logic Gates

– Outputs are Boolean functions of inputs– Respond continuously to changes in inputs

With some, small delay

ab out

ab out a out

out = a && b out = a || b out = !a

And Or Not

Voltage

Time

a

ba && b

Rising Delay Falling Delay

15Chapter 4

Combinational Circuits

Acyclic Network of Logic Gates– Continuously responds to changes on primary inputs– Primary outputs become (after some delay) Boolean functions of

primary inputs

Acyclic Network

PrimaryInputs

PrimaryOutputs

16Chapter 4

Bit Equality

– Generate 1 if a and b are equal

Hardware Control Language (HCL)– Very simple hardware description language

Boolean operations have syntax similar to C logical operations

– We’ll use it to describe control logic for processors

Bit equala

b

eqbool eq = (a&&b)||(!a&&!b)

HCL Expression

17Chapter 4

Word Equality

– 32-bit word size– HCL representation

Equality operation Generates Boolean value

b31Bit equal

a31

eq31

b30Bit equal

a30

eq30

b1Bit equal

a1

eq1

b0Bit equal

a0

eq0

Eq

==B

A

Eq

Word-Level Representation

bool Eq = (A == B)

HCL Representation

18Chapter 4

1-Bit LatchD Latch

Q+

Q–

R

S

D

C

Data

Clock

Latching

1

Q+

Q–

R

S

D

C

Q+

Q–

R

S

D

C

d !d !d !d d

d d !d0

Storing

Q+

Q–

R

S

D

C

Q+

Q–

R

S

D

C

d !d q

!q

!q

q0

0

19Chapter 4

Registers

– Stores word of data Different from program registers seen in assembly code

– Collection of edge-triggered latches– Loads input on rising edge of clock

I O

Clock

DC

Q+

DC

Q+

DC

Q+

DC

Q+

DC

Q+

DC

Q+

DC

Q+

DC

Q+

i7

i6

i5

i4

i3

i2

i1

i0

o7

o6

o5

o4

o3

o2

o1

o0

Clock

Structure

20Chapter 4

Random-Access Memory

– Stores multiple words of memory Address input specifies which word to read or write

– Register file Holds values of program registers %eax, %esp, etc. Register identifier serves as address

– ID 8 implies no read or write performed

– Multiple Ports Can read and/or write multiple words in one cycle

– Each has separate address and data input/output

Registerfile

Registerfile

A

B

W dstW

srcA

valA

srcB

valB

valW

Read ports Write port

Clock

21Chapter 4

Basic Logic Gates

NOTE: okay to use just a circle for NOT:

22Chapter 4

More than 2 Inputs? AND/OR can take any number of inputs.

– AND = 1 if all inputs are 1.– OR = 1 if any input is 1.– Similar for NAND/NOR.

Can implement with multiple two-input gates

23Chapter 4

Logical Completeness Can implement ANY truth table with AND, OR, NOT.

A B C D

0 0 0 0

0 0 1 0

0 1 0 1

0 1 1 0

1 0 0 0

1 0 1 1

1 1 0 0

1 1 1 0

1. AND combinations that yield a "1" in the truth table.

2. OR the resultsof the AND gates.

24Chapter 4

DeMorgan's Law Converting AND to OR (with some help from NOT) Consider the following gate:

A B

0 0 1 1 1 0

0 1 1 0 0 1

1 0 0 1 0 1

1 1 0 0 0 1

BA BA BA To convert AND to OR

(or vice versa),invert inputs and output.

25Chapter 4

Decoder n inputs, 2n outputs

– exactly one output is 1 for each possible input pattern

2-bitdecoder

Sequential Processors

27Chapter 4

Sequential HW Structure

State– Program counter register (PC)– Condition code register (CC)– Register File– Memories

Access same memory space Data: for reading/writing program data Instruction: for reading instructions

Instruction Flow– Read instruction at address specified by

PC– Process through stages– Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode, ifunrA , rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA, srcBdstA, dstB

valA, valB

aluA, aluB

Bch

valE

Addr, Data

valM

PCvalE, valM

newPC

28Chapter 4

Seqential Stages Fetch

– Read instruction from instruction memory

Decode– Read program registers

Execute– Compute value or address

Memory– Read or write data

Write Back– Write program registers

PC– Update program counter

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode, ifunrA , rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

PC

valP

srcA, srcBdstA, dstB

valA, valB

aluA, aluB

Bch

valE

Addr, Data

valM

PCvalE, valM

newPC

29Chapter 4

Instruction Decoding

Instruction Format– Instruction byte icode:ifun– Optional register byte rA:rB– Optional constant word valC

5 0 rA rB D

icodeifun

rArB

valC

Optional Optional

30Chapter 4

Sequential Summary Implementation

– Express every instruction as series of simple steps– Follow same general flow for each instruction type– Assemble registers, memories, predesigned combinational blocks– Connect with control logic

Limitations– Too slow to be practical– In one cycle, must propagate through instruction memory, register file,

ALU, and data memory– Would need to run clock very slowly– Hardware units only active for fraction of clock cycle

Pipelined Processors

32Chapter 4

What is Pipelining Computers execute billions of instructions, so instruction

throughput is what matters IDEA: Divide instruction execution up into several pipeline

stages. For example

IF ID EX MEM WB Simultaneously have different instructions in different

pipeline stages The length of the longest pipeline stage determines the

cycle time Desirable pipeline features (e.g., RISC):

– all instructions same length– registers located in same place in instruction format– memory operands only in loads or stores

33Chapter 4

What Is Pipelining

Laundry Example

Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

Washer takes 30 minutes

Dryer takes 40 minutes

“Folder” takes 20 minutes

A B C D

34Chapter 4

What Is Pipelining

Sequential laundry takes 6 hours for 4 loads

If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

35Chapter 4

Start work ASAP

Pipelined laundry takes 3.5 hours for 4 loads A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

What Is Pipelining

36Chapter 4

Pipelining Lessons

Pipelining doesn’t help latency of single task, it helps throughput of entire workload

Pipeline rate limited by slowest pipeline stage

Multiple tasks operating simultaneously

Potential speedup = Number pipe stages

Unbalanced lengths of pipe stages reduces speedup

Time to “fill” pipeline and time to “drain” it reduces speedup

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

What Is Pipelining

37Chapter 4

Real-World Pipelines: Car Washes

Idea– Divide process into independent stages– Move objects through stages in sequence– At any given times, multiple objects being processed

Sequential Parallel

Pipelined

38Chapter 4

Pipeline Diagrams Unpipelined

– Cannot start new operation until previous one completes

3-Way Pipelined

– Up to 3 operations in process simultaneously

Time

OP1

OP2

OP3

Time

A B C

A B C

A B C

OP1

OP2

OP3

39Chapter 4

Data Dependencies

System– Each operation depends on result from preceding one

Clock

Combinationallogic

Reg

Time

OP1

OP2

OP3

40Chapter 4

Data Hazards

– Result does not feed back around in time for next operation– Pipelining has changed behavior of system

Reg

Clock

Comb.logic

A

Reg

Comb.logic

B

Reg

Comb.logic

C

Time

OP1

OP2

OP3

A B C

A B C

A B C

OP4 A B C

41Chapter 4

One Memory Port/Structural Hazards

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Reg

ALU

DMemIfetch Reg

42Chapter 4

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Stall

Instr 3

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5

Reg

ALU

DMemIfetch Reg

Bubble Bubble Bubble BubbleBubble

How do you “bubble” the pipe?

One Memory Port/Structural Hazards

43Chapter 4

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Data Hazard on R1

Time (clock cycles)

IF ID/RF EX MEM WB

44Chapter 4

Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.

Three Generic Data Hazards

I: add r1,r2,r3J: sub r4,r1,r3

45Chapter 4

Write After Read (WAR) InstrJ writes operand before InstrI reads it

Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.

I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7


46Chapter 4


Write After Write (WAW) InstrJ writes operand before InstrI writes it.

Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.

I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7

47Chapter 4

Data Forwarding Naïve Pipeline

– Register isn’t written until completion of write-back stage– Source operands read from register file in decode stage

Needs to be in register file at start of stage

Observation– Value generated in execute or memory stage

Trick– Pass value directly from generating instruction to decode stage– Needs to be available at end of decode stage

48Chapter 4

Time (clock cycles)

Forwarding to Avoid Data Hazard

Inst

r.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

Reg

ALU

DMemIfetch Reg

CSC 2405 Computer Systems II Advanced Topics. Instruction Set Architecture.

Documents

Transcript of CSC 2405 Computer Systems II Advanced Topics. Instruction Set Architecture.