Download - CDA 5106 Advanced Computer Architecture I Instr ction Set ...

CDA 5106 Advanced Computer Architecture ICDA 5106 Advanced Computer Architecture I

Instr ction Set Architect re DesignInstr ction Set Architect re DesignInstruction Set Architecture DesignInstruction Set Architecture Design

Computer Science Departmenti i f C l l idUniversity of Central Florida

What is an ISA?What is an ISA?

• Hardware-software interface• Instruction Set Architecture (ISA) defines:Instruction Set Architecture (ISA) defines:

– STATE OF THE PROGRAM (processor registers, memory)– WHAT INSTRUCTIONS DO: Semantics of instructions, how

they update statethey update state– HOW INSTRUCTIONS ARE REPRESENTED: Syntax (bit

encodings)l d h l f h b h d• …selected so that implications of the above on hardware

design/compiler design are optimal– Example: register specifier moves around between different

instructions-- need multiple lines and a mux before the register file.

2

Why is the ISA important?Why is the ISA important?y py p• Fixed h/w-s/w interface for a generation of processors

– IBM realized early the value of a fixed ISAB t “ t k” ith b d d i i f l ti– But: “stuck” with bad decisions for long time

– Recent developments mitigate ISA problems (e.g., x86 micro-ops, Transmeta, JIT compiler, virtual machines)

• ISA decisions affect: (Revisit RISC vs. CISC…)1. Memory cost of the machine

Short vs. long bit encodingshigh vs. low semantic meaning per instruction

2 Hardware design2. Hardware designSimple, uniform-complexity ops => efficient pipelineDon’t build hardware for instructions that never get used

3. Compiler and programming language issuesHow much can compiler exploit ISA to optimize perf.How well does ISA support high-level lang. constructsChoice for hand coding vs. compiler generated code: semantics are easy to use vs. easy to generate code for

3

y y g

ISA Design Decisions & OutlineISA Design Decisions & Outlinegg

• Style of operand specification: stack, accumulator, registers, etc.g ,

• Operand access limitations• Addressing modes for operands• Semantics:

– Mix of operations supported– Control transfers

• Encoding tradeoffs• Compiler influence

E l MIPS• Example: MIPS

4

Styles of ISAsStyles of ISAsyy

Stack AccumulatorRegister-Memory Load-Store

Push A Load A Load R1 A Load R1 APush A Load A Load R1, A Load R1,APush B Add B Add R1, B Load R2, BAdd Store C Store C, R1 Add R3, R1, R2Pop C Store C R3Pop C Store C, R3

5

Why stacks, accumulatorsWhy stacks, accumulatorsy ,y ,

• Stacks:– Very compact formatVery compact format

• All calculation operations take zero operands• Example use: Java bytecode (low network b/w)

Th ti ll h t t d f i l ti ith ti – Theoretically shortest code for implementing arithmetic expressions

• All HP calculator fanatics know this• Accumulator:

– Also a very compact format– Less dependence on memory than stack-basedp y

• For both:– Compact implies memory efficient

G d if i i

6

– Good if memory is expensive

Why registers?Why registers?y gy g

1. Faster than memory– Latency: raw access time (once address is known)y ( )

• Cache access: 2 cycles (typical)• Register access: 1 cycle• Register file typically smaller than data cacheg yp y• Register file doesn’t need tag check logic

– Bandwidth: more practical to multiport a register file• ILP requires large number of operand portsq g p p

– ILP requirements• High-performance scheduling (ILP) requires detecting data

dependent/independent operations early in pipeline• Register “addresses” are known at instruction decode time• Memory addresses are known quite late due to address

computation

7

Why Registers? (cont.)Why Registers? (cont.)y g ( )y g ( )

2. Less memory traffic if values are in registers– Program runs faster if variables are inside registers Program runs faster if variables are inside registers

(compiler does “register allocation”)– Bus can be used for other things (e.g., I/O)

3 More flexible for compiler/hardware scheduling3. More flexible for compiler/hardware scheduling– (A*B) - (C*D) - (E*F)– A*B in R1, -C*D in R2, -E*F in R3: can easily rearrange ADD

i iinstructions– A to F on the stack: less flexible

• Need to add swaps/rotates or completely rewrite code

8

How many registers?How many registers?y gy g• Depends on:

– Compiler ability– Program characteristics– Program characteristics– Implementation impact such as cycle time & cost

• Number of physical registers >= Number of architectural registers• Lots-of-registers enable two important optimizations:

Register allocation (more variables can be in registers)– Register allocation (more variables can be in registers)– Limiting reuse of registers improves parallelism

• Reuse example:Load R2, A; Load R3, B; Load R4, C; Load R5, DAdd R1 R2 R3Add R1, R2, R3Add R2, R5, R4 (reuse of R2)vs.Add R1, R2, R3Add R6 R4 R5 (no reuse: had R6)

Conflict artificially serializes the two instructions

Add R6, R4, R5 (no reuse: had R6)– Without reuse Adds are “parallelizable” if there are two adders

• Instruction level parallelism (ILP)– ILP ~ Average (CPI)-1 ~ Number of registers

9

Operand access limitationsOperand access limitationspp# mem. operands # total operands type Examples

0 3 "load/store" Most RISCS1 2 "register/memory" x86, 68000

2 3 2 3 " / " VAX2, 3 2, 3 "memory/memory" VAX

• Load/store (0,3)– (+) Fixed-length instructions possible: easy fetch/decode

( ) Si l h/ ffi i t i li & t ti ll l CT– (+) Simpler h/w: efficient pipeline & potentially lower CT– (-) Higher instruction count (IC)– (-) Fixed-length instructions are wasteful

• Register/memory (1 2)• Register/memory (1,2)– (+) No need for extra loads– (+) “A few lengths” better uses bits– (-) Destroys source operand (e.g., Add R1,R2)

Good code density

( ) y p ( g )– (-) May impact CPI

• Memory/memory– (+) Most compact (code density)

10

– (-) High memory traffic (memory bottleneck)

AlignmentAlignmentgg• Byte alignment

– Any access is accommodatedW d li• Word alignment

– Only accesses that are aligned at natural word boundaries are accommodated due to DRAM/SRAM organizationR d b f d / it t 0

memory (bytes)

– Reduces number of reads/writes to memory– Eliminates hardware for alignment (typically expensive)– Often handle misalignment via software:

C il d t t & t i t

01234

Unalignedaccess

• Compiler detects & generates appropriate instructions

• …or O/S detects and runs “fixit” routine

567

Word si e = 4 b tesWord size = 4 bytes

Asking for words beginning at 0 or 4 is OKAsking for other words requires two reads

(e.g., ask for word starting at 2)

0 1 2 34 5 6 7

read #1read #2

2 34 5reorder

11

( g , g )2 3 4 5

reorder

EndianEndian--nessness• Where is the most-significant byte (MSB) in a word?

– Little-endian (e.g., x86)( g )

Byte address0 1 2 34 5 6 7

LSB MSB

• “little”-endian comes from interpreting byte address 0 as the “least”-significant byte

4 5 6 7

– Big-endian (e.g., IBM PowerPC)

0 1 2 34 5 6 7

MSB LSB

Byte address

• “big”-endian comes from interpreting byte address 0 as the “most”-significant byte

4 5 6 7y

13

0 as the most significant byte

Common addressing modesCommon addressing modesgg

• Register– Add R4, R3– R4 = R4 + R3– Used when value is in a register

• ImmediateImmediate– Add R4, #3– R4 = R4 + 3– Useful for small constants which occur frequentlyUseful for small constants, which occur frequently

• Displacement– Add R4, 100(R1)

R4 = R4 + Mem[100+R1]– R4 = R4 + Mem[100+R1]– Accesses the frame (arguments, local variables)– Accesses the global data segment

A fi ld f d t t t

14

– Accesses fields of a data struct

Addressing modes (cont.)Addressing modes (cont.)g ( )g ( )

• Register deferred/Register indirect– Add R3, (R1)( )– R3 = R3 + Mem[R1]– Access using a computed address

• IndexedIndexed– Add R3, (R1 + R2)– R3 = R3 + Mem[R1 + R2]– Array accessesArray accesses

• R1 = base, R2 = index• Direct/Absolute

Add R1 (1001)– Add R1, (1001)– R1 = R1 + M[1001]– Accessing global (“static”) data

15

Addressing modes (cont.)Addressing modes (cont.)g ( )g ( )• Memory indirect/Memory deferred

– Add R1, @(R3)R1 R1 + M [M [R3]]– R1 = R1 + Mem[Mem[R3]]

– Pointer dereferencing: x = *p; (if p is not register-allocated)• Autoincrement/Postincrement

– Add R1, (R2)+, ( )– R1 = R1 + Mem[R2]; R2 = R2 + d (d is size of operation)– Looping through arrays, stack pop

• Autodecrement/PredecrementAdd R1 (R2)– Add R1, -(R2)

– R2 = R2 - d; R1 = R1 + Mem[R2] (d is size of operation)– Same uses as autoincrement, stack push

• ScaledScaled– Add R1, 100(R2)[R3]– R1 = R1 + Mem[100+R2+R3*d] (d is size of operation)– Array accesses for non-byte-sized elements

16

Wisdom about modesWisdom about modes

• Need:– Register, Displacement, Immediate and optionally Indexed Register, Displacement, Immediate and optionally Indexed

(indexed simplifies array accesses)– Displacement size 12-16 bits (empirical)

Immediate: 8 to 16 bits (empirical)– Immediate: 8 to 16 bits (empirical)– Can synthesize the rest from simpler instructons– Example-- Mips architecture:

• Register, displacement, Immediate modes only• both immediate and displacement: 16 bits

• Choice depends on workload!p– For example, floating-point codes might require larger

immediates, or 64bit wordsize machines might also require larger immediates

17

g

Control transfer semanticsControl transfer semantics

• Types of branches– Conditional– Unconditional

• Normal• Call• Return

• Addressing mode in control transfer instructions– BranchBranch

• Branch allows relocatable (“position independent”) code• Fewer bits in encoding when the target is close

– Indirect Jump– Indirect Jump• Switch/case statements jump r1

– Jump• Jump allows branching further than branch

18

• Jump allows branching further than branch

Parts of a control transferParts of a control transfer

• WHERE– Determine target addressDetermine target address

• WHETHER– Determine if transfer should occur or not

• WHEN– Determine when in time the transfer should occur

• Each of the three decisions can be decoupledp

24

Types of control transfer (cont).Types of control transfer (cont).yp ( )yp ( )

• All three together: Compare and branch instruction– Br (R1 = R2), destination( )– (+) A single instruction– (-) Heavy hardware requirement, inflexible scheduling

• WHETHER separate from WHERE/WHEN:WHETHER separate from WHERE/WHEN:– Condition code register (CMP R1,R2 … BEQ dest)

• (+) Sometimes test happens “for free”• (-) Hard for compiler to figure out which instructions ( ) Hard for compiler to figure out which instructions

depend on CC register; limiting parallelism (two CMPs will share the CC register implicitly)

– Condition register (SUB R1,R2 … BEQ R1, dest)g ( )• (+) Simple to implement, dependencies between

instructions are obvious to compiler• (-) Uses a register (“register pressure”)

25

PreparePrepare--toto--branchbranchpp

• Decouple all three of WHERE / WHETHER / WHEN• WHERE: PBR BTR1 = destinationWHERE: PBR BTR1 destination

– BTR1 = “Branch target register #1”• WHETHER: CMP PR2 = (R1 = R2)

– PR2 = “Predicate register #2”• WHEN BR BTR1 if PR2• (+) Schedule each instruction so it happens during “free ( ) Schedule each instruction so it happens during free

time” when hardware is idle• (-) Three instructions: higher IC

F h HP L b Pl D h hi• From the HP Labs PlayDoh architecture

26

Instruction Encoding tradeoffsInstruction Encoding tradeoffsgg

• Variable width– Common instructions are short (1-2 bytes), less common or Common instructions are short (1 2 bytes), less common or

more complex instructions are long (>2 bytes)– (+) Very versatile, uses memory efficiently

( ) Instruction words must be decoded before number of – (-) Instruction words must be decoded before number of instructions is known

• Fixed widthT i ll 1 i i 32 bi d (Al h i 2 – Typically 1 instruction per 32-bit word (Alpha is 2 instructions per word)

– (+) Every instruction word is an instruction, Easier to f h/d dfetch/decode

– (-) Uses memory inefficiently

27

Addressing mode encodingAddressing mode encodingg gg g

• Each operand has a “mode” field– Also called “address specifiers”Also called address specifiers– VAX, 68000– (+) Very versatile

( ) E i bl idth i t ti (h d d d )– (-) Encourages variable-width instructions (hard decode)• Opcode specifies addressing mode

– Most RISCs– (+) Encourages fixed-width instructions (easy decode)– (+) “Natural” for a load/store ISA– (-) Limits what every instruction can do( ) Limits what every instruction can do

• But only matters for loads and stores

28

Compiler impactCompiler impactp pp p• High-level opt:

– Use a “virtual source l l” i

Parse

High-level level” representation– Loop interchange, etc.

• Low-level opt:

High-levelintermediate language

High-levelOptimize

Low-level – Clean up parser refuse– Each “optimization

pass” runs as a filterE h ll li

Low-levelOptimize

Low-levelintermediate language

Low level – Enhance parallelism• Code generation:

– Allocate registersCode generation:Allocate, Schedule

Low-levelintermediate language

– Schedule code for high performance

translate

29

Assembly code

Example: MIPSExample: MIPSpp

6 5 5 16

I-type instruction

A load/store, fixed-encodingarchitecture with a “conditionregister” architecture

Opcode rs1 rd Immediate

Load, store, all immediate operations, conditional branches (rd unused)Jump through register, call through register (“jump and link register”)

Opcode

6

rs1

5

rs2

5

rd

5

Func

6

R-type instructionOpcodeis in thesame placefor every

For semantics,see the H&P textbookFig. B.26 (pp B-40)shamt

5

Register-register ALU operations“Func” is an opcode extension

for everyinstruction

Opcode

6

Offset added to PC

26J-type instruction

32

Jump, call (“jump and link”), trap and return from exception

MIPSMIPS--64 ISA64 ISA

R0R1R2

R0 is permanent 063 0

R2...

R31

PC

F0F1F2...

F31

Fi for either single precision (32-bit) values or double precision (64-bit) values

63 0

F31

Load/store architectureTransfer sizes: B (byte), H (halfword), W (word), D (double word)No unaligned accesses allowed

33

MIPS example codeMIPS example codepp

DADDI R1,R0,#10 Put 10 into R1 (R0 = 0)LW R2,A(R0) Put A in R2

Loop L.D F0, 0(R2) Load double FP value into F0ADD.D F4, F0, F2 Add F2 to F0, ,S.D F4, 0(R2) Store result back to memoryDSUB R1,R1,#1 Decrement IDADDI R2,R2,#8 Increment loop pointerBNEZ R1,Loop

34

ApplicationApplication--Specific ISA ExtensionSpecific ISA Extensionpppp pp

• Example: Converting endian format

Byte 3 Byte 2 Byte 1 Byte 0

unsigned ss = (s<<24) | ((s<<8)&0xff0000) |(s>>8)&0xff00 | (s>>24);

Byte 0 Byte 1 Byte 2 Byte 3

On a typical general purpose processor:slli a9, a14, 24lli 8 14 8Byte 0 Byte 1 Byte 2 Byte 3 slli a8, a14, 8

srli a10, a14, 8and a10, a10, a11and a8, a8, a13, ,or a8, a8, a9extui a9,a14,24,8or a10, a10, a9or a10 a10 a8

35

or a10, a10, a8

9 instructions

ApplicationApplication--Specific ISA ExtensionSpecific ISA Extensionpppp pp

• New instruction based on hardware extension for byte-swapping operationpp g p

• Using Tensilica Instruction Extension (TIE) LanguageO BYTESWAP { AR R AR R} {}Operation BYTESWAP {out AR outR, in AR inpR} {}{wire [31:0] reg_swapped = {inpR[7:0], inpR[15:8], inpR[23:16],

inpR[31:24]};assign outR = reg_swapped;

}}

The new instruction: BYTESWAP a10, a9A speedup of 9x!

36

A speedup of 9x!

Overhead with Instruction ExtensionOverhead with Instruction Extension• To support new instructions defined in the ISA, we need

– Hardware function units to implement the operations– Compiler (or programmer) to generate code to take advantage of the new instructions

• Design flow of Xtensa microprocessor core

Processor specificationApplication code

compile Processor generator

Processor specification

profile Configured processor RTL Tailored software

Develop processor configuration and extensionsp Tailored software

development tools(compiler, debugger,

assembler, etc.)

37

Common ISA Extension TechniquesCommon ISA Extension Techniquesqq

• Fusion– Turn a data dependence subgraph into a single instructionTurn a data dependence subgraph into a single instruction

• SIMD/Vector transformationD l l ll li– Data-level parallelism

– Width of the vector register file

• VLIW (called FLIX in Tensilica)– Parallel operations determined by the compiler

Instruction level Parallelism– Instruction-level Parallelism

38

Operation FusionOperation Fusionppunsigned short *a, *b, *c;for(i=0; i<n; i++)

[ ] ( [ ] b[ ]) 1c[i] =(a[i] + b[i]) >> 1;

A new operation is resulted from fusing add and shift operations

operation AVERAGE {out AR res, in AR input0. in AR input 1) {}{

wire [16:0] tmp = input0[15:0] + input1[15:0];wire [16:0] tmp input0[15:0] + input1[15:0];assign res = tmp[16:1];

}

The C/C++ code or assembly code can use the new AVERAGE instruction as follows,

for(i=0; i<n; i++)[i] AVERAGE( [i] b[i])

39

c[i] =AVERAGE(a[i], b[i]);

SIMD/Vector transformationSIMD/Vector transformation

regfile VEC 64 8 voperation VAVERAGE {out VEC res, in VEC input0, in VEC input1}{}{wire [67:0] tmp = {input0[63:48] + input1[63:48], //a 17-bits result

input0[47:32] + input1[47:32], //a 17-bits resultinput0[31:16] + input1[31:16] //a 17 bits resultinput0[31:16] + input1[31:16], //a 17-bits resultinput0[15:0] + input1[15:0]}; //a 17-bits result

assign res = {tmp[67:52], tmp[50:35], tmp [33:18], tmp[16:1]};}}

The C/C++ code or assembly code can use the new VAVERAGE instruction as follows,VAVERAGE instruction as follows,

for(i=0; i<n; i+=4)c[i] =VAVERAGE(a[i], b[i]);

40