CDA 5106 Advanced Computer Architecture ICDA 5106 Advanced Computer Architecture I
Instr ction Set Architect re DesignInstr ction Set Architect re DesignInstruction Set Architecture DesignInstruction Set Architecture Design
Computer Science Departmenti i f C l l idUniversity of Central Florida
What is an ISA?What is an ISA?
• Hardware-software interface• Instruction Set Architecture (ISA) defines:Instruction Set Architecture (ISA) defines:
– STATE OF THE PROGRAM (processor registers, memory)– WHAT INSTRUCTIONS DO: Semantics of instructions, how
they update statethey update state– HOW INSTRUCTIONS ARE REPRESENTED: Syntax (bit
encodings)l d h l f h b h d• …selected so that implications of the above on hardware
design/compiler design are optimal– Example: register specifier moves around between different
instructions-- need multiple lines and a mux before the register file.
2
Why is the ISA important?Why is the ISA important?y py p• Fixed h/w-s/w interface for a generation of processors
– IBM realized early the value of a fixed ISAB t “ t k” ith b d d i i f l ti– But: “stuck” with bad decisions for long time
– Recent developments mitigate ISA problems (e.g., x86 micro-ops, Transmeta, JIT compiler, virtual machines)
• ISA decisions affect: (Revisit RISC vs. CISC…)1. Memory cost of the machine
Short vs. long bit encodingshigh vs. low semantic meaning per instruction
2 Hardware design2. Hardware designSimple, uniform-complexity ops => efficient pipelineDon’t build hardware for instructions that never get used
3. Compiler and programming language issuesHow much can compiler exploit ISA to optimize perf.How well does ISA support high-level lang. constructsChoice for hand coding vs. compiler generated code: semantics are easy to use vs. easy to generate code for
3
y y g
ISA Design Decisions & OutlineISA Design Decisions & Outlinegg
• Style of operand specification: stack, accumulator, registers, etc.g ,
• Operand access limitations• Addressing modes for operands• Semantics:
– Mix of operations supported– Control transfers
• Encoding tradeoffs• Compiler influence
E l MIPS• Example: MIPS
4
Styles of ISAsStyles of ISAsyy
Stack AccumulatorRegister-Memory Load-Store
Push A Load A Load R1 A Load R1 APush A Load A Load R1, A Load R1,APush B Add B Add R1, B Load R2, BAdd Store C Store C, R1 Add R3, R1, R2Pop C Store C R3Pop C Store C, R3
5
Why stacks, accumulatorsWhy stacks, accumulatorsy ,y ,
• Stacks:– Very compact formatVery compact format
• All calculation operations take zero operands• Example use: Java bytecode (low network b/w)
Th ti ll h t t d f i l ti ith ti – Theoretically shortest code for implementing arithmetic expressions
• All HP calculator fanatics know this• Accumulator:
– Also a very compact format– Less dependence on memory than stack-basedp y
• For both:– Compact implies memory efficient
G d if i i
6
– Good if memory is expensive
Why registers?Why registers?y gy g
1. Faster than memory– Latency: raw access time (once address is known)y ( )
• Cache access: 2 cycles (typical)• Register access: 1 cycle• Register file typically smaller than data cacheg yp y• Register file doesn’t need tag check logic
– Bandwidth: more practical to multiport a register file• ILP requires large number of operand portsq g p p
– ILP requirements• High-performance scheduling (ILP) requires detecting data
dependent/independent operations early in pipeline• Register “addresses” are known at instruction decode time• Memory addresses are known quite late due to address
computation
7
Why Registers? (cont.)Why Registers? (cont.)y g ( )y g ( )
2. Less memory traffic if values are in registers– Program runs faster if variables are inside registers Program runs faster if variables are inside registers
(compiler does “register allocation”)– Bus can be used for other things (e.g., I/O)
3 More flexible for compiler/hardware scheduling3. More flexible for compiler/hardware scheduling– (A*B) - (C*D) - (E*F)– A*B in R1, -C*D in R2, -E*F in R3: can easily rearrange ADD
i iinstructions– A to F on the stack: less flexible
• Need to add swaps/rotates or completely rewrite code
8
How many registers?How many registers?y gy g• Depends on:
– Compiler ability– Program characteristics– Program characteristics– Implementation impact such as cycle time & cost
• Number of physical registers >= Number of architectural registers• Lots-of-registers enable two important optimizations:
Register allocation (more variables can be in registers)– Register allocation (more variables can be in registers)– Limiting reuse of registers improves parallelism
• Reuse example:Load R2, A; Load R3, B; Load R4, C; Load R5, DAdd R1 R2 R3Add R1, R2, R3Add R2, R5, R4 (reuse of R2)vs.Add R1, R2, R3Add R6 R4 R5 (no reuse: had R6)
Conflict artificially serializes the two instructions
Add R6, R4, R5 (no reuse: had R6)– Without reuse Adds are “parallelizable” if there are two adders
• Instruction level parallelism (ILP)– ILP ~ Average (CPI)-1 ~ Number of registers
9
Operand access limitationsOperand access limitationspp# mem. operands # total operands type Examples
0 3 "load/store" Most RISCS1 2 "register/memory" x86, 68000
2 3 2 3 " / " VAX2, 3 2, 3 "memory/memory" VAX
• Load/store (0,3)– (+) Fixed-length instructions possible: easy fetch/decode
( ) Si l h/ ffi i t i li & t ti ll l CT– (+) Simpler h/w: efficient pipeline & potentially lower CT– (-) Higher instruction count (IC)– (-) Fixed-length instructions are wasteful
• Register/memory (1 2)• Register/memory (1,2)– (+) No need for extra loads– (+) “A few lengths” better uses bits– (-) Destroys source operand (e.g., Add R1,R2)
Good code density
( ) y p ( g )– (-) May impact CPI
• Memory/memory– (+) Most compact (code density)
10
– (-) High memory traffic (memory bottleneck)
AlignmentAlignmentgg• Byte alignment
– Any access is accommodatedW d li• Word alignment
– Only accesses that are aligned at natural word boundaries are accommodated due to DRAM/SRAM organizationR d b f d / it t 0
memory (bytes)
– Reduces number of reads/writes to memory– Eliminates hardware for alignment (typically expensive)– Often handle misalignment via software:
C il d t t & t i t
01234
Unalignedaccess
• Compiler detects & generates appropriate instructions
• …or O/S detects and runs “fixit” routine
567
Word si e = 4 b tesWord size = 4 bytes
Asking for words beginning at 0 or 4 is OKAsking for other words requires two reads
(e.g., ask for word starting at 2)
0 1 2 34 5 6 7
read #1read #2
2 34 5reorder
11
( g , g )2 3 4 5
reorder
EndianEndian--nessness• Where is the most-significant byte (MSB) in a word?
– Little-endian (e.g., x86)( g )
Byte address0 1 2 34 5 6 7
LSB MSB
• “little”-endian comes from interpreting byte address 0 as the “least”-significant byte
4 5 6 7
– Big-endian (e.g., IBM PowerPC)
0 1 2 34 5 6 7
MSB LSB
Byte address
• “big”-endian comes from interpreting byte address 0 as the “most”-significant byte
4 5 6 7y
13
0 as the most significant byte
Common addressing modesCommon addressing modesgg
• Register– Add R4, R3– R4 = R4 + R3– Used when value is in a register
• ImmediateImmediate– Add R4, #3– R4 = R4 + 3– Useful for small constants which occur frequentlyUseful for small constants, which occur frequently
• Displacement– Add R4, 100(R1)
R4 = R4 + Mem[100+R1]– R4 = R4 + Mem[100+R1]– Accesses the frame (arguments, local variables)– Accesses the global data segment
A fi ld f d t t t
14
– Accesses fields of a data struct
Addressing modes (cont.)Addressing modes (cont.)g ( )g ( )
• Register deferred/Register indirect– Add R3, (R1)( )– R3 = R3 + Mem[R1]– Access using a computed address
• IndexedIndexed– Add R3, (R1 + R2)– R3 = R3 + Mem[R1 + R2]– Array accessesArray accesses
• R1 = base, R2 = index• Direct/Absolute
Add R1 (1001)– Add R1, (1001)– R1 = R1 + M[1001]– Accessing global (“static”) data
15
Addressing modes (cont.)Addressing modes (cont.)g ( )g ( )• Memory indirect/Memory deferred
– Add R1, @(R3)R1 R1 + M [M [R3]]– R1 = R1 + Mem[Mem[R3]]
– Pointer dereferencing: x = *p; (if p is not register-allocated)• Autoincrement/Postincrement
– Add R1, (R2)+, ( )– R1 = R1 + Mem[R2]; R2 = R2 + d (d is size of operation)– Looping through arrays, stack pop
• Autodecrement/PredecrementAdd R1 (R2)– Add R1, -(R2)
– R2 = R2 - d; R1 = R1 + Mem[R2] (d is size of operation)– Same uses as autoincrement, stack push
• ScaledScaled– Add R1, 100(R2)[R3]– R1 = R1 + Mem[100+R2+R3*d] (d is size of operation)– Array accesses for non-byte-sized elements
16
Wisdom about modesWisdom about modes
• Need:– Register, Displacement, Immediate and optionally Indexed Register, Displacement, Immediate and optionally Indexed
(indexed simplifies array accesses)– Displacement size 12-16 bits (empirical)
Immediate: 8 to 16 bits (empirical)– Immediate: 8 to 16 bits (empirical)– Can synthesize the rest from simpler instructons– Example-- Mips architecture:
• Register, displacement, Immediate modes only• both immediate and displacement: 16 bits
• Choice depends on workload!p– For example, floating-point codes might require larger
immediates, or 64bit wordsize machines might also require larger immediates
17
g
Control transfer semanticsControl transfer semantics
• Types of branches– Conditional– Unconditional
• Normal• Call• Return
• Addressing mode in control transfer instructions– BranchBranch
• Branch allows relocatable (“position independent”) code• Fewer bits in encoding when the target is close
– Indirect Jump– Indirect Jump• Switch/case statements jump r1
– Jump• Jump allows branching further than branch
18
• Jump allows branching further than branch
Parts of a control transferParts of a control transfer
• WHERE– Determine target addressDetermine target address
• WHETHER– Determine if transfer should occur or not
• WHEN– Determine when in time the transfer should occur
• Each of the three decisions can be decoupledp
24
Types of control transfer (cont).Types of control transfer (cont).yp ( )yp ( )
• All three together: Compare and branch instruction– Br (R1 = R2), destination( )– (+) A single instruction– (-) Heavy hardware requirement, inflexible scheduling
• WHETHER separate from WHERE/WHEN:WHETHER separate from WHERE/WHEN:– Condition code register (CMP R1,R2 … BEQ dest)
• (+) Sometimes test happens “for free”• (-) Hard for compiler to figure out which instructions ( ) Hard for compiler to figure out which instructions
depend on CC register; limiting parallelism (two CMPs will share the CC register implicitly)
– Condition register (SUB R1,R2 … BEQ R1, dest)g ( )• (+) Simple to implement, dependencies between
instructions are obvious to compiler• (-) Uses a register (“register pressure”)
25
PreparePrepare--toto--branchbranchpp
• Decouple all three of WHERE / WHETHER / WHEN• WHERE: PBR BTR1 = destinationWHERE: PBR BTR1 destination
– BTR1 = “Branch target register #1”• WHETHER: CMP PR2 = (R1 = R2)
– PR2 = “Predicate register #2”• WHEN BR BTR1 if PR2• (+) Schedule each instruction so it happens during “free ( ) Schedule each instruction so it happens during free
time” when hardware is idle• (-) Three instructions: higher IC
F h HP L b Pl D h hi• From the HP Labs PlayDoh architecture
26
Instruction Encoding tradeoffsInstruction Encoding tradeoffsgg
• Variable width– Common instructions are short (1-2 bytes), less common or Common instructions are short (1 2 bytes), less common or
more complex instructions are long (>2 bytes)– (+) Very versatile, uses memory efficiently
( ) Instruction words must be decoded before number of – (-) Instruction words must be decoded before number of instructions is known
• Fixed widthT i ll 1 i i 32 bi d (Al h i 2 – Typically 1 instruction per 32-bit word (Alpha is 2 instructions per word)
– (+) Every instruction word is an instruction, Easier to f h/d dfetch/decode
– (-) Uses memory inefficiently
27
Addressing mode encodingAddressing mode encodingg gg g
• Each operand has a “mode” field– Also called “address specifiers”Also called address specifiers– VAX, 68000– (+) Very versatile
( ) E i bl idth i t ti (h d d d )– (-) Encourages variable-width instructions (hard decode)• Opcode specifies addressing mode
– Most RISCs– (+) Encourages fixed-width instructions (easy decode)– (+) “Natural” for a load/store ISA– (-) Limits what every instruction can do( ) Limits what every instruction can do
• But only matters for loads and stores
28
Compiler impactCompiler impactp pp p• High-level opt:
– Use a “virtual source l l” i
Parse
High-level level” representation– Loop interchange, etc.
• Low-level opt:
High-levelintermediate language
High-levelOptimize
Low-level – Clean up parser refuse– Each “optimization
pass” runs as a filterE h ll li
Low-levelOptimize
Low-levelintermediate language
Low level – Enhance parallelism• Code generation:
– Allocate registersCode generation:Allocate, Schedule
Low-levelintermediate language
– Schedule code for high performance
translate
29
Assembly code
Example: MIPSExample: MIPSpp
6 5 5 16
I-type instruction
A load/store, fixed-encodingarchitecture with a “conditionregister” architecture
Opcode rs1 rd Immediate
Load, store, all immediate operations, conditional branches (rd unused)Jump through register, call through register (“jump and link register”)
Opcode
6
rs1
5
rs2
5
rd
5
Func
6
R-type instructionOpcodeis in thesame placefor every
For semantics,see the H&P textbookFig. B.26 (pp B-40)shamt
5
Register-register ALU operations“Func” is an opcode extension
for everyinstruction
Opcode
6
Offset added to PC
26J-type instruction
32
Jump, call (“jump and link”), trap and return from exception
MIPSMIPS--64 ISA64 ISA
R0R1R2
R0 is permanent 063 0
R2...
R31
PC
F0F1F2...
F31
Fi for either single precision (32-bit) values or double precision (64-bit) values
63 0
F31
Load/store architectureTransfer sizes: B (byte), H (halfword), W (word), D (double word)No unaligned accesses allowed
33
MIPS example codeMIPS example codepp
DADDI R1,R0,#10 Put 10 into R1 (R0 = 0)LW R2,A(R0) Put A in R2
Loop L.D F0, 0(R2) Load double FP value into F0ADD.D F4, F0, F2 Add F2 to F0, ,S.D F4, 0(R2) Store result back to memoryDSUB R1,R1,#1 Decrement IDADDI R2,R2,#8 Increment loop pointerBNEZ R1,Loop
34
ApplicationApplication--Specific ISA ExtensionSpecific ISA Extensionpppp pp
• Example: Converting endian format
Byte 3 Byte 2 Byte 1 Byte 0
unsigned ss = (s<<24) | ((s<<8)&0xff0000) |(s>>8)&0xff00 | (s>>24);
Byte 0 Byte 1 Byte 2 Byte 3
On a typical general purpose processor:slli a9, a14, 24lli 8 14 8Byte 0 Byte 1 Byte 2 Byte 3 slli a8, a14, 8
srli a10, a14, 8and a10, a10, a11and a8, a8, a13, ,or a8, a8, a9extui a9,a14,24,8or a10, a10, a9or a10 a10 a8
35
or a10, a10, a8
9 instructions
ApplicationApplication--Specific ISA ExtensionSpecific ISA Extensionpppp pp
• New instruction based on hardware extension for byte-swapping operationpp g p
• Using Tensilica Instruction Extension (TIE) LanguageO BYTESWAP { AR R AR R} {}Operation BYTESWAP {out AR outR, in AR inpR} {}{wire [31:0] reg_swapped = {inpR[7:0], inpR[15:8], inpR[23:16],
inpR[31:24]};assign outR = reg_swapped;
}}
The new instruction: BYTESWAP a10, a9A speedup of 9x!
36
A speedup of 9x!
Overhead with Instruction ExtensionOverhead with Instruction Extension• To support new instructions defined in the ISA, we need
– Hardware function units to implement the operations– Compiler (or programmer) to generate code to take advantage of the new instructions
• Design flow of Xtensa microprocessor core
Processor specificationApplication code
compile Processor generator
Processor specification
profile Configured processor RTL Tailored software
Develop processor configuration and extensionsp Tailored software
development tools(compiler, debugger,
assembler, etc.)
37
Common ISA Extension TechniquesCommon ISA Extension Techniquesqq
• Fusion– Turn a data dependence subgraph into a single instructionTurn a data dependence subgraph into a single instruction
• SIMD/Vector transformationD l l ll li– Data-level parallelism
– Width of the vector register file
• VLIW (called FLIX in Tensilica)– Parallel operations determined by the compiler
Instruction level Parallelism– Instruction-level Parallelism
38
Operation FusionOperation Fusionppunsigned short *a, *b, *c;for(i=0; i<n; i++)
[ ] ( [ ] b[ ]) 1c[i] =(a[i] + b[i]) >> 1;
A new operation is resulted from fusing add and shift operations
operation AVERAGE {out AR res, in AR input0. in AR input 1) {}{
wire [16:0] tmp = input0[15:0] + input1[15:0];wire [16:0] tmp input0[15:0] + input1[15:0];assign res = tmp[16:1];
}
The C/C++ code or assembly code can use the new AVERAGE instruction as follows,
for(i=0; i<n; i++)[i] AVERAGE( [i] b[i])
39
c[i] =AVERAGE(a[i], b[i]);
SIMD/Vector transformationSIMD/Vector transformation
regfile VEC 64 8 voperation VAVERAGE {out VEC res, in VEC input0, in VEC input1}{}{wire [67:0] tmp = {input0[63:48] + input1[63:48], //a 17-bits result
input0[47:32] + input1[47:32], //a 17-bits resultinput0[31:16] + input1[31:16] //a 17 bits resultinput0[31:16] + input1[31:16], //a 17-bits resultinput0[15:0] + input1[15:0]}; //a 17-bits result
assign res = {tmp[67:52], tmp[50:35], tmp [33:18], tmp[16:1]};}}
The C/C++ code or assembly code can use the new VAVERAGE instruction as follows,VAVERAGE instruction as follows,
for(i=0; i<n; i+=4)c[i] =VAVERAGE(a[i], b[i]);
40
Top Related