CENG 241 Digital Design 1 Lecture 8 Amirali Baniasadi [email protected].
1 CENG 450 Computer Systems and Architecture Lecture 12 Amirali Baniasadi [email protected].
-
Upload
barry-holland -
Category
Documents
-
view
214 -
download
0
Transcript of 1 CENG 450 Computer Systems and Architecture Lecture 12 Amirali Baniasadi [email protected].
2
Multiple Issue
• Multiple Issue is the ability of the processor to start more than one instruction in a given cycle.
• Superscalar processors
• Very Long Instruction Word (VLIW) processors
3
1990’s: Superscalar Processors
Bottleneck: CPI >= 1 Limit on scalar performance (single instruction issue)
Hazards Superpipelining? Diminishing returns (hazards + overhead)
How can we make the CPI = 0.5? Multiple instructions in every pipeline stage (super-scalar)
1 2 3 4 5 6 7 Inst0 IF ID EX MEM WB Inst1 IF ID EX MEM WB Inst2 IF ID EX MEM WB Inst3 IF ID EX MEM WB Inst4 IF ID EX MEM WB Inst5 IF ID EX MEM WB
4
Superscalar Vs. VLIW
Religious debate, similar to RISC vs. CISC
Wisconsin + Michigan (Super scalar) Vs. Illinois (VLIW) Q. Who can schedule code better, hardware or
software?
5
Hardware Scheduling
High branch prediction accuracy Dynamic information on latencies (cache misses) Dynamic information on memory dependences Easy to speculate (& recover from mis-speculation) Works for generic, non-loop, irregular code Ex: databases, desktop applications, compilers
Limited reorder buffer size limits “lookahead” High cost/complexity Slow clock
6
Software Scheduling
Large scheduling scope (full program), large “lookahead”Can handle very long latencies
Simple hardware with fast clock Only works well for “regular” codes (scientific, FORTRAN)
Low branch prediction accuracyCan improve by profiling
No information on latencies like cache missesCan improve by profiling
Pain to speculate and recover from mis-speculationCan improve with hardware support
7
Superscalar Processors
Pioneer: IBM (America => RIOS, RS/6000, Power-1) Superscalar instruction combinations
1 ALU or memory or branch + 1 FP (RS/6000) Any 1 + 1 ALU (Pentium) Any 1 ALU or FP+ 1 ALU + 1 load + 1 store + 1 branch
(Pentium II)
Impact of superscalar More opportunity for hazards (why?) More performance loss due to hazards (why?)
8
Superscalar Processors
• Issues varying number of instructions per clock
• Scheduling: Static (by the compiler) or dynamic(by the hardware)
• Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo).
• IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
9
Elements of Advanced Superscalars
High performance instruction fetching Good dynamic branch and jump prediction Multiple instructions per cycle, multiple branches per cycle?
Scheduling and hazard elimination Dynamic scheduling Not necessarily: Alpha 21064 & Pentium were statically scheduled Register renaming to eliminate WAR and WAW
Parallel functional units, paths/buses/multiple register ports High performance memory systems Speculative execution
10
SS + DS + Speculation
Superscalar + Dynamic scheduling + SpeculationThree great tastes that taste great together CPI >= 1?
Overcome with superscalar Superscalar increases hazards
Overcome with dynamic scheduling RAW dependences still a problem?
Overcome with a large window Branches a problem for filling large window? Overcome with speculation
11
The Big Picture
&Static program Fetch & branch
predict execution
issue
Reorder & commit
12
Superscalar Microarchitecture
Integer register file
Floating point register file
Decode rename dispatch
Floating point inst. buffer
Integer address inst buffer
Functional units
Functional units and data cache
Memory interface
Reorder and commit
Inst.buffer
Pre-decode Inst.
Cache
13
Register renaming methods
First Method: Physical register file vs. logical (architectural) register file. Mapping table used to associate physical reg w/ current value of
log. Reg use a free list of physical registers Physical register file bigger than log register file
Second Method: physical register file same size as logical Also, use a buffer w/ one entry per inst. Reorder buffer.
14
Register renaming: first method
R2 R6 R13
R8
R7
R5
R9
R1
r0
r1
r2
r3
r4
R6 R13
R8
R7
R5
R9
R2
r0
r1
r2
r3
r4
Add r3,r3,4
Mapping table
Free List
Mapping table
Free List
15
Program
Instr
ucti
on
issu
es p
er
cy
cle
0
10
20
30
40
50
60
gcc espresso li fpppp doducd tomcatv
11
15
12
29
54
10
15
12
49
16
10
1312
35
15
44
9 10 11
20
11
28
5 5 6 5 57
4 45
45 5
59
45
Infinite 256 128 64 32 None
More Realistic HW: Register Impact
Effect of limiting the number of renaming registers
Integer: 5 - 15
FP: 11 - 45
IPC
16
Reorder Buffer
Place data in entry when execution finished
Reserve entry at tail when dispatched
Remove from head when complete
Bypass to other instructions when needed
17
…..…..
register renaming:reorder buffer
r3
R8
R7
R5
R9
rob6
r0
r1
r2
r3
r4
R3 0 R3 ….
R8
R7
R5
R9
rob8
r0
r1
r2
r3
r4
Before add r3,r3,4Add r3, rob6, 4add rob8,rob6,4
Reorder buffer
Reorder buffer
7 6 0 8 7 6 0
18
Instruction Buffers
Integer register file
Floating point register file
Decode rename dispatch
Floating point inst. buffer
Integer address inst buffer
Functional units
Functional units and data cache
Memory interface
Reorder and commit
Inst.buffer
Pre-decode Inst.
Cache
19
Issue Buffer Organization
a) Single, shared queue b)Multiple queue; one per inst. type
No out-of-orderNo Renaming
No out-of-order inside queuesQueues issue out of order
20
Issue Buffer Organization
c) Multiple reservation stations; (one per instruction type or big pool)
NO FIFO ordering Ready operands, hardware available execution starts Proposed by Tomasulo
From Instruction Dispatch
21
Typical reservation station
Operation source 1 data 1 valid 1 source 2 data 2 valid 2 destination
22
Memory Hazard Detection Logic
Address add & translation
Address compare
Load address buffer
Store address buffer
loads
stores
Hazard Control
To memoryInstruction issue
23
Example
MIPS R10000, Alpha 21264, AMD k5 : self study READ THE PAPER.
24
VLIW
VLIW: Very long instruction word In-order pipe, but each “instruction” is N instructions (VLIW)
Typically “slotted” (I.e., 1st must be ALU, 2nd must be load,etc., )
VLIW travels down pipe as a unit Compiler packs independent instructions into VLIW
IF ID
ALUALUAdFP
WBWBWBWB
MEMFP
25
Very Long Instruction Word
VLIW - issues a fixed number of instructions formatted either as one very large instruction or as a fixed packet of smaller instructions
Fixed number of instructions (4-16) scheduled by the compiler; put operators into wide templates Started with microcode (“horizontal microcode”) Joint HP/Intel agreement in 1999/2000 Intel Architecture-64 (IA-64) 64-bit address /Itanium Explicitly Parallel Instruction Computer (EPIC) Transmeta: translates X86 to VLIW Many embedded controllers (TI, Motorola) are VLIW
26
Pure VLIW: What Does VLIW Mean?
All latencies fixed
All instructions in VLIW issue at once
No hardware interlocks at all
Compiler responsible for scheduling entire pipelineIncludes stall cyclesPossible if you know structure of pipeline and
latencies exactly
27
Problems with Pure VLIW
Latencies are not fixed (e.g., caches) Option I: don’t use caches (forget it) Option II: stall whole pipeline on a miss? Option III: stall instructions waiting for memory? (need out-of-
order)
Different implementations Different pipe depths, different latencies New pipeline may produce wrong results (code stalls in wrong
place) Recompile for new implementations?
Code compatibility is very important, made Intel what it is
28
Key: Static Scheduling
VLIW relies on the fact that software can schedule code well
Loop unrolling (we have seen this one already)Code growthPoor scheduling along unrolled copies
29
Limits to Multi-Issue Machines
Inherent limitations of ILP 1 branch in 5 instructions => how to keep a 5-way VLIW busy? Latencies of units => many operations must be scheduled Need about Pipeline Depth x No. Functional Units of independent
operations to keep machines busy.
Difficulties in building HW Duplicate Functional Units to get parallel execution Increase ports to Register File Increase ports to memory Decoding Superscalar and impact on clock rate, pipeline depth: Complexity-effective designs
30
Limitations specific to either Superscalar or VLIW implementation
Decode issue in Superscalar VLIW code size: unroll loops + wasted fields in VLIW VLIW lock step => 1 hazard & all instructions stall VLIW & binary compatibility
Limits to Multi-Issue Machines
31
Multiple Issue Challenges
While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: Exactly 50% FP operations No hazards
If more instructions issue at same time, greater difficulty of decode and issue Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or
2 instructions can issue
VLIW: tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction
word are independent => execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
• 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
Need compiling technique that schedules across several branches
32
How is this used in practice? Rather than predicting the direction of a branch, execute the
instructions on both side!! We early on know the target of a branch, long before we
know it if will be taken or not. So begin fetching/executing at that new Target PC. But also continue fetching/executing as if the branch NOT
taken.
HW Support for More ILP
33
Studies of ILP
Conflicting studies of amount of improvement availableBenchmarks (vectorized FP Fortran vs. integer C
programs)Hardware sophisticationCompiler sophistication
How much ILP is available using existing mechanisms with increasing HW budgets?
Do we need to invent new HW/SW mechanisms to keep on processor performance curve?
34
Summary
Static ILP Simple, advanced loads, predication hardware (fast
clock), complex compilers
VLIW
35
Summary
Dynamic ILP Instruction buffer
Split ID into two stages one for in-order and other for out-of-order issue
Socreboard out-of-order, doesn’t deal with WAR/WAW hazards
Tomasulo’s algorithmUses register renaming to eliminate WAR/WAW
hazards Dynamic scheduling + speculation Superscalar