Computer Architecture - UoAcgi.di.uoa.gr/~halatsis/Advanced_Comp_Arch/Geniki_parousiasi.pdf ·...
Transcript of Computer Architecture - UoAcgi.di.uoa.gr/~halatsis/Advanced_Comp_Arch/Geniki_parousiasi.pdf ·...
Universityof
Amsterdam
CSPCSPComputer
Architecture
Computer ArchitectureA bottom-up perspective
Andy Pimentel
Computer Architecture Modeling & Simulation group
Andy Pimentel – p. 1/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Course material
Book: J.L. Hennessy and D.A. Patterson, “ComputerArchitecture, A Quantitative Approach”, 3rd ed.
Other nice book: D. Sima, T. Fountain and P. Kacsuk,“Advanced Computer Architecture, A Design SpaceApproach”
Sheets available at website(http://www.science.uva.nl/˜andy/aca.html)
Idem for schedule, practical assignments, deadlines, etc.
Andy Pimentel – p. 2/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Outline
Memory hierarchyDRAMCaches: from concept to implementation
Pipelined processorsPipeline hazardsSome design space issues
Modern superscalar processorsDecoding, dispatching, issuing and execution ofinstructionsRegister renamingSequential consistency, exception handlingBranch prediction
Andy Pimentel – p. 3/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Outline (cont’d)
Application specific optimizationsSIMD instructionsData prefetching
Case studiesCompaq Alpha 21264, HP PA-8700, IBM POWER 4,Intel Pentium 4
VLIW processorsPhilips TriMediaIntel/HP IA64 (Itanium 2)Transmeta Crusoe
Embedded processors
Andy Pimentel – p. 4/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Outline (cont’d)
Parallel computersInterconnection networks
Topology, switching, routing, etc.Memory hierarchy
Shared/distributed memory, cache coherency, etc.Case studies
Future directionsSuper-speculative processorsTrace/Multiscalar processorsSimultaneous multithreadingI(ntelligent)RAMs...
Andy Pimentel – p. 5/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Memory hierarchy: DRAM
8 to 16 times slower than SRAM
More dense than SRAM (e.g. SRAM needs about 6transistors/cell)
RAS/CAS addressing using time multiplexing
Needs refreshing
Cycle time roughly 2 times the access time
Processor �Memory speed-gap is wideningProcessors 50% to 100% faster/year (Moore’s Law)DRAM cycle time improves 7%/year
Andy Pimentel – p. 6/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
RAMs
capacitorStorage
Transistor
Ground
Address line
Bitline B
SRAM cellDRAM cell
Ground
C2C1
Address line
dc voltage
T6T5
T4T3
T2 T1
Bitline B Bitline B
Andy Pimentel – p. 7/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
DRAM (cont’d)
RAS/CAS addressing
Capacitor(1 transistor)
RAS
CAS
Step 1: Row Address SelectStep 2: Column Address Select (select bit)
Refresh: read and write back a whole row
Andy Pimentel – p. 8/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
DRAM (cont’d)
Refresh time typically in the tens of milliseconds
Number of refresh cycles dependent on number of rows
Two types of refreshing
Refresh Cycle
Burst
Refresh Time
Time
DistributedRefresh
Refresh
Andy Pimentel – p. 9/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
DRAM (cont’d)
Improving bandwidth (not latency!) by exploiting spatial locality
one RAS, multiple CAS addressesFast page mode DRAMsE(xtended) D(ata) O(utput) RAM
Burst mode DRAMs: for one burst 1 RAS and CAS addressBurst EDO RAMSDRAM
or by improving interface: SDRAM and Rambus
Andy Pimentel – p. 10/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
DRAM (cont’d)
DATA2
CAS
RAS
DATA1Data
EDO RAMRAS
CAS
ROW
Data
COL1 COL3
DATA1 DATA3
2-bit Burst EDO RAM
DATA2
ROW COL1 COL2 COL3
Address
Address
Andy Pimentel – p. 11/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
DRAM (cont’d)
SDRAM changed interface from asynchronous to synchronous:Synchronous DRAM Standard DRAM
Decode R/W Output Decode R/W OutputAddr. latchAddr. latch
Addr1
Addr1
Addr1
Addr1
Addr1
Addr1
Addr1
Addr1
Addr2
Addr2
Addr2
Addr3
Addr3 Addr2Addr4Addr5
Addr4
Addr3
Addr2
Clock
Brought (a sort of) pipelining to DRAMs
DDR-SDRAM (Double Data Rate) transfers data on bothrising and falling clock edges
Andy Pimentel – p. 12/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
DRAM (cont’d)
Rambus (RDRAM)
Interface using a split-transaction (= packet-switched) bus(pipelining!)
Separate row, column address control and (18 bits) datalines
So, three transactions can be active at the same time
High clock rate (400 Mhz), but long latency
Data can be transferred on both clock edges
Andy Pimentel – p. 13/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
DRAM (cont’d)
Interleaved memory: multiple banks
4 5 7
9 11
0 1 2 3
6
108
Bank 0 Bank 1 Bank 2 Bank 3
Optimizes sequential accesses and can hide refresh cycles
Problem: aliased accessesLarge number of banks (Nec SX/3, 128 banks!),number of banks prime
Andy Pimentel – p. 14/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Memory hierarchy: caches
Performance gap between processor and main memory �
apply caching (basically a poor man’s solution)
Caches are “small” and fast memories (close to theprocessor, typically SRAM)
Nowadays, 2 (or 3) levels of cache between processor andmain memory
Caches are transparent to the user (important!...however...)
Andy Pimentel – p. 15/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Caches (cont’d)
Cache exploits locality in software
Temporal locality : a referenced item tends to be referencedsoon again
InstructionsData??
Spatial locality : items close to a referenced item tend to bereferenced soon
Instructions + data
Andy Pimentel – p. 17/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Caches (cont’d)
Instruction, data or unified caches
Address cache (TLB – Translation Lookaside Buffer)Caches VA � PA translationsSplit I + D TLBs or unified, sometimes 2 levels
Three common implementationsDirect mappedFully associativeSet-associative
Andy Pimentel – p. 18/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Caches (cont’d)
Instruction and data caches store cache blocks (also called cachelines)
Tag V D Data
Valid Dirty Typically 16 - 128 bytesHigher-orderaddress bits
Andy Pimentel – p. 19/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache implementations
Direct mapped cache (often 2nd-level cache)
data
data
data
data
data
data
data
data
Tag
Tag
cache block
Byte
4
3
16 bits memory address
9
Block0Block1Block2Block3Block4Block5Block6Block7Block8Block9Block10Block11Block12Block13Block14Block15
16 bytesMain memory
with 16 bytes of data
Block
compare hit?
Simple hardware & high speed access
Rigid mapping: many memory blocks map onto onecache block � large cache size required
Andy Pimentel – p. 20/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache implementations (cont’d)
Fully associative cache (e.g. TLB, branch history table)
data
data
data
data
data
data
data
data
Tag
cache block
Byte
16 bits memory address
Block0Block1Block2Block3Block4Block5Block6Block7Block8Block9Block10Block11Block12Block13Block14Block15
16 bytesMain memory
with 16 bytes of data
Tag
4
hit?compare
12
Very flexible mapping (few conflicts)
CAMs (Content Addressable Memory) are expensive �
small caches � multimedia applications often a killer forTLBs
Andy Pimentel – p. 21/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache implementations (cont’d)
Set-associative cache (often 1st-level cache)
data
data
data
data
data
data
data
data
Tag
cache block
Byte
16 bits memory address
Block0Block1Block2Block3Block4Block5Block6Block7Block8Block9Block10Block11Block12Block13Block14Block15
16 bytesMain memory
with 16 bytes of data
4
Tag Set
compare
10 2
set
hit?
Performance similar to fully associative cachebut less expensive
Andy Pimentel – p. 22/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Virtually vs physically addressed cache
Virtually addressed
CPU
Cache
MMUVA PA
I or DI or D
Memory
Parallel VA translation and cache lookupAliasing problem
Andy Pimentel – p. 23/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Virtually vs physically addressed cache (cont’d)
Physically addressed
CPUMMU Cache
I or DPAVA
I or D
Memory
Slowdown on address translationNo aliasing problem
Andy Pimentel – p. 24/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Virtually vs physically addressed cache (cont’d)
Virtually-indexed, physically tagged cacheCache indexing during translationPage offset bits in VA used as cache index
Number of sets in cache limited (dependent on pagesize)!Solutions: large cache sets or page colouring (OSsupport)
VA
Page offsetbitsidentical bits
for VA and PA
Use as cache index
Andy Pimentel – p. 25/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache strategies
Replacement strategies in (set-)associative cachesRandom, FIFO, Least Recently Used (LRU)
Write strategiesWrite-throughWrite-back
Write-miss strategiesAllocate on write (write-back cache): with/withoutfetchNo allocate on write (write-through cache)
Andy Pimentel – p. 26/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache misses
Three types of cache misses
Compulsory (cold-start) cache missThe data block is read for the first time
Capacity cache missThe data block has been replaced (cache too small)
Conflict cache missThe data block has been replaced (associativity too low)
Andy Pimentel – p. 27/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Reducing the miss rate & penalty
More levels of cache
Critical-word-first read strategy
Lockup-free cacheMultiple outstanding requests (reads/writes), writebuffers
Sub-blocksCache block:
Tag D V Subblock0 D V Subblock1
D = dirty bitV = valid bit
Prefetching (explained later in detail)
Victim cache (basically increases associativity)
Andy Pimentel – p. 28/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipelined processors
RISC(y) processors dominate the microprocessor market
A few addressing modes, fixed instruction format,load-store architecture
Take advantage of pipelining
IF ID WBEX
IF ID EX WB
WBEXIDIF
IF=Instruction FetchID= Instruction DecodeEX=ExecuteWB=Write Back
Traditionally no microcode...not true anymore (are wereturning to CISC?)
Caching essential due to larger code size and pipelining
Andy Pimentel – p. 29/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipelined processors
Functionality of our simple pipeline:
IF : Fetch instruction from Icache, update PC
ID : Decode instruction, fetch operands from registers
EX : Execute instructionUse ALU for arithmetic instructionAccess memory for load/storeDetermine branch taken or not
WB : Write back result (from ALU, memory or PC) toregister
Andy Pimentel – p. 30/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipeline hazards
There are three types of pipeline hazards:
Structural hazard due to resource conflicts For example,two instructions using the ALU at the same clock cycle (inthis example the MUL takes 2 cycles) � the pipeline needsto be stalled causing a bubble
E = E + 1A = B*C
EE+1E
Pipeline bubble
IF ID EX
IFMUL B,C
INC
WBEX
ID EX WBB*C AB*C
Andy Pimentel – p. 31/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipeline hazards (cont’d)
Control hazard : Due to branching
IF ID WBEX
IF ID EX WBBRA
IF ID EX WB
IFBRA
= branch delay
Branch
not takenBranch
taken WBEXIDIF
In this example, the pipeline is stalled when a branch isencountered
Andy Pimentel – p. 32/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipeline hazards (cont’d)
Reducing control stalls
Branch delay slot: delayed branching
100101102103104105106
add r2,1,r3bra 105add r4,r2,r4sub r0,r7,r6st 0(r0),r3
ld r1, a
Addr.(interlocked)
Original code
bra 106no-op
interlockedSoftware
ld r1, aadd r2,1,r3
add r4,r2,r4
bra 105
optimizationDelayed branch
ld r1, a
add r2,1,r3 Branch delayslot
sub r0,r7,r6st 0(r0),r3
st 0(r0),r3
add r4,r2,r4sub r0,r7,r6
Assume branch delay of 1 cycle
Nullifying: also schedule instructions from taken/untakenpath
Andy Pimentel – p. 33/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipeline hazards (cont’d)
Reducing control stalls (cont’d)
Branch predictionPredict always-taken/always-untakenStatic prediction (compiler)
Extra bit in branch instruction to guide IF unit or tooptimize branch delay slot techniqueBased on heuristics or profiling
Dynamic prediction by hardware: discussed later on
Andy Pimentel – p. 34/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipeline hazards (cont’d)
Reducing control stalls (cont’d)
Predication
Instruction i + 3
Instruction i + 2
Branch if C
Instruction i
Instruction i + 6
Instruction i + 5
Predicated execution
Instruction i
(!C)
(!C)
(C)
(C)
Instruction i + 2
Instruction i + 3
Instruction i + 6
Instruction i + 7
Instruction i + 5
Jump to i + 7
false true
i+4
i+1
Instruction i + 7
Andy Pimentel – p. 35/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipeline hazards (cont’d)
Data hazard : Due to data dependencies between sourceand destination operands
IF ID
ID WB
WBEX
EXIFADD B,C A
INC A A+1 A
Two bubbles
B+C
A := A + 1
A := B + C
Andy Pimentel – p. 36/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipeline hazards (cont’d)
Three types of data hazards:
RAW : Read After Write (true dependency)Read an operand before it is updated by a previousinstruction
WAR : Write After Read (anti dependency)Update an operand before it is read by a previousinstruction
WAW : Write After Write (output dependency)Write an operand before it is written by a previousinstruction
WAR en WAW dependencies are false dependencies
Andy Pimentel – p. 37/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipeline hazards (cont’d)
Data hazards (cont’d)
ADD R3, R1, 6 # R3 = R1 + 6ADD R4, R3, R2 # RAW hazard
ADD R4, R3, R2ADD R3, R1, 6 # WAR hazard
ADD R4, R1, R2ADD R4, R3, 7 # WAW hazard
In a simple pipeline, only RAW hazards can occur
Andy Pimentel – p. 38/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipeline hazards (cont’d)
Avoiding RAW data hazards
Data forwarding: an execution unit (ALU) bypass
IF ID WBEXADD B,C AB+C
INCIF ID
A+1 AWB
A
EX A := A + 1
A := B + C
Andy Pimentel – p. 39/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipeline hazards (cont’d)
Avoiding RAW data hazards (cont’d)
Instruction scheduling
Data available
MEMMEMr1,aLDIF ID
ID
WBEX
EXIFINC r1 r1+1 r1
WB
100
(interlocked)Original code
ld r1, aadd r1, r1, 1sub r2, r4, r7and r3, r3, r5
no-opld r1, a
no-op
103
101102
sub r2, r4, r7
and r3, r3, r5
ld r1, a
add r1, r1, 1 add r1, r1, 1sub r2, r4, r7
and r3, r3, r5104
105
interlockedSoftware
ScheduledAddr.
Finding the optimal schedule is an NP-hard problem
Static scheduling (compiler), dynamic (hardware) or hybrid
Andy Pimentel – p. 40/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipeline hazards (cont’d)
Avoiding WAR and WAW hazards (false dependencies): registerrenaming
Old code:ST 0(R5), R4ADD R4, R3, 7 # WAR hazard
New code:ST 0(R5), R4ADD R6, R3, 7 # R4 renamed to R6
Andy Pimentel – p. 41/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Some basic pipeline design space issues
Depth: number of stagesSuperpipelined: large number of stages, high clockfrequency, but more hazards
Number of execution unitsScalar ILP (Instruction Level Parallelism) processors:sequential instruction issue to execution units, parallelexecution
Number of pipelinesSuperscalar ILP processors: parallel instruction issue,parallel execution
Andy Pimentel – p. 42/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Pipelined processors (cont’d)
Scalar versus Superscalar
IF ID
EX1
EX2
EX3
WB
IF ID
EX1
EX2
EX3
WB
Superscalar ILP pipeline
Scalar ILP pipeline
Need to preserve sequential consistency (WAR & WAW hazards,exceptions)!
Andy Pimentel – p. 43/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Today’s superscalar processors
Some of the issues that will be touched
Parallel decoding, multi-way issuing and out-of-orderexecution
Preserving sequential consistency
Exceptions
Branch prediction
Application specific optimizations (SIMD instructions, dataprefetching)
Andy Pimentel – p. 44/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Instruction decoding
Instructionbuffer
Instructioncache
Decode/Issue
Superscalar issue
Issue width
Issuewindow
Inst
ruct
ion
fetc
h st
age
To speed up decoding, instructions are predecoded inI-cache
I-cache often prefetches instructions (miss on block i, fetchi and prefetch i
�
1 when possible)
Andy Pimentel – p. 45/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Superscalar instruction issue
Blocking (direct) versus non-blocking (shelved) issue
Issue blockagesDependencies in window of fetched instructions (olderprocessors)Resource contention
Handling issue blockagesIn-order versus out-of-order issueAligned versus unaligned issue
Andy Pimentel – p. 46/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Superscalar instruction issue (cont’d)
cc
c
c
c c
c
cc
bIssued cycle 2
e
Issue window
d b aFetched
instructions
instructionsIssued a
instructions
instructionsIssued
Fetched
a
abd
Issue window
eIndependent instruction
Dependent instruction
In-order issue
dh g f
d
Aligned (in-order) issue
Issued cycle 3
abdh g f
Gliding window
bdh g f
abd
Fixed window
h g f
Next window
e
e
e
e
aIssued cycle 1 aIssued cycle 1
Out-of-order issue
b
bdh g f e
defghk j i
i
f e d
Issued cycle 2
Issued cycle 3
Unaligned (in-order) issue
Andy Pimentel – p. 47/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Non-blocking issue (shelving)
1. Dispatch instructions to buffer (check for structuralhazards)
2. Issue instructions from buffer when operands are available
Usually in-order, aligned dispatch
Note: throughout literature, the terms instruction issue anddispatch are ambiguous! (my usage differs from H&P!)
Instructionbuffer Decode/Dispatch
Shelving buffer(s)
EX EX
Instruction dispatch
Instruction issue (when operands are available)
Check structural hazards
Resolve data hazards
Andy Pimentel – p. 48/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Shelving (cont’d)
Types of shelving buffersScoreboard buffersReservation stations associated with EX units(individual, grouped or central)Combined buffer for register renaming, shelving andinstruction reordering: ROB (ReOrder Buffer)
Number of shelves nowadays � 30, e.g.AMD Athlon (K7) : 72HP PA-8x00 : 56DEC/Compaq Alpha 21264 : 35
Andy Pimentel – p. 49/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Shelving (cont’d)
Issue order
Shelving buffer Shelving buffer
In-order issue Out-of-order issue
CheckCheck
Nowadays mostly out-of-order issue
Issue rate: how many instructions can be issued/cycle fromthe shelving buffer(s)
Andy Pimentel – p. 50/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Out-of-order execution
An instruction is issued to an execution unit when itsoperands are available (no RAW hazards) : allows forout-of-order execution (dynamic scheduling)
In general, there are two schemes to control out-of-orderexecution
ScoreboardingTomasulo scheduling
Andy Pimentel – p. 51/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Scoreboarding
Introduced in the CDC 6600 (1964!) to strive for 1 IPC
Scoreboard keeps track of the state of instructions andregisters
Entries in shelving buffer store operand locations andbits indicating their availability
Registers include extra bit indicating their validityAt issue, if destination register is valid, then mark it
invalid. Otherwise block (WAW hazard). Validate bit at
WB while checking for WAR hazards
If source register is invalid, then block (RAW hazard)
Explicit register renaming to avoid WAW and WAR hazards
Traditional scoreboarding (without renaming) implementsin-order issue
Andy Pimentel – p. 52/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Scoreboarding (cont’d)
An example
r0r1r2r3r4r5
V
11
1
1020
40
OP S1 V1 DS2 V2
EXUnit
Instruction status
Register file
Instructions from
mul r3, r1, r2 0
mul r1 1 r2 1 r3
Decode/Dispatch stage
Andy Pimentel – p. 53/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Scoreboarding example (cont’d)
r0r1r2r3r4r5
V
11
1
1020
40
OP S1 V1 DS2 V2
EXUnit
Instruction status
Register file
Instructions from
0
mul r3, r1, r2
add r5,r2,r3
add r2 r3 0 r51
0
Decode/Dispatch stage
Andy Pimentel – p. 54/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Scoreboarding example (cont’d)
r0r1r2r3r4r5
V
11
1
1020
40
OP S1 V1 DS2 V2
EXUnit
Instruction status
Register file
Instructions from
add r1 r3 r01
0
1200
1
add r5,r2,r3
add r0,r1,r3
0
Decode/Dispatch stage
Andy Pimentel – p. 55/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Tomasulo scheduling
Introduced in the IBM 360/91 (1967) by Robert Tomasulo
Usually implements out-of-order issue
Dispatched instructions are kept in reservation stations,explicitly storing operand values
Reservation stations often individually associated with anexecution unit
Andy Pimentel – p. 56/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Tomasulo scheduling (cont’d)
Dependency analysis based on the dataflow principleUnavailable operands are tagged with reservationstation ID producing the value (registers also store thistag)Generated results are immediately sent to thereservation stations by a Common Data BusThis scheme automatically implements registerrenaming! (so no WAW and WAR hazards)
Andy Pimentel – p. 57/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Tomasulo scheduling (cont’d)
The basic architecture
Instructions from
r0r1r2r3r4r5
Register file Tag
S1 S2 S1 S2
EXUnit
CDB
RS2 RS5RS1
RS3
RS4
RS6
RS1
EXUnit
add 10 20
add
5mul RS1
RS4
OPOP
Decode/Dispatch stage
Andy Pimentel – p. 58/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Register renaming revisited
Hardware implementation of register renaming common incurrent superscalar processors
Three possible locations of renaming buffersMerged architectural and rename register fileSeparate architectural and rename register filesRenaming in the ROB
At operand fetch, check both architectural and renamingregister files
Andy Pimentel – p. 59/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
HW register renaming (cont’d)
Two basic buffer architectures
entryvalid reg.
dest.value valid
valuelatest
associativesearch
1221
1
11
150
1002010
111
1111
1
5validentry
index
1111 12
1
value validvalue
1111
0123
32
0123
r2
11
1020
50
Associative rename buffers Indexed rename buffers
Andy Pimentel – p. 60/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
HW register renaming (cont’d)
Number of renaming buffersCompaq Alpha 21264: 41 (int) + 41 (fp)PowerPC 750: 6 (int) + 6 (fp)AMD Athlon: 72 (in ROB)HP PA-RISC 8x00: 56 (in ROB)
Rename rate: renames per cycle
Andy Pimentel – p. 61/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
An example of HW register renaming
10,valid0,valid
add r3,r1,r2
mul r2,r0,r1
sub r2,r0,r1
tail
head
entryreg.dest.
value validvalue
latest1
11 1
1 1111
0400
10
4
1
0
1 02 1
valid
Andy Pimentel – p. 62/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
HW register renaming example (cont’d)
10,valid
add r3,r1,r2
mul r2,r0,r1
sub r2,r0,r1
tail
valid reg.dest.
value validentry
latest1
11 1
1 1111
0400
10
4
1
0
1 02 1
value
head 1 0 13tag (3), not valid
0
23
1
Andy Pimentel – p. 63/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
HW register renaming example (cont’d)
10,valid
add r3,r1,r2
mul r2,r0,r1
tail
entryvalid reg.
dest.value valid
valuelatest
sub r2,r0,r1
11 1
1 1111
0400
10
4
1
0
2
1
1 0 13
0
23
1
head 2 0
0
11
0,valid
1 10 to add instruction
Andy Pimentel – p. 64/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Preserving sequential consistency
Instructions in superscalars may finish out-of-order
Sequential consistency must be preserved: out-of-orderinstructions might have to complete (also called retire andcommit) in-order
Two issues in sequential consistency:Processor consistency (sequence of instructioncompletions)Memory consistency (sequence of memory accesses)
Andy Pimentel – p. 65/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Processor consistency
Weak processor consistencyInstructions may complete out-of-order only whenpossible (dependencies)Problems with precise exceptions (discussed later on)
Strong processor consistencyInstructions always complete in-orderEasy to implement and no exception problems �
common in modern superscalar processors
Andy Pimentel – p. 66/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Memory consistency
Strong memory consistencyMemory accesses occur in strict program order
Weak memory consistencyLoad/store reordering (not violating data dependencies)Increases processor performance
Andy Pimentel – p. 67/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Load/store reordering
Some processors allow loads to bypass stores when targetaddresses are different
If target address of store is not knownNon-speculative bypassing: do not bypass the loadSpeculative bypassing (common in modernsuperscalars): bypass the load and restore state whenthe bypass was invalid
Loads bypassing loads in case of Dcache misses:lock-up-free caches
Andy Pimentel – p. 68/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Exception processing
Out-of-order execution may cause problems withexceptions. An example:
divf f0, f1, f2 % causes an exception
add r1, r3, r4 % commits earlier than the divf
If nothing is done (weak consistency of exceptions), thenexceptions are imprecise
Undesirable in modern processors (e.g. paging, IEEEFP standard)
Andy Pimentel – p. 69/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Exception processing (cont’d)
Modern processors typically feature precise exceptionsPrevious instructions have committedNo following instruction has modified architecturalstatePC points to interrupting instruction
Solutions to implement precise exceptionsOnly issue an instruction when the previousinstructions are known not to cause an exceptionHistory bufferReOrder Buffer (ROB)
Andy Pimentel – p. 70/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
History buffer
Store sequential machine state and restore the machine stateafter an exception
Architecturalregisters
resultsInstruction
Operandfetch
state restoreSequential
itemsSuperseded
Historybuffer
(queue)
Expensive path to history buffer (for each simultaneouslywritten operand) and expensive reload after exception
Andy Pimentel – p. 71/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The ReOrder Buffer (ROB)
Increasingly popular in modern superscalar processors
The ROB is a circular bufferfor reordering instructions (toestablish sequential processorconsistency) and may alsoimplement register renamingand shelving
It effectively supportsspeculative execution andprecise exceptions
head (first free entry)
tail (next instruction to be committed)
x x fx
d = dispatchedx = in executionf = finished
dd
Andy Pimentel – p. 72/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The ReOrder Buffer (cont’d)
Instructions inserted in program order (dispatch order)
When an instruction may commit, it writes its result to anarchitectural register/memory
Commit rate: the number of instructions that can commit in1 cycle
Typical commit rate of 1-4
Status bit for speculative execution (a finished speculativeinstruction may not commit)
When the ROB implements register renaming, therenaming buffers are integrated with the ROB structure
Andy Pimentel – p. 73/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Control hazards revisited
In general, one out of five instructions is a branch
Branches
Unconditional Conditional
Jump Branch tosubroutine
Return fromsubroutine
Loop-closingbranch branch
Other cond.
1/3 1/31/3
Taken Untaken
1/61/6(n-1) iterations
5/6 1/6
Grohoski’s estimate of branch statistics
Andy Pimentel – p. 74/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Control hazards (cont’d)
As we know, branches may cause the pipeline to stallResolving of branches may take a while (e.g. branch onthe result of a FP operation)Especially, taken branches should be handledefficiently
Branch problem gets worseDeeper pipelines and superscalar processors sufferfrom more branch penalties (multiple branches inpipeline(s))
Andy Pimentel – p. 75/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Avoiding/reducing branch delays
Branch delay slots (delayed branching)
Multiway branching: follow both paths
Predication
Branch predictionStatic (compiler)Dynamic (hardware)
Andy Pimentel – p. 76/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Dynamic branch prediction
Keep branch history bits (a 2-bit history is common)Stored in the Icache, a Branch History Table (BHT) orBranch Target Buffer (discussed later on)
Predict taken
10
Predict taken
11
Predict not takenPredict not taken
01
taken
not taken
taken
not taken
00
taken
not taken
not takentaken
Accuracy
�
90%Andy Pimentel – p. 77/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Two-level (correlating) predictor
Include behaviour of other branches
3
Branch address2-bit per branch prediction
2-bit global branch history
xx
Accuracy�
95%
Andy Pimentel – p. 78/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Two-level (correlating) predictor (cont’d)
Two-level predictors come in various sorts and complexities(Yeh & Patt, 1993)
Global branchhistory register
history tableGlobal pattern
Global branchhistory register
Per-set patternhistory tables
Set(branch)
Global branchhistory register
history tables
Addr(branch)
Per-address pattern
Andy Pimentel – p. 79/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Two-level Branch Predictors (cont’d)
Per-set patternhistory tableshistory table
Global patterntablehistory
branch Per-address
Addr(branch)
tablehistory
Per-address
tablehistory
Per-address
Addr(branch)
branch
Set(branch)
Addr(branch)
branch history tables
Per-address pattern
Addr(branch)
Andy Pimentel – p. 80/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Two-level Branch Predictors (cont’d)
Per-set patternhistory tableshistory table
Global patterntablehistory tablehistorytablehistory
Set(branch)
history tablesPer-address pattern
Addr(branch)Per-setbranch branch
Per-setbranch
Per-set
Set(branch) Set(branch) Set(branch)
Andy Pimentel – p. 81/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Branch Target Buffer (BTB)
Cache branch target address besides predicted branchdirection
At each cycle, search PC (instruction to fetch in IFstage) in BTBIf PC is found, then start fetching from cached targetaddress (predicted taken)No branch delay when prediction is correct
Branch folding optimizationStore target instruction rather than addressAllows for zero-cycle branches!
Andy Pimentel – p. 82/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Branch Target Buffer (cont’d)
PC of instruction to fetch
Predicted PC
=
Prediction bits
Look up
No: proceed normally
BT
B e
ntrie
s
Yes: use predicted PC as next PC if predicted taken
Andy Pimentel – p. 83/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Speculative execution
Instructions after predicted branch are executedspeculatively
How deep can the level of speculation be?Typically between 1 and 20 branches
What is processed of speculative instructions?Typically up to execution stage; speculativeinstructions are committed after resolving a branch(e.g. using a ROB)
Increasing the degree of speculative execution: valueprediction � we’ll return to this
Andy Pimentel – p. 84/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
App. dependent optimizations: SIMD instructions
Multimedia applications increasingly popular: SIMD
parallelism can be exploited in many multimedia algorithmsMany small integer data typesFrequent multiplies and accumulates in repetitive loopsHighly parallel operations
ISAs extended with SIMD instructions (e.g. MMX)Pack multiple small data items in a register (typically64 bits or larger)Perform same instruction on multiple data items inparallel
Andy Pimentel – p. 85/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
SIMD instructions (cont’d)
a0+b0 a1+b1 a2+b2 a3+b3
b0 b2
a0 a1 a2 a3
b3b1
+ + + +
+ +
16 bits
64 bits
Packed Add
b0 b2
a0 a1 a2 a3
b3b1
* * **
a0*b0+a1*b1 a2*b2+a3*b3
Packed Multiply Add
Andy Pimentel – p. 86/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
SIMD instructions (cont’d)
Use INT/FP register file or separate register file (needs OSsupport!)
Provide both modulo and saturated arithmeticSaturated arithmetic very useful for pixel operations
Pose compiler writers for a new problem: automaticvectorization
The large variety of SIMD instruction sets doesn’t helphere
Andy Pimentel – p. 87/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
App. dependent optimizations: data prefetching
Problem
Multimedia applications suffer from compulsory cachemisses
Calculations applied on streams (no or little re-use)
Observation
Regularity in stream processing:data address Si
� A0�
i � offset
Solution: Stream prefetching
Bring data elements to the cache before they are reallyneeded
Potential problem: trashing
Andy Pimentel – p. 88/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Stream prefetching: a classification
Stream prefetching composed of two actions:
Stream detection: detect when an application is performingoperations on data streams
Issuing prefetches: request the data cache to (regularly)prefetch a certain amount of data
Issuing prefetches
Static Dynamic
Stream Static (SW) (hybrid SW/HW)
detection Dynamic � (HW)
Andy Pimentel – p. 89/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Static detection, static issuing
for (i = 0; i � 3; i++)
prefetch(&a[i]);
for (i = 0; i � N; i++) for (i = 0; i � N - 3; i++) {
sum = a[i] + sum; prefetch(&a[i+3]);
sum = a[i] + sum;
}
for (; i � N; i++)
sum = a[i] + sum;
Cheap implementation
Instruction overhead in loop body
Code rewriting and fine tuning required (e.g. affectscompiler optimizations)
Andy Pimentel – p. 90/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Dynamic detection, dynamic issuing
Detection
Instruction address of loads/stores identifies stream
Large table records the instr. + data addresses of allpossible candidates (Stride Prediction Table)
Issuing of prefetch requests
Instruction address hits in SPT:Compute stride (data-addrcurrent
� data-addrtable)Issue prefetch request foraddrpre f etch
� data-addrcurrent
�
stride
Andy Pimentel – p. 91/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Dynamic detection, dynamic issuing (cont’d)
for (i = 0; i < 100; i++)
Instr. addr. Prev. addr. Stride
ld A[i][j]
ld B[j][i]
Instr. addr. Prev. addr. Stride
ld A[i][j]
ld B[j][i]
100000
200000
100004
200400
4
400
i = 0, j = 1.
i = 0, j = 2.
for (j = 0; j < 100; j++) A[ i ][ j ] += B[ j ][ i ];
prefetch request for 100008
prefetch request for 200800
(a)
(b)
(c)
Andy Pimentel – p. 92/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Dynamic detection, dynamic issuing (cont’d)
Transparent: no programmer action required
Run-in effect
Trashing
Large SPT required
Even larger SPT needed when loop-unrolling
PC needed
Andy Pimentel – p. 93/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Dynamic detection, dynamic issuing (cont’d)
The effect of loop-unrolling
i0: ......
i1: load R1 R3
i2: ......
i3: incr R1
i4: jump i0
i0: ......i1: load R1 R3i2: ......
i3: incr R1
i4: ......
i5: load R1 R3
i6: ......
i7: incr R1
i8: jump i0
loop unrolling
Andy Pimentel – p. 94/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Dynamic detection, dynamic issuing (cont’d)
A few approaches to reduce trashing
A more complex state-machine in the SPT: e.g. onlyprefetch when measuring a constant stride
Introduce separate stream caches
Processor
SPT Stream cache
Cache
Memory
Processor
Memory
CacheSPT Stream cache
MRUreplacement
LRUreplacement
Andy Pimentel – p. 95/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Static detection, dynamic issuing
Detection
A stream prefetch instruction signals a prefetch engine tostart prefetching the elements of a stream
prefetch(&a[0], N, 4, 3);
for (i = 0; i � N; i++)
sum = a[i] + sum;
Programmer selects which streams to prefetch
No run-in effect
Small prefetch table (streams only, no candidates)
No rewriting of inner loop
Andy Pimentel – p. 96/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Static detection, dynamic issuing (cont’d)
Issuing of prefetch requests
Like HW prefetching: use instruction address ofloads/stores
Use data addresses of stream elementsPC not neededNot affected by loop-unrolling: smaller table!More robust prefetching (e.g. when stream is accessedwith multiple strides)More expensive implementation prefetch table
Andy Pimentel – p. 97/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Alpha 21264
7-stage pipeline with clustered EX units
OOO implementation using scoreboarding
IntegerReg.File(80)
IntegerReg.File(80)
IntegerExecution
IntegerExecution
IntegerExecution
IntegerExecution
BranchPrediction
line/setprediction
Icache2-way64 KB
Integer
Register
Rename
FP
Rename
Register
Integer
Issue
Queue
(20 entries)
(15)
FP
Issue
QueueReg.File
FP
(72)
FP Multiply Execution
FP Add Execution
Addr.
Addr.
Dcache2-way64 KB
L2 cacheand
SystemInterface
Fetch Rename Issue Reg. Read Execute Memory
Andy Pimentel – p. 98/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Alpha 21264 (cont’d)
Hybrid tournament branch predictor
Load/store reordering, software data prefetching
(1024x10)
Local
TableHistory
PC
GlobalPrediction(4096x2)
ChoisePrediction(4096x2)
Path history
PredictionsLocal
(1024x3)
Branch prediction
Andy Pimentel – p. 99/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
HP PA-8700
OOO implementation using ROBs (4-way issue)
Hybrid static and dynamic (BHT+BTB) branch prediction,simple (i+1) HW data prefetching
Regs
RegsRename
Rename
ROBAddr.
(28 ent.)ROB
RegistersArchitectural
Retire
SortIF Unit
Icache4−way0.75MB
ALU
4 instructions
4 instructions
MEMROB
(28 ent.)(28 ent.)
Addr.Unit
4−wayL/S
Dcache
BusInterface
1.5MB
System
Units
Units
2 Shift/
ALUsINT
2 64bit
Merge
UnitsMul/Add2 FP
2 FPDIV/SQRT
Andy Pimentel – p. 100/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
AMD Athlon
64KB 2-way
(dual ported)Icache
Pre-decode
info
BranchPredictor
ControlIF/ID 3-way x86 instr. decoders
Int. scheduler (18-entry)
IEX IEX IEXAddr. Addr. Addr.
FPU Stack Map/Rename
FPU Scheduler (36-entry)
FPU Register File (88-entry)
FADD
3DNow!MMX MMX
3DNow!
FMUL FSTORE
64KB 2-way Dcache(8 banks)
Bus
Inte
rfac
e U
nit
L2 C
ache
Con
trol
ler
ROB (72-entry) + INT Reg. File (24)
Load/Store Queue Unit (44 entries)
BHT+BTB(2K entries)
Andy Pimentel – p. 101/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
IBM POWER 4
2 processor cores on-chip (share on-chip 1.5MB 8-way L2 cache)
Andy Pimentel – p. 102/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
IBM POWER 4 (cont’d)
8-way issue OOO engine
Tournament branch prediction+ static prediction
Group Completion Table (GCT)is sort of ROB
64 KB DM/2-way Icache/Dcache,hardware data prefetch (L2 and L3caches)
Andy Pimentel – p. 103/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Intel Pentium 4
Decode stage translates 1IA32 instruction per cycleinto uops
trace-cache stores tracesof uops
BHT+BTB+static branchprediction
"double pumped" ALUs
Now also with hyper-threading (simple SMT)
Andy Pimentel – p. 104/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Trace caches
Assume a branch predictor throughput of m, thenTraces are identified by starting address and m � 1branch outcomes
Att
trace(A,taken,taken)
At t
Trace Cache
At t
Trace Cache
later...
Att
trace(A,taken,taken)
fill new trace from I$
to decoder
Lookup A with predictions (t,t)
Andy Pimentel – p. 105/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Intel Pentium 4 (cont’d)
20-stage pipeline, 6-way issue OOO execution
Andy Pimentel – p. 106/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Intel Pentium 4 (cont’d)
126-entry ROB, 128 physical registers (8 architectural)
Andy Pimentel – p. 107/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
An alternative to superscalar RISC: VLIW
Principle derived from horizontal µ-programming
Very Large Instruction Words containing multiple operationslots (operations drive execution units)
operationRISC-like
operationRISC-like
operationRISC-like
operationRISC-like
VLIW instruction
Compiler schedules operations within instruction slotsNo hardware scheduling, compiler must find ILPCompiler should know everything about thearchitecture (e.g. timing) to schedule operations
Andy Pimentel – p. 108/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
VLIW processors (cont’d)
Bind Resources
Determine Independencies
Determine Dependencies
Frontend & Optimizer
Bind Resources
Determine Independencies
Determine Dependencies
Execute
Compiler Hardware
Superscalar
Dataflow
"Horizon"
VLIW
and IA64
Andy Pimentel – p. 109/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
VLIW processors (cont’d)
The ideal architecture of a VLIW processor
Mainmermory
Register file
Load/Store
FPALU
INTALU
Branchunit
All execution units have direct access to the register fileTypically, this is infeasible (too many read/write portsfor register file): clustered architectureClustering complicates the compiling (e.g. inter-clusterdata movements)
Andy Pimentel – p. 110/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
VLIW processors (cont’d)
Require less complex hardware than (superscalar) RISCs
Generally perform well on scientific and multimedia code(predictable)
In theory, compilers should be able to find more ILPthan superscalars (they have a wider scope)Extra room on chip can be used for application-specificHW optimizations
Compiler requires static branch prediction to schedule code
Code is less compact due to NO-OPs
Andy Pimentel – p. 111/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
VLIW processors (cont’d)
Code compaction techniques to reduce impact of NO-OPsWhere to decompress instructions?
At Icache refill: not in critical path but needs largerIcacheAt instruction fetch: smaller Icache but in criticalpath
Object code compatibility hard to obtainPossible solution: static/dynamic instructionrescheduling
Andy Pimentel – p. 112/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The Philips TriMedia TM1000 processor
32-bit high-performance mediaprocessor with VLIW core(currently 6.5 BOPS)
5 operations/instruction(2 memory operations)
Guarded operations(predication)
SIMD operations
Co-processors for commonmedia algorithms
SDRAM
camera, etc.I C bus to2
2/4/6/8 chan. dig. audio
CCIR601/656YUV 4:2:2
38 MHz(19 Mpix/s)
I SDC - 100 kHz2
YUV 4:2:2
V. 34 or ISDNFront End
PCI (32 bits, 33 MHz)
Down & upscalingYUV to RGB50 Mpix/s
Huffman decoderSlice-at-a-time
32 bits data, 400 MB/s
80 MHz (40 Mpix/s)
CCIR601/656
MPEG 1&2
Stereo dig. audio
I SDC - 100 kHz2
CPUVLIW
32KI$
16KD$
Interface
2I C
OutAudio
InAudio
InVideo VLD
VideoOut
Timers
Image
PCI Interface
Coprocessor
InterfaceSerial
Synchronous
Coprocessor
Memory interface
Andy Pimentel – p. 113/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The TriMedia TM1000 processor (cont’d)
27 execution units
128 entry register file
32Kb, 8-way set-associativeIcache (compressed code)
16Kb, 8-way set-associativeDcache
8 banks, pseudo-dual portedNon-blocking, hierarchical LRUStreamed, critical-word-firstfetching
Instruction cache (32 Kb)
Instruction Fetch Buffer
Instr. Decompression Hardware
Issue Register (5 Ops)
Operation Routing Network
Register Routing and Forwarding Network
Execution Units (27 functions)
Register File (128 x 32)
Data cache (16 Kb)
Andy Pimentel – p. 114/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The TriMedia CPU64
Target: 6x to 8x performance increase over TM1000, whilethe transistor count may not be more than doubled
64-bit registers and data paths (e.g. 64-bit SIMDinstructions)
New, extensive, media instruction set
Improved cache control (SW controlled prefetch andallocation), Dcache truly dual ported
Super-Ops: double-slot operation allowing multi-argument,multi-result operations
Andy Pimentel – p. 115/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The TriMedia CPU64 (cont’d)
Andy Pimentel – p. 116/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The TriMedia CPU64 (cont’d)
Inclusion of MMU (separate I-MMU and dual-portedD-MMU)
The D-MMU is/has64-entry fully-associative D-TLB, software managedIndexed with 32-bit VA and 8-bit process IDVariable page sizes of 4Kb to 16 MB: practical formedia applications with large data streams
Precise exceptions
Andy Pimentel – p. 117/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The TriMedia CPU64 (cont’d)
Andy Pimentel – p. 118/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The (Intel/HP) Itanium 2 (McKinley)
IA64 processor using EPIC: mixing RISC and VLIW
Move complexity back to the compiler:Exploiting explicit parallelism: schedule operations inbundles
Tem-plateInstruction 2 Instruction 1 Instruction 0
40 bits 8 bits
Template provides information on inter and intradependencies of bundlesBranch + memory hints (ld.s + check.s instructions)Predication to reduce branches
Andy Pimentel – p. 119/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The Itanium 2 (cont’d)
8-stage in-order pipeline
2-level BHT + BTB, Target Address Registers forcompiler-hints + Loop Count register
6 instructions issued per cycle
Register rotating for loop-unrolling support
3 branches may be executed in parallel
Load/store reordering allowed
IA32 instructions translated to IA64 instructions
Andy Pimentel – p. 120/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The Itanium 2 (cont’d)
Andy Pimentel – p. 121/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The Transmeta Crusoe
simple 4-way VLIW CPU core5 functional units64 registers
Reduced power by replacing a large number of transistorswith software
x86-compatible through Code Morphing: translates x86 toVLIW instructions (did microcode return?)
Andy Pimentel – p. 122/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The Transmeta Crusoe (cont’d)
Code Morphing
Translates + schedules a whole group of instructions atonce (includes register renaming)
Caches translations
Analyses program behaviour � gradual optimization oftranslations
Alleviates compatibility problem of VLIWs
Applies a history-buffer approach for implementing preciseexceptions
Can control the processor’s clock speed
Andy Pimentel – p. 123/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The Transmeta Crusoe (cont’d)
The Transmeta Crusoe TM5600
L1 Icache64K
8-way set-assoc.
Unified TLB256 entries
4-way set-assoc.
CPU coreInteger unit
FP unit
Multimedia instructionsMMU
L1 Dcache64K
16-way set-assoc.
L2 WB cache512K
4-way set-assoc.
Bus interface
PCI controller
interfaceSerial ROM
SDR SDRAMcontroller
controllerDDR SDRAM
DMA
Andy Pimentel – p. 124/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Embedded processors
To the question "what’s the most popular microprocessoraround?", you probably answered Intel Pentium
Well...thanks for playing, but
Intel Pentium has almost 0% market share. Zip. Zilch.
Pentium is a statistically insignificant chip with tiny sales!
Andy Pimentel – p. 125/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Embedded processors (cont’d)
Relating microprocessors to life on earth...are Pentium’s thevirusses of the chip market? ;-)
Andy Pimentel – p. 126/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Embedded processors (cont’d)
Andy Pimentel – p. 127/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Embedded processors (cont’d)
In the embedded processor market, there’s no big leader
Andy Pimentel – p. 128/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Embedded processors (cont’d)
Types of embedded microprocessorsGP processing cores, such as ARM, MIPS, 68000,PowerPC, etc.Digital Signal Processors (DSPs)Media processors, such as TriMedia, Emotion Engine(PS2), Equator’s MAP, etc.
Andy Pimentel – p. 129/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
DSPs (cont’d)
Processing continuous data streams (sequences of samples)
Often (hard) real-time applications: call for predictablebehavior!
In-order issue/execution/completion with CPI = 1 (oftenVLIW core)
Mostly fixed-point (FP is slow and expensive)
Still lots of assembly coding
Andy Pimentel – p. 131/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
DSPs (cont’d)
Harvard architectureSeparate data memory/bus and instruction memory/bus(multiple ports, high bandwidth)
Multiply-accumulate (sum = sum + k*x[i]) in singleinstruction (common in filters)
Special addressing modesModulo addressing for circular buffers, bit-reversedaddressing (for FFTs)
Andy Pimentel – p. 132/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
DSPs (cont’d)
Both modulo and saturated arithmetic
Zero-overhead loops (loop an instruction (sequence) anumber of times)
Predictable interrupt latencies
No caches or caches with locking ( � predictability)
Andy Pimentel – p. 133/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Embedded processors (cont’d)
Other concerns
Low costsLowest possible area(Some of the) Technology behind the leading edge
Code density (small memory footprint)ISA methodsCompression
Fast time to marketCompatible architectures (e.g., ARM, MIPS) allowsreuse of codeCustomizable core
Low power if application requires portability
Andy Pimentel – p. 134/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Power Intermezzo
Power equations for CMOS logic circuits
Power consumption: P � ACV 2 f
� τAVIshort f
�
VIleak
First component measures dynamic powerconsumption of (dis-)charging the capacitive load onthe output of each gateProportional to frequency ( f ), activity of the gates (A,some gates may not switch every clock), the totalcapacitance seen by gate outputs (C) and the square ofsupply voltage (V )This term dominates
Andy Pimentel – p. 135/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Power Intermezzo (cont’d)
Second term captures power consumption due toshort-circuit current (Ishort) that momentarily (τ) flowsbetween ground and supply voltage when output of gateswitches
Third term measures the power lost due to leakage currentregardless the state of the gate
Andy Pimentel – p. 136/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Power Intermezzo (cont’d)
Reducing V effectively reduces power consumption(quadratic relationship!)
But, reducing V also limits the maximum frequency( fmax ∝
�
V � Vthreshold
� 2 �
V ) � fmax roughly linear to V
Lessen the effect of reducing V by reducing Vthreshold
Unfortunately, this increases leakage current(Ileak ∝ exp
�
�Vthreshold
� �
35mV
� �
)
Andy Pimentel – p. 137/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Techniques for power reduction
Logic levelClock gating: turn of parts of clock tree (may consumeup to 30% of the power of a processor) � reduceparameter AHalf-frequency clock (use both edges)Asynchronous logic
Exploit parallelism (allows for reducing V )This does not include pipeline parallelism! (thisrequires an increase of f )
Andy Pimentel – p. 138/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Techniques for power reduction (cont’d)
Organisation of memory (e.g., multiple smaller banks, codecompression)
Buses: reduce the number swings on address linesexploiting locality (e.g., using Gray code)
OS: dynamically control f (frequency) dependent onapplication
Andy Pimentel – p. 139/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Playstation 2 architecture
Andy Pimentel – p. 140/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Playstation 2 architecture (cont’d)
Andy Pimentel – p. 141/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Playstation 2 architecture (cont’d)
Andy Pimentel – p. 142/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Playstation 2 architecture (cont’d)
Andy Pimentel – p. 143/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Parallel systems
Amdahl’s law � do not forget the “sequential” processor
In the past, many special-purpose processors were used inparallel systems (e.g. Transputer, CM-2)
Nowadays, mostly RISC(y) commodity microprocessorsare used
Cray � DEC AlphaSGI � MIPSIBM � POWER2
Parallelism exploited at multiple levelsTask-level, thread-level and instruction-level
Andy Pimentel – p. 144/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Parallel systems (cont’d)
Some design issues
Procesors: number and power of processors, organization(connectivity)
Memory organization: location, caches, etc.
Type of network: Direct or indirect
Synchronization: SIMD � synchronous, MIMD �
asynchronous
Andy Pimentel – p. 145/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Parallel systems (cont’d)
Synchronous
Parallel paradigms
MIMD
Asynchronous
Shared memory
memoryDistributed
Vector/Array
SIMD
(MISD)Systolic
Shared memory
Distr. memory
Andy Pimentel – p. 146/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
MIMD vs SIMD
Fine grain parallelismMedium/Coarse grain parallelism
Memory
Network
IU DU IU DU
(Shared memory) MIMD
(Multiple Instructions, Multiple Data) (Single Instruction, Multiple Data)
Network
Memory
IU DU DU
(Shared memory) SIMD
SIMD has evolved into the SPMD paradigm for MIMD machines
� I will focus on MIMD parallel architectures
Andy Pimentel – p. 147/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Interconnection networks
Two types of networks
Direct (or point-to-point) connection networks
Indirect connection networks
Network properties/definitions
Network switching : the transportation of data from oneprocessor to the other
Circuit-switching (connection stays duringcommunication)Packet-switching (connection only made for singlepacket)
Andy Pimentel – p. 148/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Network properties (cont’d)
Topology : the lay-out of the networkStatic (point-to-point networks)Dynamic (indirect networks)
Node degree : nr. of channels connected to one node ( )
Diameter of network : maximum shortest path between twonodes ( )
Bisection width : when the network is cut into two equalhalves, the minimum number of channels along the cut ( )
Network redundancy (fault tolerance): amount ofalternative paths between two nodes ( )
Network scalability : measure for expandability of thenetwork ( )
Andy Pimentel – p. 149/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Network properties (cont’d)
Network routing : the process of steering data (messages)through the network
Routing and redundancy are coupled : high redundancy
� many routing possibilities
Network functionality : measure for support of routing,fault tolerance, synchronization, message combining, etc.
Network throughput : Amount of transferred datatime units
Network latency : worst case delay for transfer of a unit(empty) message through the network
Hot-spots : nodes that account for a disproportionallyamount of network traffic
Andy Pimentel – p. 150/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Direct (point-to-point ) connection networks
3−Hypercube Systolic array
TorusMesh
4−Hypercube
Completely connected
Linear array Ring Chordal ring of degree 3
Andy Pimentel – p. 151/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Direct connection networks (cont’d)
Network Node degree Diameter Bisection width
Linear array 2 N � 1 1
Ring 2� N
2
�
2
Completely conn. N � 1 1
� N2
� 2
Binary tree 3 2
�
log2N � 1
�
1
2D-mesh 4 2
�
N � 1
�
N
2D-torus 4 2
� �
N2
�
2 N
Hypercube log2N log2N N2
N equals to the number of nodes
Andy Pimentel – p. 152/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Indirect connection networks
Dynamic networks: no (fixed) neighbours (changecommunication topology based on application demands)
Bus networks
Multistage networks (blocking and non-blocking)Omega networksBaseline networksClos networks
Crossbar switches
Andy Pimentel – p. 153/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Busses
Generic bus structure in parallel machines
I/O1
I/On
P1 Pn M1 Mn
Busarbiterand
control
Address lines
Data lines
Control lines
In traditional busses, address and data lines may betime-multiplexed
When there are more bus-masters, arbitration is required
Andy Pimentel – p. 154/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Busses (cont’d)
Synchronous bus
Address
Clock
Data
Read
Typical bus read transaction
Asynchronous busMore complex/expensive (and possibly slower) but alsomore flexible than synchronous bus
Andy Pimentel – p. 155/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Busses (cont’d)
Split-transaction busses: higher throughput by pipelining(but possibly higher latency)
Need extra bus lines to signal the "owner" of data (usingtags)
addr 2
data 2
wait 1
addr 1 addr 3
data 0 data 1
OK 1
Split−transaction bus (pipelined bus)
Andy Pimentel – p. 156/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Busses (cont’d)
Traditional versus split-transaction busses
P1
P2
P3
Time
P1
P2
P3
Time
Processors
Processors
= address bus used = bus not used = data bus used
Split−transaction bus
Traditional bus
Andy Pimentel – p. 157/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Busses (cont’d)
Arbitration: two examples of centralized schemes
Independent request/grant lines: flexible + efficient butexpensive
Bus lines
Master 1 Master 2 Master n
Central
bus
arbiter
R1G1
R2G2
RnGn
Bus busy
R = RequestG = Grant
Andy Pimentel – p. 158/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Busses (cont’d)
Arbitration: two examples of centralized schemes (cont’d)
Daisy-chaining: Less expensive but slow propagation ofgrant and less fairness
Bus lines
Master 1 Master 2 Master nbus
Central
arbiter
G1 G2 Gn
Bus request
Bus busy
Andy Pimentel – p. 159/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Multistage networks
ISCSwitch
Switch
Switcha-1
ISCSwitch
Switch
Switch
ISCSwitch
Switch
Switch01
a
a -a
a -1
n
n
a+1
2a-1
0
b -1
n
b-1
1
bb+1
2b-1
b -b
n
built with a x b switches and a specific InterStage Connection (ISC) patternA generalized Multistage Interconnection Network (MIN)
Andy Pimentel – p. 160/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Multistage networks (cont’d)
Switch networks often use 2x2 switches
N inputs require log2N stages of 2x2 switches,
Each stage requires N2 switch modules
The number of stages determines the delay of the network
Andy Pimentel – p. 161/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Multistage networks (cont’d)
0
1
2
3
4
5
6
7
89
11
12
13
14
10
15
0
1
2
3
4
5
6
7
89
11
12
13
14
10
15
A 16x16 Omega network of 2x2 switches
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
Straight Crossoverbroadcast
Upperbroadcast
Lower
Andy Pimentel – p. 162/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Multistage networks (cont’d)
Routing in Omega network : not all permutations are unblocked
Permutation (0,7,6,4,2)(1,3)(5) without blocking
0
1
2
3
4
5
6
7 76
5
4
3
2
1
0
Permutation (0,6,4,7,3)(1,5)(2) blocked at switches F,G and H
F
G
H
0
1
2
3
4
5
6
7 76
5
4
3
2
1
0
Andy Pimentel – p. 163/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Crossbar switches
M1 M2 M3 M.. M.. M..
P1
P2
P3
P..
P..
P..
On
Off
Andy Pimentel – p. 164/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Crossbar switches (cont’d)
Possible implementations of a crossbarIo
I1
I2
I3
Io I1 I2 I3
O0
Oi
O2
O3
RAMphase
O0
Oi
O2
O3
DoutDin
Io
I1I2
I3
a ddr
Andy Pimentel – p. 165/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Indirect connection networks
Assume n processors on a bus of width w, an n � n MIN usingk � k switches with line width w and an n � n crossbar with linewidth w.
Network Bus Multistage Crossbar
characteristics network switch
Min. latency constant O(logkn) constant
Bandwidth O(w) O(w) to O(nw) O(w) to O(nw)
Wiring O(w) O(nwlogkn) O(n2w)
complexity
Switching O(n) O(nlogkn) O(n2)
complexity
Connectivity Only one to one Some permutations All permutations
and routing at a time and broadcast, if one at a time
capability network unblocked
Andy Pimentel – p. 166/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Packet switching
� Divide message into packets and route them through thenetwork
Common packet-switching techniques:
Store & forward switching (rather obsolete)Packet is smallest entityPacket buffers at intermediate nodes required
S
I1
I2
Time
Node
Andy Pimentel – p. 167/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Packet switching (cont’d)
Wormhole “routing”Flit is smallest entityOnly small flit buffers requiredOne packet can occupy multiple channels
S
I1
I2
Time
Node
� Nearly distance independent (low latency)
Virtual cut-through switchingCombination of the Store & forward and Wormholetechniques
Andy Pimentel – p. 168/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Store & forward vs Wormhole
The communication latencies for store&forward switching andwormhole routing are expressed by:
Ts& f
� LW
� D
Twormhole� L
W
� FW
�
D � 1
�
where L is the packet length in bits, W the channel bandwidth inbits/s, D the distance (number of hops) and F the flit length inbits.
Andy Pimentel – p. 169/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Flow control
Handshaking between switches, e.g. in wormhole routing:Switch S Switch D
ChannelFlit i
R/A (high)
R/A (low)
R/A (low)
R/A (high)
Flit i
Flit i
Flit i+1
Andy Pimentel – p. 170/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Tree saturation
Wormhole routing may suffer from tree saturation: messages arewaiting for each other � can lead to a snowball effect
Message A
Message B
Andy Pimentel – p. 171/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
A generic switch architecture
Cross−bar
InputBuffer
Control
OutputPorts
Input Receiver Transmiter
Ports
Routing, Scheduling
OutputBuffer
Andy Pimentel – p. 172/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Routing techniques
Location of routing “intelligence”:
Source-based routing (e.g. Myrinet)Routers “eat” the head of a packetLarger packetsNo fault tolerance
Local routingMore complex routers but smaller packets
Routing may cause deadlocks
Buffer deadlock (store-and-forward switching)
Channel deadlock (wormhole routing)
Routing may be minimal or non-minimal
Non-minimal routing � potential starvation
Andy Pimentel – p. 173/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Routing deadlocks
V3
graph using V3 and V4Modified channel-dependence
channels (V3, V4)Adding two virtual
V4
containing a cycle
B
A D
CC2
C1
B
A D
C
C3
C4
C2
C1
C4
C3
Channel-dependence graphChannel deadlock
C2
V3V4 C3
C4
C1
C4
C3
C2
C1
Andy Pimentel – p. 174/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Local routing techniques (cont’d)
Determining the routing path
Deterministic (non-adaptive) routing : fixed pathMinimal and deadlock free
Adaptive routing : exploits alternative pathsLess prone to contention and more fault-tolerantPotential deadlocksReassembling of messages (out-of-order arrival ofpackets)Cannot be source-based routingMinimal or non-minimalPartially adaptive vs fully adaptive
Andy Pimentel – p. 175/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Deterministic dimension order routing: X-Y
with X−Y routingX−Y routingDeadlock not possible
D
S
Andy Pimentel – p. 176/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Deterministic routing (cont’d)
Interval labelling
[8,16)
0 2 3
5 76
15
1
4
8 9 10 11
12 13 14
[0,4)
[7,8)[4,6)
Example: Inmos C104 switch
Andy Pimentel – p. 177/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Adaptive routing (cont’d)
XY routing Adaptive routing
Andy Pimentel – p. 179/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Deadlock avoidance
Deterministic routing (e.g. X-Y)
Partially adaptive routingFor example, west-first routing for 2D meshes: route apacket first to the west (if required), then route thepacket adaptively to north, south or east
XY routing West-first routing
Andy Pimentel – p. 180/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
West-first routing example
Andy Pimentel – p. 181/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Deadlock avoidance (cont’d)
Virtual channelsVirtual channels are logical links between two nodesusing their own buffers and multipex’ed over a singlephysical channelVirtual channels “break” dependency cycles
Node X Node Y
VL3VL3
VL2
VL1 VL1
VL2
E.g. Round-Robin
Andy Pimentel – p. 182/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Virtual channels (cont’d)
Double Y-channel 2D mesh +X subnetwork
Andy Pimentel – p. 183/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Virtual channels (cont’d)
Advantages
Increased network throughput
Deadlock avoidance
Virtual topologies
Dedicated channels (e.g. debugging, monitoring)
Disadvantages
Hardware cost
Higher latency
Incoming packets may be out-of-order
Andy Pimentel – p. 184/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Distributed memory MIMDs: multicomputers
Message-passing machines (packet switched)
Often a point-to-point network
MemoryLocally addressableNo global address space
Communication & synchronizationVia message passing
Architecture is scalable
Communication is not transparent
Andy Pimentel – p. 185/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Distributed memory MIMDs (cont’d)
Problem: intermediate processors must route messageswhen two communicating nodes are not neighbours
Solution: separate communication processor on node whichperforms routing and DMA transfers
Problem: no global accessible memory available, e.g.sharing of data and code difficult (not transparent)
Solution: Virtual Shared Memory (VSM) or Shared VirtualMemory (SVM)
Andy Pimentel – p. 186/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
VSM and SVM
Translate memory references into the message-passing paradigm
Virtual Shared Memory (VSM)Hardware implementationVirtual memory system transparently implemented ontop of VSMUnit of sharing typically small (e.g. cache block)
Shared Virtual Memory (SVM)Software implementation (OS) + hardware support(MMU)Virtual memory system implements shared memory(OS not transparent)Unit of sharing typically larger (e.g. pages)
Andy Pimentel – p. 187/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
VSM and SVM (cont’d)
If data has a fixed home node, then there are four approaches toVSM and SVM:
A B D
A B C
S
C
A B C
S
DCBA
− no migration− replication on reads and writes− sequencer process (S) updates
coherency
Read Replication
all replications when writing
Full Replication
Central Server Full Migration− migration
− invalidations guarantee − replication (reads)− migration (writes)
− no coherency problems− no replication
− no coherency problems− no replication− no migration
More on VSM later on...Andy Pimentel – p. 188/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Real multicomputers: the IBM SP2
POWER2 processors
Andy Pimentel – p. 189/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Shared memory MIMDs: multiprocessors
Network : typically indirect or hybrid (indirect +point-to-point)
MemoryLocally addressableGlobally addressable
Communication & synchronizationVia sharing of data (transparent)
Critical regions (locking)Message passing can be emulated
Architecture is [not,..,reasonably] scalable
Cache coherency problem
Andy Pimentel – p. 191/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache coherency in shared memory machines
Cache coherency problem :
In multiprocessor systems data inconsistencies between differentcaches can easily occur
Three sources of the problem can be identified:
Sharing of writable data
Process migration
I/O activity
Andy Pimentel – p. 192/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache coherency (cont’d)
Sharing of writable data
Processors
Caches
SharedMemory
P2P1
X X
X
P2P1
X
P2P1
X
X
X’
X’
X’
Before update Write-through Write-back
Andy Pimentel – p. 193/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache coherency (cont’d)
Process migration
Processors
Caches
SharedMemory
P2P1
X
X
P2P1 P2P1
X
XX’
X’
Write-through Write-backBefore migration
X X’
Andy Pimentel – p. 194/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache coherency (cont’d)
I/O activity
P1 P2
Caches
Processors
X X
I/O
Write−backWrite−through
MemoryI/OMemory
XXX’X’
I/OMemory
X
X
X X’ X
P1 P2P1 P2
Shared memory
C1 C2
IOP2IOP1 P2P1IOP=I/O processor
Andy Pimentel – p. 195/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache coherency (cont’d)
In general, cache coherency protocols are based on a set of(cache block) states and state transitions
Two types of protocols: write-invalidate and write-update
Write-invalidate suffers from false sharing
False sharing
Some invalidations are not necessary for correct programexecution:
Processor 1: Processor 2:
while (true) do while (true) do
A = A + 1 B = B + 1
If A en B are located in the same cache block, a cache missoccurs in each loop-iteration due to a ping-pong of invalidations
Andy Pimentel – p. 196/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache coherency (cont’d)
Processors
Caches X X X
PnP2P1
Shared memory X
Processors
Caches
PnP2P1
Shared memory
X’ X’ X’
X’
Processors
Caches
PnP2P1
Shared memory
X’
X’
I I
Write-invalidate protocol
Write-update protocol
Andy Pimentel – p. 197/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Uniform Memory Access (UMA)
P P
C C
Interconnection network
PP
Interconnection network
Memory Memory Memory Memory
Symmetric MultiProcessors (SMP) are well-known UMAarchitectures
Andy Pimentel – p. 198/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
UMA architectures (cont’d)
Not/hardly scalableBus-based architectures � saturationCrossbars � too expensive (wiring constraints)Multistage networks � wiring constraints + possiblyhigher latency (more stages)
Possible solutionsReduce network traffic by cachingClustering � non-uniform memory latency behaviour(NUMA)
Andy Pimentel – p. 199/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
UMA architectures (cont’d)
Memory contention occurs when multiple processors areaddressing the memory at the same moment
Banked/multiple memories
Non-uniform network traffic in multistage networks maycause tree saturation
Use of message combining (e.g. in the atomicFetch&Add operation)
Andy Pimentel – p. 200/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Message combining
Message combining using the Fetch&Add operation
Main memory
Fetch&Add(X,e1)
Fetch&Add(X,e2)
P1
P2
Switch
X
Main memoryP1
P2
Switch
Main memoryP1
P2
Switch
Main memoryP1
P2
Switch
Fetch&Add(X,e1+e2)
e1
e1 X+e1+e2
X
X
X
X+e1 X+e1+e2
Andy Pimentel – p. 201/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Non Uniform Memory Access (NUMA)
Multiple clusters of SMPs: VSM revisited
c c
MEM
P P
c c
P P
Shared network
Message-passing network
Shared network
MEM
Local memory references are fast, remote ones slow (ratio1:[2-15]) � latency hiding!
Cache-controller/MMU determines whether a reference islocal or remote
When caching is involved, it’s called CC-NUMA (cachecoherent NUMA)
Typically Read Replication (write invalidation)Andy Pimentel – p. 202/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
NUMA (cont’d)
Caches (CC-NUMA) reduce latency
Possibilities for latency hiding � overlap valuable computationwith communication (i.e. the fetching of remote data)
Prefetching of dataBefore remote data is actually required, fetch it from theremote node.
ThreadingWhen the processor threatens to be stalled for a remote datafetch, schedule a new thread of control (lightweightprocess)
Relaxed memory consistency models: how consistentshould the view of memory be?
Andy Pimentel – p. 203/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Sequential consistency (SC)
Processor 1: Processor 2:
A = 0; B = 0;
...... ......
A = 1; B = 1;
if (B == 0) ... if (A == 0) ..
SC model: atomic and strongly ordered memory accesses
e.g. delay write until all validations have beenacknowledged
Single-portedmemory
switch
P3P2 PnP1
Andy Pimentel – p. 204/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Relaxed consistency (RC)
Processor consistency (Sparc): loads may bypass writes
Partial store order (Sparc): loads/writes may bypass writes
Weak consistency (PowerPC) and Released consistency(Alpha,MIPS): no ordering between references(synchronization operations act as memory fences)
Note that RC models always need synchronization such that theirexecution semantics are the same as the SC model
Andy Pimentel – p. 205/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Disadvantages of CC-NUMA
Remote data only held in the small local cache
Performance is severely limited when many data referencesare remote (e.g. incorrect data partitioning, data does not fitin the cache, etc.)
Possible solutionsIncrease cache size � expensive and may increaselatency of local referencesPage migration/replication implemented by the OS
slow (OS-level) and complexworks only at page granularity: problem whenparallel accesses have finer granularity (falsesharing!)
Andy Pimentel – p. 206/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache Only Memory Architecture (COMA)
Similar to NUMA, only the main memories act as direct-mappedor set-associative caches � addresses are hashed to a DRAM“cache-line”
Fetched remote data are actually stored in the local mainmemory (replication)
Data elements do not have a fixed home location: they canmigrate
c c
P P
c c
P P
DRAM
Shared network
Message-passing network
Shared network
DRAMcachecache
Andy Pimentel – p. 207/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
COMA (cont’d)
MP-network often a hierarchical (e.g. tree) networkA switch within the tree contains a directory with dataelements residing in its sub-treeRemote access requires tree traversal as data has nohome nodeSwitches support message combiningWrite invalidate coherency protocol
Andy Pimentel – p. 208/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
COMA (cont’d)
Requires extra memory-subsystem hardwareTag memory to check a data element in the DRAMcache is the required elementComparators to perform this check for multiple DRAMcache blocks (when the DRAM cache isset-associative)State memory to keep the state of the DRAM cacheelements
Andy Pimentel – p. 209/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
COMA versus CC-NUMA
COMA more flexible than CC-NUMAReplication not constrained by a small local cache(main memory is much larger!)Dynamic migration/replication of data without the needof OS support and at a fine granularity (less falsesharing)
COMA needs non-standard memory management hardware(expensive and complex)
Remote accesses in COMA slower due to the tree traversal
Coherency protocol harder to implement in COMA: takecare that the last copy of a data element is not removed
Andy Pimentel – p. 210/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
COMA versus CC-NUMA (cont’d)
Performance difference: highly dependent on application
Low miss rates: performance of COMA and CC-NUMAsimilar
Capacity misses dominate (a capacity miss occurs becausethe data does not fit in the cache): COMA outperformsCC-NUMA as COMA’s DRAM cache is usually largeenough to store all required data (unlike CC-NUMA’s smalldata caches)
Coherence misses dominate (a coherence miss is due to theinvalidation of data): CC-NUMA outperforms COMA dueto the higher latency of remote accesses in COMA
Andy Pimentel – p. 211/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Simple COMA (S-COMA)
Data placement/allocation at page-granularity: like(software) SVM, the MMU determines whether a page is inthe local DRAM. So, tag memory and comparators are notneeded
Coherency managed in hardware and at a fine granularity:transferred data elements are cache blocks (minimizes falsesharing)
Networkblockscache
Page Cache block
DRAM cacheAndy Pimentel – p. 212/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
S-COMA (cont’d)
A page can be partially filled with valid data (cache blocks)
OS-managed main memory can be fully associative (notfeasible in normal COMA)
DRAM cache misses may be slower than COMA missesdue to OS support (e.g. page faults)
Probability of false replacement: a page fault (DRAM readmiss), fetching a single remote cache block and allocating anew local page, may falsely replace an entire page
Andy Pimentel – p. 213/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache coherency revisited
Cache coherency protocols are based on cache block states andstate transitions � how to find copies of a cache block?
Snoopy bus protocols: caches detect copies by monitoringthe bus
Typically for broadcast-based architectures: UMAmachines or within the SMP clusters of CC-NUMA’sEither write-invalidate or write-update
Directory based protocols: store locations of copies indirectory
More scalable than snoopy protocols � used innon-broadcast networks (e.g. CC-NUMA’s andCOMA’s)Typically write-invalidate
Andy Pimentel – p. 214/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Directory based protocols
Full map
X:Directory
Cache Cache
P1 P2 Pn
Cache
Read X Read X Read X
X:Directory
Cache Cache Cache
P1 P2 Pn
X: data X: X: datadata
Write X
X:Directory
Cache Cache Cache
P1 P2 Pn
dataX:
Andy Pimentel – p. 215/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Directory based protocols (cont’d)
Limited map
X:
Cache Cache Cache
P1 P2 Pn
X: data X: data
Directory
Read X
X:
Cache Cache Cache
P1 P2 Pn
X: data X: data
Directory
Andy Pimentel – p. 216/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Directory based protocols (cont’d)
Chained directory
X: data
X:Directory
CacheCacheCache
P1 P2 Pn
Read X
X: data
X:Directory
CacheCacheCache
P1 P2 Pn
X: data
Andy Pimentel – p. 217/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache coherency (cont’d)
R(i)W(i)Z(j)
R(j)R(i)
and j = i
R(j)
Z(i) = Replace block in cache i
R(i) = read block by cache i
W(i) = write to block by cache i
W(i)
State−transition graph of write−back cache i
State−transition graph of write−through cache i
W(i) Z(i)W(j)
R(i)W(j)Z(i)
Z(j)
M = Modified
S = Shared
INV = Invalid or
not in cache
R(j)Z(j)Z(i)
W(j)
SM
R(i)
W(i)
R(j)
R(j),Z(j),W(j),Z(i)
Z(j)
Invalid Valid
W(j),Z(i)
R(i),W(i)
INV
Andy Pimentel – p. 218/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache coherency (cont’d)
A snoopy-bus system with 3 processors with MSI write-backcaches
Proc. action P1 state P2 state P3 state Bus act. Data from
P1 read x S — — Rd Memory
P3 read x S — S Rd Memory
P3 write x I — M I —
P2 write x I M I RdI P3’s cache/memory
P1 read x S S I Rd P2’s cache/memory
P3 read x S S S Rd Memory
Andy Pimentel – p. 219/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache coherency (cont’d)
MESI protocol frequently used in commodity processorsM(odified) : dirty, exclusive cache blockE(exclusive) : clean, exclusive cache blockS(hared) : clean, shared cache blockI(nvalid) : block not resident in cache
Exclusive state reduces invalidation traffic: the cache canwrite without sending invalidations
Bus needs extra status line signalling whether or not data isshared
Andy Pimentel – p. 220/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Cache coherency (cont’d)
Cache coherency in a cache hierarchy
Snooping logic not at all levels
Solution: preserve inclusion propertyIf a memory block � L1 cache, then it is � L2 cacheIf a block is in Modified state in the L1 cache, then itmust also be marked as modified in the L2 cache
Requirements for inclusion are not trivial: different blocksizes, associativities, etc.
Automatic inclusion: L1 direct-mapped and L2d-m/set-associative with identical block sizes andsetsL1
� � setsL2.
Andy Pimentel – p. 221/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Synchronization
Hardware synchronization in multiprocessors: similar tosoftware based (OS-level) synchronization for critical sections(semaphores, monitors)
Atomic read-modify-write operations, such as thetest-and-set operation, allow implementation ofsynchronization primitives (locks)
test-and-set( int *address ) {temp = *address;*address = 1;return (temp);
}
Andy Pimentel – p. 222/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Synchronization (cont’d)
Spin lock
lock( int *lock ) {
while ( test-and-set( lock ) == 1 );
}
unlock( int *lock ) {
*lock = 0;
}
Suspend lock
Andy Pimentel – p. 223/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Synchronization (cont’d)
Spin locks may cause thrashing:
Mem P0 P1 P2 Mem P0 P1 P2
P0: lock
Mem P0 P1 P2 Mem P0 P1 P2
P1: lock (failed) P2: lock (failed)
(a)
(c)
(b)
(d)
= invalid lock
= dirty lock
Andy Pimentel – p. 224/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Synchronization (cont’d)
Possible solutions to avoid thrashing
Snooping lock
lock( int *lock ) {
while ( test-and-set( lock ) == 1 )
while ( *lock == 1 );
}
test-and-test-and-set lock
lock( int *lock ) {
for (;;) {
while ( *lock == 1 );
if ( test-and-set( lock ) == 0 )
break;
}
}
Andy Pimentel – p. 225/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Synchronization (cont’d)
Barrier synchronizationShared counter counting the processes reaching thebarrierHardwired barrier lines
P1 P2 P3 Pn
b1
b3b2
bn
Andy Pimentel – p. 226/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Disk storage considerations
To increase fault tolerance and performance of disk: RAID(Redundant Array of Inexpensive Disks)
Data is striped over disks
Parallel disk access is possible (important for SMPs)
7 RAID levels, each with a different scheme to provideredundancy
RAID levels 1-5 survive one disk crash, level 6 survivestwo
Andy Pimentel – p. 227/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
RAID (cont’d)
RAID level 0: no redundancy
RAID level 1: MirroringRequires twice the number of disksSmall recovery time
RAID level 3: Bit-interleaved ParityOne redundant disk containing parity informationAll reads and write to all disks � no parallel diskaccess
RAID level 5: Block-interleaved Distributed ParityOne redundant disk containing parity informationReads to one disk, writes need reads to all disks
Andy Pimentel – p. 228/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
RAID (cont’d)
Block-interleaved Parity (Level 4) versus Block-interleavedDistributed Parity (level 5)
Parity disk in RAID level 4 forms bottleneck
Block-interleaved (RAID 4) Distributed Block-interleaved (RAID 5)
... ... ... ... ...
2 3 P0
P1
P2
P3
4 5 6 7
8 9 10 11
12 13 14 15
0
... ... ... ... ...
10 2 3
7654
8 9 10 11
12 13 14 15 P3
P2
P1
P0 1
Andy Pimentel – p. 229/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Real multiprocessors: SGI Origin 2000
CC-NUMA architecture
Andy Pimentel – p. 230/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The SGI Origin 2000 (cont’d)
Andy Pimentel – p. 231/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The SGI Origin 2000 (cont’d)
Andy Pimentel – p. 232/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The Cray T3D
NUMA architecture
Andy Pimentel – p. 233/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The Cray MTA
An UMA MultiThreaded Architecture (MTA): 128 hardwarethreads per processor
Processors (max 256) I/O Processors (max 256)
I/O caches (max 256)Memories (max 512)
3D Toroidal mesh (16x16x16)
Andy Pimentel – p. 236/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The Cray MTA (cont’d)
SSW
T0
T7
R0
R31
PC
128 copies
Each thread has its own context: 128 * 32 = 4K GPRs
At every instruction a new thread may be scheduled
No data cachesLatency hiding by thread schedulingNo cache coherency problem!
Architecture fully pipelined � enough runnable threadsavoid bubbles in the pipeline
Andy Pimentel – p. 237/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The Cray MTA (cont’d)
InstructionPool fetch
Interconnection network
PoolRetry
Issue
Writ
e
Reg
iste
rW
rite
Writ
eP
ool
Poo
lM
emor
y
Writ
eR
egis
ter
Reg
iste
r
AM C
Andy Pimentel – p. 238/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
The Cray MTA (cont’d)
Lookahead in instructions indicating the number ofsucceeding, independent instructions
LIW (Large Instruction Word) instructions containing 3operations (1 arithmetic, 1 memory and 1 branch/simplearithmetic)
Tagged memorySetting traps on memory locationsForwarding (invisible indirection)Synchronization (full/empty bit), e.g. a read does notcomplete until the full bit is set
Andy Pimentel – p. 239/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Future directions: what’s next?
Who shall tell?Super-speculative processorsTrace/Multiscalar processorsSimultaneous MultiThreaded processors ( � started:Pentium 4)I(ntelligent)RAMsReconfigurable (co-)processorsSingle-chip multiprocessors (started: POWER 4)
Andy Pimentel – p. 240/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
What’s next? (cont’d)
Instruction execution generations
First generation
Pipelining, second generation
Superscalar pipelining, third generation
Fourth generation (?)
Andy Pimentel – p. 241/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Trace processors
Traces (consisting of multiple basic blocks) are basic unitfor fetching and execution
Traces are constructed dynamically
Processor contains multiple superscalar processing cores toexecute traces in parallel
Andy Pimentel – p. 242/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Trace processors (cont’d)
Rely heavily on speculative executionNext-trace predictionBranch prediction to construct tracesData value prediction to "remove" RAW dependenciesbetween traces
Mispredictions may be painful
Trace cache stores whole traces as basic elements (locatedbetween I$ and decoder)
Andy Pimentel – p. 243/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Typical architecture of a trace processor
Branchprediction
Traceconstruction
Instructionpreprocessing
Trace cache
Next-traceprediction
Globalregs. Local
regs.
Functionalunits
Instructionbuffer
Processing element 0
Superscalar processing element 1
Superscalar processing element n
Data-valueprediction
Andy Pimentel – p. 244/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Value prediction
Classification of speculative execution techniques
Speculative execution
Control speculation Data speculation
Branch direction(binary)
Branch target(multi-valued)
Data location
Aliased(binary)
Address(multi-valued)
Data value(multi-valued)
Andy Pimentel – p. 245/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Value prediction (cont’d)
Exploit value locality to reduce data-flow restrictions
Value cachingCommon subexpression elimination in hardware
Predicting values (needs verification at commit stage)Last value predictorsStride predictorsContext based predictors: next value based on anumber of preceding values (sort of Markov chain)
Andy Pimentel – p. 246/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Multiscalar processors
Compiler recognizes tasks in Control Flow Graph ofprograms
Consist of multiple BBs: similar to traces, but maycontain both taken/untaken paths of internal branchesSequential relationship between tasks
Tasks are executed in parallel while preserving loosesequential order
Data dependencies between tasks (compiler) are explicitlycommunicated via a unidirectional communication ringnetwork
Sequencer schedules tasks to computing elements andperforms next-task prediction
Andy Pimentel – p. 247/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Multiscalar processors (cont’d)
Recognition of tasks in a CFG
Task A
Task B
Task D
Task E
PE2
PE0
PE1
PE3
Dat
a va
luesB C
E
A
D
Andy Pimentel – p. 248/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Multiscalar processors (cont’d)
A possible multiscalar microarchitecture
I$
processingelement
registerfile
Head TailSequencer
ProcessingUnit
ProcessingUnit
Interconnect
Data banks
Andy Pimentel – p. 249/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Simultaneous MultiThreading (SMT)
Multiple hardware contexts, one for each threadMultiple program counters, register sets, etc.
Traditional fine-grained (vertical) multithreading allows toschedule a thread (issuing its instructions) each cycle (e.g.Cray MTA and Sun MAJC)
Provides latency hidingResources wasted when a thread does not have a lot ofILPPotential waste of resources also holds for on-chipmultiprocessors
Andy Pimentel – p. 250/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Simultaneous MultiThreading (SMT) (cont’d)
SMT allows issuing instructions from all threads at eachcycle
Provides latency hiding + improved ILP (better utilizationof resources)
Shown to be a rather straightforward extension of normalsuperscalar architectures, but
Instruction fetch unit should fetch instructions frommultiple PCs (calls for restrictions)Requires a much larger register file (deeperpipeline/lower clock speed)
Andy Pimentel – p. 251/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Simultaneous MultiThreading (SMT) (cont’d)
Thread 0
Thread 2
Thread 4
Thread 1
Superscalar Fine-grained(vertical) multithreading
Simultaneousmultithreading
Tim
e
Issue slots
Andy Pimentel – p. 252/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Simultaneous MultiThreading (SMT) (cont’d)
The sharing of resources may have some negative effectsBranch prediction interferenceInterthread cache interferenceIncreased memory traffic
However, most of these negative effects are hidden becauseof the multithreading
Andy Pimentel – p. 253/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
Intelligent RAM
Addressing the CPU-memory performance gap (wideningfor increasingly aggressive superscalar-like architectures)
Currently, about 60 to 70% of the die area is used forcaches and other memory latency hiding hardware
IRAM solution: integrate processor logic with the DRAM
DRAMs become large enough to store programs and datasets on a single chip
Andy Pimentel – p. 254/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
IRAM (cont’d)
Some potential advantages
High internal memory bandwidth (a potential 50X to 100Xincrease)
Lower memory latency (a potential 5X to 10X decrease)
More energy efficient (fewer or no accesses to ahigh-capacitance off-chip bus and DRAM consumes lessenergy than SRAM)
Andy Pimentel – p. 255/259
Universityof
Amsterdam
CSPCSPComputer
Architecture
IRAM (cont’d)
Some potential disadvantages
Larger area and lower speed of logic in a DRAM process
Multiplexed I/O lines in DRAM should be avoided:increase of area, power and costs
Retention time of DRAM dependent on temperature:refresh rates could rise dramatically
Andy Pimentel – p. 256/259