Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years...
-
Upload
nguyencong -
Category
Documents
-
view
217 -
download
2
Transcript of Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years...
Welcome to AVDARK
Erik HagerstenUppsala University
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 2
AVDARK2012
Ericsson, CPU designer 1982-84: APZ212 MIT 1984-85: Dataflow parallel architecture Ericsson computer science lab 1985-1988 NetInsight, (Erlang) SICS, parallel architectures 1988-1993 COMA, (Simics & Virtutech) Sun Microsystems, chief architect servers 1993 – 1999
WildFire, E6000, E15000, E25000 Professor Uppsala University 1999 – New modeling + Acumem Startup Acumem 2006 – 2010 ThreadSpotter Chief scientist Rogue Wave Software 2011 –
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 3
AVDARK2012
Goal for this course Understand how and why modern computer systems are designed the way the are:
pipelines memory organization virtual/physical memory ...
Understand how and why multiprocessors are built Cache coherence Memory models Synchronization…
Understand how and why parallelism is created and leveraged Instruction-level parallelism Memory-level parallelism Thread-level parallelism…
Understand how and why multiprocessors of combined SIMD/MIMD type are built GPU Vector processing…
Understand how computer systems are adopted to different usage areas General-purpose processors Embedded/network processors…
Understand the physical limitation of modern computers Bandwidth Energy Cooling…
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 4
AVDARK2012
Literature Computer Architecture A Quantitative Approach (4th/5th edition)John Hennesey & David Pattersson
Lecturer Erik Hagersten gives most lectures and is responsible for the courseAndreas Sembrant and Jonas Flodin are responsible for the labs and the hand-ins.Sverker Holmgren will teach parallel programming.David Black-Schaffer will teach about graphics processors.
Mandatory Assignment
There are four lab assignments that all participants have to complete before a hard deadline. Each can earn you a bonus point
Optional Assignment
There are four (optional) hand-in assignments. Each can earn you a bonus point
Examination Written exam at the end of the course. No books are allowed.
Bonus system 64p max/32p to pass. For each bonus point, there is a corresponding question 4p bonus question. Full bonus Pass.
AVDARK in a nutshell
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 5
AVDARK2012
Exam and bonus structure 4 Mandatory labs (mandatory) 4 Hand-in (optional) Written Exam
How to get a bonus point: Complete extra bonus activity at lab occation Complete optional bonus hand-in [with a
reasonable accuracy] before a hard deadline
32p/64p at the exam = PASS
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 6
AVDARK2012
Schedule in a nutshell: 5 Batches1. Memory systems
Caches, VM, DRAM, microbenchmarks, optimizing SW
2. MultiprocessorsTLP: coherence, memory models, interconnects, scalability, clusters, …
3. Scalable multiprocessorsScalability, synchronization, clusters, …
4. CPUsILP: pipelines, scheduling, superscalars, VLIWs, Vector instructions…
5. Widening the viewTechnology impact, GPUs, multicores, future trends
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 7
AVDARK2012
AVDARK on the Webwww.it.uu.se/edu/course/homepage/avdark/ht12
Welcome! News FAQ Schedule Slides New Papers Assignments Reading instructions Exam
Crash Course inComputer Architecture
(covering the course in 45 min)
Erik HagerstenUppsala University
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 9
AVDARK2012
≈30 years ago: APZ 212 @ 5MHz”the AXE supercomputer”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 10
AVDARK2012
APZ 212marketing brochure quotes:
”Very compact” 6 times the performance 1/6:th the size 1/5 the power consumption
”A breakthrough in computer science” ”Why more CPU power?” ”All the power needed for future development” ”…800,000 BHCA, should that ever be needed” ”SPC computer science at its most elegance” ”Using 64 kbit memory chips” ”1500W power consumption
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 11
AVDARK2012
CPU Improvements
1970 1980 1990 2000 Year
Relative Performance[log scale]
1000
100
10
1
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 12
AVDARK2012
How to get efficient architectures…
Increase clock frequency Create and explore locality:
a) Spatial locality b) Temporal localityc) Geographical locality
Create and explore parallelisma) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)c) Memory level parallelism (MLP)
Speculative executiona) Out-of-order executionb) Branch predictionc) Prefetching
Very hard today
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 13
AVDARK2012
MemoryAccesses
Three Regs:Source1Source2Destination
Load/Store architecture (e.g., ”RISC”)ALU ops: Reg -->RegMem ops: Reg <--> Mem
...54321
Mem
ExplicitRegisters
ALU
Load/Store
Example: C = A + BLD R1, [A]LD R3, [B]ADD R2, R1, R3ST R2, [C]
Compiler
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 14
AVDARK2012
Lifting the CPU hood (simplified…)
DCBA
CPU
Mem
Instructions:
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 15
AVDARK2012
Pipeline
DCBA
Mem
Instructions:
I R X WRegs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 16
AVDARK2012
Pipeline
A
Mem
I R X WRegs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 17
AVDARK2012
Pipeline
A
Mem
I R X WRegs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 18
AVDARK2012
Pipeline
A
Mem
I R X WRegs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 19
AVDARK2012
Pipeline:
A
Mem
I R X WRegs
I = Instruction fetchR = Read registerX = ExecuteW= Write register/mem
Processor ”state”
Pipeline stages
Memory fordata and instr.
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 20
AVDARK2012
Register Operations [aka ALU operation]
ADD R1, R2, R3a.k.a. R1 := R2 op R3
A
Mem
I R X WRegs
23 1
OPe.g., +, -, *, /Ifetch
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 21
AVDARK2012
Load Operation:
LD R1, mem[cnst+R2]
A
Mem
I R X WRegs
1
Ifetch
2
+
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 22
AVDARK2012
Store Operation:
ST R2, mem[cnst+R1]
A
Mem
I R X WRegs
12
Ifetch+
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 23
AVDARK2012
Branch Operations:
if (R1 Op Const) GOTO mem[R2]
A
Mem
I R X WRegs
12
Pc
OPIfetch
PC = Program Counter.A special register pointing to the next instruction to execute
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 24
AVDARK2012
Initially
DCBA
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
PC
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 25
AVDARK2012
Cycle 1
DCBA
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
PC
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 26
AVDARK2012
Cycle 2
DCBA
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
PC
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 27
AVDARK2012
Cycle 3
DCBA
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
+
PC
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 28
AVDARK2012
Cycle 4
DCBA
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
+
PC
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 29
AVDARK2012
Cycle 5
DCBA
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
+
PC
A
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 30
AVDARK2012
Cycle 6
DCBA
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
<
PC
A
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 31
AVDARK2012
Cycle 7
DCBA
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
PC
A
Branch: Addr(A) PC
A:
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 32
AVDARK2012
Cycle 8
DCBA
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
PC
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 33
AVDARK2012
Data dependency Previous execution example wrong!
DCBA
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 34
AVDARK2012
Data dependency fix 1: pipeline delays
D
CB
A
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
”Stall”
”Stall”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 35
AVDARK2012
Branch delays
D
BC
A
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
A
9 cycles per iteration of 4 instructions Need longer basic blocks with independent instr.
Branch Next PC
”Stall”
”Stall”
”Stall”
Next PC
”Stall”
”Stall”
PC
Need to find Instruction-Level Parallelism (ILP) to avoid stalls!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 36
AVDARK2012
It is actually a lot worse!Modern CPUs:”superscalars” with ~4 parallel pipelines
Mem
I R X WRegs
I R X W
I R X W
I R X W
+Higher throughput- More complicated architecture- Branch delay more expensive (more instr. missed)- Harder to find ”enough” independent instr. (need 8 instr. between write and use)
Issuelogic
Need to find 4x more ILP
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 37
AVDARK2012
It is actually a lot worse! Modern CPUs: ~10-20 stages/pipe
Mem
I RRegs
+Shorter cycletime (higher GHz)- Branch delay even more expensive- Even harder to find ”enough” independent instr.
I R B M MW
I R B M MW
I R B M MW
I
I
I
I
Issuelogic
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 38
AVDARK2012
It is actually a lot worse! DRAM access: ~150 CPU cycles
I RRegs
I R B M MW
I R B M MW
I R B M MW
I
I
I
I
Issuelogic
Mem
150 cycles
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 39
AVDARK2012
Pipeline delays gets worseD
CB
A
Mem
I R X WRegs
LD RegA, (100 + RegC)
IF RegC < 100 GOTO A
RegB := RegA + 1
RegC := RegC + 1
x151 ”Stall”
x1 ”Stall”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 40
AVDARK2012
Fix 1: Out-of order execution:Improving ILP
LD R1, M(100)ADD R3, R2, R1SUB R5, R6, R7ST R5, M(100)
>=0?
LD ...ADD ...SUB ...ST ...
YN
The HW may execute instructions ina different order, but will make the ”side-effects” of the instructions appear in order.
Assume that LD takes a long time.The ADD is dependent on the LD Start the SUB and ST before the ADDUpdate R5 and M(100) after R3
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 41
AVDARK2012
Fix 2: Branch prediction
LD R1, M(100)ADD R3, R2, R1SUB R5, R6, R7ST R5, M(100)
>=0?
LD ...ADD ...SUB ...ST ...
YN
The HW can guess if the branch istaken or not and avoid branch stallsif the guess is correct,
Assume the guess is ”Y”.
The HW can start to execute theseinstruction before the outcome the the branch is known, but cannot allow any ”side-effect” to take place until the outcome is known
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 42
AVDARK2012
Fix 3: Scheduling Past BranchesImproving ILP
LDADDSUBST
>=0?LDADDSUBST
>1?
LDADDSUBST
<2?
LDADDSUBST
=0?
Y
Y
Y
”Predict taken”
”Predict taken”
”Predict taken”
All instructions along thepredicted path can be executedout-of-order
Predicted path
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 43
AVDARK2012
LDADDSUBST
>=0?LDADDSUBST
>1?
LDADDSUBST
<2?
LDADDSUBST
=0?
Y
Y
Y
Wrong Prediction!!!
Throw awayi.e., no sideeffects!
Actual path!
Fix 4: Scheduling Past BranchesImproving ILP
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 44
AVDARK2012
Fix 5: Use a cache
Mem
I RRegs
B M MW
I R B M MW
I R B M MW
I R B M MW
I
I
I
I
Issuelogic
150cycles 1GB
64kB$1-10 cycles
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 45
AVDARK2012
How to get efficient architectures…
Increase clock frequency Create and explore locality:
a) Spatial locality b) Temporal localityc) Geographical locality
Create and explore parallelisma) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)c) Memory level parallelism (MLP)
Speculative executiona) Out-of-order executionb) Branch predictionc) Prefetching
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 46
AVDARK2012
Woops, using too much power 2007
Running at 2x the frequency will use much more than 2x the power
It is also really hard to find enough ILP Speculation results in a fair amount of
wasted work
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 47
AVDARK2012
Fix: Multicore
Mem
CPU
$1
CPU
$1
CPU
$1
CPU
$1
L2$
Mem I/F
ExternalI/F
t treads
But now we also need to find Thread-Level Parallelism (TLP)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 48
AVDARK2012
Example: Intel i7 ”Nehalem”
DRAM
Coherence
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 49
AVDARK2012
What is computer architecture?
“Bridging the gap between programs and transistors”
“Finding the best model to execute the programs”
best={fast, cheap, energy-efficient, reliable, predictable, …}
…
Caches and more cachesor
spam, spam, spam and spam
Erik HagerstenUppsala University, Sweden
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 51
AVDARK2012
Fix 5: Use a cache
Mem
I RRegs
B M MW
I R B M MW
I R B M MW
I R B M MW
I
I
I
I
Issuelogic
200cycles 1GB
~32kB$~1 cycles
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 52
AVDARK2012
Webster about “cache”1. cache \'kash\ n [F, fr. cacher to press, hide, fr. (assumed) VL coacticare to press] together, fr. L coactare to compel, fr. coactus, pp. of cogere to compel - more at COGENT 1a: ahiding place esp. for concealing and preserving provisions or implements 1b: a secure place of storage 2: something hidden or stored in a cache
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 53
AVDARK2012
Cache knowledge useful when...
Designing a new computer Writing an optimized program
or compileror operating system …
Implementing software cachingWeb cachesProxies File systems
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 54
AVDARK2012
Memory/storage
SRAM DRAM disk
sram
2000: 1ns 1ns 3ns 10ns 150ns 5 000 000ns1kB 64k 4MB 1GB 1 TB
(1982: 200ns 200ns 200ns 10 000 000ns)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 55
AVDARK2012
Address Book CacheLooking for Tommy’s Telephone Number
ÖÄÅZYXVUT
TOMMY 12345
ÖÄÅZYXV
“Address Tag”
One entry per page =>Direct-mapped caches with 28 entries
“Data”
Indexingfunction
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 56
AVDARK2012
Address Book CacheLooking for Tommy’s Number
ÖÄÅZYXVUT
OMMY 12345
TOMMY
EQ?
index
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 57
AVDARK2012
Address Book CacheLooking for Tomas’ Number
ÖÄÅZYXVUT
OMMY 12345
TOMAS
EQ?
index
Miss!Lookup Tomas’ number inthe telephone directory
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 58
AVDARK2012
Address Book CacheLooking for Tomas’ Number
ZYXVUT
OMMY 12345
TOMASindex
Replace TOMMY’s datawith TOMAS’ data. There is no other choice(direct mapped)
OMAS 23457
ÖÄÅ
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 59
AVDARK2012
Cache
CPUCache
address
data (a word)hit
Memory
address
data
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 60
AVDARK2012
Cache Organization
Cache
OMAS 23457
TOMAS
index
=
Hit (1)
(1)
(4) (4)
1
Addrtag
&
(1)
Data (5 digits)
Valid
28 entries
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 61
AVDARK2012
Cache Organization (really)4kB, direct mapped
index
=
Hit?(1)(32bits = 4 bytes)
1
Addrtag
&
(1)
Data
(1)
Valid
00100110000101001010011010100011
1k entries of 4 bytes each
(?)
(?)(?)
0101001 0010011100101…
32 bit address identifying a byte in memory
OrdinaryMemory
msb lsb
”Byte”
What is a good index
function?
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 62
AVDARK2012
Cache Organization4kB, direct mapped
index
=
Hit?(1)(32bits = 4 bytes)
1
Addrtag
&
(1)
Data
(1)
Valid
00100110000101001010011010100011
1k entries of 4 bytes each
(10)
(20) (20)
0101001 0010011100101…
32 bit address Identifies the byte within a word
msb lsb
Mem Overhead:21/32= 66%Latency =SRAM+CMP+AND
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 63
AVDARK2012
Cache
CPUCache
address
data (a word)hit
Memory
address
data
Hit: Use the data provided from the cache~Hit: Use data from memory and also store it in
the cache
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 64
AVDARK2012
Cache performance parameters Cache “hit rate” [%] Cache “miss rate” [%] (= 1 - hit_rate) Hit time [CPU cycles] Miss time [CPU cycles] Hit bandwidth Miss bandwidth Write strategy ….
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 65
AVDARK2012
How to rate architecture performance?
Marketing: Frequency and Number of “cores”…
Architecture “goodness”: CPI = Cycles Per Instruction IPC = Instructions Per Cycle
Benchmarking: SPEC-fp, SPEC-int, … TPC-C, TPC-D, … Varning: Using a unrepresentative benchmark can
be missleading
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 66
AVDARK2012
Cache performance exampleAssumption:Infinite bandwidthA perfect 1.0 CyclesPerInstruction (CPI) CPU100% instruction cache hit rate
Total number of cycles =#Instr. * ( (1 - mem_ratio) * 1 +
mem_ratio * avg_mem_accesstime) =
= #Instr * ( (1- mem_ratio) + mem_ratio * (hit_rate * hit_time +
(1 - hit_rate) * miss_time)
CPI = 1 -mem_ratio + mem_ratio * (hit_rate * hit_time +
(1 - hit_rate) * miss_time)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 67
AVDARK2012
Example Numbers
CPI = 1 - mem_ratio + mem_ratio * (hit_rate * hit_time) +mem_ratio * (1 - hit_rate) * miss_time)
mem_ratio = 0.25hit_rate = 0.85hit_time = 3miss_time = 100
CPI = 0.75 + 0.25 * 0.85 * 3 + 0.25 * 0.15 * 100 =
0.75 + 0.64 + 3.75 = 5.14CPU HIT MISS
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 68
AVDARK2012
What if ...CPI = 1 -mem_ratio +
mem_ratio * (hit_rate * hit_time) +mem_ratio * (1 - hit_rate) * miss_time)
mem_ratio = 0.25hit_rate = 0.85hit_time = 3 CPU HIT MISSmiss_time = 100 == > 0.75 + 0.64 + 3.75 = 5.14
•Twice as fast CPU ==> 0.37 + 0.64 + 3.75 = 4.77
•Faster memory (70c) ==> 0.75 + 0.64 + 2.62 = 4.01
•Improve hit_rate (0.95) => 0.75 + 0.71 + 1.25 = 2.71
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 69
AVDARK2012
How to get more effective caches:
Larger cache (more capacity) Cache block size (larger cache lines) More placement choice (more associativity) Innovative caches (victim, skewed, …) Cache hierarchies (L1, L2, L3, CMR) Latency-hiding (weaker memory models) Latency-avoiding (prefetching) Cache avoiding (cache bypass) Optimized application/compiler …
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 70
AVDARK2012
Why do you miss in a cache
Mark Hill’s three “Cs” Compulsory miss (touching data for the first time) Capacity miss (the cache is too small) Conflict misses (non-ideal cache implementation)
(too many names starting with “H”) (Multiprocessors)
Communication (imposed by communication) False sharing (side-effect from large cache blocks)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 71
AVDARK2012
Avoiding Capacity Misses –a huge address bookLots of pages. One entry per page.
Ö
Ä
Å
Z
Y
X
ÖV
ÖU
ÖT
LING 12345
ÖÖ
ÖÄ
ÖÅ
ÖZ
ÖY
ÖX
“Address Tag”
One entry per page =>Direct-mapped caches with 282 entries 784 entries
“Data”
NewIndexingfunction
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 72
AVDARK2012
Cache Organization1MB, direct mapped
index
=
Hit?(1)
(32)
1
Addrtag
&
(1)
Data
(1)
Valid
00100110000101001010011010100011
256k entries
(18)
(12) (12)
0101001 0010011100101
32 bit address Identifies the byte within a word
msb lsb
Mem Overhead:13/32= 40%
Latency =SRAM+CMP+AND
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 73
AVDARK2012
Pros/Cons Large Caches++ The safest way to get improved hit rate-- SRAMs are very expensive!!-- Larger size ==> slower speed
more load on “signals”longer distances
-- (power consumption)-- (reliability)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 74
AVDARK2012
Why do you hit in a cache? Temporal locality
Likely to access the same data again soon Spatial locality
Likely to access nearby data again soonTypical access pattern:
(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, ...
temporal spatial
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 75
AVDARK2012
(32)Data
Multiplexer(4:1 mux)
Identifies the word within a cache line
(2)Select
Identifies a byte within a word
Fetch more than a word: cache blocks(a.k.a cache line)1MB, direct mapped, CacheLine=16B
1
00100110000101001010011010100011
64k entries
0101001 0010011100101(16)index
0010011100101 0010011100101 0010011100101
=
Hit?(1)&
(1)
(12) (12)(32) (32) (32) (32)
128 bits
msb lsb
Mem Overhead:13/128= 10%
Latency =SRAM+CMP+AND
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 76
AVDARK2012
Example in ClassDirect mapped cache: Cache size = 64 kB Cache line = 16 B Word size = 4B 32 bits address (byte addressable)
“There are 10 kinds of people in the world…Those who understand binary number and
those who do not.”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 77
AVDARK2012
Pros/Cons Large Cache Lines++ Explores spatial locality++ Fits well with modern DRAMs
* first DRAM access slow* subsequent accesses fast (“page mode”)
-- Poor usage of SRAM & BW for some patterns-- Higher miss penalty (fix: critical word first)-- (False sharing in multiprocessors)
Cache line size
Perf
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 78
AVDARK2012 Thanks: Dr. Erik Berg
UART: StatCache Graph app=matrix multiply
Small cache:Short cache lines are better
Large cachesLonger cache lines are better
Huge caches:Everything fits regardless of CL size
Note: this is just a single example, butthe conclusion typically holds for mostapplications.
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 79
AVDARK2012
Cache ConflictsTypical access pattern:
(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, …
What if B and C index to the same cache locationConflict misses -- big time!Potential performance loss 10-100x
temporal spatial
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 80
AVDARK2012
Address Book CacheTwo names per page: index first, then search.
ÖÄÅZYXVUT
OMAS
TOMMY
EQ?
index
12345
23457
OMMY
EQ?
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 81
AVDARK2012
How should the select signal be
produced?
Avoiding conflict: More associativity1MB, 2-way set-associative, CL=4B
index
(32)
1
Data
00100110000101001010011010100011
128k “sets”
(17)0101001 0010011100101
Identifies a byte within a word
Multiplexer(2:1 mux)
(32) (32)
1 0101001 0010011100101
Hit?
=
&
(13)
(1)Select
(13)
=
&
“logic”
msb lsb
Latency =SRAM+CMP+AND+LOGIC+MUX
One “set”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 82
AVDARK2012
Pros/Cons Associativity
++ Avoids conflict misses-- Slower access time-- More complex implementation
comparators, muxes, ...-- Requires more pins (for external
SRAM…)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 83
AVDARK2012
Going all the way…!1MB, fully associative, CL=16B
index
=
Hit? 4B
1
&
Data
00100110000101001010011010100011One “set”
(0)
(28)0101001 0010011100101
Identifies a byte within a word
Multiplexer(256k:1 mux)
Identifies the word within a cache line
Select (16)
(13)
16B 16B
1 0101001 0010011100101
=
&
“logic”
(2)1 0101001 0010011100101 1
=
&
=
&
16B...
64kcomparators
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 84
AVDARK2012
Fully Associative Very expensive Only used for small caches (and
sometimes TLBs)
CAM = Contents-addressable memory~Fully-associative cache storing key+dataProvide key to CAM and get the associated
data
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 85
AVDARK2012
A combination thereof1MB, 2-way, CL=16B
index
=
Hit? (32)
1
&
Data
001001100001010010100110101000110
32k “sets”
(15)
(13)0101001 0010011100101
Identifies a byte within a word
Multiplexer(8:1 mux)
Identifies the word within a cache line
(1)Select
(13)
(128) (128)
1 0101001 0010011100101
=
&
“logic”(2)
(256)
msb lsb
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 86
AVDARK2012
Example in Class Cache size = 2 MB Cache line = 64 B Word size = 8B (64 bits) 4-way set associative 32 bits address (byte addressable)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 87
AVDARK2012
Who to replace?Picking a “victim”
Least-recently used (aka LRU) Considered the “best” algorithm (which is not
always true…) Only practical up to limited number of ways
Not most recently used Remember who used it last: 8-way -> 3 bits/CL
Pseudo-LRU E.g., based on course time stamps. Used in the VM system
Random replacement Can’t continuously to have “bad luck...
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 88
AVDARK2012
Cache Model: Random vs. LRU
Random
LRU
LRU
Random
art (SPEC 2000) equake (SPEC 2000)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 89
AVDARK2012
4-way sub-blocked cache1MB, direct mapped, Block=64B, sub-block=16B
index
=
Hit?
(4)
(32)
1
&
Data
00100110000101001010011010100011
16k
(14)
(12)
0101001 0010011100101
Identifies a byte within a word
0010011100101
16:1 mux
Identifies the word within a cache line
(2)
(12)(128) (128) (128) (128)
512 bits
0 1 0
& & &
logic4:1 mux(2)
Sub block within a block
msb lsb
Mem Overhead:16/512= 3%
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 90
AVDARK2012
Pros/Cons Sub-blocking++ Lowers the memory overhead++ (Avoids problems with false sharing -- MP)++ Avoids problems with bandwidth waste -- Will not explore as much spatial locality-- Still poor utilization of SRAM -- Fewer sparse “things” allocated
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 91
AVDARK2012
Replacing dirty cache lines Write-back
Write dirty data back to memory (next level) at replacement
A “dirty bit” indicates an altered cache line Write-through
Always write through to the next level (as well) data will never be dirty no write-backs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 92
AVDARK2012
Write Buffer/Store Buffer
Do not need the old value for a store
One option: Write around (no write allocate in caches) used for lower level smaller caches
CPU
cache
WB:
stores loads
===
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 93
AVDARK2012
Innovative cache: Victim cache
CPU
Cacheaddress
datahit Memory
address
data
Victim Cache (VC): a small, fairly associative cache (~10s of entries)Cache lookup: search cache and VC in parallelCache replacement: move victim to the VC and replace in VCVC hit: swap VC data with the corresponding data in Cache“A second life ”
VCaddress
data (a word)
hit
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 94
AVDARK2012
Skewed Associative CacheA, B and C have a three-way conflict
2-wayABC
4-wayABC
It has been shown that 2-way skewed performs roughly the same as 4-way caches
2-way skewedABC
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 95
AVDARK2012
Skewed-associative cache:Different indexing functions
index
=
1
&
00100110000101001010011010100011
128k entries
(17)
(13)
0101001 0010011100101
32 bit address Identifies the byte within a word
1 0101001 0010011100101
f1
f2
(>18)
(>18) (17)
msb lsb
2:1mux
=
&
function
(32) (32)
(32)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 96
AVDARK2012
UART: Elbow cacheIncrease “associativity” when needed
Performs roughly the same as an 8-way cacheSlightly fasterUses much less power!!
AB
If severe conflict:make room
ABC
Conflict!!ABC
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 97
AVDARK2012
Topology of caches: Harvard Arch
CPU needs a new instruction each cycle 25% of instruction LD/ST Data and Instr. have different access patterns==> Separate D and I first level cache==> Unified 2nd and 3rd level caches
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 98
AVDARK2012
Cache Hierarchy of Today
CPU
L1 D$on-chip
L2$on-chip
L3$
DRAM Memory
L1 I$on-chip
Small enough to keep up with the CPU speed
Use whatever transistors there i left for the Last-Level Cache (LLC)
Off-chip SRAM (today more commonly on-chip)
Separate I cache to allow for instruction fetch and data fetch in parallel
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 99
AVDARK2012
HW prefetching ...a little green man that anticipates your next
memory access and prefetches the data to the cache.
Improves MLP! Sequential prefetching: Sequential streams
[to a page]. Some number of prefetch streams supported. Often only for L2 and L3.
PC-based prefetching: Detects strides from the same PC. Often also for L1.
Adjacent prefetching: On a miss, also bring in the “next” cache line. Often only for L2 and L3.
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 100
AVDARK2012
Hardware prefetching Hardware ”monitor” looking for patterns in
memory accesses Brings data of anticipated future accesses
into the cache prior to their usage Two major types:
Sequential prefetching (typically page-based, 2nd level cache and higher). Detects sequential cache lines missing in the cache.
PC-based prefetching, integrated with the pipeline. Finds per-PC strides. Can find more complicated patterns.
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 101
AVDARK2012
Data Cache Capacity/Latency/BW
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 102
AVDARK2012
Cache implementation
L2 $256kB
Generic Cache:
Addr [63..0]MSB LSB
AT S Data = 64B
=
D1 ¢64kB
L3 € 24MB
Cacheline, here 64B:
...
=======
muxHit
Sel way ”6”
Data = 64B
index
SRAM:Caches at all level roughly work like this:
I1 ¢64kB
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 103
AVDARK2012
Address Book AnalogyTwo names per page: index first, then search.
ÖÄÅZYXVUT
OMAS
TOMMY
EQ?
index
12345
23457
OMMY
EQ?
Select the second entry!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 104
AVDARK2012
Cache lingo Cacheline: Data chunk move to/from a cache Cache set: Fraction of the cache identified by the index Associativity: Number of alternative storage places for
a cacheline Replacement policy: picking the victim to throw out
from a set (LRU/Random/Nehalem) Temporal locality: Likelihood to access the same data
again soon Spatial locality: Likelihood to access nearby data again
soonTypical access pattern:(inner loop stepping through an array) A, B, C, A+4, B, C, A+8, B, C, ...
temporal spatial
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 105
AVDARK2012
Cache Lingo Picture
C
Generic Cache:
Addr [63..0]MSB LSB
AT S Data = 64B
Cacheline, here 64B:
...
muxHit
Sel way ”6”
Data = 64B
SRAM:
Cache Set:
tag index
Associativity= 8-way
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 106
AVDARK2012
Nehalem i7 (one example)
32kB8-way pLRU
256kB8-way pLRUnon-incl
8MB16-way Neh.-replnon-incl
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 107
AVDARK2012
Take-away message: Caches Caches are fast but small Cache space allocated in cache-line chunks
(e.g., 64bytes) LSB part of the address is used to find the
”cache set” (aka, indexing) There is a limited number of cache lines per
set (associativity) Typically, several levels of caches
The most important target for optimizations
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 108
AVDARK2012
How are we doing?
Increase clock frequency Create and explore locality:
a) Spatial locality b) Temporal localityc) Geographical locality
Create and explore parallelisma) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)c) Memory level parallelism (MLP)
Speculative executiona) Out-of-order executionb) Branch predictionc) Prefetching
Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 109
AVDARK2012
Goals for this course Understand how and why modern computer systems are designed the way the are:
pipelines memory organization virtual/physical memory ...
Understand how and why multiprocessors are built Cache coherence Memory models Synchronization…
Understand how and why parallelism is created and leveraged Instruction-level parallelism Memory-level parallelism Thread-level parallelism…
Understand how and why multiprocessors of combined SIMD/MIMD type are built GPU Vector processing…
Understand how computer systems are adopted to different usage areas General-purpose processors Embedded/network processors…
Understand the physical limitation of modern computers Bandwidth Energy Cooling…