Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years...

109
Welcome to AVDARK Erik Hagersten Uppsala University

Transcript of Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years...

Page 1: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Welcome to AVDARK

Erik HagerstenUppsala University

Page 2: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 2

AVDARK2012

Ericsson, CPU designer 1982-84: APZ212 MIT 1984-85: Dataflow parallel architecture Ericsson computer science lab 1985-1988 NetInsight, (Erlang) SICS, parallel architectures 1988-1993 COMA, (Simics & Virtutech) Sun Microsystems, chief architect servers 1993 – 1999

WildFire, E6000, E15000, E25000 Professor Uppsala University 1999 – New modeling + Acumem Startup Acumem 2006 – 2010 ThreadSpotter Chief scientist Rogue Wave Software 2011 –

Page 3: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 3

AVDARK2012

Goal for this course Understand how and why modern computer systems are designed the way the are:

pipelines memory organization virtual/physical memory ...

Understand how and why multiprocessors are built Cache coherence Memory models Synchronization…

Understand how and why parallelism is created and leveraged Instruction-level parallelism Memory-level parallelism Thread-level parallelism…

Understand how and why multiprocessors of combined SIMD/MIMD type are built GPU Vector processing…

Understand how computer systems are adopted to different usage areas General-purpose processors Embedded/network processors…

Understand the physical limitation of modern computers Bandwidth Energy Cooling…

Page 4: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 4

AVDARK2012

Literature Computer Architecture A Quantitative Approach (4th/5th edition)John Hennesey & David Pattersson

Lecturer Erik Hagersten gives most lectures and is responsible for the courseAndreas Sembrant and Jonas Flodin are responsible for the labs and the hand-ins.Sverker Holmgren will teach parallel programming.David Black-Schaffer will teach about graphics processors.

Mandatory Assignment

There are four lab assignments that all participants have to complete before a hard deadline. Each can earn you a bonus point

Optional Assignment

There are four (optional) hand-in assignments. Each can earn you a bonus point

Examination Written exam at the end of the course. No books are allowed.

Bonus system 64p max/32p to pass. For each bonus point, there is a corresponding question 4p bonus question. Full bonus Pass.

AVDARK in a nutshell

Page 5: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 5

AVDARK2012

Exam and bonus structure 4 Mandatory labs (mandatory) 4 Hand-in (optional) Written Exam

How to get a bonus point: Complete extra bonus activity at lab occation Complete optional bonus hand-in [with a

reasonable accuracy] before a hard deadline

32p/64p at the exam = PASS

Page 6: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 6

AVDARK2012

Schedule in a nutshell: 5 Batches1. Memory systems

Caches, VM, DRAM, microbenchmarks, optimizing SW

2. MultiprocessorsTLP: coherence, memory models, interconnects, scalability, clusters, …

3. Scalable multiprocessorsScalability, synchronization, clusters, …

4. CPUsILP: pipelines, scheduling, superscalars, VLIWs, Vector instructions…

5. Widening the viewTechnology impact, GPUs, multicores, future trends

Page 7: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 7

AVDARK2012

AVDARK on the Webwww.it.uu.se/edu/course/homepage/avdark/ht12

Welcome! News FAQ Schedule Slides New Papers Assignments Reading instructions Exam

Page 8: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Crash Course inComputer Architecture

(covering the course in 45 min)

Erik HagerstenUppsala University

Page 9: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 9

AVDARK2012

≈30 years ago: APZ 212 @ 5MHz”the AXE supercomputer”

Page 10: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 10

AVDARK2012

APZ 212marketing brochure quotes:

”Very compact” 6 times the performance 1/6:th the size 1/5 the power consumption

”A breakthrough in computer science” ”Why more CPU power?” ”All the power needed for future development” ”…800,000 BHCA, should that ever be needed” ”SPC computer science at its most elegance” ”Using 64 kbit memory chips” ”1500W power consumption

Page 11: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 11

AVDARK2012

CPU Improvements

1970 1980 1990 2000 Year

Relative Performance[log scale]

1000

100

10

1

Page 12: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 12

AVDARK2012

How to get efficient architectures…

Increase clock frequency Create and explore locality:

a) Spatial locality b) Temporal localityc) Geographical locality

Create and explore parallelisma) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)c) Memory level parallelism (MLP)

Speculative executiona) Out-of-order executionb) Branch predictionc) Prefetching

Very hard today

Page 13: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 13

AVDARK2012

MemoryAccesses

Three Regs:Source1Source2Destination

Load/Store architecture (e.g., ”RISC”)ALU ops: Reg -->RegMem ops: Reg <--> Mem

...54321

Mem

ExplicitRegisters

ALU

Load/Store

Example: C = A + BLD R1, [A]LD R3, [B]ADD R2, R1, R3ST R2, [C]

Compiler

Page 14: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 14

AVDARK2012

Lifting the CPU hood (simplified…)

DCBA

CPU

Mem

Instructions:

Page 15: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 15

AVDARK2012

Pipeline

DCBA

Mem

Instructions:

I R X WRegs

Page 16: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 16

AVDARK2012

Pipeline

A

Mem

I R X WRegs

Page 17: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 17

AVDARK2012

Pipeline

A

Mem

I R X WRegs

Page 18: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 18

AVDARK2012

Pipeline

A

Mem

I R X WRegs

Page 19: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 19

AVDARK2012

Pipeline:

A

Mem

I R X WRegs

I = Instruction fetchR = Read registerX = ExecuteW= Write register/mem

Processor ”state”

Pipeline stages

Memory fordata and instr.

Page 20: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 20

AVDARK2012

Register Operations [aka ALU operation]

ADD R1, R2, R3a.k.a. R1 := R2 op R3

A

Mem

I R X WRegs

23 1

OPe.g., +, -, *, /Ifetch

Page 21: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 21

AVDARK2012

Load Operation:

LD R1, mem[cnst+R2]

A

Mem

I R X WRegs

1

Ifetch

2

+

Page 22: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 22

AVDARK2012

Store Operation:

ST R2, mem[cnst+R1]

A

Mem

I R X WRegs

12

Ifetch+

Page 23: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 23

AVDARK2012

Branch Operations:

if (R1 Op Const) GOTO mem[R2]

A

Mem

I R X WRegs

12

Pc

OPIfetch

PC = Program Counter.A special register pointing to the next instruction to execute

Page 24: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 24

AVDARK2012

Initially

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

PC

Page 25: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 25

AVDARK2012

Cycle 1

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

PC

Page 26: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 26

AVDARK2012

Cycle 2

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

PC

Page 27: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 27

AVDARK2012

Cycle 3

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

+

PC

Page 28: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 28

AVDARK2012

Cycle 4

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

+

PC

Page 29: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 29

AVDARK2012

Cycle 5

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

+

PC

A

Page 30: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 30

AVDARK2012

Cycle 6

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

<

PC

A

Page 31: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 31

AVDARK2012

Cycle 7

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

PC

A

Branch: Addr(A) PC

A:

Page 32: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 32

AVDARK2012

Cycle 8

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

PC

Page 33: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 33

AVDARK2012

Data dependency Previous execution example wrong!

DCBA

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

Page 34: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 34

AVDARK2012

Data dependency fix 1: pipeline delays

D

CB

A

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

”Stall”

”Stall”

Page 35: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 35

AVDARK2012

Branch delays

D

BC

A

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

A

9 cycles per iteration of 4 instructions Need longer basic blocks with independent instr.

Branch Next PC

”Stall”

”Stall”

”Stall”

Next PC

”Stall”

”Stall”

PC

Need to find Instruction-Level Parallelism (ILP) to avoid stalls!

Page 36: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 36

AVDARK2012

It is actually a lot worse!Modern CPUs:”superscalars” with ~4 parallel pipelines

Mem

I R X WRegs

I R X W

I R X W

I R X W

+Higher throughput- More complicated architecture- Branch delay more expensive (more instr. missed)- Harder to find ”enough” independent instr. (need 8 instr. between write and use)

Issuelogic

Need to find 4x more ILP

Page 37: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 37

AVDARK2012

It is actually a lot worse! Modern CPUs: ~10-20 stages/pipe

Mem

I RRegs

+Shorter cycletime (higher GHz)- Branch delay even more expensive- Even harder to find ”enough” independent instr.

I R B M MW

I R B M MW

I R B M MW

I

I

I

I

Issuelogic

Page 38: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 38

AVDARK2012

It is actually a lot worse! DRAM access: ~150 CPU cycles

I RRegs

I R B M MW

I R B M MW

I R B M MW

I

I

I

I

Issuelogic

Mem

150 cycles

Page 39: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 39

AVDARK2012

Pipeline delays gets worseD

CB

A

Mem

I R X WRegs

LD RegA, (100 + RegC)

IF RegC < 100 GOTO A

RegB := RegA + 1

RegC := RegC + 1

x151 ”Stall”

x1 ”Stall”

Page 40: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 40

AVDARK2012

Fix 1: Out-of order execution:Improving ILP

LD R1, M(100)ADD R3, R2, R1SUB R5, R6, R7ST R5, M(100)

>=0?

LD ...ADD ...SUB ...ST ...

YN

The HW may execute instructions ina different order, but will make the ”side-effects” of the instructions appear in order.

Assume that LD takes a long time.The ADD is dependent on the LD Start the SUB and ST before the ADDUpdate R5 and M(100) after R3

Page 41: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 41

AVDARK2012

Fix 2: Branch prediction

LD R1, M(100)ADD R3, R2, R1SUB R5, R6, R7ST R5, M(100)

>=0?

LD ...ADD ...SUB ...ST ...

YN

The HW can guess if the branch istaken or not and avoid branch stallsif the guess is correct,

Assume the guess is ”Y”.

The HW can start to execute theseinstruction before the outcome the the branch is known, but cannot allow any ”side-effect” to take place until the outcome is known

Page 42: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 42

AVDARK2012

Fix 3: Scheduling Past BranchesImproving ILP

LDADDSUBST

>=0?LDADDSUBST

>1?

LDADDSUBST

<2?

LDADDSUBST

=0?

Y

Y

Y

”Predict taken”

”Predict taken”

”Predict taken”

All instructions along thepredicted path can be executedout-of-order

Predicted path

Page 43: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 43

AVDARK2012

LDADDSUBST

>=0?LDADDSUBST

>1?

LDADDSUBST

<2?

LDADDSUBST

=0?

Y

Y

Y

Wrong Prediction!!!

Throw awayi.e., no sideeffects!

Actual path!

Fix 4: Scheduling Past BranchesImproving ILP

Page 44: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 44

AVDARK2012

Fix 5: Use a cache

Mem

I RRegs

B M MW

I R B M MW

I R B M MW

I R B M MW

I

I

I

I

Issuelogic

150cycles 1GB

64kB$1-10 cycles

Page 45: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 45

AVDARK2012

How to get efficient architectures…

Increase clock frequency Create and explore locality:

a) Spatial locality b) Temporal localityc) Geographical locality

Create and explore parallelisma) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)c) Memory level parallelism (MLP)

Speculative executiona) Out-of-order executionb) Branch predictionc) Prefetching

Page 46: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 46

AVDARK2012

Woops, using too much power 2007

Running at 2x the frequency will use much more than 2x the power

It is also really hard to find enough ILP Speculation results in a fair amount of

wasted work

Page 47: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 47

AVDARK2012

Fix: Multicore

Mem

CPU

$1

CPU

$1

CPU

$1

CPU

$1

L2$

Mem I/F

ExternalI/F

t treads

But now we also need to find Thread-Level Parallelism (TLP)

Page 48: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 48

AVDARK2012

Example: Intel i7 ”Nehalem”

DRAM

Coherence

Page 49: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 49

AVDARK2012

What is computer architecture?

“Bridging the gap between programs and transistors”

“Finding the best model to execute the programs”

best={fast, cheap, energy-efficient, reliable, predictable, …}

Page 50: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Caches and more cachesor

spam, spam, spam and spam

Erik HagerstenUppsala University, Sweden

[email protected]

Page 51: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 51

AVDARK2012

Fix 5: Use a cache

Mem

I RRegs

B M MW

I R B M MW

I R B M MW

I R B M MW

I

I

I

I

Issuelogic

200cycles 1GB

~32kB$~1 cycles

Page 52: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 52

AVDARK2012

Webster about “cache”1. cache \'kash\ n [F, fr. cacher to press, hide, fr. (assumed) VL coacticare to press] together, fr. L coactare to compel, fr. coactus, pp. of cogere to compel - more at COGENT 1a: ahiding place esp. for concealing and preserving provisions or implements 1b: a secure place of storage 2: something hidden or stored in a cache

Page 53: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 53

AVDARK2012

Cache knowledge useful when...

Designing a new computer Writing an optimized program

or compileror operating system …

Implementing software cachingWeb cachesProxies File systems

Page 54: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 54

AVDARK2012

Memory/storage

SRAM DRAM disk

sram

2000: 1ns 1ns 3ns 10ns 150ns 5 000 000ns1kB 64k 4MB 1GB 1 TB

(1982: 200ns 200ns 200ns 10 000 000ns)

Page 55: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 55

AVDARK2012

Address Book CacheLooking for Tommy’s Telephone Number

ÖÄÅZYXVUT

TOMMY 12345

ÖÄÅZYXV

“Address Tag”

One entry per page =>Direct-mapped caches with 28 entries

“Data”

Indexingfunction

Page 56: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 56

AVDARK2012

Address Book CacheLooking for Tommy’s Number

ÖÄÅZYXVUT

OMMY 12345

TOMMY

EQ?

index

Page 57: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 57

AVDARK2012

Address Book CacheLooking for Tomas’ Number

ÖÄÅZYXVUT

OMMY 12345

TOMAS

EQ?

index

Miss!Lookup Tomas’ number inthe telephone directory

Page 58: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 58

AVDARK2012

Address Book CacheLooking for Tomas’ Number

ZYXVUT

OMMY 12345

TOMASindex

Replace TOMMY’s datawith TOMAS’ data. There is no other choice(direct mapped)

OMAS 23457

ÖÄÅ

Page 59: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 59

AVDARK2012

Cache

CPUCache

address

data (a word)hit

Memory

address

data

Page 60: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 60

AVDARK2012

Cache Organization

Cache

OMAS 23457

TOMAS

index

=

Hit (1)

(1)

(4) (4)

1

Addrtag

&

(1)

Data (5 digits)

Valid

28 entries

Page 61: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 61

AVDARK2012

Cache Organization (really)4kB, direct mapped

index

=

Hit?(1)(32bits = 4 bytes)

1

Addrtag

&

(1)

Data

(1)

Valid

00100110000101001010011010100011

1k entries of 4 bytes each

(?)

(?)(?)

0101001 0010011100101…

32 bit address identifying a byte in memory

OrdinaryMemory

msb lsb

”Byte”

What is a good index

function?

Page 62: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 62

AVDARK2012

Cache Organization4kB, direct mapped

index

=

Hit?(1)(32bits = 4 bytes)

1

Addrtag

&

(1)

Data

(1)

Valid

00100110000101001010011010100011

1k entries of 4 bytes each

(10)

(20) (20)

0101001 0010011100101…

32 bit address Identifies the byte within a word

msb lsb

Mem Overhead:21/32= 66%Latency =SRAM+CMP+AND

Page 63: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 63

AVDARK2012

Cache

CPUCache

address

data (a word)hit

Memory

address

data

Hit: Use the data provided from the cache~Hit: Use data from memory and also store it in

the cache

Page 64: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 64

AVDARK2012

Cache performance parameters Cache “hit rate” [%] Cache “miss rate” [%] (= 1 - hit_rate) Hit time [CPU cycles] Miss time [CPU cycles] Hit bandwidth Miss bandwidth Write strategy ….

Page 65: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 65

AVDARK2012

How to rate architecture performance?

Marketing: Frequency and Number of “cores”…

Architecture “goodness”: CPI = Cycles Per Instruction IPC = Instructions Per Cycle

Benchmarking: SPEC-fp, SPEC-int, … TPC-C, TPC-D, … Varning: Using a unrepresentative benchmark can

be missleading

Page 66: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 66

AVDARK2012

Cache performance exampleAssumption:Infinite bandwidthA perfect 1.0 CyclesPerInstruction (CPI) CPU100% instruction cache hit rate

Total number of cycles =#Instr. * ( (1 - mem_ratio) * 1 +

mem_ratio * avg_mem_accesstime) =

= #Instr * ( (1- mem_ratio) + mem_ratio * (hit_rate * hit_time +

(1 - hit_rate) * miss_time)

CPI = 1 -mem_ratio + mem_ratio * (hit_rate * hit_time +

(1 - hit_rate) * miss_time)

Page 67: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 67

AVDARK2012

Example Numbers

CPI = 1 - mem_ratio + mem_ratio * (hit_rate * hit_time) +mem_ratio * (1 - hit_rate) * miss_time)

mem_ratio = 0.25hit_rate = 0.85hit_time = 3miss_time = 100

CPI = 0.75 + 0.25 * 0.85 * 3 + 0.25 * 0.15 * 100 =

0.75 + 0.64 + 3.75 = 5.14CPU HIT MISS

Page 68: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 68

AVDARK2012

What if ...CPI = 1 -mem_ratio +

mem_ratio * (hit_rate * hit_time) +mem_ratio * (1 - hit_rate) * miss_time)

mem_ratio = 0.25hit_rate = 0.85hit_time = 3 CPU HIT MISSmiss_time = 100 == > 0.75 + 0.64 + 3.75 = 5.14

•Twice as fast CPU ==> 0.37 + 0.64 + 3.75 = 4.77

•Faster memory (70c) ==> 0.75 + 0.64 + 2.62 = 4.01

•Improve hit_rate (0.95) => 0.75 + 0.71 + 1.25 = 2.71

Page 69: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 69

AVDARK2012

How to get more effective caches:

Larger cache (more capacity) Cache block size (larger cache lines) More placement choice (more associativity) Innovative caches (victim, skewed, …) Cache hierarchies (L1, L2, L3, CMR) Latency-hiding (weaker memory models) Latency-avoiding (prefetching) Cache avoiding (cache bypass) Optimized application/compiler …

Page 70: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 70

AVDARK2012

Why do you miss in a cache

Mark Hill’s three “Cs” Compulsory miss (touching data for the first time) Capacity miss (the cache is too small) Conflict misses (non-ideal cache implementation)

(too many names starting with “H”) (Multiprocessors)

Communication (imposed by communication) False sharing (side-effect from large cache blocks)

Page 71: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 71

AVDARK2012

Avoiding Capacity Misses –a huge address bookLots of pages. One entry per page.

Ö

Ä

Å

Z

Y

X

ÖV

ÖU

ÖT

LING 12345

ÖÖ

ÖÄ

ÖÅ

ÖZ

ÖY

ÖX

“Address Tag”

One entry per page =>Direct-mapped caches with 282 entries 784 entries

“Data”

NewIndexingfunction

Page 72: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 72

AVDARK2012

Cache Organization1MB, direct mapped

index

=

Hit?(1)

(32)

1

Addrtag

&

(1)

Data

(1)

Valid

00100110000101001010011010100011

256k entries

(18)

(12) (12)

0101001 0010011100101

32 bit address Identifies the byte within a word

msb lsb

Mem Overhead:13/32= 40%

Latency =SRAM+CMP+AND

Page 73: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 73

AVDARK2012

Pros/Cons Large Caches++ The safest way to get improved hit rate-- SRAMs are very expensive!!-- Larger size ==> slower speed

more load on “signals”longer distances

-- (power consumption)-- (reliability)

Page 74: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 74

AVDARK2012

Why do you hit in a cache? Temporal locality

Likely to access the same data again soon Spatial locality

Likely to access nearby data again soonTypical access pattern:

(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, ...

temporal spatial

Page 75: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 75

AVDARK2012

(32)Data

Multiplexer(4:1 mux)

Identifies the word within a cache line

(2)Select

Identifies a byte within a word

Fetch more than a word: cache blocks(a.k.a cache line)1MB, direct mapped, CacheLine=16B

1

00100110000101001010011010100011

64k entries

0101001 0010011100101(16)index

0010011100101 0010011100101 0010011100101

=

Hit?(1)&

(1)

(12) (12)(32) (32) (32) (32)

128 bits

msb lsb

Mem Overhead:13/128= 10%

Latency =SRAM+CMP+AND

Page 76: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 76

AVDARK2012

Example in ClassDirect mapped cache: Cache size = 64 kB Cache line = 16 B Word size = 4B 32 bits address (byte addressable)

“There are 10 kinds of people in the world…Those who understand binary number and

those who do not.”

Page 77: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 77

AVDARK2012

Pros/Cons Large Cache Lines++ Explores spatial locality++ Fits well with modern DRAMs

* first DRAM access slow* subsequent accesses fast (“page mode”)

-- Poor usage of SRAM & BW for some patterns-- Higher miss penalty (fix: critical word first)-- (False sharing in multiprocessors)

Cache line size

Perf

Page 78: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 78

AVDARK2012 Thanks: Dr. Erik Berg

UART: StatCache Graph app=matrix multiply

Small cache:Short cache lines are better

Large cachesLonger cache lines are better

Huge caches:Everything fits regardless of CL size

Note: this is just a single example, butthe conclusion typically holds for mostapplications.

Page 79: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 79

AVDARK2012

Cache ConflictsTypical access pattern:

(inner loop stepping through an array) A, B, C, A+1, B, C, A+2, B, C, …

What if B and C index to the same cache locationConflict misses -- big time!Potential performance loss 10-100x

temporal spatial

Page 80: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 80

AVDARK2012

Address Book CacheTwo names per page: index first, then search.

ÖÄÅZYXVUT

OMAS

TOMMY

EQ?

index

12345

23457

OMMY

EQ?

Page 81: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 81

AVDARK2012

How should the select signal be

produced?

Avoiding conflict: More associativity1MB, 2-way set-associative, CL=4B

index

(32)

1

Data

00100110000101001010011010100011

128k “sets”

(17)0101001 0010011100101

Identifies a byte within a word

Multiplexer(2:1 mux)

(32) (32)

1 0101001 0010011100101

Hit?

=

&

(13)

(1)Select

(13)

=

&

“logic”

msb lsb

Latency =SRAM+CMP+AND+LOGIC+MUX

One “set”

Page 82: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 82

AVDARK2012

Pros/Cons Associativity

++ Avoids conflict misses-- Slower access time-- More complex implementation

comparators, muxes, ...-- Requires more pins (for external

SRAM…)

Page 83: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 83

AVDARK2012

Going all the way…!1MB, fully associative, CL=16B

index

=

Hit? 4B

1

&

Data

00100110000101001010011010100011One “set”

(0)

(28)0101001 0010011100101

Identifies a byte within a word

Multiplexer(256k:1 mux)

Identifies the word within a cache line

Select (16)

(13)

16B 16B

1 0101001 0010011100101

=

&

“logic”

(2)1 0101001 0010011100101 1

=

&

=

&

16B...

64kcomparators

Page 84: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 84

AVDARK2012

Fully Associative Very expensive Only used for small caches (and

sometimes TLBs)

CAM = Contents-addressable memory~Fully-associative cache storing key+dataProvide key to CAM and get the associated

data

Page 85: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 85

AVDARK2012

A combination thereof1MB, 2-way, CL=16B

index

=

Hit? (32)

1

&

Data

001001100001010010100110101000110

32k “sets”

(15)

(13)0101001 0010011100101

Identifies a byte within a word

Multiplexer(8:1 mux)

Identifies the word within a cache line

(1)Select

(13)

(128) (128)

1 0101001 0010011100101

=

&

“logic”(2)

(256)

msb lsb

Page 86: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 86

AVDARK2012

Example in Class Cache size = 2 MB Cache line = 64 B Word size = 8B (64 bits) 4-way set associative 32 bits address (byte addressable)

Page 87: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 87

AVDARK2012

Who to replace?Picking a “victim”

Least-recently used (aka LRU) Considered the “best” algorithm (which is not

always true…) Only practical up to limited number of ways

Not most recently used Remember who used it last: 8-way -> 3 bits/CL

Pseudo-LRU E.g., based on course time stamps. Used in the VM system

Random replacement Can’t continuously to have “bad luck...

Page 88: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 88

AVDARK2012

Cache Model: Random vs. LRU

Random

LRU

LRU

Random

art (SPEC 2000) equake (SPEC 2000)

Page 89: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 89

AVDARK2012

4-way sub-blocked cache1MB, direct mapped, Block=64B, sub-block=16B

index

=

Hit?

(4)

(32)

1

&

Data

00100110000101001010011010100011

16k

(14)

(12)

0101001 0010011100101

Identifies a byte within a word

0010011100101

16:1 mux

Identifies the word within a cache line

(2)

(12)(128) (128) (128) (128)

512 bits

0 1 0

& & &

logic4:1 mux(2)

Sub block within a block

msb lsb

Mem Overhead:16/512= 3%

Page 90: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 90

AVDARK2012

Pros/Cons Sub-blocking++ Lowers the memory overhead++ (Avoids problems with false sharing -- MP)++ Avoids problems with bandwidth waste -- Will not explore as much spatial locality-- Still poor utilization of SRAM -- Fewer sparse “things” allocated

Page 91: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 91

AVDARK2012

Replacing dirty cache lines Write-back

Write dirty data back to memory (next level) at replacement

A “dirty bit” indicates an altered cache line Write-through

Always write through to the next level (as well) data will never be dirty no write-backs

Page 92: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 92

AVDARK2012

Write Buffer/Store Buffer

Do not need the old value for a store

One option: Write around (no write allocate in caches) used for lower level smaller caches

CPU

cache

WB:

stores loads

===

Page 93: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 93

AVDARK2012

Innovative cache: Victim cache

CPU

Cacheaddress

datahit Memory

address

data

Victim Cache (VC): a small, fairly associative cache (~10s of entries)Cache lookup: search cache and VC in parallelCache replacement: move victim to the VC and replace in VCVC hit: swap VC data with the corresponding data in Cache“A second life ”

VCaddress

data (a word)

hit

Page 94: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 94

AVDARK2012

Skewed Associative CacheA, B and C have a three-way conflict

2-wayABC

4-wayABC

It has been shown that 2-way skewed performs roughly the same as 4-way caches

2-way skewedABC

Page 95: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 95

AVDARK2012

Skewed-associative cache:Different indexing functions

index

=

1

&

00100110000101001010011010100011

128k entries

(17)

(13)

0101001 0010011100101

32 bit address Identifies the byte within a word

1 0101001 0010011100101

f1

f2

(>18)

(>18) (17)

msb lsb

2:1mux

=

&

function

(32) (32)

(32)

Page 96: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 96

AVDARK2012

UART: Elbow cacheIncrease “associativity” when needed

Performs roughly the same as an 8-way cacheSlightly fasterUses much less power!!

AB

If severe conflict:make room

ABC

Conflict!!ABC

Page 97: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 97

AVDARK2012

Topology of caches: Harvard Arch

CPU needs a new instruction each cycle 25% of instruction LD/ST Data and Instr. have different access patterns==> Separate D and I first level cache==> Unified 2nd and 3rd level caches

Page 98: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 98

AVDARK2012

Cache Hierarchy of Today

CPU

L1 D$on-chip

L2$on-chip

L3$

DRAM Memory

L1 I$on-chip

Small enough to keep up with the CPU speed

Use whatever transistors there i left for the Last-Level Cache (LLC)

Off-chip SRAM (today more commonly on-chip)

Separate I cache to allow for instruction fetch and data fetch in parallel

Page 99: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 99

AVDARK2012

HW prefetching ...a little green man that anticipates your next

memory access and prefetches the data to the cache.

Improves MLP! Sequential prefetching: Sequential streams

[to a page]. Some number of prefetch streams supported. Often only for L2 and L3.

PC-based prefetching: Detects strides from the same PC. Often also for L1.

Adjacent prefetching: On a miss, also bring in the “next” cache line. Often only for L2 and L3.

Page 100: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 100

AVDARK2012

Hardware prefetching Hardware ”monitor” looking for patterns in

memory accesses Brings data of anticipated future accesses

into the cache prior to their usage Two major types:

Sequential prefetching (typically page-based, 2nd level cache and higher). Detects sequential cache lines missing in the cache.

PC-based prefetching, integrated with the pipeline. Finds per-PC strides. Can find more complicated patterns.

Page 101: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 101

AVDARK2012

Data Cache Capacity/Latency/BW

Page 102: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 102

AVDARK2012

Cache implementation

L2 $256kB

Generic Cache:

Addr [63..0]MSB LSB

AT S Data = 64B

=

D1 ¢64kB

L3 € 24MB

Cacheline, here 64B:

...

=======

muxHit

Sel way ”6”

Data = 64B

index

SRAM:Caches at all level roughly work like this:

I1 ¢64kB

Page 103: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 103

AVDARK2012

Address Book AnalogyTwo names per page: index first, then search.

ÖÄÅZYXVUT

OMAS

TOMMY

EQ?

index

12345

23457

OMMY

EQ?

Select the second entry!

Page 104: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 104

AVDARK2012

Cache lingo Cacheline: Data chunk move to/from a cache Cache set: Fraction of the cache identified by the index Associativity: Number of alternative storage places for

a cacheline Replacement policy: picking the victim to throw out

from a set (LRU/Random/Nehalem) Temporal locality: Likelihood to access the same data

again soon Spatial locality: Likelihood to access nearby data again

soonTypical access pattern:(inner loop stepping through an array) A, B, C, A+4, B, C, A+8, B, C, ...

temporal spatial

Page 105: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 105

AVDARK2012

Cache Lingo Picture

C

Generic Cache:

Addr [63..0]MSB LSB

AT S Data = 64B

Cacheline, here 64B:

...

muxHit

Sel way ”6”

Data = 64B

SRAM:

Cache Set:

tag index

Associativity= 8-way

Page 106: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 106

AVDARK2012

Nehalem i7 (one example)

32kB8-way pLRU

256kB8-way pLRUnon-incl

8MB16-way Neh.-replnon-incl

Page 107: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 107

AVDARK2012

Take-away message: Caches Caches are fast but small Cache space allocated in cache-line chunks

(e.g., 64bytes) LSB part of the address is used to find the

”cache set” (aka, indexing) There is a limited number of cache lines per

set (associativity) Typically, several levels of caches

The most important target for optimizations

Page 108: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 108

AVDARK2012

How are we doing?

Increase clock frequency Create and explore locality:

a) Spatial locality b) Temporal localityc) Geographical locality

Create and explore parallelisma) Instruction level parallelism (ILP)b) Thread level parallelism (TLP)c) Memory level parallelism (MLP)

Speculative executiona) Out-of-order executionb) Branch predictionc) Prefetching

Page 109: Welcome to AVDARK - Uppsala University computer science lab 1985-1988 NetInsight, ... ≈30 years ago: APZ 212 @ 5MHz ”the AXE supercomputer ... AVDARK 2012 APZ 212

Dept of Information Technology| www.it.uu.se © Erik Hagersten| http://user.it.uu.se/~ehIntro and Caches 109

AVDARK2012

Goals for this course Understand how and why modern computer systems are designed the way the are:

pipelines memory organization virtual/physical memory ...

Understand how and why multiprocessors are built Cache coherence Memory models Synchronization…

Understand how and why parallelism is created and leveraged Instruction-level parallelism Memory-level parallelism Thread-level parallelism…

Understand how and why multiprocessors of combined SIMD/MIMD type are built GPU Vector processing…

Understand how computer systems are adopted to different usage areas General-purpose processors Embedded/network processors…

Understand the physical limitation of modern computers Bandwidth Energy Cooling…