Distributed Operating Systems Cache Coherence - TU...

29
Faculty of Computer Science Institute for System Architecture, Operating Systems Group Distributed Operating Systems Cache Coherence SS2014 Marcus Völp (slides Julian Stecklina, Marcus Völp)

Transcript of Distributed Operating Systems Cache Coherence - TU...

Page 1: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Faculty of Computer Science Institute for System Architecture, Operating Systems Group

Distributed Operating Systems Cache Coherence

SS2014

Marcus Völp (slides Julian Stecklina, Marcus Völp)

Page 2: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

TU Dresden, 5.05.2014 Distributed Operating Systems Slide 2

Concurrent programs

int i; int k;

global variables:

i = 1; if (i > 1) k = 3;

i = i + 1; if (k == 0) k = 4;

||

mov $1, [%i] cmp [%i], $1 jgt end mov $3, [%k] end:

inc [%i] cmp [%k], $0 jne end mov $4, [%k] end:

||

lock;

Page 3: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 3

Symmetric Multiprocessor (SMP)

CPU0

L1

L2

CPU1

L1

L2

CPU2

L1

L2

CPU3

L1

L2

Memory

Processors

Local Caches

Bus or Crossbar

Shared Memory

TU Dresden, 5.05.2014

Page 4: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 4

Chip Multiprocessor (CMP), Multicore

CPU0

L1

L2

CPU1

L1

Memory

CPU2

L1

L2

CPU3

L1

Processors

Local Cache

Bus or Crossbar

Shared Memory

Shared LL-Cache

TU Dresden, 5.05.2014

Page 5: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 5

Symmetric Multithreading (SMT), Hyperthreading

L1

L2

L1

HT0 HT1

HT2 HT3

Memory

L1

L2

L1

HT4 HT5

HT6 HT7

Processors

Local Cache

Bus or Crossbar

Shared Memory

Shared LL-Cache

TU Dresden, 5.05.2014

Page 6: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 6

Non-Uniform Memory Access (NUMA)

CPU0 Memory CPU3 Memory

CPU1 Memory CPU2 Memory

General Interconnect

TU Dresden, 5.05.2014

Page 7: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 7

NUMA Example: Tilera Gx

Source: http://tilera.com

TU Dresden, 5.05.2014

Page 8: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 8

Summary: Memory Organization

• Multiple processors share memory

• Memory access paths through one or more controllers

– UMA (Uniform Memory Access)

– NUMA (Non-Uniform Memory Access)

• Caches / store buffers hold memory content near accessing CPUs.

TU Dresden, 5.05.2014

Page 9: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 9

Cache Coherency

L1

L2

L1

HT0 HT1

HT2 HT3

Memory = =

address tag idx ofs

set

tag RAM

data RAM

TU Dresden, 5.05.2014

Page 10: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 10

Cache Coherency

• Caches lead to multiple copies for the content of a single memory location

• Cache Coherency keeps copies “consistent” – locate all copies

– invalidate/update content

• Write Propagation

writes must eventually become visible to all processors.

• Write Serialization

every processor should see writes to the same location in the same order.

TU Dresden, 5.05.2014

Page 11: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 11

Alternative Definition: SWMR

Single-Writer, Multiple-Reader Invariant For any memory location A, at any given time,

either only a single core may write (or read-modify-write) the content of A

or any number of cores may read the content of A.

Data-Value Invariant The value of a memory location at the start of an

operation is the same as the value at the end of its

last write (read-modify-write) operation.

[based on Sorin et al., 2011]

TU Dresden, 5.05.2014

Page 12: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 12

Attempt 1: write through all caches

CPU0

WT Cache

CPU1

WT Cache

Memory x=0

x=0 x=0 x=1

x=1

CPU0: read x x=0 stored in cache

CPU1: read x x=0 stored in cache

CPU0: write x=1 x=1 stored in cache x=1 stored in memory

CPU1: read x x=0 retrieved from cache Write not visible to CPU1!

TU Dresden, 5.05.2014

Page 13: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 13

Attempt 2: write back

CPU0

WB Cache

CPU1

WB Cache

Memory x=0

x=0 x=0

CPU0: read x x=0 stored in cache

CPU1: read x x=0 stored in cache

CPU0: write x=1 x=1 stored in cache

CPU1: write x=2 x=2 stored in cache

CPU1: writeback x=2 stored in memory

CPU0: writeback x=1 stored in memory

Later store x=2 lost!

x=1 x=2

x=2 x=1

TU Dresden, 5.05.2014

Page 14: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 14

Coherency Problems & Solutions

Both examples violate SWMR!

Problem 1 CPU1 used stale value that had already been modified by CPU0.

– Solution: Invalidate copies before write proceeds!

Problem 2 Incorrect writeback order of modified cache lines.

– Solution: Disallow more than one modified copy!

TU Dresden, 5.05.2014

Page 15: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 15

Coherency Protocol Design Space

• Snooping-based – All coherency related traffic broadcasted to all CPUs

– Each processor snoops and acts accordingly:

• Invalidate lines written by other CPUs

• Signal sharing for lines currently in cache

– Straightforward for bus-based systems

– Suited for small-scale systems

• Directory-based – Uses central directory to track cache line owner

– Update copies in other caches

• Can update all CPUs at once (less traffic for alternating reads and writes)

• Multiple writes need multiple updates (more traffic for subsequent writes)

– Suited for large-scale systems TU Dresden, 5.05.2014

Page 16: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 16

Coherency Protocol Design Space

• Snooping-based vs. Directory-based

Memory

CPU0

L1

CPU1

L1

L2 Snoop Filter

Memory

CPU0

L1

CPU1

L1

L2 Snoop Filter

TU Dresden, 5.05.2014

Page 17: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 17

Invalidation vs. Update Protocols

• Invalidation-based – Only write misses hit bus (suited for WB caches)

– Subsequent writes are write hits

– Good for multiple writes to same cache line by same CPU

• Update-based – All shares of a cache line continue to hit in the cache after

a write by one CPU

– Otherwise lots of useless updates (wastes bandwidth) → Rarely used!

• Hybrid forms are possible!

TU Dresden, 5.05.2014

Page 18: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 18

A Basic Coherency Protocol: MSI

• Modified (M) – No copies on other caches; local copy modifed

– Memory is stale

• Shared (S) – Unmodified copies in one or more caches

– Memory is up-to-date

• Invalid (I) – Not in cache

• States tracked from the view of the cache controller. Sees events from: – Local processor → processor transactions

– Other processors → snoop transactions

TU Dresden, 5.05.2014

Page 19: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 19

MSI: Processor Transitions

• State is I, CPU reads (PrRd) – Generate bus read request (BusRd)

– Go to S

• State is S or M, CPU reads (PrRd) – No transition

• State is S, CPU writes (PrWr) – Upgrade cache line for exclusive ownership (BusRdX)

– Go to M

• State is M, CPU writes (PrWr) – No transition

TU Dresden, 5.05.2014

Page 20: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 20

MSI: Snoop Transitions

• Receiving a read snoop (BusRd) for a cache line – If M, write cache line back to memory (WB), transition to S

– If S, no transition

• Receiving a exclusive ownership snoop (BusRdX) – If M, write cache line back to memory (WB), discard it,

transition to I

– If S, discard cache line, transition to I

TU Dresden, 5.05.2014

Page 21: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 21

MSI State Transitions

I

S

M

Snoop Transitions

Processor Transitions

PrWr → BusRdX

PrWr PrRd

BusRd → WB

BusRd

PrRd

PrRd → BusRd

BusRdX → WB

BusRd BusRdX

BusRdX

PrWr → BusRdX

TU Dresden, 5.05.2014

Page 22: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 22

Problems in MSI

A common usecase is to: – read variable A: S

– Modify A: BusRdX sent, S → M

Invalidation message pointless, if no other cache holds A.

Solved by adding Exclusive (E) state: – No copies exist in other caches

– Memory is up-to-date

Variants of MESI are used by most popular microprocessors.

TU Dresden, 5.05.2014

Page 23: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 23

MESI State Transitions

E

I

S

M

Snoop Transitions

Processor Transitions

PrWr → BusRdX PrWr

PrWr

BusRd → WB

BusRd → HIT

PrRd

PrRd → BusRd (HIT) PrRd → BusRd (!HIT)

BusRdX → WB

BusRd BusRdX

BusRd → HIT

PrWr → BusRdX

BusRdX BusRdX

TU Dresden, 5.05.2014

Page 24: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 24

MOESI: Adding Owned to MESI

• Similar to MESI, with some extensions

• Cache-to-Cache transfers of modified cache lines – Modified cache lines not written back to memory, but

supplied by to other CPUs on BusRd

– CPU  that  had  initial  modified  copy  becomes  “owner” • Avoids writeback to memory when another CPU

accesses cache line – Beneficial when cache-to-cache latency/bandwidth is

better than cache-to-memory latency/bandwidth

• Used by AMD Opteron

TU Dresden, 5.05.2014

Page 25: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 25

MOESI State Transitions

E

I

S

M

Snoop Transitions

Processor Transitions

PrWr → BusRdX

PrWr

PrWr

BusRdX → WB

BusRd → HIT

PrRd

PrRd → BusRd (HIT) PrRd → BusRd (!HIT)

BusRdX → XFER

BusRd BusRdX

BusRd → HIT

PrWr → BusRdX

BusRdX BusRdX

O BusRd → HIT, XFER

PrWr → BusRdX

PrRd

BusRd → HIT, XFER

TU Dresden, 5.05.2014

Page 26: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 26

Coherency in Multi-Level Caches

• Bus only connected to last-level cache (e.g. L2) – Snoop requests are relevant to inner-level caches (e.g. L1)

– Modifications in L1 may not be visible to L2 (and the bus)

• Idea: L2 forwards filters transactions for L1: – On BusRd check if line is M/O in L1 (may be S or E in L2)

– On BusRdX, send invalidate to L1

• Only easy for inclusive caches!

• Inclusion property Outer cache contains a superset of the content of its inner caches.

TU Dresden, 5.05.2014

Page 27: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

Distributed Operating Systems Slide 105

References

� A Primer on Memory Consistency and Cache Coherence Sorin, Hill, Wood; 2011

� atomic<> Weapons: The C++ Memory Model and Modern Hardware (Video) Sutter; 2013

� Shared memory consistency models: a tutorial Adve, Gharachorloo; 1996

� IA Memory Model Richard Hudson; Google Tech Talk 2008

� Memory Ordering in Modern Microprocessors McKenney; Linux Journal 2005

� How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs Lamport, 1979

� PowerPC Storage Model TU Dresden, 5.05.2014

Page 28: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

References

Scheduler-Conscious Synchronization Leonidas Kontothanassis, Robert Wisniewski, Michael Scott Scalable Reader- Writer Synchronization for Shared-Memory Multiprocessors John M. Mellor-Crummey, Michael L. Scottt Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors John Mellor-Crummey, Michael Scott Concurrent Update on Multiprogrammed Shared Memory Multiprocessors Maged M. Michael, Michael L. Scott Scalable Queue-Based Spin Locks with Timeout Michael L. Scott and William N. Scherer III

Page 29: Distributed Operating Systems Cache Coherence - TU Dresdenos.inf.tu-dresden.de/Studium/DOS/SS2014/04-Cache-Coherence.pdf · TU Dresden, 5.05.2014 . Distributed Operating Systems Slide

References

Reactive Synchronization Algorithms for Multiprocessors B. Lim, A. Agarwal

Lock Free Data Structures John D. Valois (PhD Thesis)

Reduction: A Method for Proving Properties of Parallel Programs R. Lipton - Communications of the ACM 1975 Decoupling Contention Management from Scheduling (ASPLOS 2010) F.R. Johnson, R. Stoica, A. Ailamaki, T. Mowry Corey: An Operating System for Many Cores (OSDI 2008) Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, Zheng Zhang