Lecture 7. Multiprocessor and Memory Coherence

63
Lecture 7. Multiprocessor and Memory Coherence Prof. Taeweon Suh Computer Science Education Korea University COM515 Advanced Computer Architecture

description

COM515 Advanced Computer Architecture. Lecture 7. Multiprocessor and Memory Coherence. Prof. Taeweon Suh Computer Science Education Korea University. Bus-based shared memory. P. P. P. $. $. $. Memory. Distributed shared memory. P. P. $. $. Memory. Memory. - PowerPoint PPT Presentation

Transcript of Lecture 7. Multiprocessor and Memory Coherence

Page 1: Lecture 7. Multiprocessor and Memory Coherence

Lecture 7. Multiprocessor and Memory Coherence

Prof. Taeweon SuhComputer Science Education

Korea University

COM515 Advanced Computer Architecture

Page 2: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Memory Hierarchy in a Multiprocessor

2

P P P$

Bus-based shared memory

$ $

Memory

P P P$

Memory

Fully-connected shared memory

(Dancehall)

$ $

Memory

Interconnection Network

P

$Memory

Interconnection Network

P

$Memory

Distributed shared memory

Page 3: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Why Cache Coherency?

• Closest cache level is private• Multiple copies of cache line can be

present across different processor nodes• Local updates (writes) leads to incoherent

state Problem exhibits in both write-through and

writeback caches

3Slide from Prof. H.H. Lee in Georgia Tech

Page 4: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Writeback Cache w/o Coherence

4

P

Cache

Memory

P

X= 100

X= 100Cache

P

CacheX= 100X= 505

read?

X= 100

read? write

Slide from Prof. H.H. Lee in Georgia Tech

Page 5: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Writethrough Cache w/o Coherence

5

P

Cache

Memory

P

X= 100

X= 100Cache

P

CacheX= 100X= 505

X= 505

X= 505

Read? write

Slide from Prof. H.H. Lee in Georgia Tech

Page 6: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Definition of Coherence

• A multiprocessor memory system is coherent if the results of any execution of a program can be reconstructed by a hypothetical serial order

• Implicit definition of coherence Write propagation

• Writes are visible to other processes Write serialization

• All writes to the same location are seen in the same order by all processes

• For example, if read operations by P1 to a location see the value produced by write w1 (say, from P2) before the value produced by write w2 (say, from P3), then reads by another process P4 (or P2 or P3) also should not be able to see w2 before w1

6Slide from Prof. H.H. Lee in Georgia Tech

Page 7: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ7

Sounds Easy?

P0 P1 P2 P3A=1 B=2T1

A=0 B=0

T2 A=1 A=1 B=2 B=2T3 A=1 A=1 B=2

B=2 A=1B=2

T3 A=1 A=1 B=2B=2 A=1

B=2B=2 A=1

See A’s update before B’s See B’s update before A’s

Page 8: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Cache Coherence Protocols According to Caching Policies

• Write-through cache Update-based protocol Invalidation-based protocol

• Writeback cache Update-based protocol Invalidation-based protocol

8

Page 9: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Bus Snooping based on Write-Through Cache

• All the writes will be shown as a transaction on the shared bus to memory

• Two protocols Update-based Protocol Invalidation-based Protocol

9Slide from Prof. H.H. Lee in Georgia Tech

Page 10: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Bus Snooping• Update-based Protocol on Write-Through cache

10

P

Cache

Memory

P

X= 100

X= 100Cache

P

CacheX= 505

Bus transaction

Bus snoopX= 505

X= 505 X= 100

write

Slide from Prof. H.H. Lee in Georgia Tech

Page 11: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Bus Snooping• Invalidation-based Protocol on Write-Through

cache

11

P

Cache

Memory

P

X= 100

X= 100Cache

P

CacheX= 505

Bus transaction

Bus snoop

X= 505

Load X

X= 505

write

X= 100

Slide from Prof. H.H. Lee in Georgia Tech

Page 12: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

A Simple Snoopy Coherence Protocol for a WT, No Write-Allocate Cache

12

Invalid

Valid

PrRd / BusRd

PrRd / --- PrWr / BusWr

BusWr / ---

PrWr / BusWr Processor-initiated TransactionBus-snooper-initiated Transaction

Observed / Transaction

Slide from Prof. H.H. Lee in Georgia Tech

Page 13: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

How about Writeback Cache?

• WB cache to reduce bandwidth requirement

• The majority of local writes are hidden behind the processor nodes

• How to snoop?

• Write Ordering

13Slide from Prof. H.H. Lee in Georgia Tech

Page 14: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Cache Coherence Protocols for WB Caches

• A cache has an exclusive copy of a line if It is the only cache having a valid copy Memory may or may not have it

• Modified (dirty) cache line The cache having the line is the owner of the

line, because it must supply the block

14Slide from Prof. H.H. Lee in Georgia Tech

Page 15: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Update-based Protocol on WB Cache

15

P

Cache

Memory

P

Cache

P

Cache

Bus transaction

X= 100X= 100X= 100

Store X

X= 505

updateupdate

X= 505X= 505

• Update data for all processor nodes who share the same data• Because a processor node keeps updating the memory

location, a lot of traffic will be incurredSlide from Prof. H.H. Lee in Georgia Tech

Page 16: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Update-based Protocol on WB Cache

16

P

Cache

Memory

P

Cache

P

Cache

Bus transaction

X= 505X= 505X= 505

Load X

Hit !

Store X

X= 333

update update

X= 333X= 333

• Update data for all processor nodes who share the same data• Because a processor node keeps updating the memory

location, a lot of traffic will be incurredSlide from Prof. H.H. Lee in Georgia Tech

Page 17: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Invalidation-based Protocol on WB Cache

• Invalidate the data copies for the sharing processor nodes• Reduced traffic when a processor node keeps updating the

same memory location17

P

Cache

P

Cache

P

Cache

Bus transaction

X= 100X= 100X= 100

Store X

invalidateinvalidate

X= 505

Memory

Slide from Prof. H.H. Lee in Georgia Tech

Page 18: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Invalidation-based Protocol on WB Cache

18

P

Cache

P

Cache

P

Cache

Bus transaction

X= 505

Load X

Bus snoop

Miss !Snoop hit

X= 505

Memory

• Invalidate the data copies for the sharing processor nodes• Reduced traffic when a processor node keeps updating the

same memory locationSlide from Prof. H.H. Lee in Georgia Tech

Page 19: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Invalidation-based Protocol on WB Cache

19

P

Cache

P

Cache

P

Cache

Bus transaction

X= 505

Store X

Bus snoop

X= 505X= 333

Store X

X= 987

Store XX= 444

• Invalidate the data copies for the sharing processor nodes• Reduced traffic when a processor node keeps updating the

same memory location

Memory

Slide from Prof. H.H. Lee in Georgia Tech

Page 20: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MSI Writeback Invalidation Protocol

• Modified Dirty Only this cache has a valid copy

• Shared Memory is consistent One or more caches have a valid copy

• Invalid

• Writeback protocol: A cache line can be written multiple times before the memory is updated

20Slide from Prof. H.H. Lee in Georgia Tech

Page 21: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MSI Writeback Invalidation Protocol

• Two types of request from the processor PrRd PrWr

• Three types of bus transactions post by cache controller BusRd

• PrRd misses the cache• Memory or another cache supplies the line

BusRd eXclusive (Read-to-own)• PrWr is issued to a line which is not in the Modified state

BusWB• Writeback due to replacement• Processor does not directly involve in initiating this

operation

21Slide from Prof. H.H. Lee in Georgia Tech

Page 22: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MSI Writeback Invalidation Protocol(Processor Request)

22

Modified

Invalid

Shared

PrRd / BusRd

PrRd / ---

PrWr / BusRdX

PrWr / ---

PrRd / ---

PrWr / BusRdX

Processor-initiated

Slide from Prof. H.H. Lee in Georgia Tech

Page 23: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MSI Writeback Invalidation Protocol(Bus Transaction)

23

• Flush data on the bus• Both memory and requestor will

grab the copy• The requestor get data by

Cache-to-cache transfer; or Memory

Modified

Invalid

Shared

Bus-snooper-initiated

BusRd / ---

BusRd / Flush

BusRdX / Flush BusRdX / ---

Slide from Prof. H.H. Lee in Georgia Tech

Page 24: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MSI Writeback Invalidation Protocol(Bus transaction) Another possible

Implementation

24

Modified

Invalid

Shared

Bus-snooper-initiated

BusRd / ---

BusRd / Flush

BusRdX / Flush BusRdX / ---

• Anticipate no more reads from this processor

• A performance concern• Save “invalidation” trip if the

requesting cache writes the shared line later

BusRd / Flush

Slide from Prof. H.H. Lee in Georgia Tech

Page 25: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MSI Writeback Invalidation Protocol

25

Modified

Invalid

Shared

Bus-snooper-initiated

BusRd / ---

PrRd / BusRd

PrRd / ---

PrWr / BusRdX

PrWr / ---

PrRd / ---

PrWr / BusRdX

Processor-initiated

BusRd / Flush

BusRdX / Flush BusRdX / ---

Slide from Prof. H.H. Lee in Georgia Tech

Page 26: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MSI Example

26

P1

Cache

P2 P3

Bus

Cache Cache

MEMORY

BusRd

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data SupplierS --- --- BusRd MemoryP1 reads X

X=10

X=10 S

Slide from Prof. H.H. Lee in Georgia Tech

Page 27: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MSI Example

27

P1

Cache

P2 P3

Bus

Cache Cache

MEMORY

X=10 S

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data SupplierS --- --- BusRd MemoryP1 reads X

P3 reads X

BusRd

X=10 S

S --- S BusRd Memory

X=10

Slide from Prof. H.H. Lee in Georgia Tech

Page 28: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MSI Example

28

P1

Cache

P2 P3

Bus

Cache Cache

MEMORY

X=10 S

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data SupplierS --- --- BusRd MemoryP1 reads X

P3 reads X

X=10 S

S --- S BusRd Memory

P3 writes X

BusRdX

--- I M

I --- M BusRdX

X=10

X=-25

Slide from Prof. H.H. Lee in Georgia Tech

Page 29: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MSI Example

29

P1

Cache

P2 P3

Bus

Cache Cache

MEMORY

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data SupplierS --- --- BusRd MemoryP1 reads X

P3 reads X

X=-25 M

S --- S BusRd Memory

P3 writes X

--- I

I --- M BusRdXP1 reads X

BusRd

X=-25 S S

S --- S BusRd P3 Cache

X=10X=-25

Slide from Prof. H.H. Lee in Georgia Tech

Page 30: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MSI Example

30

P1

Cache

P2 P3

Bus

Cache Cache

MEMORY

Processor Action State in P1 State in P2 State in P3 Bus Transaction Data SupplierS --- --- BusRd MemoryP1 reads X

P3 reads X

X=-25 M

S --- S BusRd Memory

P3 writes X I --- M BusRdXP1 reads X

X=-25 S S

S --- S BusRd P3 Cache

X=10X=-25

P2 reads X

BusRd

X=-25 S

S S S BusRd MemorySlide from Prof. H.H. Lee in Georgia Tech

Page 31: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MESI Writeback Invalidation Protocol

• To reduce two types of unnecessary bus transactions BusRdX that snoops and converts the block from S to M

when only you are the sole owner of the block BusRd that gets the line in S state when there is no sharers

(that lead to the overhead above)

• Introduce the Exclusive state One can write to the copy without generating BusRdX

• Illinois Protocol: Proposed by Pamarcos and Patel in 1984

• Employed in Intel, PowerPC, MIPS

31Slide from Prof. H.H. Lee in Georgia Tech

Page 32: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MESI Writeback Invalidation ProtocolProcessor Request (Illinois Protocol)

32

Invalid

Exclusive Modified

Shared

PrRd / BusRd(not-S)

PrWr / ---

Processor-initiated

PrRd / --- PrRd, PrWr / ---

PrRd / ---S: Shared Signal

PrWr / BusRdX

PrRd / BusRd (S)

PrWr / BusRdX

Slide from Prof. H.H. Lee in Georgia Tech

Page 33: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MESI Writeback Invalidation ProtocolBus Transactions (Illinois Protocol)

33

Invalid

Exclusive Modified

Shared

Bus-snooper-initiated

BusRd / Flush

BusRdX / Flush

BusRd / Flush*

Flush*: Flush for data supplier; no action for other sharers

BusRdX / Flush*

BusRd / Flush Or ---)

BusRdX / ---

• Whenever possible, Illinois protocol performs $-to-$ transfer rather than having memory to supply the data• Use a Selection algorithm if there are multiple suppliers (Alternative: add an O state or force update

memory)• Most of the MESI implementations simply write to memory

Slide from Prof. H.H. Lee in Georgia Tech

Page 34: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MESI Writeback Invalidation Protocol(Illinois Protocol)

34

Invalid

Exclusive Modified

Shared

Bus-snooper-initiated

BusRd / Flush

BusRdX / Flush

BusRd / Flush*BusRdX / Flush*

BusRdX / ---PrRd / BusRd(not-S)

PrWr / ---

Processor-initiated

PrRd / --- PrRd, PrWr / ---

PrRd / ---

PrWr / BusRdX

S: Shared Signal

PrWr / BusRdX

BusRd / Flush (or ---)

Flush*: Flush for data supplier; no action for other sharers

Slide from Prof. H.H. Lee in Georgia Tech

Page 35: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MOESI Protocol

35

• Add one additional state ─ Owner state• Similar to Shared state• The O state processor will be responsible for

supplying data (copy in memory may be stale)• Employed by

Sun UltraSparc AMD Opteron

• In dual-core Opteron, cache-to-cache transfer is done through a system request interface (SRI) running at full CPU speed

CPU0

L2

CPU1

L2

System Request Interface

Crossbar

Hyper-Transport

MemController

Slide from Prof. H.H. Lee in Georgia Tech

Page 36: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ36

Implication on Multi-Level Caches

• How to guarantee coherence in a multi-level cache hierarchy Snoop all cache levels? Intel’s 8870 chipset has a “snoop filter” for quad-core

• Maintaining inclusion property Ensure data in the outer level must be present in the

inner level Only snoop the outermost level (e.g. L2) L2 needs to know L1 has write hits

• Use Write-Through cache• Use Write-back but maintain another “modified-but-stale” bit in

L2

Slide from Prof. H.H. Lee in Georgia Tech

Page 37: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ37

Inclusion Property

• Not so easy … Replacement

• Different bus observes different access activities • e.g. L2 may replace a line frequently accessed in L1

L1 and L2 are 2-way set associative Blocks m1 and m2 go to the same set in L1 and L2 A new block m3 mapped to the same entry replaces m1 in L1

and m2 in L2 due to the LRU scheme Split L1 caches and Unified L2

• Imagine all caches are direct-mapped.• m1 (instruction block) and m2 (data block) mapped to the

same entry in L2 Different cache line sizes

• What happens if L1’s block size is smaller than L2’s?

Modified Slide from Prof. H.H. Lee in Georgia Tech

Page 38: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ38

Inclusion Property

• Use specific cache configurations to maintain the inclusion property automatically E.g., DM L1 + bigger DM or set-associative L2

with the same cache line size (#sets in L2 ≥ #sets in L1)

• Explicitly propagate L2 action to L1 L2 replacement will flush the corresponding L1

line Observed BusRdX bus transaction will

invalidate the corresponding L1 line To avoid excess traffic, L2 maintains an

Inclusion bit for filtering (to indicate in L1 or not)

Modified Slide from Prof. H.H. Lee in Georgia Tech

Page 39: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ39

Directory-based Coherence Protocol

• Snooping-based protocol N transactions for an N-node MP All caches need to watch every memory request from each processor Not a scalable solution for maintaining coherence in large shared memory

systems• Directory protocol

Directory-based control of who has what; HW overheads to keep the directory (~ # lines * # processors)

P$

P$

P$

P$

Memory

Interconnection Network

DirectoryModified bitPresence bits, one for each node

Slide from Prof. H.H. Lee in Georgia Tech

Page 40: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ40

Directory-based Coherence Protocol

P$

P$

P$

P$

Memory

Interconnection Network

P$

1 1 1 000 000 0 0 001 01

C(k)C(k+1)

0 0 0 101 00 C(k+j)

1 presence bit for each processor, each cache block in memory

1 modified bit for each cache block in memory

Slide from Prof. H.H. Lee in Georgia Tech

Page 41: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ41

Directory-based Coherence Protocol (Limited Dir)

Encoded Present bits (log2N), each cache line can reside in 2 processors in this example

1 modified bit for each cache block in memory

P0

$

P13

$

P14

$

P15

$

Memory

Interconnection Network

P1

$

Presence encoding is NULL or not

0 0 0 00 1 1 1 1 010 0 0 11 1 - - - -0

- - - -0 0 - - - -0

Slide from Prof. H.H. Lee in Georgia Tech

Page 42: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ42

Distributed Directory Coherence Protocol

• Centralized directory is less scalable (contention)• Distributed shared memory (DSM) for a large MP system • Interconnection network is no longer a shared bus• Maintain cache coherence (CC-NUMA)• Each address has a “home”

P$

Memory

Interconnection Network

P$

Memory

P$

MemoryP$

Memory

P$

Memory

P$

Memory

Directory

Directory

Directory

Directory

Directory

Directory

Slide from Prof. H.H. Lee in Georgia Tech

Page 43: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ43

Stanford DASH

• Stanford DASH: 4 CPUs in each cluster, total 16 clusters (1992) Invalidation-based cache coherence Directory keeps one of the 3 status of a cache block at its home node

• Uncached• Shared (unmodified state)• Dirty

P$

MemoryDirectory

Interconnection Network

Snoop bus

Modified Slide from Prof. H.H. Lee in Georgia Tech

P$

P$

MemoryDirectory

Snoop bus

P$

Page 44: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ44

DASH Memory Hierarchy (1992)

• Processor Level• Local Cluster Level• Home Cluster Level (address is at home)If dirty, needs to get it from remote node which owns it• Remote Cluster Level

Interconnection Network

Modified Slide from Prof. H.H. Lee in Georgia Tech

P$

MemoryDirectory

Snoop bus

P$

P$

MemoryDirectory

Snoop bus

P$

Processor Level Local Cluster Level

Page 45: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Stanford DASH

45

• MIPS R3000• 33MHz• 64KB L1 I$• 64KB L1 D$• 256KB L2• MESI

Page 46: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ46

$

Directory Coherence Protocol: Read Miss

Interconnection Network

0 0 1

P$

MemoryMemory

P Miss Z (read)

Go to Home NodeMemory

P

Z

Z

1

Data Z is shared (clean)

Home of Z$Z

Modified from Prof. H.H. Lee’ slide in Georgia Tech

Page 47: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Memory

47

Directory Coherence Protocol: Read Miss

Interconnection Network

1 0 1 0

P

Memory

P Miss Z (read) P

Data Z is Dirty

Go to Home NodeRespond with Owner InfoData Request

Z

0 1

Data Z is Clean, Shared by 2 nodes

$$$ ZZ

Memory

Home of Z

Modified from Prof. H.H. Lee’ slide in Georgia Tech

Page 48: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

MemoryMemory

48

Directory Coherence Protocol: Write Miss

Interconnection Network

0 0 1

PP Miss Z (write) P

1

Z

Respond w/ sharersInvalidate

ACK ACK 0 01 1

Write Z can proceed in P0

$ $$ ZZ Z

Slide from Prof. H.H. Lee in Georgia Tech

MemoryInvalidate

Go to Home Node

Page 49: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ49

Memory Consistency Issue• What do you expect for the following

codes?P1 P2

A=1;Flag = 1;

while (Flag==0) {};print A;

P1 P2

A=1;B=1;

print B;print A;

Initial valuesA=0B=0

Is it possible P2 prints A=0?

Is it possible P2 prints B=1, A=0?

Slide from Prof. H.H. Lee in Georgia Tech

Page 50: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ50

Memory Consistency Model• Programmers anticipate certain memory ordering

and program behavior• Become very complex When

Running shared-memory programs A processor supports out-of-order execution

• A memory consistency model specifies the legal ordering of memory events when several processors access the shared memory locations

Slide from Prof. H.H. Lee in Georgia Tech

Page 51: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ51

Sequential Consistency (SC) [Leslie Lamport]

• An MP is Sequentially Consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.

• Two properties Program ordering Write atomicity (All writes to any location should appear to all

processors in the same order)• Intuitive to programmers

P P P

Memory

Slide from Prof. H.H. Lee in Georgia Tech

Page 52: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ52

SC Example

T=1U=2

Y=1Z=2

P1 P2 P3P0A=1 A=2

T=A Y=A

U=A Z=A

Sequentially Consistent

T=1U=2

Y=2Z=1

Violating Sequential Consistency!(but possible in processor consistency model)

P1 P2 P3P0A=1 A=2

T=A Y=A

U=A Z=A

Slide from Prof. H.H. Lee in Georgia Tech

Page 53: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ53

Maintain Program Ordering (SC)• Dekker’s algorithm• Only one processor is

allowed to enter the CSP1 P2

Flag1 = Flag2 = 0

Flag1 = 1if (Flag2 == 0) enter Critical Section

Flag2 = 1if (Flag1 == 0) enter Critical Section

Caveat: implementation fine with uni-processor, but violate the ordering of the above

P1P0Flag1=1Write Buffer

Flag2=1Write Buffer

Flag1: 0Flag2: 0

Flag2=0 Flag1=0

INCORRECT!!BOTH ARE IN CRITICAL SECTION!

Slide from Prof. H.H. Lee in Georgia Tech

Page 54: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ54

Atomic and Instantaneous Update (SC)

• Update (of A) must take place atomically to all processors

• A read cannot return the value of another processor’s write until the write is made visible by “all” processors

P1 P2

A = B = 0

A = 1if (A==1) B =1

P3

if (B==1) R1=A

Slide from Prof. H.H. Lee in Georgia Tech

Page 55: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ55

Atomic and Instantaneous Update (SC)

• Update (of A) must take place atomically to all processors

• A read cannot return the value of another processor’s write until the write is made visible by “all” processors

P1 P2

A = B = 0

A = 1if (A==1) B =1

P3

if (B==1) R1=A

P1 P2 P4P3A=1

B=1

P0A=1A=1

B=1

A=1

Caveat when an update is not atomic to all …

R1=0?Slide from Prof. H.H. Lee in Georgia Tech

Page 56: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ56

Relaxed Memory Models• Relax program order requirement (its applies only to

operation pairs accessing different locations) Load bypass store Load bypass load Store bypass store Store bypass load

• Relax write atomicity requirement? Read others’ write early

• Allow a read to return the value of another processor’s write before all cached copies of the accessed location receive the invalidation or update messages generated by the write

Read own write early• Allow a read to return the value of its own previous write before the write is

made visible to other processors

Modified Slide from Prof. H.H. Lee in Georgia Tech

Page 57: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Relaxed Memory Models

57

Page 58: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ58

Relaxed Consistency• Processor Consistency

Used in P6 Write visibility could be in different orders of

different processors (not guarantee write atomicity)

Allow loads to bypass independent stores in each individual processor

To achieve SC, explicit synchronization operations need to be substituted or inserted• Read-modify-write instructions• Memory fence instructions

Slide from Prof. H.H. Lee in Georgia Tech

Page 59: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ59

Processor Consistency

Modified Slide from Prof. H.H. Lee in Georgia Tech59

Load bypassing Stores

59

F1=1 F2=1A=1 A=2R1=A R3=AR2=F2 R4=F1

R1=1; R3=2; R2=0; R4=0 is a possible outcome

Page 60: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ60

Processor Consistency

• Allow load bypassing store to a different address• Unlike SC, cannot guarantee mutual exclusion in

the critical section

P1 P2Flag1 = Flag2 = 0

Flag1 = 1if (Flag2 == 0) enter Critical Section

Flag2 = 1if (Flag1 == 0) enter Critical Section

Slide from Prof. H.H. Lee in Georgia Tech

Page 61: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ61

Processor Consistency

B=1;R1=0 is a possible outcome since PC allows A=1 to be visible in P2 prior to P3

P1 P2

A = B = 0

A = 1if (A==1) B =1

P3

if (B==1) R1=A

Slide from Prof. H.H. Lee in Georgia Tech

Page 62: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Backup Slides

62

Page 63: Lecture 7. Multiprocessor and Memory Coherence

Korea Univ

Dekker’s Algorithm• Dekker’s algorithm guarantees mutual exclusion,

freedom from deadlock, and freedom from starvation

63Source: Wikipedia