CS 213 Lecture 11: Multiprocessor 3: Directory Organization
description
Transcript of CS 213 Lecture 11: Multiprocessor 3: Directory Organization
CS252/PattersonLec 12.1
2/28/01
CS 213
Lecture 11: Multiprocessor 3: Directory
Organization
CS252/PattersonLec 12.2
2/28/01
Example Directory Protocol Contd.
– Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.
Cache to Cache Transfer: Can occur with a remote read or write miss. Idea: Transfer block directly from the cache with exclusive copy to the requesting cache. Why go through directory? Rather inform directory after the block is transferred => 3 transfers over the IN instead of 4.
CS252/PattersonLec 12.3
2/28/01
Basic Directory Transactions
P
A M/D
C
P
A M/D
C
P
A M/D
C
Read requestto directory
Reply withowner identity
Read req.to owner
DataReply
Revision messageto directory
1.
2.
3.
4a.
4b.
P
A M/D
CP
A M/D
C
P
A M/D
C
RdEx requestto directory
Reply withsharers identity
Inval. req.to sharer
1.
2.
P
A M/D
C
Inval. req.to sharer
Inval. ack
Inval. ack
3a. 3b.
4a. 4b.
Requestor
Node withdirty copy
Directory nodefor block
Requestor
Directory node
Sharer Sharer
(a) Read miss to a block in dirty state (b) Write miss to a block with two sharers
CS252/PattersonLec 12.4
2/28/01
Protocol Enhancements for Latency
• Forwarding messages: memory-based protocols
L H R
1: req
2:reply
3:intervention
4a:revise
4b:response
L H R
1: req 2:intervention
3:response4:reply
L H R
1: req 2:intervention
3b:response
3a:revise
(a) Strict request-reply (a) Intervention forwarding
(a) Reply forwarding
Intervention is like a req,but issued in reaction to req. and sent to cache, rather than memory.
CS252/PattersonLec 12.5
2/28/01
Example
P1 P2 Bus Directory Memorystep StateAddr ValueStateAddrValueActionProc. Addr Value Addr State{Procs}Value
P1: Write 10 to A1
P1: Read A1P2: Read A1
P2: Write 40 to A2
P2: Write 20 to A1
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect MemoryDirectory
CS252/PattersonLec 12.6
2/28/01
Example
P1 P2 Bus Directory Memorystep StateAddr ValueStateAddrValueActionProc. Addr Value Addr State{Procs}Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0
P1: Read A1P2: Read A1
P2: Write 40 to A2
P2: Write 20 to A1
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect MemoryDirectory
CS252/PattersonLec 12.7
2/28/01
Example
P1 P2 Bus Directory Memorystep StateAddr ValueStateAddrValueActionProc. Addr Value Addr State{Procs}Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10P2: Read A1
P2: Write 40 to A2
P2: Write 20 to A1
A1 and A2 map to the same cache block
Processor 1 Processor 2 Interconnect MemoryDirectory
CS252/PattersonLec 12.8
2/28/01
Example
P2: Write 20 to A1
A1 and A2 map to the same cache block
P1 P2 Bus Directory Memorystep StateAddr ValueStateAddrValueActionProc. Addr Value Addr State{Procs}Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1,P2} 10
1010
P2: Write 40 to A2 10
Processor 1 Processor 2 Interconnect MemoryDirectory
A1
Write BackWrite Back
A1
CS252/PattersonLec 12.9
2/28/01
Example
P2: Write 20 to A1
A1 and A2 map to the same cache block
P1 P2 Bus Directory Memorystep StateAddr ValueStateAddrValueActionProc. Addr Value Addr State{Procs}Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1,P2} 10Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0
WrBk P2 A1 20 A1 Unca. {} 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0
Processor 1 Processor 2 Interconnect MemoryDirectory
A1A1
CS252/PattersonLec 12.10
2/28/01
Example
P2: Write 20 to A1
A1 and A2 map to the same cache block
P1 P2 Bus Directory Memorystep StateAddr ValueStateAddrValueActionProc. Addr Value Addr State{Procs}Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1,P2} 10Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 10
Processor 1 Processor 2 Interconnect MemoryDirectory
A1A1
CS252/PattersonLec 12.11
2/28/01
Example
P2: Write 20 to A1
A1 and A2 map to the same cache block
P1 P2 Bus Directory Memorystep StateAddr ValueStateAddrValueActionProc. Addr Value Addr State{Procs}Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1,P2} 10Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0
WrBk P2 A1 20 A1 Unca. {} 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0
Processor 1 Processor 2 Interconnect MemoryDirectory
A1A1
CS252/PattersonLec 12.12
2/28/01
Assume Network latency = 25 cycles
CS252/PattersonLec 12.13
2/28/01
P
M
Reducing Storage Overhead
• Optimizations for full bit vector schemes– increase cache block size (reduces storage
overhead proportionally)– use multiprocessor nodes (bit per mp node, not
per processor)– still scales as P*M, but reasonable for all but very
large machines» 256-procs, 4 per cluster, 128B line: 6.25%
ovhd.
• Reducing “width”– addressing the P term?
• Reducing “height”– addressing the M term?
CS252/PattersonLec 12.14
2/28/01
Storage Reductions
• Width observation: – most blocks cached by only few nodes– don’t have a bit per node, but entry contains a few
pointers to sharing nodes– P=1024 => 10 bit ptrs, can use 100 pointers and
still save space– sharing patterns indicate a few pointers should
suffice (five or so)– need an overflow strategy when there are more
sharers
• Height observation: – number of memory blocks >> number of cache
blocks– most directory entries are useless at any given time– organize directory as a cache, rather than having
one entry per memory block
CS252/PattersonLec 12.15
2/28/01
Insight into Directory Requirements
• If most misses involve O(P) transactions, might as well broadcast!
=> Study Inherent program characteristics:
– frequency of write misses?– how many sharers on a write miss– how these scale
• Also provides insight into how to organize and store directory information
CS252/PattersonLec 12.16
2/28/01
Cache Invalidation PatternsLU Invalidation Patterns
8.75
91.22
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.220
10
20
30
40
50
60
70
80
90
100
0 1 2 3 4 5 6 7
8 to
11
12 t
o 15
16 t
o 19
20 t
o 23
24 t
o 27
28 t
o 31
32 t
o 35
36 t
o 39
40 t
o 43
44 t
o 47
48 t
o 51
52 t
o 55
56 t
o 59
60 t
o 63
# of invalidations
% o
f sh
are
d w
rite
s
Ocean Invalidation Patterns
0
80.98
15.06
3.04 0.49 0.34 0.03 0 0.03 0 0 0 0 0 0 0 0 0 0 0 0 0.020
10
20
30
40
50
60
70
80
90
0 1 2 3 4 5 6 7
8 to
11
12 t
o 15
16 t
o 19
20 t
o 23
24 t
o 27
28 t
o 31
32 t
o 35
36 t
o 39
40 t
o 43
44 t
o 47
48 t
o 51
52 t
o 55
56 t
o 59
60 t
o 63
# of invalidations
% o
f sh
are
d w
rite
s
CS252/PattersonLec 12.17
2/28/01
Cache Invalidation PatternsBarnes-Hut Invalidation Patterns
1.27
48.35
22.87
10.56
5.332.87 1.88 1.4 2.5 1.06 0.61 0.24 0.28 0.2 0.06 0.1 0.07 0 0 0 0 0.33
0
5
10
15
20
25
30
35
40
45
50
0 1 2 3 4 5 6 7
8 to
11
12 t
o 15
16 t
o 19
20 t
o 23
24 t
o 27
28 t
o 31
32 t
o 35
36 t
o 39
40 t
o 43
44 t
o 47
48 t
o 51
52 t
o 55
56 t
o 59
60 t
o 63
# of invalidations
% o
f sh
are
d w
rite
s
Radiosity Invalidation Patterns
6.68
58.35
12.04
4.162.24 1.59 1.16 0.97
3.28 2.2 1.74 1.46 0.92 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0.91
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7
8 to
11
12 t
o 15
16 t
o 19
20 t
o 23
24 t
o 27
28 t
o 31
32 t
o 35
36 t
o 39
40 t
o 43
44 t
o 47
48 t
o 51
52 t
o 55
56 t
o 59
60 t
o 63
# of invalidations
% o
f sh
are
d w
rite
s
CS252/PattersonLec 12.18
2/28/01
Sharing Patterns Summary• Generally, few sharers at a write, scales slowly
with P– Code and read-only objects (e.g, scene data in Raytrace)
» no problems as rarely written– Migratory objects (e.g., cost array cells in LocusRoute)
» even as # of PEs scale, only 1-2 invalidations– Mostly-read objects (e.g., root of tree in Barnes)
» invalidations are large but infrequent, so little impact on performance
– Frequently read/written objects (e.g., task queues)» invalidations usually remain small, though frequent
– Synchronization objects» low-contention locks result in small invalidations» high-contention locks need special support (SW trees,
queueing locks)
• Implies directories very useful in containing traffic
– if organized properly, traffic and latency shouldn’t scale too badly
• Suggests techniques to reduce storage overhead
CS252/PattersonLec 12.19
2/28/01
Overflow Schemes for Limited Pointers• Broadcast (DiriB): Directory size is I. If more
copies are needed (Overflow), enable broadcast bit so that invalidation signal will be broadcast to all processors in case of a write
– bad for widely-shared frequently read data
• No-broadcast (DiriNB): Don’t allow more than I copies to be present at any time. If a new request arrives, invalidate one of the existing copies - on overflow, new sharer replaces one of the old ones
– bad for widely read data
• Coarse vector (DiriCV)– change representation to a coarse vector, 1
bit per k nodes– on a write, invalidate all nodes that a bit
corresponds to
Ref: Chaiken, et al “Directory-Based Cache Coherence in Large-Scale Multiprocessors”, IEEE Computer, June 1990.
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
1
Over½ow bit 8-bit coarse vector
(a) Over½ow
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
0
Over½ow bit 2 Pointers
(a) No over½ow
CS252/PattersonLec 12.20
2/28/01
Overflow Schemes (contd.)
• Software (DiriSW)– trap to software, use any number of pointers
(no precision loss)» MIT Alewife: 5 ptrs, plus one bit for local
node– but extra cost of interrupt processing on
software» processor overhead and occupancy» latency
• 40 to 425 cycles for remote read in Alewife• 84 cycles for 5 inval, 707 for 6.
• Dynamic pointers (DiriDP)– use pointers from a hardware free list in
portion of memory– manipulation done by hw assist, not sw– e.g. Stanford FLASH
CS252/PattersonLec 12.21
2/28/01
Some Data
– 64 procs, 4 pointers, normalized to full-bit-vector– Coarse vector quite robust
• General conclusions:– full bit vector simple and good for moderate-scale– several schemes should be fine for large-scale
0
100
200
300
400
500
600
700
800
LocusRoute Cholesky Barnes-Hut
Norm
ali
zed
In
vali
dati
on
s
BNBCV
CS252/PattersonLec 12.22
2/28/01
Summary of Directory Organizations
• Flat Schemes:• Issue (a): finding source of directory data
– go to home, based on address
• Issue (b): finding out where the copies are– memory-based: all info is in directory at home– cache-based: home has pointer to first element of distributed linked
list
• Issue (c): communicating with those copies– memory-based: point-to-point messages (perhaps coarser on
overflow)» can be multicast or overlapped
– cache-based: part of point-to-point linked list traversal to find them» serialized
• Hierarchical Schemes:– all three issues through sending messages up and down tree– no single explict list of sharers– only direct communication is between parents and children
CS252/PattersonLec 12.23
2/28/01
Summary of Directory Approaches
• Directories offer scalable coherence on general networks
– no need for broadcast media
• Many possibilities for organizing directory and managing protocols
• Hierarchical directories not used much– high latency, many network transactions, and bandwidth
bottleneck at root
• Both memory-based and cache-based flat schemes are alive
– for memory-based, full bit vector suffices for moderate scale
» measured in nodes visible to directory protocol, not processors
– will examine case studies of each
CS252/PattersonLec 12.24
2/28/01
Summary
• Caches contain all information on state of cached memory blocks
• Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access)
• Directory has extra data structure to keep track of state of all cache blocks
• Distributing directory => scalable shared address multiprocessor => Cache coherent, Non uniform memory access