Scalable Shared-Memory Implementations
Erik HagerstenUppsala University
Sweden
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh2
AVDARK2010
Cache-to-cache in snoop-based
Thread
$
Thread
$
Thread
$
A:
...Read A…Write A
B:
Read B…Read A
BusRTS
My RTSwait
for data
Gottaanswer
Read ARead A……Read A
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh3
AVDARK2010
”Upgrade” in dir-based
Thread
$
Thread
$
Thread
$
Read ARead A……
A:
...Read A…Write A
B:
Read B…Read A
INV INV
Who has a copy
Who has a copy
INV
ACK ACK
ACK
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh4
AVDARK2010
Read ARead A……Read A
Cache-to-cache in dir-based
Thread
$
Thread
$
Thread
$
A:
...Read A…Write A
B:
Read B…Read A
Forward
ReadRequest
Who has a copy
Who has a copy
ReadDemandAck
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh5
AVDARK2010
Directory-based coherence: per-cachline info in the memory
Thread
$
Thread
$
Thread
$
A: B:
Cache access Cache access Cache access
Directory Protocol
State
State
State
Directorystate
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh6
AVDARK2010
Directory-based snooping: NUMA.Per-cachline info in the home node
Thread
$
Thread
$
A: B:
Cache access Cache access
Directory Protocol
State
State
Directory Protocol
Interconnect
Directorystate
Directorystate
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh7
AVDARK2010
Why directory-basedP2P messages high bandwidthSuits out-of-the-box coherenceMuch more scalable!Note:
Dir-based can be used to build a uniform-memory architecture (UMA)Bandwidth will be great!!Memory latency will be OKCache-to-cache latency will not!Memory overhead can be high (storing directory...)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh8
AVDARK2010
Cache-to-cache in snoop-based
Thread
$
Thread
$
Thread
$
A:
...Read A…Write A
B:
Read B…Read A
BusRTS
My RTSwait
for data
Gottaanswer
Read ARead A……Read A
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh9
AVDARK2010
Read ARead A……Read A
Cache-to-cache in dir-based
Thread
$
Thread
$
Thread
$
A:
...Read A…Write A
B:
Read B…Read A
Forward
ReadRequest
Who has a copy
Who has a copy
ReadDemandAck
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh10
AVDARK2010
Fully mapped directory
• k Nodes• Each node is the ”home”
for 1/k of the memory• Dir entry per cacheline in
home memory: k presence-bits + 1 dirty-bit
• Requests are first sent to the home node’s CA
• ••Memory Directory
presence bits dirty bit”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh11
AVDARK2010
Reducing the Memory Overhead: SCI
--- Scalable Coherence Interface (SCI)• home only holds pointer to rest of the directory info [log(N) bits]• distributed linked list of copies, weaves through caches
• cache tag has pointer, points to next cache with a copy• on read, add yourself to head of the list (comm. needed)• on write, propagate chain of invalidations down the list• on replacement: remove yourself from the list
P
Cache
P
Cache
P
Cache
Main Memory(Home)
Node 0 Node 1 Node 2
Log N bit ptr
Dir Data
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh12
AVDARK2010
Cache Invalidation PatternsBarnes-Hut Invalidation Patterns
1.27
48.35
22.87
10.56
5.332.87 1.88 1.4 2.5 1.06 0.61 0.24 0.28 0.2 0.06 0.1 0.07 0 0 0 0 0.33
05
10
1520
253035
4045
50
0 1 2 3 4 5 6 7
8 to
11
12 to
15
16 to
19
20 to
23
24 to
27
28 to
31
32 to
35
36 to
39
40 to
43
44 to
47
48 to
51
52 to
55
56 to
59
60 to
63
# of invalidations
Radiosity Invalidation Patterns
6.68
58.35
12.04
4.162.24 1.59 1.16 0.97
3.28 2.2 1.74 1.46 0.92 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0.910
10
20
30
40
50
60
0 1 2 3 4 5 6 7
8 to
11
12 to
15
16 to
19
20 to
23
24 to
27
28 to
31
32 to
35
36 to
39
40 to
43
44 to
47
48 to
51
52 to
55
56 to
59
60 to
63
# of invalidations
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh13
AVDARK2010
Overflow Schemes for Limited Pointers
Broadcast (DiriB)broadcast bit turned on upon overflowbad for widely-shared invalidated data
No-broadcast (DiriNB)on overflow, new sharer replaces one of the old ones (invalidated)bad for widely read data
Coarse vector (DiriCV)change representation to a coarse vector, 1 bit per k nodeson a write, invalidate all nodes that a bit corresponds to
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
1
Over½ow bit 8-bit coarse vector
(a) Over½ow
P0 P1 P2 P3
P4 P5 P6 P7
P8 P9 P10 P11
P12 P13 P14 P15
0
Overflow bit 2 Pointers
(a) No over½ow
Dir in memory:
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh14
AVDARK2010
Directory cache
Thread
$
Thread
$
A: B:
Cache access Cache access
Directory Protocol
State
State
Directory Protocol
Interconnect
Directorystate
Directorystate
Dir$ Dir$
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh15
AVDARK2010
cc-NUMA issuesMemory placement is key!Gotta’ migrate data to where it’s being usedGotta’ have cache affinity
Long time between process switches in the OSReschedule processor on the CPU it ran last
SGI Origin 2000’s migration always turned off
Interconnect
$P
M$P
M
...
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh16
AVDARK2010
Three options for shared memory
Interconnect
$P
AM$P
AM
...
Interconnect
$P
M$P
M
...
Interconnect
$P
M
$P
M
...
COMAcache-only
(@SICS)
NUMAnon-uniform
UMAuniform(a.k.a. SMP)
Sun’s E6000 Server Family
Erik HagerstenUppsala University
Sweden
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh18
AVDARK2010
What Approach to Shared Memory
I/O devicesMem
P1
$ $
Pn
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
P1
$
Inter connection network
$
Pn
Mem Mem
(b) Bus-based shar ed memory
(c) Dancehall
(a) Shar ed cache
First-level $
Bus
P1
$
Inter connection network
$
Pn
Mem Mem
(d) Distributed-memory NUMAUMA
UMA
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh19
AVDARK2010
Looks like a NUMA but drives like a UMA
Logical View
P1
$
Interconnection network
$
Pn
Mem Mem
Physical Viewl
P1
$ $
PnMem Mem
Addr
Interconnection network
•Memory bandwidth scales with the processor count•One “interconnect load” per (2xCPU + 2xMem)•Optimize for the dancehall case (no memory shortcut)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh20
AVDARK2010
SUN Enterprise Overview
16 slots with with either CPUs or IOUp to 30 UltraSPARC processors (peak 9 GFLOPs)GigaplaneTM bus has peak bw 2.67 GB/s; up to 30GB memory16 bus slots, for processing or I/O boards
Gigaplane TM bus (256 data, 41 address+ctrl, 83 MHz-100MHz)
I/O Cards
P
$2
$P
$ 2
$
mem ctrl
Bus Interface / SwitchBus Interf ace
CPU/MemCards
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh21
AVDARK2010
Enterprize Server E6000
I/O
Interconnect
P
$MEM
P
$
16 boards
P
$MEM
P
$
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh22
AVDARK2010
An E6000 Proc Board
P
$
P
$Mem
SS
80 signals = addr, uid, arb, ...
288 signals = 256 data + ECC
Proctags
Data
S
S
Snooptags
ctrlAddressController
Addr
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh23
AVDARK2010
An I/O Board80 signals = addr, uid, arb, ...
288 signals = 256 data + ECC
ctrlAddressController
DataAddr
$I/O
$I/O
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh24
AVDARK2010
Split-Transaction Bus
Address/CMD
Mem Access Delay
Data
Address/CMD
Data
Address/CMD
Busarbitration
Split bus transaction into request and response sub-transactions
Separate arbitration for each phase
Other transactions may interveneImproves bandwidth dramaticallyResponse is matched to requestBuffering between bus and cache controllers
time
AddrSignals
DataSignals
A A A D D
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh25
AVDARK2010
Gigaplane Bus Timing
Arbitration
Address
State
Tag
Status
Data
1
Rd A
A D A D A D A D A D A D A D A D
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Share ~Own
OK
D0 D1
4,5
Rd B
Own
Tag
6
Cancel
Tag
7
uid1 uid2
uid1
D0 D
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh26
AVDARK2010
Electrical Characteristics of the Bus
At most 16 electrical loads per signal8 boards from each side (ex. 15 CPU+1 I/O)20.5 inches "centerplane"Well controlled impedance~350-400 signalsRuns at 90/100 MHz
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh27
AVDARK2010
Address Controller
S
S
ctrl
addr arb aid did
OQ
IQ
P
$
P
$ SSProctags
Snooptags
CohProt
Data
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh28
AVDARK2010
Dual State Tags
S
S
ctrl
addr arb aid did
OQ
IQ
P
$
P
$ SSAccess-rightStat
ObligationState
CohProt
Data
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh29
AVDARK2010
Timing of a single read transBoard 1 reading from mem 2
Arbitration
Address
State
Tag
Status
Data
1
RdA
A D A D A D A D A D A D A D A D2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Share ~Own
OK
D0 D1
4,5
RdB
Own
Tag
6
Cancel
Tag
7
uid1 uid2
uid1
D0 D1
= Fixed latency
= Shortest possible latency
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh30
AVDARK2010
Protocol tuned for timing
11 cycles = ~110 ns
SRAM lookup
DRAM accessAddr decode
Arbitration
Address
State
Tag
Status
Data
1
RdA
A D A D A D A D A D A D A D A D2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Share ~Own
OK
D0 D1
4,5
RdB
Own
Tag
6
Cancel
Tag
7
uid1 uid2
uid1
D0 D1
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh31
AVDARK2010
Foreign and own transactions queue in IQState Change on Address Packet
•Data “A” initially resides in CPU7’s cache•CPU1: Issues a store request to “A”•CPU1: Read-To-Write req, ID=d, (i.e., “write request”)•CPU13: LD “A” -> Read-To-Shared req, ID=e•CPU15: ST “A” -> RTW req , ID=f
mRTO stored in IQCPU1Own read IQtrans retired when data arrivesLater requests for A queued in IQCPU1 behind mRTOIQCPU1 will eventually store: <mRTWIDd, fRTSIDe, fRTWIDf>
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh32
AVDARK2010
S
S
ctrl
addr arb aid did
OQ
IQ
P
$
P
$ SSProctags = I
Snooptags =I
CohProt
.
Store A
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh33
AVDARK2010
S
S
ctrl
addr arb aid did
OQ
IQ
P
$
P
$ SSProctags = I
Snooptags = I
CohProt
.
RTW
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh34
AVDARK2010
S
S
ctrl
addr arb aid did
OQ
IQ
P
$
P
$ SSProctags = I
Snooptags => M
CohProt
.
mRTW
mRTW, IDd
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh35
AVDARK2010
S
S
ctrl
addr arb aid did
OQ
P
$
P
$ SSProctags = I
Snooptags => O
CohProt
.
fRTSmRTW
fRTS, IDe
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh36
AVDARK2010
S
S
ctrl
addr arb aid did
OQ
P
$
P
$ SSProctags = I
Snooptags => I
CohProt
.
fRTWfRTSmRTW
fRTO, IDf
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh37
AVDARK2010
S
S
ctrl
addr arb aid did
OQ
P
$
P
$ SSProctags = I
Snooptags = I
CohProt
.
fRTWfRTSmRTW
Data, IDd
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh38
AVDARK2010
S
S
ctrl
addr arb aid did
OQ
P
$
P
$ SSProctags =>M
Snooptags = I
CohProt
Data, IDd
.
fRTWfRTS
mRTO
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh39
AVDARK2010
S
S
ctrl
addr arb aid did
OQ
P
$
P
$ SSProctags =>O
Snooptags = I
CohProt
Data, IDe
.
fRTWData, IDe
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh40
AVDARK2010
S
S
ctrl
addr arb aid did
OQ
P
$
P
$ SSProctags =>I
Snooptags = I
CohProt
Data, IDf
.
Data, IDf
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh41
AVDARK2010
A cascade of "write requests”A initially resides in CPU7’s cacheOn the bus:
•CPU1: RTW, ID=a•CPU2: RTW, ID=b•…•CPU5: RTW, ID=f
IQ1 = <mRTWIDa, fRTWIDb>IQ2 = <mRTWIDb, fRTWIDc>...IQ5 = <mRTWIDf>…IQ7= < fRTWIDa>
I I
I I
I M
S I
CPUtags
Snooptags
Implementing Sun’s SunFire 6800
Erik HagerstenUppsala University
Sweden
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh43
AVDARK2010
CPU$ CPU $DCDS
MEM
MEM
CPU$ CPU $DCDS
MEM
MEM
Addr. Rep.CPUboard
Addr. Rep.
L2 cache = 8MB, snoop tags on-chipCPU 1+GHz UltraSPARC IIIMem= 4+GB/CPU
FirePlane, 24 CPUs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh44
AVDARK2010
CPU$ CPU $DCDS
MEM
MEM
CPU$ CPU $DCDS
MEM
MEM
Addr. Rep.CPUboard
Addr. Rep.
CMD, Addr, ID
ID = <CPU#, Uid>
FirePlane, 24 CPUs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh45
AVDARK2010
CPU$ CPU $DCDS
MEM
MEM
CPU$ CPU $DCDS
MEM
MEM
Addr. Rep.CPUboard
Addr. Rep.FirePlane, 24 CPUs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh46
AVDARK2010
CPU$ CPU $DCDS
MEM
MEM
CPU$ CPU $DCDS
MEM
MEM
Addr. Rep.CPUboard
Addr. Rep.FirePlane, 24 CPUs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh47
AVDARK2010
CPU$ CPU $DCDS
MEM
MEM
CPU$ CPU $DCDS
MEM
MEM
Addr. Rep.CPUboard
Addr. Rep.FirePlane, 24 CPUs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh48
AVDARK2010
CPU$ CPU $DCDS
MEM
MEM
CPU$ CPU $DCDS
MEM
MEM
Addr. Rep.CPUboard
Addr. Rep.
Here it is!!
FirePlane, 24 CPUs
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh49
AVDARK2010
CPU$ CPU $DCDS
MEM
MEM
CPU$ CPU $DCDS
MEM
MEM
Data Repeater
Addr. Rep.CPUboard
Addr. Rep.
Data Repeater
DR
Data, ID
ID = <CPU#, Uid>
FirePlane, 24 CPUs
DR DRDRDR
Sun’s WildFire System
Erik HagerstenUppsala University
Sweden
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh51
AVDARK2010
Sun’s WildFire System
Runs unmodified SMP apps in a more scalable way than E6000Minor modifications to E6000 snooping requiredCPUs generate local address OR global addressGlobal address --> no replication (NUMA)Coherent Memory Replication(~Simple COMA@ SICS)Hardware support for detecting migration/replication pagesDirectory cache + address translation cache backed by memoryDeterministic directory implementation (easy to verify)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh52
AVDARK2010
WildFire: One Solaris spanning four nodes
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh53
AVDARK2010
SwitchSwitch
...$$
PP
MM
$$
PP
MM
ccNUMA
SwitchSwitch
...$$
PP
$$
PP
$$
COMA
$$
COMA: self-optimizing DSM
COMA:Self-optimizing architectureProblem at high memory pressureComplex hardware and coherence protocol
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh54
AVDARK2010
Adaptive S-COMA of Large SMPs
A page may have space allocated in many nodes
HW maintains memory coherence per cache line
Replication under SW control --> simple HW (S-COMA)
Adaptive replication algorithm in OS (R-NUMA)
Coherent Memory Replication (CMR)
Hierarchical affinity scheduler (HAS)
Few large nodes ->simple interconnect and coherence protocol
...SwitchSwitch
$$PP
$$PP...
MemoryMemory
SwitchSwitch
$$PP
$$PP...
MemoryMemory
InterconnectInterconnect
28X
4X
28X
o
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh55
AVDARK2010
A WildFire Node
16 slots with either CPUs, IO or…oard
Gigaplane TM bus (256 data, 41 address+ctrl, 83 MHz-100MHz)
I/O I/O
P
$2
$P
$ 2
$
mem ctrl
Bus Interface / SwitchBus Interf ace
CPU/Mem Cards
WildFireCard
Bus Interf ace
WildFire extension board
Up to 28 UltraSPARC processors GigaplaneTM bus has peak bw 2.67 GB/sLocal access time of 330ns (lmbench)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh56
AVDARK2010
Sun WildFire Interface Board
ADDRController
SRAM
DataBuffers
Link Link LinkThis spacefor rent
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh57
AVDARK2010
Sun WildFire Interface Board
Data BusAddr Bus
Links
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh58
AVDARK2010
WildFire as a vanilla "NUMA"
Dir$ 8 b/line
Mtag2 b/line
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Interconnect
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh59
AVDARK2010
NUMA -- local memory acces
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Interconnect
AccessrightOK?
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh60
AVDARK2010
NUMA -- remote memory access
Dir$ 8b/line
Mtag 2b/line
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
InterconnectWho hasthe data
SRAM overhead = 10/512 = 2% (lower bound 2/512 = 0.4%)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh61
AVDARK2010
Global Cache Coherence Prot.
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Accessrightchanges
Mod direntry
Reply(Data)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh62
AVDARK2010
NUMA -- local memory access
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Interconnect
AccessrightOK?NO!!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh63
AVDARK2010
Gigaplane Bus Timing
Arbitration
Address
State
Tag
Status
Data
1
Rd A
A D A D A D A D A D A D A D A D
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Share ~Own
OK
D0 D1
4,5
Own
uidB
6
Cancel
uidB
7
uidA Rd B uidB
uidA
Rd C uidCXXX xxx
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh64
AVDARK2010
Arbitration
Address
State
Tag
Status
Data
1
Rd A
A D A D A D A D A D A D A D A D
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Share Ignore
OK
D0 D1
4,5
Own
uidB
6
Cancel
uidB
7
uidA Rd B uidB
uidA
Rd A uidAXXX xxx
Resent by WildFireAsserted by WildFire
9
WildFire Bus Extensions
Ignore transaction squashes an ongoing transaction => not put in IQWildFire eventually reissues the same transactionRTSF -- a new transaction sends data to CPU and memory
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh65
AVDARK2010
WildFire Directory -- only 4 nodes!!
• k nodes (with one or more procs). • With each cache-block in memory: k
presence-bits, 1 dirty-bit• With each cache-block in cache: 1 valid
bit, and 1 dirty (owner) bit• ••
P P
Cache Cache
Memory Directory
presence bits dirty bit
Interconnection Network
• ReadRequestfrom main memory by processor i:• If dirty-bit OFF then { read from main memory; turn p[i] ON}• if dirty-bit ON then { recall line from dirty proc (cache state
to shared); update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}
….
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh66
AVDARK2010
NUMA "detecting excess misses"
Dir$ 8b/line
Mtag 2b/line
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
InterconnectI thoughtyou hadthe data!!*
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh67
AVDARK2010
Detecting a page for replication
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
= addr
AssociativeCounters
Data w/ E-miss-bit)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh68
AVDARK2010
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Interconnect
AT AT$
= addr
AddressTranslationGrey. <--> Yel..
VA->PA
OS Initializes a CMR page
Init accright toINV
/page
NewV->Pmappingin thisnode
/CL
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh69
AVDARK2010
An access to a CMR page
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Interconnect
AT AT$
AccessrightOK?
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh70
AVDARK2010
An access to a CMR page (miss)
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Interconnect
AT AT$
AccessrightOK?NO!!
Address Translation (AT) overhead = 8B/8kB = 0.1%No extra latency added
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh71
AVDARK2010
An access to a CMR page (miss)
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Interconnect
AT AT$
ChangeMTAGto “shared”
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh72
AVDARK2010
An access to a CMR page (hit)
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Dir$
Mtag
Mem
I/F
Cache
Proc
Cache
Proc...
Interconnect
AT AT$
AccessrightOK?YES!
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh73
AVDARK2010
Deterministic Directory MOSI protocol, fully mapped directory (one bit/node)Directory blocking: one outstanding trans/cache lineDirectory blocks new requests until completion receivedThe directory state and cache state always in agreement (except for silent replacement…)
L H R
1: req 2a:demand
3a:reply(D)
(a) Write: remote shared
L H R
1: req 2:demand
3:reply(D)
4:compl.
(b) Read: Remote dirty
4:compl.
3b. ack RR
2b. inv
L H
1: req
2:ack/nack
3:comp(D).
(c) Writeback
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh74
AVDARK2010
Replication Issues Revisited
Replication
"Physical" memory
mem tot
mem tot
controlledby the OS
WildFire
COMA
ccNUMA
Only "promising" pages are replicatedOS dynamically limits the amount of replicationSolaris CMR changes in the hat_layer (=port)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh75
AVDARK2010
Advantages of Multiprocessor Nodes
Pros:amortization of fixed node costs over multiple processorscan use commodity SMPsfewer nodes to keep track of in the directorymuch communication may stay within node (NUCA)can share “node caches” (WildFire: Coherent Memory Replication)
Cons:bandwidth shared among processors and interfacebus may increases latency to local memorysnoopy bus at remote node increases delays there too
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh76
AVDARK2010
Memory cost of replication
Example: Replicate 10% of data in all nodes50 nodes, each with 2 CPUs ==> 490% overhead
4 nodes, each with 25 CPUs ==> 30% overhead
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh77
AVDARK2010
Does migration/replication help?NAS parallel Benchmark Study (Execution time in seconds)[M. Bull, EPCC 2002]
No migr Migr NoMigr MigrNo Repl 26 5.9 6.1 6.1
Repl 7.2 6.2 6.1 6.1
No Initial Plac. Initial PlacementNo migr Migr NoMigr Migr
960 610 620 600590 580 580 580
No Initial Plac. Initial Placement
No migr Migr NoMigr MigrNo Repl 520 330 380 260
Repl 250 260 190 200
No Initial Plac. Initial PlacementNo migr Migr NoMigr Migr1540 780 760 780670 680 670 670
No Initial Plac. Initial Placement
Shallow BT
FT SP
No migr Migr NoMigr MigrNo Repl 230 230 240 230
Repl 220 220 220 220
No Initial Plac. Initial Placement
MG
No migr Migr NoMigr Migr1060 700 940 700300 280 300 290
No Initial Plac. Initial Placement
CG
Unopt. SWopt.HWopt.
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh78
AVDARK2010
WildFire’s Technology Limits
Dir$=8 b/line
Mtag=2 b/line
Mem
I/F
Cache
Proc
Cache
Proc...
Mem
Interconnect
SRAM size =DRAMsize/256Snoop frequency
SRAM size =DRAMsize/256Snoop frequency
Dir $ reach >> sum(cache size)Dir $ reach >> sum(cache size)
Slow interconnect
Slow interconnect
Hard to make busses faster
Hard to make busses faster
Sun’s SunFire 15k/25k
Erik HagerstenUppsala University
Sweden
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh80
AVDARK2010
StarCatSun Fire 15k/25k(used at Lab2)
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh81
AVDARK2010
Front Side
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh82
AVDARK2010
Back Side
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh83
AVDARK2010
StarCat, 72 CPUs
CPU$ CPU $DCDS
MEM
MEM
CPU$ CPU $DCDS
MEM
MEM
Data Repeater
Addr. Rep.CPUboard
Glob-coh prot Data Rep.Dir$Expanderboard
18x18 addr X-bar 18x18 addr X-barActive Backplane
Total18 slots
P2P Directory Coherence
Repeater Snoop Coherence
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh84
AVDARK2010
StarCat Coherence Mechanism
CPU$ CPU $DCDS
MEM
MEM
CPU$ CPU $DCDS
MEM
MEM
Data Repeater
Addr. Rep.CPUboard
Glob-coh prot Data Rep.Dir$Expanderboard
18x18 addr X-bar 18x18 addr X-barActive Backplane
DATA+MTAG+ECC =576 bits
DATA+MTAG+ECC =576 bits
MTAG Check!
MTAG Check!
DATA
Addr (snoop)Remoterequest
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh85
AVDARK2010
StarCat, 72 CPUs
CPU$ CPU $DCDS
MEM
MEM
CPU$ CPU $DCDS
MEM
MEM
Data Repeater
Addr. Rep.CPUboard
Glob-coh prot Data Rep.Dir$Expanderboard
18x18 addr X-bar 18x18 addr X-barActive Backplane
Allocate Dir$ entry only for writerequests. Speculate onclean data on Dir$ miss
Allocate Dir$ entry only for writerequests. Speculate onclean data on Dir$ miss
WildCat coherencew/o CMR& w/ faster interconnect
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh86
AVDARK2010
Directory cache, but no directory(broadcast on Dir$ miss)
Thread
$
Thread
$
A: B:
Cache access Cache access
Directory Protocol
State
State
Directory Protocol
Interconnect
Dir$ Dir$
Dept of Information Technology| www.it.uu.se © Erik Hagersten| user.it.uu.se/~eh87
AVDARK2010
StarCat Performance Data
CPU$ CPU $DCDS
MEM
MEM
CPU$ CPU $DCDS
MEM
MEM
Data Repeater
Addr. Rep.CPUboard
Glob-coh prot Data Rep.Dir$Expanderboard
18x18 addr X-bar 18x18 addr X-barActive Backplane
Lat = 200-340nsGBW=43GB/sLBW=86GB/s
Up to 104 CPU(trading for I/O)
Top Related