The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory...

35
The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh Gharachoroloo, Anoop Gupta, and John Hennessy

Transcript of The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory...

The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor

Computer System Laboratory Stanford University

Daniel Lenoski, James Laudon, Kourosh Gharachoroloo, Anoop Gupta, and John Hennessy

Designing low-cost high-performance multiprocessor

Message-passing (multicomputer)-distributed add. space, locally access more scalable more cumbersome to program

Shared-memory (multiprocessor)-single add. space, remote access simplicity( data partitioning, dynamic load distribution) consume bandwidth, cache coherence

DASH (Directory Architecture for Shared memory)

Distributed shared main mem. among the processing nodes to provide scalable mem. bandwidth

Distributed directory-based protocol to support cache coherence

DASH architecture Processing node (cluster)

-bus-based multiprocessor-snoopy protocol, amortizes cost of dir. logic & network interface

Set of clusters-mesh interconnected network-distributed directory-based protocol, keeps the summary info for each mem.line specifying the cluster that are caching it.

Details Cache--individual to each processor Memory-- shared to processors w/in the

same cluster Directory memory-- keep track of all

processors caching a block, send point-to-point msg (invalidate/update), avoid broadcast

Remote Access Cache (RAC)– maintaining state of currently outstanding requests, buffering replies from the network to release waiting processor for bus arbitration.

Design distributed directory-based protocol

Correctness issues-memory consistency model, strong constrained? Less constrained?-deadlock, loop, generation of previous request is the requirement of the next.-error handling, manage data integrity & fault tolerance.

Performance issues-latency

write misses-write buffer, release consistency model

read misses-min inter-cluster msg, delay of msg.-bandwidth, reduce serialization (queuing delays), traffic, # of msg, caches & distributed memory in DASH.

Distributed control & complexity issues-distribute control to components, balance system performance & complexity of the components.

DASH prototype Cluster(node)

Silicon Graphics PowerStation 4D/240 4 processors (MIPS 3000/3010) L1(64 Kbyte instruction,64Kbyte

write-through data) L2(256 Kbyte write-back), convert

RTRB, cache tag for snooping, maintaining consistency using Illinois MESI protocol

Memory bus Separated into 32-bit add. bus & 64-bit data

bus. Supporting mem-to-cache & cache-to-cache

transfer 16 bytes every 4 bus clocks with a latency of

6 bus clocks, max bandwidth 64 mbps Retry mechanism, when a request requires

services from a remote cluster, remote request are signaled to retry, mask & unmasked requesting processor to avoid unnecessary retries.

Modification

Directory controller board-maintaining cache coherence inter-node, interface to interconnection network

Directory controller (DC)-contains the directory mem. corresponding to the portion of main mem. Initiates out-bound network requests

Pseudo-CPU (PCPU)- buffering income requests, issuing requests on bus

Reply controller (RC)- tracks outstanding requests made by local processors, receives & buffers the corresponding replies from remote cluster, acts as mem. In case of request retry.

Interconnection network-2 wormhole routed meshes (request & reply)

HW monitoring logic, miscellaneous control and status registers-logic samples directory board and bus events, derive usage and performance statistics.

Directory memory

-array of directory entries

-one entry for each mem. Block

-single state bit (shared/dirty)

-a bit vector of pointer to each of the 16 clusters

-directory information is combined with bus operation, address, and result of snooping within the cluster

-DC generates network msg & bus controls

• ••

P P

Cache Cache

Memory Directory

presence bits dirty bit

Interconnection Network

• Assume “N" processors. • With each cache-block in memory :N presence-bits (bit vector), and 1 dirty-bit (state bit)

Remote Access Cache (RAC) Maintaining state of currently

outstanding requests from RC Buffering replies from the network,

waiting processor is released for bus arbitration.

Supplementing the functionality of the processor’s caches

Supplies data cache-to-cache when released processor retry the access

DASH cache coherence protocol Local cluster

a cluster that contains the processor originating a given request

Home clusterthe cluster which contains the main memory and directory for a given physical memory address

Remote clusterany other cluster

Owning clustera cluster owns a dirty memory block

Local memorythe main memory associated with the local cluster

Remote memoryany memory whose home is not the local

DASH cache coherence protocol Invalidation-based ownership protocol Memory block Unchached-remote-- not cached by any remote cluster Shared-remote--cached in an unmodified state by one

or more remote clusters Dirty-remote—cached in a modified state by a single

remote cluster Cache block Invalid–the copy in cache is stale Shared—other processors caching that location Dirty—this cache contains an exclusive copy of the

memory block, and the block has been modified.

3 primitive operations Read request (load) In L1, simply supplies the data In L2, fill operation find and bring the

required block to L1 Others, send a read request on the bus

• Shares- local, simply transfer over the bus• Dirty-local, RAC take ownership of the cache line• Unchached-remote/shared-remote, send data over

the reply network to requesting cluster• Dirty-remote, forward request to owning cluster,

owning cluster send data to requesting cluster and sharing write-back request to home cluster.

Forward strategy

reduce latency by direct responds

process many request simultaneously (multithreaded)

reduce serialization

Additional latency when simultaneously accesses are made to the same block, 1st request will be satisfied and dirty cluster loses ownership, 2nd request return negative acknowledge(NAK) that force retry access.

Read-exclusive request (store) In local memory, write and invalidate others

copies Dirty-remote, owning processor invalidate

that block from its cache, send granting ownership and data to requesting cluster, send update ownership msg to home cluster.

Unchached-remote/ shared-remote, write, send invalidate request for shared state.

Acknowledge

-needed for the requesting processor to know when the store has been complete w/ respect to all processors.

-maintain consistency, guarantee that new owner will not loose ownership before the directory has been updated

Write-back requesta dirty cache line that is replaced must be written back to memory

Home cluster is local, write back to main memory

Home cluster is remote, send a message to the remote home cluster, update the main memory in remote home and mark the block unchached-remote.

Bus initiated cache transaction

Transactions made by cache snooping the bus

Read operation, dirty cache supplies date and changes to shared state

Read-exclusive operation, invalidate all other cached copies

Line in L2 is invalidated, L1 do the same

Exception conditions A request forwarded to a dirty cluster

may arrived there to find that the dirty cluster no longer owns the data.

Prior access, change ownership Owning cluster perform a write backSol: requesting cluster is sent a NAK

responses and is required to reissure the request(release mask, treating as new request)

Ownership bouncing back to two remote clusters, requesting cluster receives multiple NAK’s

Time-out Return a bus errorSol: add a additional directory states access

queue, responds for all read only requests, grants ownership to each exclusive request on a pseudo-random basis.

Separate request and reply network, some msg sent between 2 clusters can be received out-of-order

Sol: acknowledge reply,out-of-order requests receive NAK response

Invalidate request overtakes read reply which try to purge the read copy.

Sol: when RAC detects an invalidation request for a pending read, change state of that RAC entry to invalidate-read-pending, RC assumes that any read reply is stale and treats the reply as a NAK response.

Deadlock HW 2 mesh network, point-to-point message passing

consumption of an incoming message may require the generation of another outgoing message.

Protocol Request message

read, read-exclusive, invalidation requests Reply message

read & read-exclusive replies, invalidation ack. Separate mesh function

Error handling Error checking system ECC on main memory Parity checking on directory memory Length checking of network message Inconsistent bus and network message checking Report to processor through bus errors and associated

error capture registers. Issuing processor time-out originating request or

fencing operation. OS can clean up the state of a line by using back-door paths the allow direct addressing of the RAC and directory mem.

Scalability of the DASH directory Amount of dir.mem.=mem.size x processors # Limited pointer per entry, no space for processors

that are not caching the line Allow pointer to be shared between directory

entries Use a cache of directory entries to supplement or

replace the normal directory Sparse-directories, limited pointers and a coarse

vector

Validation of the protocol 2 SW simulator base testing methods Low-level DASH system simulator that incorporates

the coherence protocol, caches, buses and interconnection network

High-level functional simulator that models the processors and executes parallel programs

2 scheme for testing protocol Running existing parallel programming and compare

output Test script Hardware

Comparison with scalable coherent interface protocol (SCI) Similarities

-rely on coherence caches maintained by distributed directories-rely on distributed memories to provide scalable memory bandwidth

Differences-in SCI, directory is a distributed sharing list maintained by cache-in DASH, all the directory info is placed with main memory

SCI advantages-amount of directory pointer grows naturally with the # of processors-employ SRAM technology used by cache-guarantee forward progress in all cases

SCI disadvantages-directory entries increases the complexity and latency of the directory protocol, additional update msg must be sent bet caches-require more inter-node communication