Distributed Memory and Cache Consistency (some slides courtesy of Alvin Lebeck)
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE...
-
Upload
laureen-bishop -
Category
Documents
-
view
217 -
download
0
Transcript of The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE...
The Stanford Directory Architecture for Shared Memory
(DASH)*
Presented by: Michael BauerECE 259/CPS 221
Spring Semester 2008Dr. Lebeck
* Based on “The Stanford Dash Multiprocessor” in IEEE Computer March 1992
Outline1. Motivation
2. High Level System Overview
3. Cache Coherence Protocol
4. Memory Consistency Model: Release Consistency
5. Overcoming Long Latency Operations
6. Software Support
7. Performance Results
8. Conclusion: Where is it now?
Motivation
1. Minimal impact on programming model
2. Cost efficiency
3. Scalability!!!
Goals:
Design Decisions:1. Shared Address Space (no MPI)
2. Parallel architecture instead of next sequential processor (no clock issues yet!)
3. Hardware controlled, directory based cache coherency
High Level System OverviewProcessor
Cache
ProcessorCache
ProcessorCache
ProcessorCache
Memory
Directory
ProcessorCache
ProcessorCache
ProcessorCache
ProcessorCache
Memory
Directory
ProcessorCache
ProcessorCache
ProcessorCache
ProcessorCache
Memory
Directory
Interconnect Network
A shared address space without shared memory??*
* See http://www.uschess.org/beginners/read/ for meaning of “??”
Cluster
Cache Coherence Protocol
DASH’s Big Idea: Hierarchical Directory Protocol
Processor Level
Processor Cache
Local Cluster Level
Other processor cacheswithin local cluster
Remote Cluster Level
Processor caches inremote clusters
Home Cluster Level
Directory and main memoryassociated with a given address
- Locate cache blocks using a hierarchy of directories
- Like NUCA except for directories (NUDA = Non-Uniform Directory Access?)
- Cache blocks in three possible states
- Dirty (M)- Shared (S)- Uncached (I)
Cache Coherency Example
ProcessorCache
ProcessorCache
ProcessorCache
ProcessorCache
Memory
Directory
ProcessorCache
ProcessorCache
ProcessorCache
ProcessorCache
Memory
Directory
ProcessorCache
ProcessorCache
ProcessorCache
ProcessorCache
Memory
Directory
Interconnect Network
Requesting ProcessorHome Cluster
Processor Holding Block
1. Processor makes request on local bus
2. No response, directory broadcasts on network3. Home directory sees request, sends message to remote cluster
4. Remote directory puts request on bus5. Remote processor responds with data6. Remote directory forwards data, updates home directory7. Data delivered, home directory updated
Implications of Cache Coherence Protocol
- What do hierarchical directories get us?
- What problems still exist?
- Very fast access on local cluster
- Moderately fast access to home cluster
- Minimized data movement (assumed temporal and spatial locality?)
- Broadcast in some circumstances can be bottleneck to scalability
- Complexity of cache and directory controllers, require manyoutstanding requests to hide latency -> power hungry CAM’s
- Potential for long latency events as shown in example (more onthis later)
Memory Consistency Model: Release Consistency
Release Consistency Review*:1. W->R reordering allowed (to different blocks only)2. W->W reordering allowed (to different blocks only)3. R->W (to different blocks only) and R-R reordering allowed
Why Release Consistency?
1. Provides acceptable programming model
2. Reordering events is essential for performanceon a variable latency system
3. Relaxed requirements for interconnect network, no need forin order distribution of messages
* Taken from “Shared Memory Consistency Models: A Tutorial”, we’ll read this later
Overcoming Long Latency Operations
Prefetching:- How is this beneficial to execution?
- What can go wrong with prefetching?
- Does this scale?
Update and Deliver Operations:
- What if we know data is going to be needed by many threads?
- Tell system to broadcast data to everyone using Update-Writeoperation
- Does this scale well?
- What about embarrassingly parallel applications?
Software Support
- Parallel version of Unix OS
- Handle prefetching in softwared (will this scale?)
- Parallelizing compiler (how well do you think this works?)
- Parallel language Jade (how easy to rewrite applications?)
Performance Results
Do theselook like they scalewell?
What is goingon here?!?
Conclusion: Where is it now?
- Novel architecture and cache coherence protocol
- Some level of scalability for diverse applications
- Why don’t we see DASH everywhere?- Parallel architectures not cost-effective for general purposecomputing until recently
- Requires adaptation of sequential code to parallel architecture
- Power?
- Any other reasons?
- For anyone interested: DASH -> FLASH -> SGI Origin (Server) http://www-flash.stanford.edu/architecture/papers/ISCA94/http://www.futuretech.blinkenlights.nl/origin/isca.pdf