Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue...

36
Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University

Transcript of Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue...

Page 1: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Memory Sharing Predictor: The key to speculative Coherent DSM

An-Chow Lai

Babak Falsafi

Purdue University

Page 2: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Organization

• Introduction• Directory based cache coherence• Pattern Based Message Predictors• Memory Sharing Predictors• Vector Memory Sharing predictors• Speculative Coherent operations• Performance Analysis• Results• Summary & conclusions

Page 3: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Introduction

• Distributed Shared Memory Multiprocessors:– Provide a logical shared

address space over physically distributed memory

– Programming easier compared to SMPs.

– Non-Uniform Memory Access(Bottleneck): Remote access far slower compared to local access.

DSM

Page 4: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

• Efforts to eliminate this difference:– Custom designed motherboards– cannot get benefit of excellent cost-

performance of off-shelf motherboards

– Reduce remote access frequency

– Reduce coherence protocol overhead—will need complex adaptive coherence protocols.

– Existing predictors—directed to specific sharing patterns known a priori.

– Pattern based predictors:• Dynamically adapt to an application’s sharing pattern at runtime • Does not modify the base coherence protocol

– Memory Sharing Predictors & Vector Memory Sharing Predictors :• Topic of this paper• Improvement on general pattern based predictors proposed by Mukherjee

& Hill

Page 5: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Directory based cache coherence

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O

Interconnection network

Directory Directory Directory Directory

Page 6: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Directory based cache coherence

• Directory Based cache Coherence Protocols– Each node maintains sharing

information of all memory blocks

– Based on a Finite state machine in which states : directory state

& actions: messages– This paper uses half migratory

protocol– Speculative Coherent DSM

must accurately predict remote access and timely perform actions.

Directory protocol transitions

A remote read request

Page 7: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Pattern Based Message Predictors

• Predicts the sender and type of next incoming message for a particular block.

• Structure : Similar to a two level branch predictor

• History table: captures most recent sequence of incoming messages for every memory block

• Pattern table records all observed sequences of coherence messages for every memory block –(An Entry : Sequence of messages : prediction message)

A two level Message predictor

Page 8: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Pattern Based Message Predictors(contd.)

• Depth of History Table Register = number of past messages, it keeps track of.

• Deeper history depth=> more accurate prediction, no race conditions.

• Deeper history depth => Large Pattern history table=> high cost.

Message History Table

(MHT)

Message History Register (MHR)

<sender, type>

<sender, type>

Page 9: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Memory Sharing Predictors

• Shortcomings of General Message Predictor:- Invalidation messages may arrive in any order,

thus may interfere with prediction of more necessary request messages

- It increases the number of pattern table entries(almost doubles)

- It increases the number of bits needed to encode

the messages (three requests & two acks).

• Observations:– To eliminate the coherence overhead on remote access,

only necessary to predict memory request messages (read ,write, upgrade).

– Coherence acknowledgement message prediction extra overhead as they are always expected to arrive in response to a coherence action

Page 10: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Memory Sharing Predictors

• MSP addresses these issues:– predicting only the memory

request messages– Since the acknowledgements are

eliminated, all the effects of possible reordering of acknowledgements are eliminated.

– Only 2 bits required to encode messages compared to 3 for general predictor

Page 11: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

VMSP: A Vector MSP

• Observations:– Full map protocol allows multiple

processors to simultaneously cache read only copy of a memory block.

– A predictor must identify the sharers and not maintain the order in which they are read.

• Optimizations to MSP to get VMSP:

– Rather than record and predict read requests as individual pattern table entries, encode a sequence of read requests as a bit vector just like the directory maintains the list of sharers.

Page 12: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Vector Memory Sharing Predictor(contd.)

• Benefits:– reduces the number of pattern

table entries– eliminates the effect of re-ordering

of reads on size– Effect on history depth :

number of sharers

– Good when the number of readers are large(>(2+n)/2+log(n)).

Page 13: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Triggering Request Speculation

• Important considerations:– Predict what remote

memory requests arrive

– Predict when remote accesses arrive

– Execute necessary coherence actions A speculative coherent DSM node

and coherence hardware

Page 14: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Triggering Request Speculation

• A) What remote memory request arrives : somewhat simple from pattern history table (which stores what memory accesses take place)

• B) When : somewhat tough here

– early speculation may take away block from its readers

– Late speculation may incur additional delay and may limit DSM’s ability to hide coherence overhead

– was not a problem in COSMOS as all the coherence messages were being predicted but not sent. They were sent only after the previous message arrived. Since there are no coherence acknowledgement messages in the history table so timing is a problem now.

Page 15: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Triggering Request Speculation

• Two ways to overcome:

1) Speculative Write Invalidation: • Based on common memory access patterns– most producer consumer

scenario: Producer writes to a memory block and then no longer accesses until it has been read by consumers. Common in parallel commercial data base servers.

• MSP predicts that a processor is done writing when the processor writes to some other memory location

• Maintain a early write-invalidate table – stores last address written by a processor.

• If address in EWI table changes, trigger speculative write invalidate and subsequent reads.

Page 16: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Comparison with general Message predictor

Time

Read

WritebackSend block

P1 reader P3 Directory P2 Writer

Time Write A

P1 reader P3 Directory P2 writer

Write B

Invalidate

WritebackPrefetching

starts

Send block

Read hit

invalidate

Page 17: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Question?

• What happens if while speculatively read data has been sent by P3 to P1, P1 has already made the request for data?

Page 18: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Question?

• What happens if while speculatively read data has been sent by P3 to P1, P1 has already made the request for data?--The DSM node on receiving that speculated message drops this message to avoid modifying the protocol.

Page 19: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Question?

• What happens if P1 makes read request before P2 does the second write?

Page 20: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Question?

• What happens if P1 makes read request before P2 does the second write?– First Read Protocol

2) First Read:– If SWI fails, then on the first read request made, all subsequent reads

are triggered.

Page 21: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Speculative Coherence Operations

Final Action:– execute a coherence action speculatively – verify the accuracy of the predictor

• Requirements:– Co-exist with the base coherence protocol without any protocol

modifications

• MSP simply advices the protocol to execute coherence operations. Any misspeculation results in additional coherence operations but no interference with protocol functionality– eg. A premature write invalidation results in additional read /write

request by producer.

• MSP will advice the protocol to send read-only block copies to requesters.

Page 22: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Verification of accuracy

• Reference bit in remote cache of every block placed speculatively

• On actual reference, remote cache clears the bit, verifying that the access occurred.

• On invalidation of this block, reference bit is sent alongwith the invalidation message

• The MSP at home node examines this bit and removes mispredicted messages.

Page 23: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Performance Analysis

• Performance depends on– Speculation accuracy– Reduction in latency on successful speculation– Misspeculation penalty– Speculation opportunity– A computationally intensive application

will benefit little from speculation.

• Assumptions:– When speculative memory request is successfully executed,

entire remote latency is hidden– Misspeculation only slows the remote access, does not increase

the request frequency

Page 24: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Performance

• Performance Model:– c : Application’s communication ratio

– f : fraction of speculatively executed instructions over all

received requests

– p : request prediction accuracy

– laccess : local access latency

– raccess: remote access latency

– rtl : raccess /laccess

– n: misspeculation penalty factor

– N: number of remote requests on the critical path

Page 25: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Performance

• Communication speedup is given by:• (Comm time w/o speculation)/(comm time w/ speculation)

Nraccess

= --------------------------------------------------

(1-f)Nraccess + fN(placcess + (1-p)nraccess)

1

= --------------------------------------------------

(1-f) +f (p/rtl + n(1-p))

• Total speedup is given by :• (total execution time w/o speculation)/(total execution time w/ speculation)

1

= -------------------------------------------------

(1-c) + c/(comm_speedup)

Page 26: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Speedup vs various parameters

Potential Speedup in a speculative coherent DSM

Page 27: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Speedups

• Prediction accuracy plays prominent role in speedup– A low prediction accuracy of 10-50% results in slowdown due to high

speculation overhead while a high prediction accuracy (90%) increases speedup even for moderate communication ratios.

– At high prediction rates, slowdown due to increasing misspeculation penalty is not significant

– f: fraction of speculated instructions, is a measure of number of request messages it takes to learn and predict. For rapidly changing patterns, even at high prediction accuracy, performance improvement will not be significant.

– Speculative coherent Protocol impacts clusters most because of high rtl ratio.

Page 28: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Simulation & results

• Wisconsin wind tunnel II to simulate CC-Numa with 16 nodes interconnected through hardware DSM boards to a low latency switched network.

• Full map write invalidate protocol with 32 byte coherence blocks.

• Benchmarks: appbt, barnes, em3d, moldyn, ocean, tomcatv, unstructures.

Page 29: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Results

Base predictor accuracy comparison(history depth 1)

Page 30: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Results• Em3d, Moldyn exhibit producer/consumer sharing with small read

sharing => low impact of read ordering => high performance with MSP.

• Unstructured exhibits wide read-sharing in producer/consumer phase, hence MSP can get a prediction accuracy of less that 65% while VMSP can get almost 85%.

Page 31: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Results

Prediction accuracy with varying history depths

Page 32: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Results

Messages predicted(correctly predicted) for a history depth of 1

Page 33: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Results

Predictor storage overhead

Page 34: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Results• All predictors use 4 bits to encode processor id• Cosmos uses 3 bits to encode message type => 7 bits for history

table entry and 14 bit per pte => (7+14) bits per block• MSP and VMSP use 2 bits to encode a message type• MSP 12 bits per pte =>(6+12) bits per block• VMSP uses 18 bits per history table, but (18+6) bits per pte =>

(18+24) bits per block (in VMSP a read vector is always followed by a write/upgrade and vice versa). A pte will contain at most one entry.

• MSP and VMSP require less storage compared to cosmos.

Page 35: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Summary and Conclusion• Proposed the Memory Sharing Predictor tom predict and execute

coherence protocols speculatively.• MSP eliminates acknowledgement messages in pattern tables and

increases prediction accuracy from 81% to 86%.• VMSP further improves accuracy upto 93% using compact vector

representations and eliminating perturbations due to read request reorderings.

• VMSP also reduces implementation storage.• High accuracy predictors are key to high performance SC DSM.

Page 36: Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.

Discussions