Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Post on 02-Feb-2016

25 views 0 download

Tags:

description

Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors. http://iacoma.cs.uiuc.edu. Motivation. CMPs are ubiquitous. Shared memory + caches = cache coherence. Traditional cache coherence solutions. shared bus-based: electrical, layout issues. - PowerPoint PPT Presentation

Transcript of Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Uncorq:

Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

http://iacoma.cs.uiuc.edu

Karin Strauss AMD Advanced Architecture and Technology Lab

Xiaowei Shen IBM Research

Josep Torrellas University of Illinois at Urbana-Champaign

Karin Strauss - “Uncorq” 2QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Motivation

• CMPs are ubiquitous

• Shared memory + caches = cache coherence

• directory-based: indirection, storage

• shared bus-based: electrical, layout issues

• Traditional cache coherence solutions

Karin Strauss - “Uncorq” 3QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Embedded-ring cache coherence [ISCA 2006]

• logical ring is embedded in network

• control messages use ring

Simple and inexpensive to implement

• Novel snoopy cache coherence for mid-sized machines

• data messages use any path

Snoop requests can have long latencies

Karin Strauss - “Uncorq” 4QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Contributions

• Propose invariant for transaction serialization

• Propose performance enhancements

• reduces cache-to-cache transfer latency

• Uncorq: unconstrained snoop request delivery

• Simple hardware data prefetching technique

• reduces memory-to-cache transfer latency

Karin Strauss - “Uncorq” 5QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

logical ring

Embedded-ring terminology

• snoop request• snoop response• snoop request + response

• data

• Types of messages:

+requestresponse

snoop op. outcome

positive snoop op. outcome

+ positive response

data

A B

control messages

• Snoopy, invalidate protocol

responserequest

• Single supplier protocol

request

Ordering invariant

Karin Strauss - “Uncorq” 7QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Transaction serialization

inv

inv

ack

ack

read

data S

MSI

S A BS I S

timeinv

I

old value new value

Karin Strauss - “Uncorq” 8QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Serialization enforcement with embedded-ring

• Logical unidirectional ring provides partial ordering

• Distributed algorithm establishes global order for same-address transactions

• one is declared the “winner” (first to reach supplier)• others have to retry

• On simultaneous transactions to same address:

A

requestrequest requestrequestresponseresponse responseresponse

Karin Strauss - “Uncorq” 9QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

How to serialize transactions

A

B

S

A’s request and response

B’s request and response

No clear “first” transaction

B’s request reaches S first

A receives B’s positive response before its own

A retries: B A

Ring guarantees responsesare forwarded in the order Sperformed snoop operations

+

+

Karin Strauss - “Uncorq” 10QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Enforcing transaction serialization

Ordering Invariant: the order in which responses travel the ring after leaving the supplier must be the same as the order in which the supplier processed their corresponding requests.

• Node whose request arrives at supplier node first is the “winner”

• What we need to enforce transaction serialization:

loser node sees other node’s positive response before its own

+

S

+request

response

Uncorq:Unconstrained snoop

request delivery

Karin Strauss - “Uncorq” 12QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Uncorq idea

Baselinerequest

response

Idea: requests do not have to follow the ring(but responses do)

Uncorq

Karin Strauss - “Uncorq” 13QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Benefit of Uncorq

Baseline

Reduced cache-to-cache transfer latency

Uncorq

savings

request snoop data

request reachessupplier node

time

Karin Strauss - “Uncorq” 14QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Implications of Uncorq

• Uncorq no longer restricts order of requests

• Nodes may receive and process requests in any order

• Responses may also get reordered

Problem: distributed algorithm relies on the fact that response order reflects order of requests at supplier

Karin Strauss - “Uncorq” 15QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Example: incorrect transaction ordering

A

B

S

S

++ +

S

Ordering invariant

A node cannot forward any other response ifit has an outstanding positive snoop outcome

request

response

Karin Strauss - “Uncorq” 16QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

How Uncorq stalls responses

+ +

addr Crequestsresponses

+

• Local transaction table (per-node structure)

A B …C

• records messages that node is currently processing

+request

response

Karin Strauss - “Uncorq” 17QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Optimization: prefetching from memory

• Predict when no node will supply data

• Access memory in parallel with ring snoop

R

memory(1)

(2)

R

memory

(1)

(1)

• Goal: reduce latency of memory-to-cache transfers

unoptimized optimized

Evaluation

Karin Strauss - “Uncorq” 19QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

• 64 nodes in a single CMP

Experimental setup

• SESC simulator (sesc.sourceforge.net)

• SPLASH-2, SPECjbb and SPECweb workloads

• Interconnection network: 2D torus with embedded-ring

Karin Strauss - “Uncorq” 20QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Cache-to-cache transfer latency

substantial reduction in latencyUncorq

0

2

4

6

8

10

0 100 200 300 400 500 600

distribution (%)

0

20

40

60

80

100

cumulative

distribution (%)

cache-to-cache transfer latency

Baseline

0

2

4

6

8

10

0 100 200 300 400 500 600

distribution (%)

0

20

40

60

80

100

cumulative

distribution (%)

cache-to-cache transfer latency

Karin Strauss - “Uncorq” 21QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Execution Time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SPLASH-2 SPECjbb SPECweb

normalized

execution time

BaselineUncorqUncorq+Pref

• Uncorq + Pref performs the best (reduction: 13-26%)

• Uncorq significantly reduces execution time (reduction: 5-23%)

Karin Strauss - “Uncorq” 22QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Also in the paper

• Serialization mechanism for case with no supplier

• System and node forward progress

• Fences and memory consistency issues

• Characterization of prefetching mechanism

• Comparison against ccHyperTransport

Karin Strauss - “Uncorq” 23QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Conclusion

• Propose invariant for transaction serialization

• Propose performance enhancements• Uncorq: unconstrained snoop request delivery

• Reduce execution time by 13-26%

• Simple hardware data prefetching technique

Uncorq:

Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

http://iacoma.cs.uiuc.edu

Karin Strauss AMD Advanced Architecture and Technology Lab

Xiaowei Shen IBM Research

Josep Torrellas University of Illinois at Urbana-Champaign