1 Hardware Transactional Memory (Herlihy, Moss, 1993) Some slides are taken from a presentation by...

37
1 Hardware Transactional Memory (Herlihy, Moss, 1993) Some slides are taken from a presentation by Royi Maimon & Merav Havuv, prepared for a seminar given by Prof. Yehuda Afek.

Transcript of 1 Hardware Transactional Memory (Herlihy, Moss, 1993) Some slides are taken from a presentation by...

1

Hardware Transactional Memory(Herlihy, Moss, 1993)

Some slides are taken from a presentation by Royi Maimon & Merav Havuv, prepared for a seminar given by Prof. Yehuda Afek.

2

Outline

Hardware Transactional Memory (HTM) Transactions Caches and coherence protocols General Implementation Simulation

3

What is a transaction?

A transaction is a sequence of memory loads and stores executed by a single process that either commits or aborts

If a transaction commits, all the loads and stores appear to have executed atomically

If a transaction aborts, none of its stores take effect Transaction operations aren't visible until they

commit (if they do)

4

Transactional Memory

A new multiprocessor architecture The goal: Implementing non-blocking synchronization that

is– efficient– easy to use

compared with conventional techniques based on mutual exclusion

Implemented by straightforward extensions to multiprocessor cache-coherence protocols and / orby software mechanisms

55

Outline

Hardware Transactional Memory (HTM) Transactions Caches and coherence protocols General Implementation Simulation

6

A cache is an associative (a.k.a. content-addressable) memory

Conventional memory

Address A Data @A

Associative memory

Address A, s.t. *A=DData D

7

Cache Associativity

8

Fully associative cache

9

Cache tags and address structure

Main Memory Cache

Indexes and Tags are typically

high-order address bits

10

Cache-Coherence Protocol

In multiprocessors, each processor typically has its own local cache memory

– Minimize average latency due to memory access– Decrease bus traffic– Maximize cache hit ratio

A Cache-coherence protocol manages the consistency of caches and main memory:

– Shared memory semantics maintained– Caches and main memory communicate to guarantee

coherency

11

The need to maintain coherency

Figure taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson

12

Coherency requirements

Text taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson

13

Snoopy Cache

All caches monitor (snoop) the activity on a global bus/interconnect to determine if they have a copy of the block of data that is requested on the bus.

14

Coherence protocol types

Write through: the information is written to both the cache block and to the block in the lower-level memory

Write-back: the information is written only to the cache block. The modified cache block is written to main memory only when it is replaced

15

3-state Coherence protocol

Invalid: cache line/block does not contain legal information

Shared: cache line/block contains information that may be shared by other caches

Modified/exclusive: cache line/block was modified while in cache and is exclusively owned by current cache

16

Cache-coherency mechanism (write-back)

Cache-coherency mechanism – state transition diagram

Figure taken from the book: “Computer architecture – A quantitative approach” by Hennessy and Peterson

Transitions based on processor requests

Transitions based on bus requests

18

MESI protocol (Goodman, 1983)

Cache line status

M(Modified)

E(Exclusive

S(Shared)

I(Invalid)

Is line valid?YesYesYesNo

Main memory updated?

NoYesYes__

Other cache copies exist?

NoNoMaybe__

19

Outline

Hardware Transactional Memory (HTM) Transactions Caches and coherence protocols General Implementation Simulation

20

HTM-supported API

The following primitive instructions for accessing memory are provided:

Load-transactional (LT): reads value of a shared memory location into a private register.

Load-transactional-exclusive (LTX): Like LT, but “hinting” that the location is likely to be modified.

Store-transactional (ST) tentatively writes a value from a private register to a shared memory location.

Commit (COMMIT) Abort (ABORT) Validate (VALIDATE) tests the current transaction status.

21

Some definitions

Read set: the set of locations read by LT by a transaction

Write set: the set of locations accessed by LTX or ST issued by a transaction

Data set (footprint): the union of the read and write sets.

A set of values in memory is inconsistent if it couldn’t have been produced by any serial execution of transactions

22

Intended Use

Instead of acquiring a lock, executing the critical section, and releasing the lock, a process would:

1. use LT or LTX to read from a set of locations2. use VALIDATE to check that the values read are

consistent,3. use ST to modify a set of locations4. use COMMIT to make the changes permanent.

If either the VALIDATE or the COMMIT fails, the process returns to Step (1).

23

Implementation

Hardware transactional memory is implemented by modifying standard multiprocessor cache coherence protocols

Herlihy and Moss suggested to extend “snoopy” cache protocol for a shared bus to support transactional memory

Supports short-lived transactions with a relatively small data set.

24

The basic idea

Any protocol capable of detecting register access conflicts can also detect transaction conflict at no extra cost

Once a transaction conflict is detected, it can be resolved in a variety of ways

25

Implementation

Each processor maintains two caches– Regular cache for non-transactional operations, – Transactional cache small, fully associative

cache for transactional operations. It holds all the tentative writes, without propagating them to other processors or to main memory (until commit)

An entry may reside in one cache or the other but not in both

26

Cache line states

Each cache line (regular or transactional) has one of the following states:

Each transactional cache lines has (in addition) one of these states:

(Exclusive)

“Old” values“New” values

(Modified)

27

Cleanup

When the transactional cache needs space for a new entry, it searches for:– A TC_INVALID entry

– If none - a TC_NORMAL entry

– finally for an TC_COMMIT entry (why can such entries be replaced?)

28

Processor actions

Each processor maintains two flags:– The transaction active (TACTIVE) flag: indicates whether a

transaction is in progress

– The transaction status (TSTATUS) flag: indicates whether that transaction is active (True) or aborted (False)

Non-transactional operations behave exactly as in original cache-coherence protocol

29

Example – LT operation:

Look for tc_ABORT entry

Return its value

Look for NORMAL entry

Change it to tc_ABORT and allocate another tc_COMMIT entry with same value

Found?Not Found?

Ask to read this block from the shared memory

Found?

Not Found?

Successful read

Create two entries: tc_ABORT and tc_COMMIT

Busy signal

Abort the transaction:

TSTATUS=FALSE

Drop tc_ABORT entries

All tc_COMMIT entries are set to tc_NORMAL

Cache miss

30

Snoopy cache actions:

Both the regular cache and the transactional cache snoop on the bus.

A cache ignores any bus cycles for lines not in that cache.

The transactional cache’s behavior:– If TSTATUS=False, or if the operation isn’t transactional,

the cache acts just like the regular cache, but ignores entries with state other than TC_NORMAL

– Otherwise: On LT of another cpu, if the state is TC_NORMAL or the line not written to, the cache returns the value, and in all other cases it returns BUSY

31

Committing/aborting a transaction

Upon commit Set all entries tagged by TC_COMMIT to TC_INVALID Set all entries tagged by TC_ABORT to TC_NORMAL

Upon abort Set all entries tagged by TC_ABORT to TC_INVALID Set all entries tagged by TC_COMMIT to TC_NORMAL

Since transactional cache is small, it is assumed that these operations can be done in parallel.

32

Outline

Lock-Free Hardware Transactional Memory (HTM)

Transactions Caches and coherence protocols General Implementation Simulation

33

Simulation

We’ll see an example code for the producer/consumer algorithm using transactional memory architecture.

The simulation runs on both cache coherence protocols: snoopy and directory cache.

The simulation uses 32 processors The simulation finishes when 2^16 operations have

completed.

34

Part Of Producer/Consumer Code

typedef struct { Word deqs; // Holds the head’s index Word enqs; // Holds the tail’s index Word items[QUEUE_SIZE];} queue;

unsigned queue_deq(queue *q) { unsigned head, tail, result; unsigned backoff = BACKOFF_MIN unsigned wait; while (1) { result = QUEUE_EMPTY; tail = LTX(&q->enqs); head = LTX(&q->deqs); if (head != tail) { /* queue not empty? */ result = LT(&q->items[head % QUEUE_SIZE]); /* advance counter */ ST(&q->deqs, head + 1); } if (COMMIT()) break; /* abort => backoff */ wait = random() % (01 << backoff); while (wait--); if (backoff < BACKOFF_MAX) backoff++; } return result;}

35

The results:

Snoopy cache Directory-based coherency

36

Transactional size is limited by cache size

Transaction length effectively limited by scheduling quantum

Process migration problematic

Key Limitations:

37

MSA: A few sample research directions

Theoretic

o Are there counters/stacks/queues with sub-linear write-contention?

o What is the space complexity of obstruction-free read/write consensus?

o What is the step-complexity of 1-time read/write counter?

o ...

(More) practical

o The design of efficient lock-free/blocking concurrent objects

o Defining more realistic metrics for blocking synchronization, and designing algorithms that are efficient w.r.t these metrics

o Improve the usability of transactional memory

o ...