Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Post on 21-Jan-2016

214 views 0 download

Transcript of Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Software Transactional Memory

TiC 2010

Adam Welc

Programming Systems LabIntel Labs

2

Agenda

Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space

Part 2: STM Implementation– Runtime– Compiler– Performance

3

Concurrent Programming Today

•Mutual exclusion locks (Java monitors, pthread locks etc.) used for concurrency control– Coarse-grained locking limits concurrency– Fine-grained locking is hard: composability,

possibility of deadlocks, etc.

•Transactional Memory (TM) offers an alternative

4

Designing Map Structure

•Operations

T1

m.get(k);

T2

m.put(k,v);

T3

m.remove(k);

get (Key k)put (Key k, Value v)remove (Key k)

{ seqGet(k); }{ seqPut(k, v); }{ seqRemove(k); }

• How to make it thread-safe?

5

ConcurrentMap Classsynchronized

Value get(Key k) {

return seqGet(k);

}

synchronized

void put(Key k, Value v) {

seqVal(k, v);

}

synchronized

void remove(Key k) {

seqRemove(k);

}

What if workload

mostly read-only?

6

Refined ConcurrentMap Class

Value get(Key k) {

// try unsynchronized

Value tmp = seqGet(k);

if (tmp != null) return tmp;

else synchronized(this) {

// possible interference

return seqGet(k);

} }

void put(Key k, Value v) {

synchronized(this) {

seqPut(k, v);

} }

void remove(Key k) {

synchronized(this) {

seqRemove(k);

} }

7

Actual Code

public Object get(Object key) { int hash = hash(key); // Try first without locking... Entry[] tab = table; int index = hash & (tab.length - 1); Entry first = tab[index]; Entry e;

for (e = first; e != null; e = e.next) { if (e.hash == hash && eq(key, e.key)) { Object value = e.value; if (value != null) return value; else break; } }…

… // Recheck under synch if key not there or interference Segment seg = segments[hash & SEGMENT_MASK]; synchronized(seg) { tab = table; index = hash & (tab.length - 1); Entry newFirst = tab[index]; if (e != null || first != newFirst) { for (e = newFirst; e != null; e = e.next) { if (e.hash == hash && eq(key, e.key)) return e.value; } } return null; } }

DO YOU REALLY

WANT TO WRITE

THIS KIND OF CODE?

8

Composition

•Simple concurrent accesses work

•Consider concurrent value deposit

int v1 = map.get(k);

v1 += 10;

map.put(k, v1);

synchronized(map) {

}

Back to coarse-grained locking

T1 T2

map.get(k) == 100

int v2 = map.get(k);

v2 += 20;

map.put(k, v2);

synchronized(map) {

}

== 100== 100

== 120

== 120

== 110

== 110

IS LOST

9

TM Approach

Let TM system take care of the rest

get (Key k)put (Key k, Value v)remove (Key k)

{ __tm_atomic { seqGet(k); }}{ __tm_atomic { seqPut(k, v); }}{ __tm_atomic { seqRemove(k); }}

int v = map.get(k);v += amount;map.put(k, v);

__tm_atomic {

}

10

Agenda

Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space

Part 2: STM Implementation– Runtime– Compiler– Performance

11

Managed vs. Unmanaged STM

• Same core semantics and language constructs (and algorithms)

• Managed (e.g. Java, .NET)– Controlled execution of native code– Dynamic compilation

• Unmanaged (e.g. C, C++)– Problem with legacy binaries– Have to know upfront if code executed

transactionally

12

Atomic Blocks == Transactions

•Originally a database concept

•Transactional executions– Atomic– Consistent– Isolated– Durable

serial

serializable

Serializable – appearance of serial

13

Serial Execution

T1 T2

__tm_atomic { int tmp1 = x;

int tmp2 = y;}

__tm_atomic { x = 42;

y = 42;}

int x = 0; int y = 0;

== 42

== 42

== 0

== 0

BOTH RESULTS CORRECT

14

Serializable Execution

T1 T2

__tm_atomic { int tmp1 = x;

int tmp2 = y;}

__tm_atomic { x = 42;

y = 42;}

int x = 0; int y = 0;

== 42

== 42

== 42

== 42

BOTH RESULTS THE SAME DESPITE

INTERLEAVING

15

Non-Serializable Execution

T1 T2

__tm_atomic { int tmp1 = x;

}

__tm_atomic { x = 42;

int x = 0; int y = 0;

== 42

== 42

== 0== 42

int tmp2 = y;

y = 42;}

DIFFERENT FROM ANY

SERIAL

TM’s role is to “fix” conflicting executions

ROLL BACK

! CONFLICT !

16

Transaction Nesting

•Required for composability

•Open nesting– Results exposed upon inner transaction commit– Compensating actions used upon outer

transaction abort– May lead to serializability violations

•Closed nesting– Computation results exposed only upon

outermost transaction commit– Transactions can be flattened - inner

transaction is semantically a no-op

17

Open Nesting

• Conditional can be entered after inner commit

__tm_atomic {

__tm_atomic { inc(); }

}

__tm_atomic { if (x == 1) { … }}

void inc() { x++; }void dec() { x--; }

int x = 0;

// register dec()

dec();

T1 T2

• Effect is undone but T2 has seen the result!

18

Closed Nesting

• Conditional can be entered only after outermost commit

__tm_atomic {

__tm_atomic { inc(); }

}

__tm_atomic { if (x == 1) { … }}

void inc() { x++; }void dec() { x--; }

int x = 0;T1 T2

19

Flatten Or Not To Flatten?

__tm_atomic {

}

__tm_atomic {

}

potential conflict

ROLL BACK

ROLL BACK

More on Execution Semantics

• Transactions are serializable, but

• The notion comes from database world where all actions are transactional

• What about non-transactional code?

20

Problematic Behavior

T1 T2

__tm_atomic { if (p != NULL)

tmp = *p;}

Should this behavior be allowed? Yes: This program is buggy, p = null should be inside a

transaction No: Transactions should be atomic no matter what

p = null;true

int * p = &x;

NULL POINTER

== null

21

Two Points of View on Atomicity

•Weak atomicity – Transactions serializable with respect to other

transactions

•Strong atomicity– Transactions serializable with respect to all

memory accesses

WEAK ATOMICITY

STRENGTH

STRONG ATOMICITY

22

Weak Atomicity

• Non-transactional accesses bypass STM access protocol– Non-transactional code remains un-instrumented– Most STMs behave this way

• Requires segregation of transactional and non-transactional data– Hard to enforce

• Otherwise – behavior depends on implementation – Unexpected results can be observed

23

Non-Repeatable Read

T1 T2

__tm_atomic { tmp1 = x;

tmp2 = x;}

•Non-txn code can affect transactional computation

x = 42;

int x = 0;

== 42

== 42

== 0

== 0

tmp1 == tmp2tmp1 != tmp2

24

Dirty Read

T1 T2

__tm_atomic { x++;

x++;}

•Txn code can leak intermediate results to non-transactional computation

tmp = x;

int x = 0;

tmp is eventmp is odd

== 0

== 1

== 2

== 1

25

Strong Atomicity

•Non-transactional accesses turned into micro-transactions– Reads and writes block until write gets

committed– Interleaved writes can invalidate a transaction

•Avoids all undesirable behaviors of weak atomicity, but

•All code needs to be instrumented

26

Non-Repeatable Read

T1 T2

__tm_ atomic { tmp1 = x;

tmp2 = x;}

•Write by T2 invalidates T1’s transaction

__tm_atomic { x = 42;}

int x = 0;

== 0

ROLL BACK

27

Dirty Read

T1 T2

atomic { x++;

x++;}

•Blocking effectively reschedules and serializes non-transactional operations

__tm_atomic { tmp = x;}

int x = 0;

== 2

BLOCK== 1

== 2

28

Are We Done?

•Overhead of strong atomicity can be huge (up to 10x slowdown)

•Non-txn code instrumentation may be problematic (precompiled libraries, system calls, etc.)

•Can we find an in-between solution?

WEAK ATOMICITY

STRENGTH

STRONG ATOMICITY

SGLA

29

Single Global Lock Atomicity

• Transactions execute as if protected by a single global lock

__tm_atomic { synchronized(m) {

S; S;

} }

•Matches intuition of weakly atomic STM– Transactions are serialized w.r.t. each other– And, no surprises compared to locks

• STM must provide additional guarantees– Consistency– Privatization safety

30

31

Consistency

__tm_atomic {

__tm_atomic {

int t1 = x;

int t2 = x;

if (t1 != t2)

*ptr = x;

}

lock(mutex);

int t1 = x;

int t2 = x;

if (t1 != t2)

*ptr = x;

unlock(mutex);

x=y;

}

lock(mutex);

x=y;

unlock(mutex);

int *ptr = NULL;

int x = 0; int y = 1

NULL POINTER

T1 T2

== 1

== 1

== 0

// cannot happen

32

Privatization Safety

__tm_atomic { t1 = head; if (t1)

__tm_atomic { t2 = head; head = t2->next; t2->next = NULL;}priv = t2->x;…assert (priv == t2->y);

lock(mutex); t2 = head; head = t2->next; t2->next = NULL;unlock(mutex);priv = t2->x;…assert (priv == t2->y);

t1->x = t1->y = 1;}

lock(mutex); t1 = head; if (t1)

t1->x = t1->y = 1;unlock(mutex);

T1 T2

0

0

x

y

next

head

t1

t2 1

1

= NULL;

== 1

== 1== 1

== 0

33

Agenda

Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space

Part 2: STM Implementation– Runtime– Compiler– Performance

34

Transactional Execution Modes

Optimistic Pessimistic

Lock data on write (exclusive write locks)

Record reads

Release write locks and validate reads on commit

Lock data on write (exclusive write locks)

Lock data on read (shared read locks)

Release read and write locks on commit

Pros Cache effects

No read locking cost

Privatization-safety and consistency for free

Filtering

Cons Providing privatization and consistency incurs extra cost

No filtering

Cache effects

Additional read locking cost

•Obstinate – pessimistic transaction that wins all conflicts

35

Write Buffering vs. In-Place Update

Write Buffering

(a.k.a. Lazy Versioning)

In-Place Update

(a.k.a. Eager Versioning)

Write to private buffer

Copy to memory on commit

Lazy Locking (acquire locks on commit) or Eager Locking (acquire locks on access)

Directly write shared memory

Record old values in a undo log

Eager Locking: acquire write-locks on write

Pros Fast abort Fast commit

Direct reads

Cons Slow commit

Reads have to search buffer

Slow abort

36

Conflict Detection Granularity

class Foo { int x; int y;}

object-based(Java/C#)

word-based(cacheline-based)

(C/C++)

struct Foo { int x; int y;}

y

x

metadata

vtbl

metadata

metadata

metadata

metadata

metadata

y

x

Owner Table

…… …

… …

37

Agenda

Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space

Part 2: STM Implementation– Runtime– Compiler– Performance

38

Intel C/C++ STM http://whatif.intel.com (NEW RELEASE IN Q3 2010)•Based on Intel’s product compiler

•Features• Consistency and privatization safety preserving close-nested

atomic blocks (__tm_atomic) to support SGLA semantics

• User abort (__tm_abort) for failure atomicity

• Transaction retry (__tm_retry) for condition synchronization

• Multiple transactional execution modes: optimistic and pessimistic STM, obstinate

• Serial execution mode (for I/O and calls to legacy binaries)

• TM support for C++ : virtual functions, (multiple) inheritance, function and class templates, exceptions

39

System Architecture

transactional C/C++

Intel C/C++ compiler

multicore system

C/C++ support

APPLICATION

LANGUAGESUPPORT

TMRUNTIME

HARDWARE

Runtime Overview

• In-place updates

• Cacheline-level conflict detection granularity

• Information for rollback recorded in undo log

• Reads recorded in read set:– For validation (optimistic mode)– For locking/unlocking (pessimistic and obstinate modes)

• Writes recorded in write set for locking/unlocking (all transactional modes)

• Two-phase locking (2PL) protocol

40

Per thread metadata

•Transaction Descriptor

–Read set: validation or unlocking

–Write set: unlocking

–Undo log: rollback

–… local timestamp, execution mode …

•Transaction Memento

–Checkpoint of machine and transaction state

–For nesting & partial rollback

41

Transation Record (TxnRec)

•Tracks transactional state of shared data

–For optimistic transactions (OptTxnRec)• Unlocked – contains timestamp (more on this later!)• Write-locked – contains transaction descriptor of lock owner

–For pessimistic transactions (PessTxnRec)• Unlocked – contains special mark• Read-locked – contains info about all readers• Write locked – contains info about single writer

•Stored in the owner table mapping each memory word to a single transaction record

42

Optimistic STM Algorithm

•Timestamp-based–Global Timestamp (G_TS): incremented every time a

writing transaction commits

– Local Timestamp (L_TS): records last time transaction was valid

–On transactional read of shared data record timestamp associated with its OptTxnRec in the transaction’s read set

–On transaction termination update local timestamps and write them to OptTxnRec-s of all data updated by this transaction

•Validation for serializability and consistency

•Quiescence for privatization safety

43

44

Consistency

__tm_atomic {

__tm_atomic {

int t1 = x;

int t2 = x;

if (t1 != t2)

*ptr = x;

}

lock(mutex);

int t1 = x;

int t2 = x;

if (t1 != t2)

*ptr = x;

unlock(mutex);

x=y;

}

lock(mutex);

x=y;

unlock(mutex);

int *ptr = NULL;

int x = 0; int y = 1

NULL POINTER

T1 T2

== 1

== 1

== 0

// cannot happen

Validation

•For every entry in read set, abort transaction if recorded timestamp greater than local timestamp

•Performed on commit to guarantee serializability

•Performed on read to guarantee consistency (when data’s OptTxnRec > local timestamp)

45

Validation

T1 T2__tm_atomic {

__tm_atomic {

int t1 = x;

int t2 = x;

if (t1 != t2)

*ptr = x;

}

x=y;

}

G_TS =

NULL POINTER

x

0OptTxnRec-s

0 1

L_TS = 0W_SET = <&x>

L_TS = 0R_SET = <&x>

1T1

ABORT

// cannot happen

R_SET = <&y>

y

0L_TS = 1

T1

46

47

Privatization Safety

__tm_atomic { t1 = head; if (t1)

__tm_atomic { t2 = head; head = t2->next; t2->next = NULL;}priv = t2->x;…assert (priv == t2->y);

lock(mutex); t2 = head; head = t2->next; t2->next = NULL;unlock(mutex);priv = t2->x;…assert (priv == t2->y);

t1->x = t1->y = 1;}

lock(mutex); t1 = head; if (t1)

t1->x = t1->y = 1;unlock(mutex);

T1 T2

0

0

x

y

next

head

t1

t2 1

1

= NULL;

== 1

== 1== 1

== 0

Quiescence

•Maintain list of active transactions containing their current local timestamp

•Implicit infinite timestamp for pessimistic transactions

•Committing transaction waits for all active transactions whose timestamp is smaller than its own timestamp

48

Quiescence

__tm_atomic {

t1 = head;

if (t1)

__tm_atomic {

t2 = head;

head = t2->next;

t2->next = NULL;

}

t1->x = t1->y = 1;

}

priv = t2->x;

assert (priv == t2->y);

G_TS = 0 1

T1 T2

L_TS = L_TS =

T1 T2

01

WAIT

0

2

49

50

Unified STM

• Both optimistic and pessimistic readers can co-exist

• Owner table is shared and contains both OptTxnRec and PessTxnRec

• Read barriers:– Optimistic – reads only OptTxnRec– Pessimistic – reads only PessTxnRec

• Write barriers need to write both TxnRec-s

51

Owner Table for Unified STM

typedef uintptr_t TxnRec;typedef struct OwnerTableEntryS { TxnRec optimistic; TxnRec pessimistic;} OwnerTableEntry;

……

Owner Table

PessTxnRec OptTxnRec

52

OptTxnRec

Lock bit0: Write-Locked (Exclusive)

1: Unlocked (Shared)

Upper bitsOwner TxnDesc upper bits

Or timestamp upper bits

31 … 1 0

53

PessTxnRec

Lock bit0: Write-locked (Exclusive)

1: Unlocked (Shared)

Upgrading bit0: no upgrading request

1: upgrading requested

Owner bitsEach bit represents a pessimistic transaction

Locked if non zero

31 … 2 1 0

54

xxx … xxxxx0000 … 0000111110

Unified STM Algorithm

T1 (PESS)

__tm_atomic { r1 = x; r3 = x;}

T2 (OPT)

__tm_atomic {

r2 = x;

x = r2 +1;

}

0

x T1

PessTxnRec OptTxnRec

T2

0 000 … 000001 000 … 000

Agenda

Part 1: STM Overview• Introduction• Language Constructs and Semantics• Design space

Part 2: STM Implementation• Runtime• Compiler• Performance

55

56

Compiler/Runtime Interaction

• Decouple compiler from the runtime– Enables use of different library implementations with the

same compiler (e.g. in-place updates vs. write-buffering)– Enables use of different algorithms within the library

itself (e.g. optimistic vs. pessimistic)

• Calls to the runtime realized through a vtable-like mechanism

• Compiler/runtime ABI:– General – same code used for different algorithms– Rich – to enable additional optimizations

57

ABI: Txn Begin and Commit

_ITM_transaction * _ITM_getTransaction()– Returns (creates if necessary) a transaction descriptor

uint32 _ITM_beginTransaction(_ITM_transaction* td, uint32 props)– Saves machine state– Pass information to runtime via props (e.g. pr_multiwayCode

- both instrumented and uninstrumented code is available) – Can return more than once (e.g. on abort); possible return

values: a_saveLiveVariables, a_restoreLiveVariables

void _ITM_commitTransaction(_ITM_transaction *td)

58

ABI: Read and Write Barriers

• Templates:void _ITM_Wtypesig(_ITM_transaction* td, type *addr, type val)

type _ITM_Rtypesig(_ITM_transaction* td, type *addr)

typesig: U[1248] – unsigned int[FDE] – float, double,

long…

•Examples:_ITM_WF(_ITM_transaction *td, float *addr, float val);

_ITM_RU4(_ITM_transaction *td, uint32 *addr);

59

Simple Atomic Block Translated

uint32Val = 42;

}

uint32 props = pr_multiwayCode;

_ITM_transaction *td = _ITM_getTransaction();

uint32 doWhat =

_ITM_beginTransaction(td, props);

if (doWhat & a_restoreLiveVariables) {

/* code to restore live local variables */

}

if (doWhat & a_saveLiveVariables) {

/* code to save live local variables */

}

_ITM_WU4(td, &uint32Val, 42);

_ITM_commitTransaction(td);

__tm_atomic {

! CONFLICT !

60

User Abort and Retry Translated

uint32Val = 42;

}

uint32 props = pr_multiwayCode;

_ITM_transaction *td = _ITM_getTransaction();

uint32 doWhat = _ITM_beginTransaction(td, props);

if (doWhat & a_restoreLiveVariables) {

/* code to restore live local variables */

}

if (doWhat & a_saveLiveVariables) {

/* code to save live local variables */

}

_ITM_WU4(td, &uint32Val, 42);

_ITM_commitTransaction(td);

__tm_atomic {

if (!_ITM_RU(td, &cond))

_ITM_abortTransaction(td, userRetry);

if (error) __tm_abort;

if (cond) __tm_retry;

if (_ITM_RU(td, &error))

_ITM_abortTransaction(td, userAbort);

if (doWhat & a_abortTransaction) goto ABORT_TXN;

ABORT_TXN:

61

Optimizations for Transactions

•Standard optimizations– Careful IR design enables existing optimizations

• Partial redundancy elimination (PRE), dead code elimination, …

– Subtle in presence of nesting

•STM-specific optimizations–No instrumentation when executing in serial mode

– Conversion of generic STM read/write barriers to cheaper variants

– Also:• Flattening nested transactions if no user abort is inside• Barrier elimination for __thread (thread local) or const data

Un-instrumented Serial Mode

if (flag) {

printf(“Hello!”); }

}

uint32 props = pr_multiwayCode;

_ITM_transaction *td = _ITM_getTransaction();

uint32 doWhat = _ITM_beginTransaction(td, props);

if (doWhat & a_restoreLiveVariables) {

/* code to restore live local variables */

}

_ITM_commitTransaction(td);

__tm_atomic {

if (doWhat & a_saveLiveVariables) {

/* code to save live local variables */

}

if (_ITM_RU4(td, &flag)) {

_ITM_changeTransactionMode(td, modeSerialIrrevocable);

printf(“Hello!”);

}

if (doWhat & a_instrumentedCode) {

} else {

if (flag) printf(“Hello!”);

}

62

ABI: Optimized Barrier Templates

•After read or after write (e.g. eliminate redundant locking operations)void _ITM_W{aRW}typesig(_ITM_transaction* td, type

*addr, type val)

type _ITM_R{aRW}typesig(_ITM_transaction* td, type *addr)

•Read-for-write (e.g. acquire write lock early and eliminate read lock)type _ITM_RfWtypesig(_ITM_transaction* td, type *addr)

63

6464

Barrier Optimization Example

__tm_atomic { if (x < N) { x++; }}

…t1 = _ITM_RU4(td, &x);if (t1 < N) { t2 = _ITM_RU4(td, &x); _ITM_WU4(td, &x,t2+1);}….

…t1 = _ITM_RU4(td, &x);if (t1 < N) { _ITM_WU4(td, &x,t1+1);}….

…t1 = _ITM_RU4(td, &x);if (t1 < N) { _ITM_WaRU4(td, &x,t1+1);}….

…t1 = _ITM_RfWU4(td, &x);if (t1 < N) { _ITM_WaWU4(td, &x,t1+1);}….

65

ABI: Undo and Commit Functions

• Programmers may register actions executed by the runtime on transaction termination

void _ITM_addUserCommitAction(_ITM_transaction *td, _ITM_userCommitFunction fn, _ITM_transactionId tid, void *arg)

void _ITM_addUserUndoAction(_ITM_transaction *td, _ITM_userUndoFunction, void *arg)

• Current transaction id_ITM_transactionId _ITM_getTransactionId(_ITM_transaction *tid)(1: non-txn, 2: outer txn begin, ++: inner txn begin)

• Undo and commit actions can be used inside of function wrappers

Transactional Function Wrappers

•Transparently replace a call to non-transactional function with a call to its transactional version

•Transactional wrapper’s code:– Un-instrumented– Can use explicit calls to the runtime

•Intended use - implementation of library functions (e.g. transactions-aware memory management)

__declspec (tm_wrap(foo)) void fooTxn();

66

Memory Management Risks

•Txn allocation, non-txn de-allocation– Re-executions leading to multiple allocations but only one

de-allocation operation

•Non-txn allocation, txn de-allocation– Re-executions leading to the same region being de-

allocated more than once

•Txn allocation, txn de-allocation– Combination of two previous cases depending on when re-

execution gets triggered

67

Memory Management Algorithm

•Uses function wrappers mechanism to take advantage of the existing allocators

•Allocation and de-allocation sites marked with tid

•Allocation creates an allocation record – If allocation record exists on outer commit – remove it– On abort – de-allocate and remove allocation record

•De-allocation removes allocation record– De-allocate immediately if txn_id(de-alloc) <= txn_id(alloc)– Otherwise, de-allocate on commit at the nesting level where

condition holds

68

Safe Memory Management

p1 = malloc(size);

tm_atomic {

p2 = malloc(size);

tm_atomic {

free(p2);

p3 = malloc(size);

p4 = malloc(size);

}

free(p1);

free(p3);

tm_atomic {

free(p4);

}

}

2

13

3

p2

p1p3

p4

AllocationRecordstxn_id

1223333

22

2

4421

>

><

>

defer until txn_id <= 2

defer until txn_id <= 1

defer until txn_id <= 3

execute

69

70

Functions Code Generation

•tm_callable–Generate two copies, instrumented (transactional) and

uninstrumented (non-transactional)

•tm_pure–Only generate uninstrumented code – does not cause

transaction to go serial

•tm_unknown– Switch to serial mode before a call is made inside a

transaction

–May be promoted to tm_callable or tm_pure by compiler

71

Code Generation for tm_callable

__declspec(tm_callable)

int inc (int *p)

{

p++;

}

inc:

jmp inc_$nontxn

mov eax, MAGIC

jmp inc_$txn

inc_$nontxn:

inc_$txn:

72

Code Generation for tm_pure

__declspec(tm_pure)

int peek(int *p)

{

return *p;

}

peek:

jmp peek_$nontxn

mov eax, MAGIC

jmp peek_$nontxn

peek_$nontxn:

73

Indirect Calls

if (*(fp + MAGIC_OFFSET) == MAGIC) {

call fp + TXN_TWIN_OFFSET;

} else {

switchToSerialMode();

call fp;

}

•No overhead for indirect calls outside of transactions

•Same execution mode available across inheritance hierarchy thanks to virtual function overriding rules

•No annotation on function pointers– Indirect call to non-recompiled tm_pure function causes switch to serial mode

74

Agenda

Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space

Part 2: STM Implementation– Runtime– Compiler– Performance

75

TM in Real World

• Realistic workloads: STAMP, SPLASH, and PARSEC benchmark suites (fluid dynamics, raytracing, etc.)

• Performance bottlenecks– Sometimes we use a single global lock (GLOCK)

as a baseline– Bottleneck discovery performed on optimistic

STM only

76

False Conflicts

•Poor scalability due to conflicts -- >90% false conflicts

•The same STM had no problems on SPLASH-2

Genome Vacation

Exe

cuti

on

Tim

e (s

)

GLOCK STM

0

5

10

15

20

25

30

1 2 4 8

# threads

0

2

4

6

8

10

12

1 2 4 8

# threads

77

Mapping to TxnRec-s

0561931

Address

20

0x0000

0x3FFF

Ownership Table

Transaction Record

Reserved to avoid cache line

ping ponging

•Addresses map to a transaction record via a hash function

• Different addresses can map to the same record

78

Refined Hash Function

• 4 additional bits to index into transaction record

• Reduce false conflict vs. pontentially increasing cache ping-ponging

031

Address

23 561920

0x0000

0x3FFF

Ownership Table

Transaction Record

79

False Conflicts Reduced

GLOCK STM (old hash) STM (new hash)

0

2

4

6

8

10

12

1 2 4 8

0

5

10

15

20

25

30

1 2 4 8# threads # threads

Genome Vacation

Exe

cuti

on

Tim

e (s

)

80

Over-Instrumentation

•Compiler generates more barriers than necessary– Thread-local memory accesses, – Objects alternating between modification and constant phase– Constant global objects

TxLD (optimal)

TxLD (compiler)

TxST (optimal)

TxST (compiler)

TxLD overhead

TxST overhead

Genome 58,701,959 624,073,490 2,252,291 19,078,705 10.63x 8.60x

Kmeans 86,666,710 255,662,754 86,666,710 86,666,711 2.95x 1.00x

Vacation 785,775,435 925,584,125 26,300,714 122,543,905 1.18x 4.66x

Transactional Barrier Counts for STAMP

81

__tm_waiver

•No instrumentation for a block or function marked with __tm_waiver

• Allows incremental optimizations but should be used with caution

__tm_atomic { y= ++x; // instrumented __tm_waiver { ++local; // no instrumentation }}

82

Over-Instrumentation Reduced

•__tm_waiver used for– thread-local object allocation routines – quasi-static shared objects

0

2

4

6

8

10

12

1 2 4 8

0

5

10

15

20

25

30

1 2 4 8

GLOCK STM (new hash) STM (new hash + __tm_waiver)

# threads # threads

Genome Vacation

Exe

cuti

on

Tim

e (s

)

83

Quiescence Overhead

•Only some programs use privatization idiom•Provide API to let programmer selectively disable privatization safety

0

0.5

1

1.5

2

sphinx genome kmeans vacation average

2 threads 4 threads 8 threads

spee

du

p

84

Other Issues

•Small transactions overwhelmed by fixed costs– Fluidanimate: ~1 load and ~1 store per transaction– Different code for small transactions

•Atomic blocks make porting of some benchmarks (e.g., BerkeleyDB) difficult but are more amenable to compiler optimizations

•Annotating transactional functions can be a burden (40% of functions in vacation)

•Many workloads require condition synchronization

85

Finding the Bottlenecks

•Many workloads would not scale at first

•Cumulative stats would shed no light - low contention, no false conflicts, …

•And then we remembered … the devil is in the details …

86

Per Critical Section Statistics

Only critical section 601 suffers from high abort rate and prevents scaling

critical section tx_begin commit abort abort %

code size (lines)

602 1314 1312 2 0.15% O(1)

542 222481 221043 1438 0.65% O(1)

559 220908 220908 0 0.00% O(1)

601 12306 6194 6112 49.67% O(1000)

571 42917 42889 28 0.07% O(1)

588 42770 42770 0 0.00% O(1)

301 1313 1312 1 0.08% O(1)

Transactional Statistics for Sphinx

87

Overall Performance

0

1

2

3

4

5

6

7

8

geno

me

kmea

ns/lo

w

kmea

ns/h

igh

vaca

tion/

low

vaca

tion/

high

chole

sky fft

lu/co

nt.

lu/no

n co

nt.

radix

barn

esfm

m

ocea

n/co

nt.

ocea

n/no

n co

nt.

radio

sity

raytr

ace

volre

nd

water

-nsq

uare

d

water

-spa

tial

fluida

nimat

e

1 thread 2 threads 4 threads 8 threads

STM vs. single-thread GLOCK

spee

du

p