Software Transactional Memory
TiC 2010
Adam Welc
Programming Systems LabIntel Labs
2
Agenda
Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space
Part 2: STM Implementation– Runtime– Compiler– Performance
3
Concurrent Programming Today
•Mutual exclusion locks (Java monitors, pthread locks etc.) used for concurrency control– Coarse-grained locking limits concurrency– Fine-grained locking is hard: composability,
possibility of deadlocks, etc.
•Transactional Memory (TM) offers an alternative
4
Designing Map Structure
•Operations
T1
m.get(k);
T2
m.put(k,v);
T3
m.remove(k);
get (Key k)put (Key k, Value v)remove (Key k)
{ seqGet(k); }{ seqPut(k, v); }{ seqRemove(k); }
• How to make it thread-safe?
5
ConcurrentMap Classsynchronized
Value get(Key k) {
return seqGet(k);
}
synchronized
void put(Key k, Value v) {
seqVal(k, v);
}
synchronized
void remove(Key k) {
seqRemove(k);
}
What if workload
mostly read-only?
6
Refined ConcurrentMap Class
Value get(Key k) {
// try unsynchronized
Value tmp = seqGet(k);
if (tmp != null) return tmp;
else synchronized(this) {
// possible interference
return seqGet(k);
} }
void put(Key k, Value v) {
synchronized(this) {
seqPut(k, v);
} }
void remove(Key k) {
synchronized(this) {
seqRemove(k);
} }
7
Actual Code
public Object get(Object key) { int hash = hash(key); // Try first without locking... Entry[] tab = table; int index = hash & (tab.length - 1); Entry first = tab[index]; Entry e;
for (e = first; e != null; e = e.next) { if (e.hash == hash && eq(key, e.key)) { Object value = e.value; if (value != null) return value; else break; } }…
… // Recheck under synch if key not there or interference Segment seg = segments[hash & SEGMENT_MASK]; synchronized(seg) { tab = table; index = hash & (tab.length - 1); Entry newFirst = tab[index]; if (e != null || first != newFirst) { for (e = newFirst; e != null; e = e.next) { if (e.hash == hash && eq(key, e.key)) return e.value; } } return null; } }
DO YOU REALLY
WANT TO WRITE
THIS KIND OF CODE?
8
Composition
•Simple concurrent accesses work
•Consider concurrent value deposit
int v1 = map.get(k);
v1 += 10;
map.put(k, v1);
synchronized(map) {
}
Back to coarse-grained locking
T1 T2
map.get(k) == 100
int v2 = map.get(k);
v2 += 20;
map.put(k, v2);
synchronized(map) {
}
== 100== 100
== 120
== 120
== 110
== 110
IS LOST
9
TM Approach
Let TM system take care of the rest
get (Key k)put (Key k, Value v)remove (Key k)
{ __tm_atomic { seqGet(k); }}{ __tm_atomic { seqPut(k, v); }}{ __tm_atomic { seqRemove(k); }}
int v = map.get(k);v += amount;map.put(k, v);
__tm_atomic {
}
10
Agenda
Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space
Part 2: STM Implementation– Runtime– Compiler– Performance
11
Managed vs. Unmanaged STM
• Same core semantics and language constructs (and algorithms)
• Managed (e.g. Java, .NET)– Controlled execution of native code– Dynamic compilation
• Unmanaged (e.g. C, C++)– Problem with legacy binaries– Have to know upfront if code executed
transactionally
12
Atomic Blocks == Transactions
•Originally a database concept
•Transactional executions– Atomic– Consistent– Isolated– Durable
serial
serializable
Serializable – appearance of serial
13
Serial Execution
T1 T2
__tm_atomic { int tmp1 = x;
int tmp2 = y;}
__tm_atomic { x = 42;
y = 42;}
int x = 0; int y = 0;
== 42
== 42
== 0
== 0
BOTH RESULTS CORRECT
14
Serializable Execution
T1 T2
__tm_atomic { int tmp1 = x;
int tmp2 = y;}
__tm_atomic { x = 42;
y = 42;}
int x = 0; int y = 0;
== 42
== 42
== 42
== 42
BOTH RESULTS THE SAME DESPITE
INTERLEAVING
15
Non-Serializable Execution
T1 T2
__tm_atomic { int tmp1 = x;
}
__tm_atomic { x = 42;
int x = 0; int y = 0;
== 42
== 42
== 0== 42
int tmp2 = y;
y = 42;}
DIFFERENT FROM ANY
SERIAL
TM’s role is to “fix” conflicting executions
ROLL BACK
! CONFLICT !
16
Transaction Nesting
•Required for composability
•Open nesting– Results exposed upon inner transaction commit– Compensating actions used upon outer
transaction abort– May lead to serializability violations
•Closed nesting– Computation results exposed only upon
outermost transaction commit– Transactions can be flattened - inner
transaction is semantically a no-op
17
Open Nesting
• Conditional can be entered after inner commit
__tm_atomic {
__tm_atomic { inc(); }
}
__tm_atomic { if (x == 1) { … }}
void inc() { x++; }void dec() { x--; }
int x = 0;
// register dec()
dec();
T1 T2
• Effect is undone but T2 has seen the result!
18
Closed Nesting
• Conditional can be entered only after outermost commit
__tm_atomic {
__tm_atomic { inc(); }
}
__tm_atomic { if (x == 1) { … }}
void inc() { x++; }void dec() { x--; }
int x = 0;T1 T2
19
Flatten Or Not To Flatten?
__tm_atomic {
…
…
}
__tm_atomic {
}
potential conflict
ROLL BACK
ROLL BACK
More on Execution Semantics
• Transactions are serializable, but
• The notion comes from database world where all actions are transactional
• What about non-transactional code?
20
Problematic Behavior
T1 T2
__tm_atomic { if (p != NULL)
tmp = *p;}
Should this behavior be allowed? Yes: This program is buggy, p = null should be inside a
transaction No: Transactions should be atomic no matter what
p = null;true
int * p = &x;
NULL POINTER
== null
21
Two Points of View on Atomicity
•Weak atomicity – Transactions serializable with respect to other
transactions
•Strong atomicity– Transactions serializable with respect to all
memory accesses
WEAK ATOMICITY
STRENGTH
STRONG ATOMICITY
22
Weak Atomicity
• Non-transactional accesses bypass STM access protocol– Non-transactional code remains un-instrumented– Most STMs behave this way
• Requires segregation of transactional and non-transactional data– Hard to enforce
• Otherwise – behavior depends on implementation – Unexpected results can be observed
23
Non-Repeatable Read
T1 T2
__tm_atomic { tmp1 = x;
tmp2 = x;}
•Non-txn code can affect transactional computation
x = 42;
int x = 0;
== 42
== 42
== 0
== 0
tmp1 == tmp2tmp1 != tmp2
24
Dirty Read
T1 T2
__tm_atomic { x++;
x++;}
•Txn code can leak intermediate results to non-transactional computation
tmp = x;
int x = 0;
tmp is eventmp is odd
== 0
== 1
== 2
== 1
25
Strong Atomicity
•Non-transactional accesses turned into micro-transactions– Reads and writes block until write gets
committed– Interleaved writes can invalidate a transaction
•Avoids all undesirable behaviors of weak atomicity, but
•All code needs to be instrumented
26
Non-Repeatable Read
T1 T2
__tm_ atomic { tmp1 = x;
tmp2 = x;}
•Write by T2 invalidates T1’s transaction
__tm_atomic { x = 42;}
int x = 0;
== 0
ROLL BACK
27
Dirty Read
T1 T2
atomic { x++;
x++;}
•Blocking effectively reschedules and serializes non-transactional operations
__tm_atomic { tmp = x;}
int x = 0;
== 2
BLOCK== 1
== 2
28
Are We Done?
•Overhead of strong atomicity can be huge (up to 10x slowdown)
•Non-txn code instrumentation may be problematic (precompiled libraries, system calls, etc.)
•Can we find an in-between solution?
WEAK ATOMICITY
STRENGTH
STRONG ATOMICITY
SGLA
29
Single Global Lock Atomicity
• Transactions execute as if protected by a single global lock
__tm_atomic { synchronized(m) {
S; S;
} }
•Matches intuition of weakly atomic STM– Transactions are serialized w.r.t. each other– And, no surprises compared to locks
• STM must provide additional guarantees– Consistency– Privatization safety
30
31
Consistency
__tm_atomic {
__tm_atomic {
int t1 = x;
…
int t2 = x;
if (t1 != t2)
*ptr = x;
}
lock(mutex);
int t1 = x;
…
int t2 = x;
if (t1 != t2)
*ptr = x;
unlock(mutex);
x=y;
}
lock(mutex);
x=y;
unlock(mutex);
int *ptr = NULL;
int x = 0; int y = 1
NULL POINTER
T1 T2
== 1
== 1
== 0
// cannot happen
32
Privatization Safety
__tm_atomic { t1 = head; if (t1)
__tm_atomic { t2 = head; head = t2->next; t2->next = NULL;}priv = t2->x;…assert (priv == t2->y);
lock(mutex); t2 = head; head = t2->next; t2->next = NULL;unlock(mutex);priv = t2->x;…assert (priv == t2->y);
t1->x = t1->y = 1;}
lock(mutex); t1 = head; if (t1)
t1->x = t1->y = 1;unlock(mutex);
T1 T2
0
0
x
y
next
head
t1
t2 1
1
= NULL;
== 1
== 1== 1
== 0
33
Agenda
Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space
Part 2: STM Implementation– Runtime– Compiler– Performance
34
Transactional Execution Modes
Optimistic Pessimistic
Lock data on write (exclusive write locks)
Record reads
Release write locks and validate reads on commit
Lock data on write (exclusive write locks)
Lock data on read (shared read locks)
Release read and write locks on commit
Pros Cache effects
No read locking cost
Privatization-safety and consistency for free
Filtering
Cons Providing privatization and consistency incurs extra cost
No filtering
Cache effects
Additional read locking cost
•Obstinate – pessimistic transaction that wins all conflicts
35
Write Buffering vs. In-Place Update
Write Buffering
(a.k.a. Lazy Versioning)
In-Place Update
(a.k.a. Eager Versioning)
Write to private buffer
Copy to memory on commit
Lazy Locking (acquire locks on commit) or Eager Locking (acquire locks on access)
Directly write shared memory
Record old values in a undo log
Eager Locking: acquire write-locks on write
Pros Fast abort Fast commit
Direct reads
Cons Slow commit
Reads have to search buffer
Slow abort
36
Conflict Detection Granularity
class Foo { int x; int y;}
object-based(Java/C#)
word-based(cacheline-based)
(C/C++)
struct Foo { int x; int y;}
y
x
metadata
vtbl
metadata
metadata
metadata
metadata
metadata
y
x
Owner Table
…… …
… …
37
Agenda
Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space
Part 2: STM Implementation– Runtime– Compiler– Performance
38
Intel C/C++ STM http://whatif.intel.com (NEW RELEASE IN Q3 2010)•Based on Intel’s product compiler
•Features• Consistency and privatization safety preserving close-nested
atomic blocks (__tm_atomic) to support SGLA semantics
• User abort (__tm_abort) for failure atomicity
• Transaction retry (__tm_retry) for condition synchronization
• Multiple transactional execution modes: optimistic and pessimistic STM, obstinate
• Serial execution mode (for I/O and calls to legacy binaries)
• TM support for C++ : virtual functions, (multiple) inheritance, function and class templates, exceptions
39
System Architecture
transactional C/C++
Intel C/C++ compiler
multicore system
C/C++ support
APPLICATION
LANGUAGESUPPORT
TMRUNTIME
HARDWARE
Runtime Overview
• In-place updates
• Cacheline-level conflict detection granularity
• Information for rollback recorded in undo log
• Reads recorded in read set:– For validation (optimistic mode)– For locking/unlocking (pessimistic and obstinate modes)
• Writes recorded in write set for locking/unlocking (all transactional modes)
• Two-phase locking (2PL) protocol
40
Per thread metadata
•Transaction Descriptor
–Read set: validation or unlocking
–Write set: unlocking
–Undo log: rollback
–… local timestamp, execution mode …
•Transaction Memento
–Checkpoint of machine and transaction state
–For nesting & partial rollback
41
Transation Record (TxnRec)
•Tracks transactional state of shared data
–For optimistic transactions (OptTxnRec)• Unlocked – contains timestamp (more on this later!)• Write-locked – contains transaction descriptor of lock owner
–For pessimistic transactions (PessTxnRec)• Unlocked – contains special mark• Read-locked – contains info about all readers• Write locked – contains info about single writer
•Stored in the owner table mapping each memory word to a single transaction record
42
Optimistic STM Algorithm
•Timestamp-based–Global Timestamp (G_TS): incremented every time a
writing transaction commits
– Local Timestamp (L_TS): records last time transaction was valid
–On transactional read of shared data record timestamp associated with its OptTxnRec in the transaction’s read set
–On transaction termination update local timestamps and write them to OptTxnRec-s of all data updated by this transaction
•Validation for serializability and consistency
•Quiescence for privatization safety
43
44
Consistency
__tm_atomic {
__tm_atomic {
int t1 = x;
…
int t2 = x;
if (t1 != t2)
*ptr = x;
}
lock(mutex);
int t1 = x;
…
int t2 = x;
if (t1 != t2)
*ptr = x;
unlock(mutex);
x=y;
}
lock(mutex);
x=y;
unlock(mutex);
int *ptr = NULL;
int x = 0; int y = 1
NULL POINTER
T1 T2
== 1
== 1
== 0
// cannot happen
Validation
•For every entry in read set, abort transaction if recorded timestamp greater than local timestamp
•Performed on commit to guarantee serializability
•Performed on read to guarantee consistency (when data’s OptTxnRec > local timestamp)
45
Validation
T1 T2__tm_atomic {
__tm_atomic {
int t1 = x;
…
int t2 = x;
if (t1 != t2)
*ptr = x;
}
x=y;
}
G_TS =
NULL POINTER
x
0OptTxnRec-s
0 1
L_TS = 0W_SET = <&x>
L_TS = 0R_SET = <&x>
1T1
ABORT
// cannot happen
R_SET = <&y>
y
0L_TS = 1
T1
46
47
Privatization Safety
__tm_atomic { t1 = head; if (t1)
__tm_atomic { t2 = head; head = t2->next; t2->next = NULL;}priv = t2->x;…assert (priv == t2->y);
lock(mutex); t2 = head; head = t2->next; t2->next = NULL;unlock(mutex);priv = t2->x;…assert (priv == t2->y);
t1->x = t1->y = 1;}
lock(mutex); t1 = head; if (t1)
t1->x = t1->y = 1;unlock(mutex);
T1 T2
0
0
x
y
next
head
t1
t2 1
1
= NULL;
== 1
== 1== 1
== 0
Quiescence
•Maintain list of active transactions containing their current local timestamp
•Implicit infinite timestamp for pessimistic transactions
•Committing transaction waits for all active transactions whose timestamp is smaller than its own timestamp
48
Quiescence
__tm_atomic {
t1 = head;
if (t1)
__tm_atomic {
t2 = head;
head = t2->next;
t2->next = NULL;
}
t1->x = t1->y = 1;
}
priv = t2->x;
…
assert (priv == t2->y);
G_TS = 0 1
T1 T2
L_TS = L_TS =
T1 T2
01
WAIT
0
2
49
50
Unified STM
• Both optimistic and pessimistic readers can co-exist
• Owner table is shared and contains both OptTxnRec and PessTxnRec
• Read barriers:– Optimistic – reads only OptTxnRec– Pessimistic – reads only PessTxnRec
• Write barriers need to write both TxnRec-s
51
Owner Table for Unified STM
typedef uintptr_t TxnRec;typedef struct OwnerTableEntryS { TxnRec optimistic; TxnRec pessimistic;} OwnerTableEntry;
……
Owner Table
PessTxnRec OptTxnRec
52
OptTxnRec
Lock bit0: Write-Locked (Exclusive)
1: Unlocked (Shared)
Upper bitsOwner TxnDesc upper bits
Or timestamp upper bits
31 … 1 0
53
PessTxnRec
Lock bit0: Write-locked (Exclusive)
1: Unlocked (Shared)
Upgrading bit0: no upgrading request
1: upgrading requested
Owner bitsEach bit represents a pessimistic transaction
Locked if non zero
31 … 2 1 0
54
xxx … xxxxx0000 … 0000111110
Unified STM Algorithm
T1 (PESS)
__tm_atomic { r1 = x; r3 = x;}
T2 (OPT)
__tm_atomic {
r2 = x;
x = r2 +1;
}
0
x T1
PessTxnRec OptTxnRec
T2
0 000 … 000001 000 … 000
Agenda
Part 1: STM Overview• Introduction• Language Constructs and Semantics• Design space
Part 2: STM Implementation• Runtime• Compiler• Performance
55
56
Compiler/Runtime Interaction
• Decouple compiler from the runtime– Enables use of different library implementations with the
same compiler (e.g. in-place updates vs. write-buffering)– Enables use of different algorithms within the library
itself (e.g. optimistic vs. pessimistic)
• Calls to the runtime realized through a vtable-like mechanism
• Compiler/runtime ABI:– General – same code used for different algorithms– Rich – to enable additional optimizations
57
ABI: Txn Begin and Commit
_ITM_transaction * _ITM_getTransaction()– Returns (creates if necessary) a transaction descriptor
uint32 _ITM_beginTransaction(_ITM_transaction* td, uint32 props)– Saves machine state– Pass information to runtime via props (e.g. pr_multiwayCode
- both instrumented and uninstrumented code is available) – Can return more than once (e.g. on abort); possible return
values: a_saveLiveVariables, a_restoreLiveVariables
void _ITM_commitTransaction(_ITM_transaction *td)
58
ABI: Read and Write Barriers
• Templates:void _ITM_Wtypesig(_ITM_transaction* td, type *addr, type val)
type _ITM_Rtypesig(_ITM_transaction* td, type *addr)
typesig: U[1248] – unsigned int[FDE] – float, double,
long…
•Examples:_ITM_WF(_ITM_transaction *td, float *addr, float val);
_ITM_RU4(_ITM_transaction *td, uint32 *addr);
59
Simple Atomic Block Translated
uint32Val = 42;
}
uint32 props = pr_multiwayCode;
_ITM_transaction *td = _ITM_getTransaction();
uint32 doWhat =
_ITM_beginTransaction(td, props);
if (doWhat & a_restoreLiveVariables) {
/* code to restore live local variables */
}
if (doWhat & a_saveLiveVariables) {
/* code to save live local variables */
}
_ITM_WU4(td, &uint32Val, 42);
_ITM_commitTransaction(td);
__tm_atomic {
! CONFLICT !
60
User Abort and Retry Translated
uint32Val = 42;
}
uint32 props = pr_multiwayCode;
_ITM_transaction *td = _ITM_getTransaction();
uint32 doWhat = _ITM_beginTransaction(td, props);
if (doWhat & a_restoreLiveVariables) {
/* code to restore live local variables */
}
if (doWhat & a_saveLiveVariables) {
/* code to save live local variables */
}
_ITM_WU4(td, &uint32Val, 42);
_ITM_commitTransaction(td);
__tm_atomic {
if (!_ITM_RU(td, &cond))
_ITM_abortTransaction(td, userRetry);
if (error) __tm_abort;
if (cond) __tm_retry;
if (_ITM_RU(td, &error))
_ITM_abortTransaction(td, userAbort);
if (doWhat & a_abortTransaction) goto ABORT_TXN;
ABORT_TXN:
61
Optimizations for Transactions
•Standard optimizations– Careful IR design enables existing optimizations
• Partial redundancy elimination (PRE), dead code elimination, …
– Subtle in presence of nesting
•STM-specific optimizations–No instrumentation when executing in serial mode
– Conversion of generic STM read/write barriers to cheaper variants
– Also:• Flattening nested transactions if no user abort is inside• Barrier elimination for __thread (thread local) or const data
Un-instrumented Serial Mode
if (flag) {
printf(“Hello!”); }
}
uint32 props = pr_multiwayCode;
_ITM_transaction *td = _ITM_getTransaction();
uint32 doWhat = _ITM_beginTransaction(td, props);
if (doWhat & a_restoreLiveVariables) {
/* code to restore live local variables */
}
_ITM_commitTransaction(td);
__tm_atomic {
if (doWhat & a_saveLiveVariables) {
/* code to save live local variables */
}
if (_ITM_RU4(td, &flag)) {
_ITM_changeTransactionMode(td, modeSerialIrrevocable);
printf(“Hello!”);
}
if (doWhat & a_instrumentedCode) {
} else {
if (flag) printf(“Hello!”);
}
62
ABI: Optimized Barrier Templates
•After read or after write (e.g. eliminate redundant locking operations)void _ITM_W{aRW}typesig(_ITM_transaction* td, type
*addr, type val)
type _ITM_R{aRW}typesig(_ITM_transaction* td, type *addr)
•Read-for-write (e.g. acquire write lock early and eliminate read lock)type _ITM_RfWtypesig(_ITM_transaction* td, type *addr)
63
6464
Barrier Optimization Example
__tm_atomic { if (x < N) { x++; }}
…t1 = _ITM_RU4(td, &x);if (t1 < N) { t2 = _ITM_RU4(td, &x); _ITM_WU4(td, &x,t2+1);}….
…t1 = _ITM_RU4(td, &x);if (t1 < N) { _ITM_WU4(td, &x,t1+1);}….
…t1 = _ITM_RU4(td, &x);if (t1 < N) { _ITM_WaRU4(td, &x,t1+1);}….
…t1 = _ITM_RfWU4(td, &x);if (t1 < N) { _ITM_WaWU4(td, &x,t1+1);}….
65
ABI: Undo and Commit Functions
• Programmers may register actions executed by the runtime on transaction termination
void _ITM_addUserCommitAction(_ITM_transaction *td, _ITM_userCommitFunction fn, _ITM_transactionId tid, void *arg)
void _ITM_addUserUndoAction(_ITM_transaction *td, _ITM_userUndoFunction, void *arg)
• Current transaction id_ITM_transactionId _ITM_getTransactionId(_ITM_transaction *tid)(1: non-txn, 2: outer txn begin, ++: inner txn begin)
• Undo and commit actions can be used inside of function wrappers
Transactional Function Wrappers
•Transparently replace a call to non-transactional function with a call to its transactional version
•Transactional wrapper’s code:– Un-instrumented– Can use explicit calls to the runtime
•Intended use - implementation of library functions (e.g. transactions-aware memory management)
__declspec (tm_wrap(foo)) void fooTxn();
66
Memory Management Risks
•Txn allocation, non-txn de-allocation– Re-executions leading to multiple allocations but only one
de-allocation operation
•Non-txn allocation, txn de-allocation– Re-executions leading to the same region being de-
allocated more than once
•Txn allocation, txn de-allocation– Combination of two previous cases depending on when re-
execution gets triggered
67
Memory Management Algorithm
•Uses function wrappers mechanism to take advantage of the existing allocators
•Allocation and de-allocation sites marked with tid
•Allocation creates an allocation record – If allocation record exists on outer commit – remove it– On abort – de-allocate and remove allocation record
•De-allocation removes allocation record– De-allocate immediately if txn_id(de-alloc) <= txn_id(alloc)– Otherwise, de-allocate on commit at the nesting level where
condition holds
68
Safe Memory Management
p1 = malloc(size);
tm_atomic {
p2 = malloc(size);
tm_atomic {
free(p2);
p3 = malloc(size);
p4 = malloc(size);
}
free(p1);
free(p3);
tm_atomic {
free(p4);
}
}
2
13
3
p2
p1p3
p4
AllocationRecordstxn_id
1223333
22
2
4421
>
><
>
defer until txn_id <= 2
defer until txn_id <= 1
defer until txn_id <= 3
execute
69
70
Functions Code Generation
•tm_callable–Generate two copies, instrumented (transactional) and
uninstrumented (non-transactional)
•tm_pure–Only generate uninstrumented code – does not cause
transaction to go serial
•tm_unknown– Switch to serial mode before a call is made inside a
transaction
–May be promoted to tm_callable or tm_pure by compiler
71
Code Generation for tm_callable
__declspec(tm_callable)
int inc (int *p)
{
p++;
}
inc:
jmp inc_$nontxn
mov eax, MAGIC
jmp inc_$txn
inc_$nontxn:
…
inc_$txn:
…
72
Code Generation for tm_pure
__declspec(tm_pure)
int peek(int *p)
{
return *p;
}
peek:
jmp peek_$nontxn
mov eax, MAGIC
jmp peek_$nontxn
peek_$nontxn:
…
73
Indirect Calls
if (*(fp + MAGIC_OFFSET) == MAGIC) {
call fp + TXN_TWIN_OFFSET;
} else {
switchToSerialMode();
call fp;
}
•No overhead for indirect calls outside of transactions
•Same execution mode available across inheritance hierarchy thanks to virtual function overriding rules
•No annotation on function pointers– Indirect call to non-recompiled tm_pure function causes switch to serial mode
74
Agenda
Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space
Part 2: STM Implementation– Runtime– Compiler– Performance
75
TM in Real World
• Realistic workloads: STAMP, SPLASH, and PARSEC benchmark suites (fluid dynamics, raytracing, etc.)
• Performance bottlenecks– Sometimes we use a single global lock (GLOCK)
as a baseline– Bottleneck discovery performed on optimistic
STM only
76
False Conflicts
•Poor scalability due to conflicts -- >90% false conflicts
•The same STM had no problems on SPLASH-2
Genome Vacation
Exe
cuti
on
Tim
e (s
)
GLOCK STM
0
5
10
15
20
25
30
1 2 4 8
# threads
0
2
4
6
8
10
12
1 2 4 8
# threads
77
Mapping to TxnRec-s
0561931
Address
20
…
0x0000
0x3FFF
Ownership Table
Transaction Record
Reserved to avoid cache line
ping ponging
•Addresses map to a transaction record via a hash function
• Different addresses can map to the same record
78
Refined Hash Function
• 4 additional bits to index into transaction record
• Reduce false conflict vs. pontentially increasing cache ping-ponging
031
Address
23 561920
…
0x0000
0x3FFF
Ownership Table
Transaction Record
79
False Conflicts Reduced
GLOCK STM (old hash) STM (new hash)
0
2
4
6
8
10
12
1 2 4 8
0
5
10
15
20
25
30
1 2 4 8# threads # threads
Genome Vacation
Exe
cuti
on
Tim
e (s
)
80
Over-Instrumentation
•Compiler generates more barriers than necessary– Thread-local memory accesses, – Objects alternating between modification and constant phase– Constant global objects
TxLD (optimal)
TxLD (compiler)
TxST (optimal)
TxST (compiler)
TxLD overhead
TxST overhead
Genome 58,701,959 624,073,490 2,252,291 19,078,705 10.63x 8.60x
Kmeans 86,666,710 255,662,754 86,666,710 86,666,711 2.95x 1.00x
Vacation 785,775,435 925,584,125 26,300,714 122,543,905 1.18x 4.66x
Transactional Barrier Counts for STAMP
81
__tm_waiver
•No instrumentation for a block or function marked with __tm_waiver
• Allows incremental optimizations but should be used with caution
__tm_atomic { y= ++x; // instrumented __tm_waiver { ++local; // no instrumentation }}
82
Over-Instrumentation Reduced
•__tm_waiver used for– thread-local object allocation routines – quasi-static shared objects
0
2
4
6
8
10
12
1 2 4 8
0
5
10
15
20
25
30
1 2 4 8
GLOCK STM (new hash) STM (new hash + __tm_waiver)
# threads # threads
Genome Vacation
Exe
cuti
on
Tim
e (s
)
83
Quiescence Overhead
•Only some programs use privatization idiom•Provide API to let programmer selectively disable privatization safety
0
0.5
1
1.5
2
sphinx genome kmeans vacation average
2 threads 4 threads 8 threads
spee
du
p
84
Other Issues
•Small transactions overwhelmed by fixed costs– Fluidanimate: ~1 load and ~1 store per transaction– Different code for small transactions
•Atomic blocks make porting of some benchmarks (e.g., BerkeleyDB) difficult but are more amenable to compiler optimizations
•Annotating transactional functions can be a burden (40% of functions in vacation)
•Many workloads require condition synchronization
85
Finding the Bottlenecks
•Many workloads would not scale at first
•Cumulative stats would shed no light - low contention, no false conflicts, …
•And then we remembered … the devil is in the details …
86
Per Critical Section Statistics
Only critical section 601 suffers from high abort rate and prevents scaling
critical section tx_begin commit abort abort %
code size (lines)
602 1314 1312 2 0.15% O(1)
542 222481 221043 1438 0.65% O(1)
559 220908 220908 0 0.00% O(1)
601 12306 6194 6112 49.67% O(1000)
571 42917 42889 28 0.07% O(1)
588 42770 42770 0 0.00% O(1)
301 1313 1312 1 0.08% O(1)
Transactional Statistics for Sphinx
87
Overall Performance
0
1
2
3
4
5
6
7
8
geno
me
kmea
ns/lo
w
kmea
ns/h
igh
vaca
tion/
low
vaca
tion/
high
chole
sky fft
lu/co
nt.
lu/no
n co
nt.
radix
barn
esfm
m
ocea
n/co
nt.
ocea
n/no
n co
nt.
radio
sity
raytr
ace
volre
nd
water
-nsq
uare
d
water
-spa
tial
fluida
nimat
e
1 thread 2 threads 4 threads 8 threads
STM vs. single-thread GLOCK
spee
du
p
Top Related