Compiler and Runtime Support for Efficient Software Transactional Memory
description
Transcript of Compiler and Runtime Support for Efficient Software Transactional Memory
Compiler and Runtime Supportfor Efficient
Software Transactional Memory
Vijay Menon
Programming Systems Lab
Ali-Reza Adl-Tabatabai, Brian T. Lewis,Brian R. Murphy, Bratin Saha, Tatiana Shpeisman
2
Motivation
Locks are hard to get right• Programmability vs scalability
Transactional memory is appealing alternative• Simpler programming model• Stronger guarantees
•Atomicity, Consistency, Isolation•Deadlock avoidance
• Closer to programmer intent• Scalable implementations
Questions• How to lower TM overheads – particularly in software?• How to balance granularity / scalability?
3
Our System
Java Software Transactional Memory (STM) System– Pure software implementation (McRT-STM – PPoPP ’06)– Language extensions in Java (Polyglot)– Integrated with JVM & JIT (ORP & StarJIT)
Novel Features– Rich transactional language constructs in Java– Efficient, first class nested transactions– Complete GC support – Risc-like STM API / IR– Compiler optimizations– Per-type word and object level conflict detection
4
Transactional Java → Java
Transactional Java
atomic {
S;
}
Other Language Constructs• Built on prior research
– retry (STM Haskell, …)– orelse (STM Haskell) – tryatomic (Fortress)– when (X10, …)
Standard Java + STM API
while(true) {
TxnHandle th = txnStart();
try {
S’;
break;
} finally {
if(!txnCommit(th))
continue;
}
}
5
Tight integration with JVM & JIT
StarJIT & ORP
• On-demand cloning of methods (Harris ’03)
• Identifies transactional regions in Java+STM code
• Inserts read/write barriers in transactional code
• Maps STM API to first class opcodes in StarJIT IR (STIR)
Good compiler representation →
greater optimization opportunities
6
Representing Read/Write Barriers
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
…
stmWr(&a.x, t1)
stmWr(&a.y, t2)
if(stmRd(&a.z) != 0) {
stmWr(&a.x, 0);
stmWr(&a.z, t3)
}
Traditional barriers hide redundant locking/logging
7
An STM IR for Optimization
Redundancies exposed:
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnOpenForWrite(a)
txnLogObjectInt(&a.y, a)
a.y = t2
txnOpenForRead(a)
if(a.z != 0) {
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = 0
txnOpenForWrite(a)
txnLogObjectInt(&a.z, a)
a.z = t3
}
8
Optimized Code
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnLogObjectInt(&a.y, a)
a.y = t2
if(a.z != 0) {
a.x = 0
txnLogObjectInt(&a.z, a)
a.y = t3
}
Fewer & cheaper STM operations
9
Compiler Optimizations for Transactions
Standard optimizations• CSE, Dead-code-elimination, …
• Careful IR representation exposes opportunities and enables optimizations with almost no modifications
• Subtle in presence of nesting
STM-specific optimizations• Immutable field / class detection & barrier removal (vtable/String)
• Transaction-local object detection & barrier removal
• Partial inlining of STM fast paths to eliminate call overhead
10
McRT-STM
PPoPP 2006 (Saha, et. al.)• C / C++ STM• Pessimistic Writes:
– strict two-phase locking– update in place– undo on abort
• Optimistic Reads: – versioning– validation before commit
• Benefits– Fast memory accesses (no buffering / object wrapping)– Minimal copying (no cloning for large objects)– Compatible with existing types & libraries
Similar STMs: Ennals (FastSTM), Harris, et.al (PLDI ’06)
11
STM Data Structures
Per-thread:
• Transaction Descriptor– Per-thread info for version validation, acquired locks, rollback– Maintained in Read / Write / Undo logs
• Transaction Memento– Checkpoint of logs for nesting / partial rollback
Per-data:
• Transaction Record– Pointer-sized field guarding a set of shared data– Transactional state of data
• Shared: Version number (odd)• Exclusive: Owner’s transaction descriptor (even / aligned)
12
Mapping Data to Transaction Record
Every data item has an associated transaction record
TxR1
TxR2
TxR3
…TxRn
Object words hashinto table of TxRs
Hash is f(obj.hash, offset)
class Foo { int x; int y;}
TxRxy
vtbl Transactionrecord embedded
In objectObject
granularity
Wordgranularity
class Foo { int x; int y;}
hashxy
vtbl
13
Granularity of Conflict Detection
Object-level• Cheaper operation• Exposes CSE opportunities• Lower overhead on 1P
Word-level • Reduces false sharing• Better scalability
Mix & Match• Per type basis• E.g., word-level for arrays,
object-level for non-arrays
// Thread 1
a.x = …
a.y = …
// Thread 2
… = … a.z …
14
Experiments
16-way 2.2 GHz Xeon with 16 GB shared memory• L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)
Workloads• Hashtable, Binary tree, OO7 (OODBMS)
– Mix of gets, in-place updates, insertions, and removals
• Object-level conflict detection by default– Word / mixed where beneficial
15
Effective of Compiler Optimizations
1P overheads over thread-unsafe baseline
Prior STMs typically incur ~2x on 1PWith compiler optimizations:
- < 40% over no concurrency control- < 30% over synchronization
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
HashMap TreeMap
% O
verh
ead
on
1P
Synchronized
No STM Opt
+Base STM Opt
+Immutability
+Txn Local
+Fast Path Inlining
16
Scalability: Java HashMap Shootout
Unsafe (java.util.HashMap)• Thread-unsafe w/o Concurrency Control
Synchronized• Coarse-grain synchronization via SynchronizedMap wrapper
Concurrent (java.util.concurrent.ConcurrentHashMap)• Multi-year effort: JSR 166 -> Java 5• Optimized for concurrent gets (no locking)• For updates, divides bucket array into 16 segments (size / locking)
Atomic• Transactional version via “AtomicMap” wrapper
Atomic Prime• Transactional version with minor hand optimization
• Tracks size per segment ala ConcurrentHashMap
Execution• 10,000,000 operations / 200,000 elements• Defaults: load factor, threshold, concurrency level
17
Scalability: 100% Gets
Atomic wrapper is competitive with ConcurrentHashMapEffect of compiler optimizations scale
02468
10121416
0 4 8 12 16
# of Processors
Sp
eed
up
over
1P
Un
safe
Unsafe Synchronized Concurrent
Atomic (No Opt) Atomic
18
Scalability: 20% Gets / 80% Updates
ConcurrentHashMap thrashes on 16 segmentsAtomic still scales
0
24
6
8
1012
14
16
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent Atomic (No Opt) Atomic
19
20% Inserts and Removes
Atomic conflicts on entire bucket array- The array is an object
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent Atomic
20
20% Inserts and Removes: Word-Level
We still conflict on the single size field in java.util.HashMap
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent
Object Atomic Word Atomic
21
20% Inserts and Removes: Atomic Prime
Atomic Prime tracks size / segment – lowering bottleneckNo degradation, modest performance gain
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime
22
20% Inserts and Removes: Mixed-Level
Mixed-level preserves wins & reduces overheads-word-level for arrays-object-level for non-arrays
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime Mixed Atomic Prime
23
Key Takeaways
Optimistic reads + pessimistic writes is nice sweet spot
Compiler optimizations significantly reduce STM overhead- 20-40% over thread-unsafe
- 10-30% over synchronized
Simple atomic wrappers sometimes good enough
Minor modifications give competitive performance to complex fine-grain synchronization
Word-level contention is crucial for large arrays
Mixed contention provides best of both
24
Novel Contributions
Rich transactional language constructs in Java
Efficient, first class nested transactions
Complete GC support
Risc-like STM API
Compiler optimizations
Per-type word and object level conflict detection