Compiler and Runtime Support for Efficient Software Transactional Memory
description
Transcript of Compiler and Runtime Support for Efficient Software Transactional Memory
![Page 1: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/1.jpg)
Compiler and Runtime Supportfor Efficient
Software Transactional Memory
Vijay Menon
Ali-Reza Adl-Tabatabai, Brian T. Lewis,Brian R. Murphy, Bratin Saha, Tatiana Shpeisman
![Page 2: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/2.jpg)
2
Motivation
Multi-core architectures are mainstream– Software concurrency needed for scalability– Concurrent programming is hard– Difficult to reason about shared data
Traditional mechanism: Lock-based Synchronization– Hard to use– Must be fine-grain for scalability – Deadlocks– Not easily composable
New Solution: Transactional Memory (TM)– Simpler programming model: Atomicity, Consistency, Isolation– No deadlocks– Composability– Optimistic concurrency– Analogy
• GC : Memory allocation ≈ TM : Mutual exclusion
![Page 3: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/3.jpg)
3
Composability
class Bank { ConcurrentHashMap accounts; … void deposit(String name, int amount) { synchronized (accounts) { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}
Thread-safe – but no scaling• ConcurrentHashMap (Java 5/JSR 166) does not help• Performance requires redesign from scratch & fine-grain locking
![Page 4: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/4.jpg)
4
Transactional solution
class Bank { HashMap accounts; … void deposit(String name, int amount) { atomic { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } …}
Underlying system provide:• isolation (thread safety)• optimistic concurrency
![Page 5: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/5.jpg)
5
Transactions are Composable
Scalability - 10,000,000 operations
0
1
2
3
4
0 4 8 12 16
# of Processors
Sca
lab
ilit
y
Synchronized Transactional
Scalability on 16-way 2.2 GHz Xeon System
![Page 6: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/6.jpg)
6
Our System
A Java Software Transactional Memory (STM) System– Pure software implementation – Language extensions in Java– Integrated with JVM & JIT
Novel Features– Rich transactional language constructs in Java– Efficient, first class nested transactions– Risc-like STM API– Compiler optimizations– Per-type word and object level conflict detection– Complete GC support
![Page 7: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/7.jpg)
7
System Overview
Polyglot
ORP VM
McRT STM
StarJIT
Transactional Java
Java + STM API
Transactional STIR
Optimized T-STIR
Native Code
![Page 8: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/8.jpg)
8
Transactional Java
Java + new language constructs:• Atomic: execute block atomically
• atomic {S}• Retry: block until alternate path possible
• atomic {… retry;…}• Orelse: compose alternate atomic blocks
• atomic {S1} orelse{S2} … orelse{Sn}• Tryatomic: atomic with escape hatch
• tryatomic {S} catch(TxnFailed e) {…}• When: conditionally atomic region
• when (condition) {S}
Builds on prior researchConcurrent Haskell, CAML, CILK, JavaHPCS languages: Fortress, Chapel, X10
![Page 9: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/9.jpg)
9
Transactional Java → Java
Transactional Java
atomic {
S;
}
STM API• txnStart[Nested]• txnCommit[Nested]• txnAbortNested• txnUserRetry• ...
Standard Java + STM API
while(true) {
TxnHandle th = txnStart();
try {
S’;
break;
} finally {
if(!txnCommit(th))
continue;
}
}
![Page 10: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/10.jpg)
10
JVM STM support
On-demand cloning of methods called inside transactions
Garbage collection support• Enumeration of refs in read set, write set & undo log
Extra transaction record field in each object• Supports both word & object granularity
Native method invocation throws exception inside transaction• Some intrinsic functions allowed
Runtime STM API• Wrapper around McRT-STM API
• Polyglot / StarJIT automatically generates calls to API
![Page 11: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/11.jpg)
11
Background: McRT-STM
STM for• C / C++ (PPoPP 2006)• Java (PLDI 2006)
• Writes: – strict two-phase locking– update in place– undo on abort
• Reads: – versioning– validation before commit
• Granularity per type– Object-level : small objects– Word-level : large arrays
• Benefits– Fast memory accesses (no buffering / object wrapping)– Minimal copying (no cloning for large objects)– Compatible with existing types & libraries
![Page 12: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/12.jpg)
12
Ensuring Atomicity: Novel Combination
Memory Ops
Mode ↓ Reads Writes
Pessimistic Concurrency
Optimistic Concurrency
+ Caching effects+ Avoids lock operations
Quantitative results in PPoPP’06
+ In place updates+ Fast commits+ Fast reads
![Page 13: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/13.jpg)
13
McRT-STM: Example
……atomic { B = A + 5;}…
……stmStart(); temp = stmRd(A); stmWr(B, temp + 5);stmCommit();…
STM read & write barriers before accessing memory inside transactions
STM tracks accesses & detects data conflicts
![Page 14: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/14.jpg)
14
Transaction Record
Pointer-sized record per object / word
Two states:• Shared (low bit is 1)
– Read-only / multiple readers– Value is version number (odd)
• Exclusive– Write-only / single owner– Value is thread transaction descriptor (4-byte aligned)
Mapping• Object : slot in object• Field : hashed index into global record table
![Page 15: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/15.jpg)
15
Transaction Record: Example
Every data item has an associated transaction record
TxR1
TxR2
TxR3
…TxRn
Object words hashinto table of TxRs
Hash is f(obj.hash, offset)
class Foo { int x; int y;}
vtblxy
TxRxy
vtbl Extra transactionrecord fieldObject
granularity
Wordgranularity
class Foo { int x; int y;}
hashxy
vtbl
![Page 16: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/16.jpg)
16
Transaction Descriptor
Descriptor per thread– Info for version validation, lock release, undo on abort, …
Read and Write set : {<Ti, Ni>}– Ti: transaction record– Ni: version number
Undo log : {<Ai, Oi, Vi, Ki>}– Ai: field / element address– Oi: containing object (or null for static)– Vi: original value– Ki: type tag (for garbage collection)
In atomic region– Read operation appends read set– Write operation appends write set and undo log– GC enumerates read/write/undo logs
![Page 17: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/17.jpg)
17
McRT-STM: Example
atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; }
T1atomic { t1 = bar.x; t2 = bar.y; }
T2
• T1 copies foo into bar• T2 reads bar, but should not see intermediate values
Class Foo { int x; int y;};Foo bar, foo;
![Page 18: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/18.jpg)
18
McRT-STM: Example
stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit();
T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();
T2
• T1 copies foo into bar• T2 reads bar, but should not see intermediate values
![Page 19: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/19.jpg)
19
McRT-STM: Example
stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit;
T1stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit();
T2
hdrx = 0y = 0
5hdr
x = 9y = 7
3foo bar
Reads <foo, 3> Reads <bar, 5>
T1
x = 9
<foo, 3>Writes <bar, 5>Undo <bar.x, 0>
T2 waits
y = 7
<bar.y, 0>
7
<bar, 7>
Abort
•T2 should read [0, 0] or should read [9,7]
Commit
![Page 20: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/20.jpg)
20
Early Results: Overhead breakdown
STM time breakdown
0%
20%
40%
60%
80%
100%
Binary tree Hashtable Linked list Btree
Application
TLS access
STM write
STM commit
STM validate
STM read
Time breakdown on single processor
STM read & validation overheads dominate
Good optimization targets
![Page 21: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/21.jpg)
21
System Overview
Polyglot
ORP VM
McRT STM
StarJIT
Transactional Java
Java + STM API
Transactional STIR
Optimized T-STIR
Native Code
![Page 22: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/22.jpg)
22
Leveraging the JIT
StarJIT: High-performance dynamic compiler
• Identifies transactional regions in Java+STM code
• Differentiates top-level and nested transactions
• Inserts read/write barriers in transactional code
• Maps STM API to first class opcodes in STIR
Good compiler representation →
greater optimization opportunities
![Page 23: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/23.jpg)
23
Representing Read/Write Barriers
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
…
stmWr(&a.x, t1)
stmWr(&a.y, t2)
if(stmRd(&a.z) != 0) {
stmWr(&a.x, 0);
stmWr(&a.z, t3)
}
Traditional barriers hide redundant locking/logging
![Page 24: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/24.jpg)
24
An STM IR for Optimization
Redundancies exposed:
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnOpenForWrite(a)
txnLogObjectInt(&a.y, a)
a.y = t2
txnOpenForRead(a)
if(a.z != 0) {
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = 0
txnOpenForWrite(a)
txnLogObjectInt(&a.z, a)
a.z = t3
}
![Page 25: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/25.jpg)
25
Optimized Code
atomic {
a.x = t1
a.y = t2
if(a.z == 0) {
a.x = 0
a.z = t3
}
}
txnOpenForWrite(a)
txnLogObjectInt(&a.x, a)
a.x = t1
txnLogObjectInt(&a.y, a)
a.y = t2
if(a.z != 0) {
a.x = 0
txnLogObjectInt(&a.z, a)
a.y = t3
}
Fewer & cheaper STM operations
![Page 26: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/26.jpg)
26
Compiler Optimizations for Transactions
Standard optimizations• CSE, Dead-code-elimination, …
• Careful IR representation exposes opportunities and enables optimizations with almost no modifications
• Subtle in presence of nesting
STM-specific optimizations• Immutable field / class detection & barrier removal (vtable/String)
• Transaction-local object detection & barrier removal
• Partial inlining of STM fast paths to eliminate call overhead
![Page 27: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/27.jpg)
27
Experiments
16-way 2.2 GHz Xeon with 16 GB shared memory• L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four)
Workloads• Hashtable, Binary tree, OO7 (OODBMS)
– Mix of gets, in-place updates, insertions, and removals
• Object-level conflict detection by default– Word / mixed where beneficial
![Page 28: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/28.jpg)
28
Effective of Compiler Optimizations
1P overheads over thread-unsafe baseline
Prior STMs typically incur ~2x on 1PWith compiler optimizations:
- < 40% over no concurrency control- < 30% over synchronization
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
HashMap TreeMap
% O
verh
ead
on
1P
Synchronized
No STM Opt
+Base STM Opt
+Immutability
+Txn Local
+Fast Path Inlining
![Page 29: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/29.jpg)
29
Scalability: Java HashMap Shootout
Unsafe (java.util.HashMap)• Thread-unsafe w/o Concurrency Control
Synchronized• Coarse-grain synchronization via SynchronizedMap wrapper
Concurrent (java.util.concurrent.ConcurrentHashMap)• Multi-year effort: JSR 166 -> Java 5• Optimized for concurrent gets (no locking)• For updates, divides bucket array into 16 segments (size / locking)
Atomic• Transactional version via “AtomicMap” wrapper
Atomic Prime• Transactional version with minor hand optimization
• Tracks size per segment ala ConcurrentHashMap
Execution• 10,000,000 operations / 200,000 elements• Defaults: load factor, threshold, concurrency level
![Page 30: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/30.jpg)
30
Scalability: 100% Gets
Atomic wrapper is competitive with ConcurrentHashMapEffect of compiler optimizations scale
02468
10121416
0 4 8 12 16
# of Processors
Sp
eed
up
over
1P
Un
safe
Unsafe Synchronized Concurrent
Atomic (No Opt) Atomic
![Page 31: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/31.jpg)
31
Scalability: 20% Gets / 80% Updates
ConcurrentHashMap thrashes on 16 segmentsAtomic still scales
0
24
6
8
1012
14
16
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent Atomic (No Opt) Atomic
![Page 32: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/32.jpg)
32
20% Inserts and Removes
Atomic conflicts on entire bucket array- The array is an object
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent Atomic
![Page 33: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/33.jpg)
33
20% Inserts and Removes: Word-Level
We still conflict on the single size field in java.util.HashMap
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized Concurrent
Object Atomic Word Atomic
![Page 34: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/34.jpg)
34
20% Inserts and Removes: Atomic Prime
Atomic Prime tracks size / segment – lowering bottleneckNo degradation, modest performance gain
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime
![Page 35: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/35.jpg)
35
20% Inserts and Removes: Mixed-Level
Mixed-level preserves wins & reduces overheads-word-level for arrays-object-level for non-arrays
0
0.5
1
1.5
2
2.5
3
0 4 8 12 16
# of Processors
Sp
eed
up
ove
r 1P
Un
safe
Synchronized ConcurrentObject Atomic Word AtomicWord Atomic Prime Mixed Atomic Prime
![Page 36: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/36.jpg)
36
Scalability: java.util.TreeMap
02
46
810
1214
16
0 4 8 12 16
# of Processors
Scal
abili
ty
Unsafe Synchronized Atomic
100% Gets 80% Gets
Results similar to HashMap
0
0.2
0.4
0.6
0.8
1
1.2
0 4 8 12 16
# of Processors
Scal
abili
tySynchronized Atomic Atomic Prime
![Page 37: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/37.jpg)
37
Scalability: OO7 – 80% Reads
“Coarse” atomic is competitive with medium-grain synchronization
Operations & traversal over synthetic database
0
1
2
3
4
5
6
0 4 8 12 16
# of Processors
Sca
lab
ilit
y
Atomic Synch (Coarse) Synch (Med.) Synch (Fine)
![Page 38: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/38.jpg)
38
Key Takeaways
Optimistic reads + pessimistic writes is nice sweet spot
Compiler optimizations significantly reduce STM overhead- 20-40% over thread-unsafe
- 10-30% over synchronized
Simple atomic wrappers sometimes good enough
Minor modifications give competitive performance to complex fine-grain synchronization
Word-level contention is crucial for large arrays
Mixed contention provides best of both
![Page 39: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/39.jpg)
39
Research challenges
Performance– Compiler optimizations– Hardware support– Dealing with contention
Semantics– I/O & communication– Strong atomicity– Nested parallelism– Open transactions
Debugging & performance analysis tools
System integration
![Page 40: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/40.jpg)
40
Conclusions
Rich transactional language constructs in Java
Efficient, first class nested transactions
Risc-like STM API
Compiler optimizations
Per-type word and object level conflict detection
Complete GC support
![Page 41: Compiler and Runtime Support for Efficient Software Transactional Memory](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814bfc550346895db8f735/html5/thumbnails/41.jpg)
41