Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of...
-
Upload
flora-pearson -
Category
Documents
-
view
215 -
download
0
Transcript of Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of...
Toward High Performance Nonblocking Software Transactional Memory
Virendra J. Marathe
University of Rochester
Mark Moir
Sun Microsystems Labs
2
Nonblocking Progress & Transactional Memory
Nonblocking Progress – arbitrary delays in some threads do not prevent others from making forward progress
TM research began for nonblocking concurrent algorithms [Herlihy&Moss ISCA’93] Early software TMs (STMs) were nonblocking, but slow Recent shift toward blocking STMs
Significant performance improvements
General argument – nonblocking STMs are fundamentally slow
We were not convinced
3
Agenda
Why is nonblocking progress important?
Background on STM Implementations
What makes nonblocking STMs slow?
Making nonblocking STMs fast
Experimental Results
Conclusions
4
The Virtues of Nonblocking Progress
Tolerance from arbitrary delays due to Preemption, Page faults, Thread faults
External scheduler support mitigates some problems, but Not portable Ideally contain the problem within the STM
Environments where blocking is unacceptable TxLinux interrupt handler transactions
5
Agenda
Why is nonblocking progress important?
Background on STM Implementations
What makes nonblocking STMs slow?
Making nonblocking STMs fast
Experimental Results
Conclusions
6
STM Implementations
Transactions execute speculatively Reads and writes use STM metadata Speculative writes typically acquire ownership of
locations (using atomic ops. e.g. CAS) Reads are typically logged in a private read set
for validation at commit time Post-commit/abort cleanup
Make speculative updates non-speculative, or rollback speculative updates
Release ownership of locations This forces waiting in blocking STMs
7
STM Implementations
Two types of implementations for speculative writes: Redo Log –
writes made to private buffer, and flushed out on commit ownership acquisition can be done at first write (eager
acquire) or commit time (lazy acquire) Undo Log –
writes are made directly to memory (need eager acquire), old values are logged in a private buffer, and old values are restored in case of an abort
Read set validation to ensure isolation Several schemes (e.g. incremental, commit counter,
timestamp, etc.)
8
Agenda
Why is nonblocking progress important?
Background on STM Implementations
What makes nonblocking STMs slow?
Making nonblocking STMs fast
Experimental Results
Conclusions
9
What makes nonblocking STMs slow?
In Blocking STMs Transaction waits for a conflicting transaction in its
post-commit/abort cleanup phase
These usually lead to overheads in the (contention-free) common case
Nonblocking STMs avoid waiting with Indirection (object-based STMs) Copying and Cloning Helping Stealing (Harris & Fraser; also our approach)
10
What makes blocking STMs fast?
Significantly less overhead in the common case Simple metadata structure Streamlined fast path Performance optimizations
Timestamp based validation
We need to incorporate all these features in a nonblocking STM to make it competitive
11
Agenda
Why is nonblocking progress important?
Background on STM Implementations
What makes nonblocking STMs slow?
Making nonblocking STMs fast
Experimental Results
Conclusions
12
Our Contributions
Keep the common case simple Resort to complicated case only when cleanup is
delayed More streamlined common case execution path
Incorporate recent optimizations (timestamp based validation)
13
STM Data Structures Word-based STM
Conflict detection at granularity of contiguous blocks of memory Appropriate for unmanaged languages – C, C++
A table of ownership records (orecs) Each heap location hashes into a single orec Each orec indicates if currently owned or free, and identifies the owner
Transaction Descriptor Read set Write set (redo log) – a 2D list, each row corresponds to an acquired orec Status – Active/Aborted/Committed
14
Common Case Execution
Algorithm behaves like a blocking STM in the absence of contention Log reads, writes of transaction Acquire ownership of write set locations via their
orecs Ensure that reads are still consistent (read set
validation) Flush out updates after commit/abort Release orecs
15
Uncommon Case: Stealing
Two flags in the orec for the stealing process stolen_orec: for orec’s stolen/unstolen state copier_exists: indicates if there exists an
owner in cleanup phase
16
Stealing Example
Shared Heap Ownership Records (orec)
hashingver# ID, flags
T1COMMITTED
o1
o2
o3
o4
o5
OWNER
T2ACTIVE
T3ACTIVE
STEALER 1
STEALER 2S C
locX
Copyback in progress
0 01
locX:11
Write Set
locX:11
Write Set
1
locX:12
Write Set
1011
Copyback complete
0
Redo Copyback
0
Clear C
10
locX’s logical value
locX:12T2
COMMITTED
12
17
Stealing Complexity
Stealing mechanism quite complex Several corner case race conditions need to be
handled (read the paper for further details) Overhead of accessing stolen locations is quite
high, requiring a lookup in the last stealer’s write set
However, we can throttle stealing and make it an uncommon case
18
Streamlining Common Case
To release acquired orecs prior nonblocking STMs required Expensive synch. instructions (e.g. CAS) Indirection & garbage collection
Blocking STMs use store instruction So do we (details in the paper)
19
Timestamps and Validation
A significant optimization to read set validation (e.g. TL2)
Log time at which orec was modified (done when owner releases orec)
A reader checks if the orec was modified after it began execution, and if so, aborts conservatively
20
Adding Timestamps
Recall: orec contains a pointer to the owner Superimpose a timestamp on this pointer A writer releases orec by storing back the
current global time Timestamps lowered the cost of read set
validation significantly
21
Undo Log Variant
We have developed the first nonblocking undo log STM through simple modifications to a redo log variant Stealing of orecs happens in the redo log STM
when a committed owner is delayed In undo log STMs stealing largely happens when
an aborted owner is delayed Logical values of locations are in aborted owner’s
undo log
22
Agenda
Why is nonblocking progress important?
Background on STM Implementations
What makes nonblocking STMs slow?
Making nonblocking STMs fast
Experimental Results
Conclusions
23
Experimental Platform Implementation of all STMs done in C Throughput tests conducted on microbenchmarks
Scalable workloads: hash table, binary search tree Torture tests (no scaling): counter, array of counters
Tests conducted on a 16 processor Sun Fire machine We compared the following STMs
TL2, TL2 with schedctl calls to avoid preemption pathologies, Harris and Fraser’s word-based nonblocking STM Our Base blocking and nonblocking variants (do not contain
store-based release and optimizations), and 3 variants of our Optimized STM (eager redo log, lazy redo log,
undo log)
24
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
Thread #
Txn
s/se
c
Redo Log
Undo Log
TL2 Schedctl
TL2
HF-STM
Base NB
Binary Search Tree
Our Optimized STMs
TL2
HF-STM
Base NB
25
Hash Table
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
1 8 15 22 29 36 43 50 57 64
Thread #
Txn
s/se
c
Redo Log
Undo Log
TL2 Schedctl
TL2
HF-STM
Base NB
TL2-Sched TL2 Our Optimized STMs
26
0
50000
100000
150000
200000
250000
300000
1 8 15 22 29 36 43 50 57 64Thread #
Txn
s/se
c
Redo Log
Undo Log
TL2 Schedctl
TL2
HF-STM
Base NB
Array of Counters
TL2-Sched TL2
Redo Log
Undo Log
27
Array of Counters – Stealing rate
0
5
10
15
20
25
30
35
40
1 8 15 22 29 36 43 50 57 64
Thread #
Ste
alin
g R
ate
(in
% t
xns)
Redo Log Eager
Redo Log Lazy
Undo Log
Redo Log
Undo Log
28
Conclusion
We presented several variants of a new STM that Effectively decouples the common case from nonblocking
infrastructure Enables a more streamlined fast path (comparable to state-
of-the-art blocking STMs) Enables integration of key optimizations such as
Timestamp-based transaction validation
We have shown that common case performance of nonblocking STMs can be made competitive with state-of-the-art blocking STMs
29
Thank You!
Questions?
30
Common Case Example
Shared Heap Ownership Records (orec)
hashingver# ID, flags T1
ACTIVEo1
o2
o3
o4
o5S C
locX
Copyback in progress
locX:11
Write Set
1011
Copyback complete locX’s logical value
0 0
T1COMMITTED
ReleaseStore
31
Basic Idea
Transaction steals ownership of the location under conflict Inspired by Harris & Fraser’s WSTM
Stealing Requires complex metadata management Leads to high latency reads and writes
Switch the stolen location back to unstolen state as quickly as possible
32
Phase-I STM: Switching orec back to Unstolen state
If an orec is stolen, logical values of mapping locations may be in the last stealer’s write set (pointed by the orec) Stealer will reuse such a write set row (for a new
transaction) only after it is reclaimed Subsequent stealer that comes across a
stolen orec with (copier_exists == false) switches orec to unstolen state
Stealing-releasing is a complex process
33
Phase-I STM: Illustration
Shared Heap Ownership Records (orec)
hashingver# ID, flags
T1COMMITTED
o1
o2
o3
o4
o5
First owner
T2ACTIVE
T3ACTIVE
Second owner (stealer 1)
Third owner (stealer 2)
S C
0 01 1
Clear C
1 00 0
34
STM API
stm_begin(my_txn): Initializes a transacation
stm_read(my_txn,loc): Speculative read of location loc
stm_write(my_txn,loc,val): Speculative write val to loc
stm_commit(my_txn): Attempt to commit transaction
35
Phase-I STM: Example
Shared Heap Ownership Records (orec)
hashingver# ID, flags
T1COMMITTED
o1
o2
o3
o4
o5
First owner
T2ACTIVE
T3ACTIVE
Second owner (stealer 1)
Third owner (stealer 2)S C
locX
Copyback in progress
0 01
locX:11
Write Set
locX:11
Write Set
1
locX:11
Write Set
1011
Copyback complete
0
Redo Copyback
0
Clear C
10
locX’s logical value
36
Phase-I STM: Stealing Mechanism
Steal orec when transaction encounters orec acquired by a committed transaction The committed transaction is copying back its
speculative updates Stealing done in two steps:
Merge speculative updates of victim to the orec’s locations into stealer’s write set
Acquire the orec with an atomic op This involves setting some special flags that indicate
to the system that the orec is stolen
37
Phase-I STM: Stolen orec state
Logical values of stolen locations are always in the stealer’s write set
Subsequent accesses to these locations must lookup the stealer’s write set Quite expensive
We use some flags to indicate when it is safe for a new stealer to switch the orec back to the unstolen state