Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of...

37
Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs

Transcript of Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of...

Page 1: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

Toward High Performance Nonblocking Software Transactional Memory

Virendra J. Marathe

University of Rochester

Mark Moir

Sun Microsystems Labs

Page 2: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

2

Nonblocking Progress & Transactional Memory

Nonblocking Progress – arbitrary delays in some threads do not prevent others from making forward progress

TM research began for nonblocking concurrent algorithms [Herlihy&Moss ISCA’93] Early software TMs (STMs) were nonblocking, but slow Recent shift toward blocking STMs

Significant performance improvements

General argument – nonblocking STMs are fundamentally slow

We were not convinced

Page 3: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

3

Agenda

Why is nonblocking progress important?

Background on STM Implementations

What makes nonblocking STMs slow?

Making nonblocking STMs fast

Experimental Results

Conclusions

Page 4: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

4

The Virtues of Nonblocking Progress

Tolerance from arbitrary delays due to Preemption, Page faults, Thread faults

External scheduler support mitigates some problems, but Not portable Ideally contain the problem within the STM

Environments where blocking is unacceptable TxLinux interrupt handler transactions

Page 5: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

5

Agenda

Why is nonblocking progress important?

Background on STM Implementations

What makes nonblocking STMs slow?

Making nonblocking STMs fast

Experimental Results

Conclusions

Page 6: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

6

STM Implementations

Transactions execute speculatively Reads and writes use STM metadata Speculative writes typically acquire ownership of

locations (using atomic ops. e.g. CAS) Reads are typically logged in a private read set

for validation at commit time Post-commit/abort cleanup

Make speculative updates non-speculative, or rollback speculative updates

Release ownership of locations This forces waiting in blocking STMs

Page 7: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

7

STM Implementations

Two types of implementations for speculative writes: Redo Log –

writes made to private buffer, and flushed out on commit ownership acquisition can be done at first write (eager

acquire) or commit time (lazy acquire) Undo Log –

writes are made directly to memory (need eager acquire), old values are logged in a private buffer, and old values are restored in case of an abort

Read set validation to ensure isolation Several schemes (e.g. incremental, commit counter,

timestamp, etc.)

Page 8: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

8

Agenda

Why is nonblocking progress important?

Background on STM Implementations

What makes nonblocking STMs slow?

Making nonblocking STMs fast

Experimental Results

Conclusions

Page 9: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

9

What makes nonblocking STMs slow?

In Blocking STMs Transaction waits for a conflicting transaction in its

post-commit/abort cleanup phase

These usually lead to overheads in the (contention-free) common case

Nonblocking STMs avoid waiting with Indirection (object-based STMs) Copying and Cloning Helping Stealing (Harris & Fraser; also our approach)

Page 10: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

10

What makes blocking STMs fast?

Significantly less overhead in the common case Simple metadata structure Streamlined fast path Performance optimizations

Timestamp based validation

We need to incorporate all these features in a nonblocking STM to make it competitive

Page 11: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

11

Agenda

Why is nonblocking progress important?

Background on STM Implementations

What makes nonblocking STMs slow?

Making nonblocking STMs fast

Experimental Results

Conclusions

Page 12: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

12

Our Contributions

Keep the common case simple Resort to complicated case only when cleanup is

delayed More streamlined common case execution path

Incorporate recent optimizations (timestamp based validation)

Page 13: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

13

STM Data Structures Word-based STM

Conflict detection at granularity of contiguous blocks of memory Appropriate for unmanaged languages – C, C++

A table of ownership records (orecs) Each heap location hashes into a single orec Each orec indicates if currently owned or free, and identifies the owner

Transaction Descriptor Read set Write set (redo log) – a 2D list, each row corresponds to an acquired orec Status – Active/Aborted/Committed

Page 14: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

14

Common Case Execution

Algorithm behaves like a blocking STM in the absence of contention Log reads, writes of transaction Acquire ownership of write set locations via their

orecs Ensure that reads are still consistent (read set

validation) Flush out updates after commit/abort Release orecs

Page 15: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

15

Uncommon Case: Stealing

Two flags in the orec for the stealing process stolen_orec: for orec’s stolen/unstolen state copier_exists: indicates if there exists an

owner in cleanup phase

Page 16: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

16

Stealing Example

Shared Heap Ownership Records (orec)

hashingver# ID, flags

T1COMMITTED

o1

o2

o3

o4

o5

OWNER

T2ACTIVE

T3ACTIVE

STEALER 1

STEALER 2S C

locX

Copyback in progress

0 01

locX:11

Write Set

locX:11

Write Set

1

locX:12

Write Set

1011

Copyback complete

0

Redo Copyback

0

Clear C

10

locX’s logical value

locX:12T2

COMMITTED

12

Page 17: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

17

Stealing Complexity

Stealing mechanism quite complex Several corner case race conditions need to be

handled (read the paper for further details) Overhead of accessing stolen locations is quite

high, requiring a lookup in the last stealer’s write set

However, we can throttle stealing and make it an uncommon case

Page 18: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

18

Streamlining Common Case

To release acquired orecs prior nonblocking STMs required Expensive synch. instructions (e.g. CAS) Indirection & garbage collection

Blocking STMs use store instruction So do we (details in the paper)

Page 19: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

19

Timestamps and Validation

A significant optimization to read set validation (e.g. TL2)

Log time at which orec was modified (done when owner releases orec)

A reader checks if the orec was modified after it began execution, and if so, aborts conservatively

Page 20: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

20

Adding Timestamps

Recall: orec contains a pointer to the owner Superimpose a timestamp on this pointer A writer releases orec by storing back the

current global time Timestamps lowered the cost of read set

validation significantly

Page 21: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

21

Undo Log Variant

We have developed the first nonblocking undo log STM through simple modifications to a redo log variant Stealing of orecs happens in the redo log STM

when a committed owner is delayed In undo log STMs stealing largely happens when

an aborted owner is delayed Logical values of locations are in aborted owner’s

undo log

Page 22: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

22

Agenda

Why is nonblocking progress important?

Background on STM Implementations

What makes nonblocking STMs slow?

Making nonblocking STMs fast

Experimental Results

Conclusions

Page 23: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

23

Experimental Platform Implementation of all STMs done in C Throughput tests conducted on microbenchmarks

Scalable workloads: hash table, binary search tree Torture tests (no scaling): counter, array of counters

Tests conducted on a 16 processor Sun Fire machine We compared the following STMs

TL2, TL2 with schedctl calls to avoid preemption pathologies, Harris and Fraser’s word-based nonblocking STM Our Base blocking and nonblocking variants (do not contain

store-based release and optimizations), and 3 variants of our Optimized STM (eager redo log, lazy redo log,

undo log)

Page 24: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

24

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

Thread #

Txn

s/se

c

Redo Log

Undo Log

TL2 Schedctl

TL2

HF-STM

Base NB

Binary Search Tree

Our Optimized STMs

TL2

HF-STM

Base NB

Page 25: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

25

Hash Table

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

1 8 15 22 29 36 43 50 57 64

Thread #

Txn

s/se

c

Redo Log

Undo Log

TL2 Schedctl

TL2

HF-STM

Base NB

TL2-Sched TL2 Our Optimized STMs

Page 26: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

26

0

50000

100000

150000

200000

250000

300000

1 8 15 22 29 36 43 50 57 64Thread #

Txn

s/se

c

Redo Log

Undo Log

TL2 Schedctl

TL2

HF-STM

Base NB

Array of Counters

TL2-Sched TL2

Redo Log

Undo Log

Page 27: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

27

Array of Counters – Stealing rate

0

5

10

15

20

25

30

35

40

1 8 15 22 29 36 43 50 57 64

Thread #

Ste

alin

g R

ate

(in

% t

xns)

Redo Log Eager

Redo Log Lazy

Undo Log

Redo Log

Undo Log

Page 28: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

28

Conclusion

We presented several variants of a new STM that Effectively decouples the common case from nonblocking

infrastructure Enables a more streamlined fast path (comparable to state-

of-the-art blocking STMs) Enables integration of key optimizations such as

Timestamp-based transaction validation

We have shown that common case performance of nonblocking STMs can be made competitive with state-of-the-art blocking STMs

Page 29: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

29

Thank You!

Questions?

Page 30: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

30

Common Case Example

Shared Heap Ownership Records (orec)

hashingver# ID, flags T1

ACTIVEo1

o2

o3

o4

o5S C

locX

Copyback in progress

locX:11

Write Set

1011

Copyback complete locX’s logical value

0 0

T1COMMITTED

ReleaseStore

Page 31: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

31

Basic Idea

Transaction steals ownership of the location under conflict Inspired by Harris & Fraser’s WSTM

Stealing Requires complex metadata management Leads to high latency reads and writes

Switch the stolen location back to unstolen state as quickly as possible

Page 32: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

32

Phase-I STM: Switching orec back to Unstolen state

If an orec is stolen, logical values of mapping locations may be in the last stealer’s write set (pointed by the orec) Stealer will reuse such a write set row (for a new

transaction) only after it is reclaimed Subsequent stealer that comes across a

stolen orec with (copier_exists == false) switches orec to unstolen state

Stealing-releasing is a complex process

Page 33: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

33

Phase-I STM: Illustration

Shared Heap Ownership Records (orec)

hashingver# ID, flags

T1COMMITTED

o1

o2

o3

o4

o5

First owner

T2ACTIVE

T3ACTIVE

Second owner (stealer 1)

Third owner (stealer 2)

S C

0 01 1

Clear C

1 00 0

Page 34: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

34

STM API

stm_begin(my_txn): Initializes a transacation

stm_read(my_txn,loc): Speculative read of location loc

stm_write(my_txn,loc,val): Speculative write val to loc

stm_commit(my_txn): Attempt to commit transaction

Page 35: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

35

Phase-I STM: Example

Shared Heap Ownership Records (orec)

hashingver# ID, flags

T1COMMITTED

o1

o2

o3

o4

o5

First owner

T2ACTIVE

T3ACTIVE

Second owner (stealer 1)

Third owner (stealer 2)S C

locX

Copyback in progress

0 01

locX:11

Write Set

locX:11

Write Set

1

locX:11

Write Set

1011

Copyback complete

0

Redo Copyback

0

Clear C

10

locX’s logical value

Page 36: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

36

Phase-I STM: Stealing Mechanism

Steal orec when transaction encounters orec acquired by a committed transaction The committed transaction is copying back its

speculative updates Stealing done in two steps:

Merge speculative updates of victim to the orec’s locations into stealer’s write set

Acquire the orec with an atomic op This involves setting some special flags that indicate

to the system that the orec is stolen

Page 37: Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs.

37

Phase-I STM: Stolen orec state

Logical values of stolen locations are always in the stealer’s write set

Subsequent accesses to these locations must lookup the stealer’s write set Quite expensive

We use some flags to indicate when it is safe for a new stealer to switch the orec back to the unstolen state