LECTURE 5 A Brief History of TM
description
Transcript of LECTURE 5 A Brief History of TM
LECTURE 5A Brief History of TM
Precursors of Computing: ENIAC
• 5000 ops/second• 486k $ in 1946• 19k vacuum tubes• 200K watts• 67 cubic meters
Latest trends: Intel Nehalem
• 1.9 billion transistors• 12 billion ops per second• 4 microprocessors• 8 MB of on-chip memory• 100 W• 246 square millimeters
The Way: Not just Chip Frequency!
• 1970s: Programmable controllers, single chip microprocessors
• 1980s: Instruction pipelines, cache hierarchies• 1990s: Speculative execution, Superscalar
processors• 2000s: Multicore chips, embedded computing
Pipelining
• Split the processing of an instruction into a series of independent steps
• Classic pipeline– Instruction Fetch (IF)– Instruction Decode (ID)– Execute (EX)– Memory Access (MEM)– Register Write Back (WB)
Pipelining
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Instr 1 IF ID EX MEM WB
Instr 2 IF ID EX MEM
Instr 3 IF ID EX
Instr 4 IF ID
Different parts of the CPU used for different stages of the pipeline
Pipelining
• Throughput: Speed of the slowest step instead of the whole instruction
• More expensive design• Performance of a pipelined processor depends
on the executing program, and is harder to predict than a non-pipelined processor
Superscalar
• Executes multiple instructions per clock cycle by simultaneously dispatching to redundant functional units
• Think of it as multiple parallel pipelines, each processing instructions from a single stream
• Limitation: Degree of intrinsic parallelism in the stream
Out of Order Execution (OOE)
• Multiple instructions fetched• Instructions dispatched to an instruction queue
(also called instruction buffer or reservation stations)
• Instruction waits in the queue until the input operands are available
• Note that the instruction may leave the queue before earlier instructions
• Results are queued
Speculation in ILP
• Pipelining, OOE, Superscalars all consist of certain “speculation”
Branch prediction
• There has always been some speculation “circuitry” in processors
Forms of parallelism
• Functional: Perform tasks that are functionally different in parallel, e.g. building a house – plumber, carpenter, electrician
• Pipeline: Perform tasks that are different in a particular order, e.g. lunch buffet
• Data: Perform the same task on different data, e.g. grading exams, MapReduce
Limitations of ILP
• Finite amount of ILP in any sequence of instructions
• Another possibility: Thread Level Parallelism (Functional parallelism)
• How to get multiple threads? – Write parallel programs– Thread level speculation– Code parallelization
Thread Level Speculation
• Takes a sequence of instructions• Arbitrarily breaks it into a sequenced group of
threads that may run in parallel• Allows for oblivious parallelization of
sequential programs• Parallelization by speculation dynamically
finds parallelism at runtime, and thus is not conservative
Code parallelization
• Implemented in compilers, e.g. SUIF
• Problems: Hard to identify dependencies between pieces of code and data at compile time
CMP (Chip Multiprocessors)
• Forward data between parallel threads• Detect when reads occur too early• Safely discard speculative state after violations• Retire speculative writes in correct order
• Examples: Stanford HYDRA, Wisconsin Multiscalar, CMU Stampede (1995-2000)
Cache Coherence
• Consistency of data stored in local caches of a shared resource (Wiki definition)
• Protocols– MESI– MOESI– MOSI– MSI
MAIN MEMORY
INTERCONNECTION NETWORK
CACHE CACHE CACHE CACHE
P1 P2 P3 P4
2-state Invalidation Cache Protocol
VALID INVALID
BusWr
PrRd / BusRd
PrWr / BusWr
PrRd / --
PrWr / BusWr
X/Y: Action X / Reaction YPrRd: Processor ReadPrWr: Processor WriteBusRd: Fetch a cache blockBusWr: Write through one word--: No action
Write Through, No AllocationValid indicates cache presence
2-State Protocol
• Simple hardware and protocol
• Requires high bandwidth (every write goes on bus!)
3-state Protocol (MSI)
• Modified
• Shared
• Invalid
MSI State Diagram
M
BusRdX/BusWB
PrRd / --BusRd/--
PrRd / --
PrWr / --
I
S
PrWr / BusRdX
PrWr / BusRdXBusRdX/--
PrRd /BusRdBusRd/BusWB
Further Improvements
• MESI: Illinois protocol
• MOESI
FIRST TRANSACTIONAL MEMORIES
Precursors: Knight (1986)
• Idea of TLS• Two caches per processor• The first idea to propose the use of caches and
cache coherence to maintain and enforce ordering among speculatively parallelized regions of a sequential code in the presence of unknown memory dependencies
The word “Transactional Memory”
• Introduced by Herlihy and Moss in 1991
• Idea: Adapt the cache coherence protocol so that transactional accesses are monitored
ISCA 93
• Six new instructions– Load-transactional– Load-transactional-exclusion– Store-transactional– Commit– Abort– Validate
• New processor flags– Tactive: Is a transaction currently active?– Tstatus: Is the active transaction in progress, or aborted?
Transactional Cache
• States: MESI• Additional transactional tags: EMPTY,
NORMAL, XCOMMIT, XABORT• Transactional operations create two entries:
one with XCOMMIT and one with XABORT• Modifications made to XABORT on Store
Extra three bus cycles
• T_READ: On a transactional load
• T_RFO: On a transactional load exclusive, or a store
• BUSY: Full cache or other reasons (prevent deadlocks or mutual aborts)
Load_transactional
• LT: – Search TxCache for an XABORT entry. Return if one
exists– No XABORT entry Search for a NORMAL entry.
Change it to XABORT. Allocate a second entry with tag XCOMMIT and same data
– Else, issue a T_READ cycle. Behaves as Goodman’s read. Two entries created: tagged with XABORT and XCOMMIT.
Load_transactional_exclusive
• Similar to LT
• Instead of T_READ, T_RFO used on a miss
Store
• Similar to LTX
• Changes the XABORT entry’s data too
Validate
• Returns the TSTATUS flag
• If the TSTATUS flag is FALSE– Sets TSTATUS to TRUE– Sets TACTIVE to FALSE
Abort
• Discards the XABORT entries (sets their tags as EMPTY)
• Sets the tags of XCOMMIT entries as NORMAL• Sets the TSTATUS to TRUE• Sets the TACTIVE to FALSE
Commit
• Discards the XCOMMIT entries (sets their tags to EMPTY)
• Sets the tags of XABORT entries to NORMAL• Sets TSTATUS to TRUE• Sets TACTIVE to FALSE
Digression
• Why transactional memories instead of locks?
• Locks create several problems and require programmers to properly use them– Priority inversion: Lower priority process that holds a lock
preempted when a higher priority that needs the lock– Convoying: Process holding a lock is descheduled, and no
other process can progress– Deadlock: Two or more processes attempt to lock same
set of objects in different orders
Digression
• Transactional memory was invented as a faster means of performing lock-free synchronization
• That is why, earliest TM implementations have no misspeculations. They have aborts due to capacity constraints (HTM) or lock contentions
Speculative Lock Elision (SLE)
• Another reason to use TM!• Speculatively execute critical sections guarded
by locks• Use cache coherence and rollback for recovery
from misspeculation
Hardware TMs in general
• Great idea, efficient implementations• Limitations– High cost of implementation– Small transactional buffer sizes– Context switches
• Solutions: Unbounded HTM
SOFTWARE TM
Advantage
• More flexible than hardware, allows to experiment with variety of algorithms
• Fewer limitations imposed by fixed size hardware, like caches
Access Granularity
• Detects conflicting accesses on objects / words / regions
• Object: Easy implementation, but lot of false conflicts
• Word: Less false conflicts• Region: Less overhead than words
Update
• How the global memory is updated: Direct / deferred
• Direct: The transaction directly modifies the object itself, logs the original value in order to restore in case of abort
• Deferred: The transaction makes local modifications, and changes global memory only on commit
Conflict Detection
• When are the conflicts detected: Eager / lazy / mixed• What is a conflict: Multiple accesses, one of them is
a write• For commit, a transaction must acquire every
location updated. Eager if acquired at the first update operation, lazy if done at the time of commit.
• Mixed: Eagerly detects write/write conflicts, and lazily detects read/write conflicts
STM: 1995
• Memory to be accessed in a transaction known in advance
• Lock-free: Transactions help each other
• Motivation: Replace N-word CAS, implement lock-free data structures etc
The System Model
We assume that every shared memory location supports these 4 operations: Writei(L,v) - thread i writes v to L Readi(L,v) - thread i reads v from L LLi(L,v) - thread i reads v from L and marks that
L was read by I SCi(L,v) - thread i writes v to L and returns
success if L is marked as read by i. Otherwise it returns failure.
Threadclass Rec {
boolean stable = false;boolean,int status= (false,0); //can have two values…boolean allWritten = false;int version = 0;int size = 0;int locs[] = {null};int oldValues[] = {null};
}
Each thread is defined by an instance of a Rec class(short for record).
The Rec instance definesthe current transaction thethread is executing (only one transaction at a time)
The STM Object
Memory
Ownerships
statusversionsizelocs[]oldValues[]
Rec1
statusversionsizelocs[]oldValues[]
Rec2
statusversionsizelocs[]oldValues[]
Recn
This is the shared memory
Pointers to threads
Flow of a transaction
startTransaction Thread i
initialize
transaction
acquireOwnershipsagreeOldValues
calcNewValues
updateMemory
releaseOwnerships
releaseOwnerships
isInitiator?
ThreadsSTM
(Failure,failed loc)
FT
Initiatehelping
transactionto failed loc
(isInitiator:=F)
(Null, 0)
Success
Failure
The STM Objectpublic class STM {
int memory[];Rec ownerships[];
public boolean, int[] startTranscation(Rec rec, int[] dataSet){...};
private void initialize(Rec rec, int[] dataSet)private void transaction(Rec rec, int version, boolean isInitiator) {...};private void acquireOwnerships(Rec rec, int version) {...};private void releaseOwnershipd(Rec rec, int version) {...};private void agreeOldValues(Rec rec, int version) {...};private void updateMemory(Rec rec, int version, int[] newvalues) {...};
}
Implementationpublic boolean, int[] startTranscation(Rec rec, int[] dataSet) {
initialize(rec, dataSet);rec.stable = true;transaction(rec, rec.version, true);rec.stable = false;rec.version++;if (rec.status) return (true, rec.oldValues);else return false;
}
This notifies other threads that I can be helped
rec – The thread that executes this transaction.dataSet – The location in memory it needs to own.
Implementation
private void transaction(Rec rec, int version, boolean isInitiator) {acquireOwnerships(rec, version); // try to own locations
(status, failedLoc) = LL(rec.status); if (status == null) { // success in acquireOwnerships
if (versoin != rec.version) return;SC(rec.status, (true,0));
}
(status, failedLoc) = LL(rec.status);if (status == true) { // execute the transaction
agreeOldValues(rec, version);int[] newVals = calcNewVals(rec.oldvalues); updateMemory(rec, version);releaseOwnerships(rec, version);
}else { // failed in acquireOwnerships
releaseOwnerships(rec, version);if (isInitiator) {
Rec failedTrans = ownerships[failedLoc];if (failedTrans == null) return;else { // execute the transaction that owns the location you want
int failedVer = failedTrans.version;if (failedTrans.stable) transaction(failedTrans, failedVer, false);
}}
}}
rec – The thread that executes this transaction.version – Serial number of the transaction.isInitiator – Am I the initiating thread or the helper?
Another thread own the locations I need and it hasn’t finished its transaction yet.
So I go out and execute its transaction in order to help it.
Implementation
private void acquireOwnerships(Rec rec, int version) {for (int j=1; j<=rec.size; j++) {
while (true) do {int loc = locs[j];if LL(rec.status) != null return; // transaction completed by some other
threadRec owner = LL(ownerships[loc]); if (rec.version != version) return; if (owner == rec) break; // location is already mineif (owner == null) { // acquire location
if ( SC(rec.status, (null, 0)) ) { if ( SC(ownerships[loc], rec) ) { break; }}
}else {// location is taken by someone else
if ( SC(rec.status, (false, j)) ) return;}
}
}}
If I’m not the last one to read this field, it means that another thread is trying to execute this transaction. Try to loop until I succeed or until the other thread completes the transaction
Implementation
private void agreeOldValues(Rec rec, int version) {for (int j=1; j<=rec.size; j++) {
int loc = locs[j];if ( LL(rec.oldvalues[loc]) != null ) {
if (rec.version != version) return;SC(rec.oldvalues[loc], memory[loc]);
}}
}
private void updateMemory(Rec rec, int version, int[] newvalues) {for (int j=1; j<=rec.size; j++) {
int loc = locs[j];int oldValue = LL(memory[loc]);if (rec.allWritten) return; // work is doneif (rec.version != version) return;if (oldValue != newValues[j]) SC(memory[loc], newValues[j]);
}if (! LL(rec.allWritten) ) {
if (rec.version != version) SC(rec.allWritten, true);}
}
Copy the dataSet to my private space
Selectively update the shared memory
DSTM: 2003
• Object granularity
• Deferred update
• Eager conflict detection
• Indirection
• Validation
TL2: 2006
• Lock based
• Smart idea: Keep validation fast
• Many of the recent STM use TL2 as its base
Trends
• Initially: Lock-free
• Then: Obstruction-free
• Now: Mostly Lock based
• Reason: Simplicity pays off!
Homework 2
• Q1. Review a paper on HTM/STM
McRT STM
Bartok STM
NZ TM
Swiss TM
Tiny STM
Log TM
UTM
Question 2
• Understand the importance of different validation steps in DSTM and TL2
• Due date for Homework 2: 25 November
References• Cache Coherence Protocols: Evaluation Using a
Multiprocessor Simulation Model (Archibald and Baer, TOCS 1986)
• Transactional Memory: Architectural Support for Lock-Free Data Structures (Maurice Herlihu and J.Eliot B. Moss, ISCA 1993)
• Software Transactional Memory (Nir Shavit and Dan Touitou, PODC 2005)
• STM for Dynamic-sized Data Structures (Maurice Herlihy et al., PODC 2003)
• Transactional Locking II (Dave Dice et al., DISC 2006)
Next Lecture
• Correctness Properties in TM
• Formal Semantics of TM