LECTURE 5 A Brief History of TM

60
LECTURE 5 A Brief History of TM

description

LECTURE 5 A Brief History of TM. Precursors of Computing: ENIAC. 5000 ops/second 486k $ in 1946 19k vacuum tubes 200K watts 67 cubic meters. Latest trends: Intel Nehalem. 1.9 billion transistors 12 billion ops per second 4 microprocessors 8 MB of on-chip memory 100 W - PowerPoint PPT Presentation

Transcript of LECTURE 5 A Brief History of TM

Page 1: LECTURE 5 A Brief History of TM

LECTURE 5A Brief History of TM

Page 2: LECTURE 5 A Brief History of TM

Precursors of Computing: ENIAC

• 5000 ops/second• 486k $ in 1946• 19k vacuum tubes• 200K watts• 67 cubic meters

Page 3: LECTURE 5 A Brief History of TM

Latest trends: Intel Nehalem

• 1.9 billion transistors• 12 billion ops per second• 4 microprocessors• 8 MB of on-chip memory• 100 W• 246 square millimeters

Page 4: LECTURE 5 A Brief History of TM

The Way: Not just Chip Frequency!

• 1970s: Programmable controllers, single chip microprocessors

• 1980s: Instruction pipelines, cache hierarchies• 1990s: Speculative execution, Superscalar

processors• 2000s: Multicore chips, embedded computing

Page 5: LECTURE 5 A Brief History of TM

Pipelining

• Split the processing of an instruction into a series of independent steps

• Classic pipeline– Instruction Fetch (IF)– Instruction Decode (ID)– Execute (EX)– Memory Access (MEM)– Register Write Back (WB)

Page 6: LECTURE 5 A Brief History of TM

Pipelining

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Instr 1 IF ID EX MEM WB

Instr 2 IF ID EX MEM

Instr 3 IF ID EX

Instr 4 IF ID

Different parts of the CPU used for different stages of the pipeline

Page 7: LECTURE 5 A Brief History of TM

Pipelining

• Throughput: Speed of the slowest step instead of the whole instruction

• More expensive design• Performance of a pipelined processor depends

on the executing program, and is harder to predict than a non-pipelined processor

Page 8: LECTURE 5 A Brief History of TM

Superscalar

• Executes multiple instructions per clock cycle by simultaneously dispatching to redundant functional units

• Think of it as multiple parallel pipelines, each processing instructions from a single stream

• Limitation: Degree of intrinsic parallelism in the stream

Page 9: LECTURE 5 A Brief History of TM

Out of Order Execution (OOE)

• Multiple instructions fetched• Instructions dispatched to an instruction queue

(also called instruction buffer or reservation stations)

• Instruction waits in the queue until the input operands are available

• Note that the instruction may leave the queue before earlier instructions

• Results are queued

Page 10: LECTURE 5 A Brief History of TM

Speculation in ILP

• Pipelining, OOE, Superscalars all consist of certain “speculation”

Branch prediction

• There has always been some speculation “circuitry” in processors

Page 11: LECTURE 5 A Brief History of TM

Forms of parallelism

• Functional: Perform tasks that are functionally different in parallel, e.g. building a house – plumber, carpenter, electrician

• Pipeline: Perform tasks that are different in a particular order, e.g. lunch buffet

• Data: Perform the same task on different data, e.g. grading exams, MapReduce

Page 12: LECTURE 5 A Brief History of TM

Limitations of ILP

• Finite amount of ILP in any sequence of instructions

• Another possibility: Thread Level Parallelism (Functional parallelism)

• How to get multiple threads? – Write parallel programs– Thread level speculation– Code parallelization

Page 13: LECTURE 5 A Brief History of TM

Thread Level Speculation

• Takes a sequence of instructions• Arbitrarily breaks it into a sequenced group of

threads that may run in parallel• Allows for oblivious parallelization of

sequential programs• Parallelization by speculation dynamically

finds parallelism at runtime, and thus is not conservative

Page 14: LECTURE 5 A Brief History of TM

Code parallelization

• Implemented in compilers, e.g. SUIF

• Problems: Hard to identify dependencies between pieces of code and data at compile time

Page 15: LECTURE 5 A Brief History of TM

CMP (Chip Multiprocessors)

• Forward data between parallel threads• Detect when reads occur too early• Safely discard speculative state after violations• Retire speculative writes in correct order

• Examples: Stanford HYDRA, Wisconsin Multiscalar, CMU Stampede (1995-2000)

Page 16: LECTURE 5 A Brief History of TM

Cache Coherence

• Consistency of data stored in local caches of a shared resource (Wiki definition)

• Protocols– MESI– MOESI– MOSI– MSI

Page 17: LECTURE 5 A Brief History of TM

MAIN MEMORY

INTERCONNECTION NETWORK

CACHE CACHE CACHE CACHE

P1 P2 P3 P4

Page 18: LECTURE 5 A Brief History of TM

2-state Invalidation Cache Protocol

VALID INVALID

BusWr

PrRd / BusRd

PrWr / BusWr

PrRd / --

PrWr / BusWr

X/Y: Action X / Reaction YPrRd: Processor ReadPrWr: Processor WriteBusRd: Fetch a cache blockBusWr: Write through one word--: No action

Write Through, No AllocationValid indicates cache presence

Page 19: LECTURE 5 A Brief History of TM

2-State Protocol

• Simple hardware and protocol

• Requires high bandwidth (every write goes on bus!)

Page 20: LECTURE 5 A Brief History of TM

3-state Protocol (MSI)

• Modified

• Shared

• Invalid

Page 21: LECTURE 5 A Brief History of TM

MSI State Diagram

M

BusRdX/BusWB

PrRd / --BusRd/--

PrRd / --

PrWr / --

I

S

PrWr / BusRdX

PrWr / BusRdXBusRdX/--

PrRd /BusRdBusRd/BusWB

Page 22: LECTURE 5 A Brief History of TM

Further Improvements

• MESI: Illinois protocol

• MOESI

Page 23: LECTURE 5 A Brief History of TM

FIRST TRANSACTIONAL MEMORIES

Page 24: LECTURE 5 A Brief History of TM

Precursors: Knight (1986)

• Idea of TLS• Two caches per processor• The first idea to propose the use of caches and

cache coherence to maintain and enforce ordering among speculatively parallelized regions of a sequential code in the presence of unknown memory dependencies

Page 25: LECTURE 5 A Brief History of TM

The word “Transactional Memory”

• Introduced by Herlihy and Moss in 1991

• Idea: Adapt the cache coherence protocol so that transactional accesses are monitored

Page 26: LECTURE 5 A Brief History of TM

ISCA 93

• Six new instructions– Load-transactional– Load-transactional-exclusion– Store-transactional– Commit– Abort– Validate

• New processor flags– Tactive: Is a transaction currently active?– Tstatus: Is the active transaction in progress, or aborted?

Page 27: LECTURE 5 A Brief History of TM

Transactional Cache

• States: MESI• Additional transactional tags: EMPTY,

NORMAL, XCOMMIT, XABORT• Transactional operations create two entries:

one with XCOMMIT and one with XABORT• Modifications made to XABORT on Store

Page 28: LECTURE 5 A Brief History of TM

Extra three bus cycles

• T_READ: On a transactional load

• T_RFO: On a transactional load exclusive, or a store

• BUSY: Full cache or other reasons (prevent deadlocks or mutual aborts)

Page 29: LECTURE 5 A Brief History of TM

Load_transactional

• LT: – Search TxCache for an XABORT entry. Return if one

exists– No XABORT entry Search for a NORMAL entry.

Change it to XABORT. Allocate a second entry with tag XCOMMIT and same data

– Else, issue a T_READ cycle. Behaves as Goodman’s read. Two entries created: tagged with XABORT and XCOMMIT.

Page 30: LECTURE 5 A Brief History of TM

Load_transactional_exclusive

• Similar to LT

• Instead of T_READ, T_RFO used on a miss

Page 31: LECTURE 5 A Brief History of TM

Store

• Similar to LTX

• Changes the XABORT entry’s data too

Page 32: LECTURE 5 A Brief History of TM

Validate

• Returns the TSTATUS flag

• If the TSTATUS flag is FALSE– Sets TSTATUS to TRUE– Sets TACTIVE to FALSE

Page 33: LECTURE 5 A Brief History of TM

Abort

• Discards the XABORT entries (sets their tags as EMPTY)

• Sets the tags of XCOMMIT entries as NORMAL• Sets the TSTATUS to TRUE• Sets the TACTIVE to FALSE

Page 34: LECTURE 5 A Brief History of TM

Commit

• Discards the XCOMMIT entries (sets their tags to EMPTY)

• Sets the tags of XABORT entries to NORMAL• Sets TSTATUS to TRUE• Sets TACTIVE to FALSE

Page 35: LECTURE 5 A Brief History of TM

Digression

• Why transactional memories instead of locks?

• Locks create several problems and require programmers to properly use them– Priority inversion: Lower priority process that holds a lock

preempted when a higher priority that needs the lock– Convoying: Process holding a lock is descheduled, and no

other process can progress– Deadlock: Two or more processes attempt to lock same

set of objects in different orders

Page 36: LECTURE 5 A Brief History of TM

Digression

• Transactional memory was invented as a faster means of performing lock-free synchronization

• That is why, earliest TM implementations have no misspeculations. They have aborts due to capacity constraints (HTM) or lock contentions

Page 37: LECTURE 5 A Brief History of TM

Speculative Lock Elision (SLE)

• Another reason to use TM!• Speculatively execute critical sections guarded

by locks• Use cache coherence and rollback for recovery

from misspeculation

Page 38: LECTURE 5 A Brief History of TM

Hardware TMs in general

• Great idea, efficient implementations• Limitations– High cost of implementation– Small transactional buffer sizes– Context switches

• Solutions: Unbounded HTM

Page 39: LECTURE 5 A Brief History of TM

SOFTWARE TM

Page 40: LECTURE 5 A Brief History of TM

Advantage

• More flexible than hardware, allows to experiment with variety of algorithms

• Fewer limitations imposed by fixed size hardware, like caches

Page 41: LECTURE 5 A Brief History of TM

Access Granularity

• Detects conflicting accesses on objects / words / regions

• Object: Easy implementation, but lot of false conflicts

• Word: Less false conflicts• Region: Less overhead than words

Page 42: LECTURE 5 A Brief History of TM

Update

• How the global memory is updated: Direct / deferred

• Direct: The transaction directly modifies the object itself, logs the original value in order to restore in case of abort

• Deferred: The transaction makes local modifications, and changes global memory only on commit

Page 43: LECTURE 5 A Brief History of TM

Conflict Detection

• When are the conflicts detected: Eager / lazy / mixed• What is a conflict: Multiple accesses, one of them is

a write• For commit, a transaction must acquire every

location updated. Eager if acquired at the first update operation, lazy if done at the time of commit.

• Mixed: Eagerly detects write/write conflicts, and lazily detects read/write conflicts

Page 44: LECTURE 5 A Brief History of TM

STM: 1995

• Memory to be accessed in a transaction known in advance

• Lock-free: Transactions help each other

• Motivation: Replace N-word CAS, implement lock-free data structures etc

Page 45: LECTURE 5 A Brief History of TM

The System Model

We assume that every shared memory location supports these 4 operations: Writei(L,v) - thread i writes v to L Readi(L,v) - thread i reads v from L LLi(L,v) - thread i reads v from L and marks that

L was read by I SCi(L,v) - thread i writes v to L and returns

success if L is marked as read by i. Otherwise it returns failure.

Page 46: LECTURE 5 A Brief History of TM

Threadclass Rec {

boolean stable = false;boolean,int status= (false,0); //can have two values…boolean allWritten = false;int version = 0;int size = 0;int locs[] = {null};int oldValues[] = {null};

}

Each thread is defined by an instance of a Rec class(short for record).

The Rec instance definesthe current transaction thethread is executing (only one transaction at a time)

Page 47: LECTURE 5 A Brief History of TM

The STM Object

Memory

Ownerships

statusversionsizelocs[]oldValues[]

Rec1

statusversionsizelocs[]oldValues[]

Rec2

statusversionsizelocs[]oldValues[]

Recn

This is the shared memory

Pointers to threads

Page 48: LECTURE 5 A Brief History of TM

Flow of a transaction

startTransaction Thread i

initialize

transaction

acquireOwnershipsagreeOldValues

calcNewValues

updateMemory

releaseOwnerships

releaseOwnerships

isInitiator?

ThreadsSTM

(Failure,failed loc)

FT

Initiatehelping

transactionto failed loc

(isInitiator:=F)

(Null, 0)

Success

Failure

Page 49: LECTURE 5 A Brief History of TM

The STM Objectpublic class STM {

int memory[];Rec ownerships[];

public boolean, int[] startTranscation(Rec rec, int[] dataSet){...};

private void initialize(Rec rec, int[] dataSet)private void transaction(Rec rec, int version, boolean isInitiator) {...};private void acquireOwnerships(Rec rec, int version) {...};private void releaseOwnershipd(Rec rec, int version) {...};private void agreeOldValues(Rec rec, int version) {...};private void updateMemory(Rec rec, int version, int[] newvalues) {...};

}

Page 50: LECTURE 5 A Brief History of TM

Implementationpublic boolean, int[] startTranscation(Rec rec, int[] dataSet) {

initialize(rec, dataSet);rec.stable = true;transaction(rec, rec.version, true);rec.stable = false;rec.version++;if (rec.status) return (true, rec.oldValues);else return false;

}

This notifies other threads that I can be helped

rec – The thread that executes this transaction.dataSet – The location in memory it needs to own.

Page 51: LECTURE 5 A Brief History of TM

Implementation

private void transaction(Rec rec, int version, boolean isInitiator) {acquireOwnerships(rec, version); // try to own locations

(status, failedLoc) = LL(rec.status); if (status == null) { // success in acquireOwnerships

if (versoin != rec.version) return;SC(rec.status, (true,0));

}

(status, failedLoc) = LL(rec.status);if (status == true) { // execute the transaction

agreeOldValues(rec, version);int[] newVals = calcNewVals(rec.oldvalues); updateMemory(rec, version);releaseOwnerships(rec, version);

}else { // failed in acquireOwnerships

releaseOwnerships(rec, version);if (isInitiator) {

Rec failedTrans = ownerships[failedLoc];if (failedTrans == null) return;else { // execute the transaction that owns the location you want

int failedVer = failedTrans.version;if (failedTrans.stable) transaction(failedTrans, failedVer, false);

}}

}}

rec – The thread that executes this transaction.version – Serial number of the transaction.isInitiator – Am I the initiating thread or the helper?

Another thread own the locations I need and it hasn’t finished its transaction yet.

So I go out and execute its transaction in order to help it.

Page 52: LECTURE 5 A Brief History of TM

Implementation

private void acquireOwnerships(Rec rec, int version) {for (int j=1; j<=rec.size; j++) {

while (true) do {int loc = locs[j];if LL(rec.status) != null return; // transaction completed by some other

threadRec owner = LL(ownerships[loc]); if (rec.version != version) return; if (owner == rec) break; // location is already mineif (owner == null) { // acquire location

if ( SC(rec.status, (null, 0)) ) { if ( SC(ownerships[loc], rec) ) { break; }}

}else {// location is taken by someone else

if ( SC(rec.status, (false, j)) ) return;}

}

}}

If I’m not the last one to read this field, it means that another thread is trying to execute this transaction. Try to loop until I succeed or until the other thread completes the transaction

Page 53: LECTURE 5 A Brief History of TM

Implementation

private void agreeOldValues(Rec rec, int version) {for (int j=1; j<=rec.size; j++) {

int loc = locs[j];if ( LL(rec.oldvalues[loc]) != null ) {

if (rec.version != version) return;SC(rec.oldvalues[loc], memory[loc]);

}}

}

private void updateMemory(Rec rec, int version, int[] newvalues) {for (int j=1; j<=rec.size; j++) {

int loc = locs[j];int oldValue = LL(memory[loc]);if (rec.allWritten) return; // work is doneif (rec.version != version) return;if (oldValue != newValues[j]) SC(memory[loc], newValues[j]);

}if (! LL(rec.allWritten) ) {

if (rec.version != version) SC(rec.allWritten, true);}

}

Copy the dataSet to my private space

Selectively update the shared memory

Page 54: LECTURE 5 A Brief History of TM

DSTM: 2003

• Object granularity

• Deferred update

• Eager conflict detection

• Indirection

• Validation

Page 55: LECTURE 5 A Brief History of TM

TL2: 2006

• Lock based

• Smart idea: Keep validation fast

• Many of the recent STM use TL2 as its base

Page 56: LECTURE 5 A Brief History of TM

Trends

• Initially: Lock-free

• Then: Obstruction-free

• Now: Mostly Lock based

• Reason: Simplicity pays off!

Page 57: LECTURE 5 A Brief History of TM

Homework 2

• Q1. Review a paper on HTM/STM

McRT STM

Bartok STM

NZ TM

Swiss TM

Tiny STM

Log TM

UTM

Page 58: LECTURE 5 A Brief History of TM

Question 2

• Understand the importance of different validation steps in DSTM and TL2

• Due date for Homework 2: 25 November

Page 59: LECTURE 5 A Brief History of TM

References• Cache Coherence Protocols: Evaluation Using a

Multiprocessor Simulation Model (Archibald and Baer, TOCS 1986)

• Transactional Memory: Architectural Support for Lock-Free Data Structures (Maurice Herlihu and J.Eliot B. Moss, ISCA 1993)

• Software Transactional Memory (Nir Shavit and Dan Touitou, PODC 2005)

• STM for Dynamic-sized Data Structures (Maurice Herlihy et al., PODC 2003)

• Transactional Locking II (Dave Dice et al., DISC 2006)

Page 60: LECTURE 5 A Brief History of TM

Next Lecture

• Correctness Properties in TM

• Formal Semantics of TM