CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg +...

122
CS4021 L OCKLESS ALGORITHMS © 2012 [email protected] 11-Dec-12 School of Computer Science and Statistics, Trinity College Dublin 1 CS4021 Advanced Computer Architecture concurrent programming with and without locks atomic instructions / updates lock implementations and performance lockless [non blocking] data structures and algorithms CAS based MCAS based memory management [e.g. hazard pointers] hardware transactional memory [HTM] Herlihy and Moss [1993] Intel Haswell CPU [2012] The Art of Multiprocessor Programming, Maurice Herlihy and Nir Shavit CS4021 website https://www.scss.tcd.ie/Jeremy.Jones/CS4021/CS4021.htm

Transcript of CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg +...

Page 1: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 1

CS4021 Advanced Computer Architecture • concurrent programming with and without locks • atomic instructions / updates

• lock implementations and performance • lockless [non blocking] data structures and algorithms

CAS based MCAS based memory management [e.g. hazard pointers]

• hardware transactional memory [HTM]

Herlihy and Moss [1993] Intel Haswell CPU [2012]

• The Art of Multiprocessor Programming, Maurice Herlihy and Nir Shavit

• CS4021 website https://www.scss.tcd.ie/Jeremy.Jones/CS4021/CS4021.htm

Page 2: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 2

Why be concerned? • clock rate of a single CPU core appears to be limited to ≈ 4GHz • single CPU core processing power very far short of doubling every 18 months • Intel, AMD, Sun, IBM, … producing multicore CPUs instead • typical desktop has 4 cores with each core capable of executing 2 threads [hyper-

threading] giving a total of 8 concurrent threads

• typical desktop in 2014 16 threads, 2016 32 threads, … [Moore's Law and Joy's Law] • need to be able to exploit cheap threads on multicore CPUs • locked based solutions are simply not scalable as a lock inhibits parallelism

• need to explore lockless data structures and algorithms

Page 3: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 3

Consider an ordered linked list [set] • each node has a key and a next field

• NB: list doesn't contain duplicate keys

• add(25) adds a node containing 25 to list [does nothing if item already in list]

Page 4: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 4

Consider an ordered linked list… • remove(30) removes node containing 30 from list [does nothing if item NOT in list]

Page 5: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 5

Concurrent Updates • conventional approach is to protect list with a single lock

[CriticalSection, Mutex, …] which prevents concurrent accesses by different threads

• if list is protected by a lock, it is clear that ONLY ONE operation can

occur at a time [access to list serialised by lock] • ALSO clear that if the list is long enough, multiple add and remove

operations can occur concurrently as they will update pointers in disjoint parts of the list [disjoint access parallelism]

• lockless approach allows multiple add and remove operations to occur concurrently

• remedial action taken if a clash is detected [non disjoint updates]

Page 6: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 6

Spin Lock Implementations • implementations should minimise bus traffic especially when a lock is heavily

contested

• CPUs waiting for a lock are idle and shouldn't generate unnecessary bus traffic which slow the CPUs doing real work

• spin lock implementations usually rely on atomic instructions which comprise an indivisible read-modify-write [RMW] access to a shared memory location

• in a single CPU system, many instructions are effectively atomic because interrupts

are ONLY recognised between instructions

Page 7: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 7

Spin Lock Implementations… • consider a spinlock implementation based on an IA32 logical shift right instruction [shr]

; ; simple spin lock (NB: 1 == free, 0 == taken) ; wait shr lock, 1 ; lock in memory jnc wait ; jump no carry (retry if C == 0) ret ; return free mov lock, 1 ; lock = 1 (free) ret ; return

• works in a single CPU system, but not in a multiprocessor • why? determined by how CPU updates memory

if lock free and “shr lock, 1” is executed, lock becomes taken and the carry flag is set atomically/simultaneously sets lock as taken and returns the fact that the lock has been acquired in the carry flag

Page 8: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 8

Bus Arbiter

• if CPU wishes to access shared memory, it asserts its bus request signal [/BREQn]

• arbiter grants access to one CPU at a time by asserting its bus grant signal [/BGRNTn]

• arbiter normally grants bus to CPUs on a cycle by cycle basis in a fair manner [round

robin]

• ONLY one CPU at time can access shared memory

• CPUs given access to bus and shared memory, one at a time, by a bus arbiter

Page 9: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 9

Atomic Instructions • atomic RMW memory accesses [read cycle followed by a write cycle] must NOT be

interleaved with memory accesses made by other CPUs • CPUs generally have special atomic instructions which indicate externally that an

atomic RMW memory access is being performed • if bus cycles are arbitrated on a cycle by cycle basis [i.e. NON atomic] then

a CPU could read a lock and find it free; on the next bus cycle another CPU could also read the lock and find it still free before the first CPU has been given a bus cycle to set the lock; this would result in the lock being allocated to both CPUs

• IA32/x64 CPUs asserts a /LOCK signal [external pin on chip] to inform bus arbiter that it is trying to perform an atomic RMW memory access

• bus arbiter must simply lock CPU onto bus while the /LOCK signal is asserted

Page 10: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 10

IA32/x64 Atomic Instructions • XCHG [exchange] instruction generates an atomic read-modify-write memory access • use variant which exchanges [swaps] a register with a memory location

; ; testAndSet lock [NB: 0 = free, 1 = taken] ; wait mov eax, 1 ; eax = 1 xchg eax, lock ; exchange eax and lock in memory test eax, eax ; test eax [result of xchg] jne wait ; re-try if unsuccessful ret ; return

free mov lock, 0 ; clear lock ret

• XCHG asserts /LOCK when executed, hence atomic

Page 11: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 11

IA32/x64 Atomic Instructions

• a selection of other IA32/x64 instructions can perform atomic RMW cycles if preceded

with a LOCK prefix instruction

• bts, btr, btc, xadd, cmpxchg, cmpxchg8b, inc, dec, not, neg, add, adc, sub, sbb, and, or,

& xor [only valid if instruction performs a read-modify-write access to memory]

• consider the useful exchange and add instruction xadd

lock ; lock prefix

xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp

• without a LOCK prefix, XADD is executed non atomically

Page 12: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 12

Windows Example • can mix assembly language and C++, BUT…

• x64 VC++ compiler doesn't support an inline assembler, so for Win32/x64 portability

can use the intrinsics defined in intrin.h instead

LONG __cdecl InterlockedExchange( _Inout_ LONG volatile *Target, _In_ LONG Value );

• could be used as follows

volatile long lock = 0; // declare and initialise lock while (InterlockedExchange(a, 1)); // acquire lock

• NB: even though long and int are both 32 bit signed integers, types are NOT equivalent

Page 13: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 13

Volatile • lock must be declared as volatile

• description of volatile from Visual Studio 2012 documentation

objects that are declared as volatile are not used in certain optimizations because their values can change at any time. The system always reads the current value of a volatile object when it is requested, even if a previous instruction asked for a value from the same object. Also, the value of the object is written immediately on assignment.

• to declare object pointed to by a pointer as volatile use: volatile int *p; // what p points to is volatile

• to declare the pointer itself volatile use:

int * volatile p; // contents of p is volatile

• both volatile int* volatile p; // p and what p points to are all volatile

Page 14: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 14

Windows Example… • x64 Release code for a call to InterlockedExchange() obtained using Visual Studio

Debugger [VC++ compiler generates in line code rather than a function call]

000000013F3B1330 mov eax,1 000000013F3B1335 xchg eax,dword ptr [rsi+8] 000000013F3B1338 test eax,eax 000000013F3B133A jne worker+0D0h (013F3B1330h)

[rsi+8] contains address of lock

retry if unsuccessful

Page 15: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 15

Serializing Instructions… • consider the following testAndSet code for obtaining a lock

CPU0 shared data CPU1

wait mov eax, 1

obtain lock

wait mov eax, 1

xchg eax, lock xchg eax, lock

test eax, eax test eax, eax

jne wait jne wait

<update shared data> update shared data <update shared data>

mov lock, 0 release lock mov lock, 0

Page 16: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 16

Serializing Instructions… • need to consider memory read and write ordering if locks are to work correctly • CPU must NOT read ahead data in the shared data structure before it has obtained the

lock [otherwise the CPU with lock may not have finished updating the shared data structure and out of date will be read]

• CPU must not release the lock until ALL its writes to the shared data structure have

been completed [otherwise next lock holder could read out of date data] • LOCKED instructions [e.g. xchg, lock xadd] act implicitly as a memory barrier or fence • reads/writes cannot pass [be carried out ahead of] locked [serialising] instructions

Page 17: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 17

Serializing Instructions… • CPUs often have explicit memory barrier or fence instructions to flush the write buffer

and to enforce ordering • IA32/x64 have the following fence instructions

SFENCE store fence flush all writes before executing instruction LFENCE load fence don't read ahead until instruction executed MFENCE memory fence flush all writes before executing instruction and…

don't read ahead until instruction executed

• see section 8.2 on Memory Ordering in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1

• and also Intel® 64 Architecture Memory Ordering White Paper

Page 18: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 18

Serializing Instructions • from a hardware perspective

• CPU has an internal write

buffer which is used to buffer writes to the memory hierarchy [for speed]

• data in write buffer not visible externally until written to memory, in this case to the first level cache

Page 19: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 19

Serializing Instructions… • why does the previous testAndSet code work on an IA32/x64 CPU?

1) writes are a made to memory in program order so that when the lock is cleared and visible [mov lock, 0] ALL previous writes to the shared data structure are also visible

2) lock obtained using a serialising instruction [xchg eax, lock] which prevents read ahead so that data in the shared data structure will not be read until the lock is obtained

3) executing serialising instructions reduces CPU performance as it prevents CPU from reading and writing ahead

Page 20: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 20

Load Locked / Store Conditional Instructions • alternative approach for performing atomic RMW accesses to memory • executing a load locked [LL] followed by a store conditional [SC] instruction is used to

perform an atomic RMW access to memory

• first used by MIPS CPU [ll/sc]

• also used by Alpha [ldq_l/stq_c], IBM Power PC [lwarx/stwcx] and ARM [ldrex/strex] CPUs

Page 21: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 21

Alpha LL/SC Implementation • each CPU has a lockFlag [LF] and a lockPhysicalAddressRegister [LPAR] used by the LL

and SC instructions • LDQ_L Ra, va ; load quadword locked

lockFlag = 1 lockPhysicalAddressRegister = physicalAddress(va) Ra = [va]

• STQ_C Ra, va ; conditionally store quadword

if (lockFlag == 1) ; check lock flag [va] = Ra ; conditional store if lockFlag is set Ra = lockFlag ; used to test if store occurred lockFlag = 0 ; clear lock flag

Page 22: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 22

Alpha LL/SC Implementation… • where is the magic?

• if the per CPU lockFlag is still set when an associated STQ_C is executed, the store occurs

otherwise NO store takes place [conditional store] • what clears the lockFlag? if any CPU does a store [write] to the physical memory address contained in a

lockPhysicalAddressRegister, the associated CPU clears its lockFlag

Page 23: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 23

Alpha LL/SC Hardware Perspective • consider CPU0 executing a LDQ_L ra, lock instruction [NB: lock is the virtual address of

the lock]

• state just after CPU0 has executed LDQ_L ra, lock

• NB: LF and LPAR values

Page 24: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 24

Alpha LL/SC Hardware Perspective… • if any other CPU writes to the lock variable , CPU0's lockFlag will be cleared

• CPU2 writes to the lock variable resulting in CPU0's lockFlag being cleared

• when CPU0 executes the store conditional associated with the LL, it will not write to

memory

Page 25: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 25

Using Alpha LL/SC to Perform an Atomic RMW • if the following sequence of instructions is successfully executed on a given CPU [BEQ

XXX doesn't branch back to XXX] …

XXX: LDQ_L ra, va ; read <modify> ; modify STQ_C rb, va ; conditional write BEQ XXX ; retry if unsuccessful

• it means that the CPU has performed an atomic RMW access to memory location a

• if the conditional store fails, must retry • LL/SC implementation means that a write to va is ONLY performed if there's a guarantee

of a atomic RMW access to va

Page 26: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 26

Using Alpha LL/SC to Implement a TestAndSet Lock ACQUIRE LDQ_L R1, lock ; read lock BLBS R1, ACQUIRE ; retry if already set OR R1, #1, R2 ; r2 = 1 STQ_C R2, lock ; store conditional BEQ R2, ACQUIRE ; retry if unsuccessful MB ; memory barrier

<update shared data structure >

MB ; memory barrier STQ R31, lock ; clear lock [R31 always 0]

• BLBS: branch if register low bit set • MB: memory barrier which is equivalent to an IA32/x64 memory fence

see section on serialising instructions [slide 21]

• means that a write to a lock is ONLY performed when the lock is obtained resulting in much less bus traffic when lock contested

Page 27: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 27

Cost of Sharing Data Between Threads • no problem if threads only make read accesses

• with the MESI cache coherency protocol, writes to shared data will cause copies in the

other caches to be invalidated

• program read and write accesses are typically 32 or 64 bits, while the size of a cache line is typically 64 bytes [with latest Intel CPUs]

• means that data in a cache line can be invalidated by a write to another part of the cache line

• known as false sharing [cache line shared rather than the data] • can be prevented by storing data in its own cache line

• sharing.cpp written to evaluate the cost of sharing

Page 28: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 28

Cost of Sharing Data Between Threads…

• uses the following data structure

• each long variable stored in its own cache line

• code uses CPUID instruction to find CPU cache line size [see CPUID application Note]

• each thread repeatedly executes InterlockedExchangeAdd() to increment a thread specific or a shared variable for NSECONDS

• 0%, 25%, 50%, 75% and 100% sharing determined by how often the shared variable is incremented relative to the thread specific variable

Page 29: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 29

Cost of Sharing Data Between Threads… • for 25% sharing, for example, each thread executes

InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(thread), 1); // thread specific InterlockedExchangeAdd(GINDX(maxThread), 1); // shared NB: threads numbers from 0 .. maxThread-1

• use _aligned_malloc to allocate data on a cache line boundary volatile long *g; // NB: position of volatile g = (long*) _aligned_malloc((maxThread+1)*lineSz, lineSz); // shared global variable

• GINDX macro defined as follows #define GINDX(n) (g + n*lineSz/sizeof(long)) // index into g

Page 30: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 30

Code to Create and Run N Threads for NSECONDS…

int _tmain(int argc, _TCHAR* argv[]) { … for (sharing = 0; sharing <= 100; sharing += 25) { // sharing range for (int nt = 1; nt <= 2*ncpus; nt *= 2) { // thread range tstart = clock(); for (int thread = 0; thread < nt; thread++) // create and start nt threads threadH[thread] = CreateThread(NULL, 0, worker, (LPVOID) thread, 0, NULL); WaitForMultipleObjects(nt, threadH, true, INFINITE); // wait for ALL threads to finish printResults(); // print results for (int thread = 0; thread < nt; thread++) // delete thread handles CloseHandle(threadH[thread]); } } … }

Page 31: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 31

Worker Function DWORD WINAPI worker(LPVOID thread) { long long ops = 0; // 64 bit local counter while (1) { for (int i = 0; i < NOPS / 4; i++) { // NOPS/4 since work comprises...

// do some work // 4 InterlockedExchange operations } ops += NOPS; // local to thread if (clock() - tstart > NSECONDS*CLOCKS_PER_SEC) // NSECONDS of work? break; } cnt[(int) thread] = ops; // remember in global cnt array return 0; }

Page 32: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 32

Cost of Sharing Data Between Threads… • NB: cache data retrieved from CPU using CPUID instruction

Page 33: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 33

Cost of Sharing Data Between Threads…

1.00

1.74

2.88

3.41 3.40

0.98 0.87 0.92

1.05 1.12

1.00

0.29 0.28 0.25 0.25

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

4.00

1 2 4 8 16

Rela

tati

ve t

o s

ingle

thre

ad a

nd 0

% s

hari

ng

# threads

Cost of CPUs Sharing Write Data

0% sharing

25% sharing

50% sharing

75% sharing

100% sharing

Page 34: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 34

Cost of Sharing Data Between Threads… • comments of graph…

• replacing InterlockedExchangeAdd(g, 1) with (*g)++

Page 35: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 35

TestAndSet Lock • declaration of lock

volatile long lock = 0; // lock stored in shared memory • to acquire lock

while (InterlockedExchange(&lock, 1)); // wait for lock [0:free 1:taken]

• to release lock

lock = 0; // clear lock

• if an xchg instruction [InterlockedExchange] is used to obtain a lock, performance is poor when there's contention for the lock

• need to remember how the MESI cache coherency protocol operates if considering IA32/x64 CPUs

Page 36: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 36

TestAndSet Lock… • ALL waiting CPUs repeatedly execute an xchg instruction trying to get hold of lock • the memory accesses made by the xchg instruction don't benefit from having a cache

since the shared cache lines are continually overwritten [even if the lock is a 1, it is overwritten with a 1] which invalidates the entries in the other caches which results in bus cycles for both the read and write parts of ALL xchg instructions [think MESI]

• ALL the xchg read and writes will be to memory

• a write update cache coherency protocol would allows the reads to be local cache reads [Firefly]

• the lock is overwritten even if there is NO chance of obtaining the lock

• why is there not an instructions which conditionally writes a 1 if the value read is 0

[e.g. conditional testAndSet] ?

Page 37: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 37

TestAndTestAndSet Lock • designed to take advantage of underlying cache behaviour • to acquire lock [optimistic version]

while (InterlockedExchange(&lock, 1)) // try for lock while (lock == 1) // wait unit lock free _mm_pause(); // instrinsic see next slide

• to acquire lock [pessimistic version]

do { while (lock == 1) // wait unit lock free _mm_pause(); // intrinsic see next slide } while (InterlockedExchange(&lock, 1)); // try for lock

• optimistic version assumes lock is going to be free

Page 38: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 38

IA32 TestAndTestAndSet Lock… • 7.11.2 PAUSE Instruction

The PAUSE instruction can improves the performance of processors supporting Hyper-Threading Technology when executing “spin-wait loops” and other routines where one thread is accessing a shared lock or semaphore in a tight polling loop. When executing a spin-wait loop, the processor can suffer a severe performance penalty when exiting the loop because it detects a possible memory order violation and flushes the core processor’s pipeline. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation and prevent the pipeline flush. In addition, the PAUSE instruction de-pipelines the spin-wait loop to prevent it from consuming execution resources excessively. (See Section 7.11.6.1, “Use the PAUSE Instruction in Spin-Wait Loops,” for more information about using the PAUSE instruction with IA-32 processors supporting Hyper-Threading Technology.)

Page 39: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 39

TestAndTestAndSet Lock… • the advantage is that the test of the lock [lock == 1] is executed entirely within the

cache and the xchg instruction is only executed when the lock is known to be free and there is a chance of acquiring the lock

• the cached lock variable will be invalidated or updated when the lock is released and

only then is an attempt made to obtain the lock by executing a xchg instruction

• if the release of the lock invalidates the other shared caches lines then O(n2) [where n is number of CPUs waiting for lock] bus cycles will be generated? quote from the literature

• ALL n waiting CPUs continuously read the lock [from their own local cache]; these

cache lines will be invalidated when the lock is released; subsequent reads of the lock will appear on bus which will be serialised by a typical round-robin bus arbiter and each CPU, in turn, will see the lock free

Page 40: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 40

TestAndTestAndSet Locks... • an individual CPU executes its xchg instruction but then sees the remaining CPUs

executing their xchg instructions which will invalidate its cache line so a bus cycle has to be performed to read the lock again i.e. O(n2)

• however won't the bus cycles for the xchg be such that all CPUs will execute them one

after another [thanks to the round robin arbiter] so that a CPU's cache line is effectively invalidated only once? i.e. O(n)

• if the release of the lock updates the other caches directly then the generated bus traffic will only be of O(n)

• either way there will be enough bus activity to interfere with the process in the critical

section as well as the other processes not involved with the lock

• if the lock is held for a long time the impact is unimportant, but for short critical sections the lock will be released before the last spurt of activity has subsided resulting in continued bus saturation

Page 41: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 41

TestAndSet Lock with Exponential Back Off • don't continuously try to acquire lock, delay between attempts

to acquire lock: d = 1; // initialise back off delay while (InterlockedExchange(&lock, 1)) { // if unsuccessful… delay(d); // delay d time units d *= 2; // exponential back off }

• testAndTestAndSet lock NOT necessary when using a back off scheme • the longer the CPU has being waiting for the lock, the longer it will have to wait before

it attempts to acquire the lock again, possibility of starvation • supposed to work well in practice

Page 42: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 42

Ticket Lock with Proportional Back Off class TicketLock { public: volatile long ticket; // initialise to 0 volatile long nowServing; // initialise to 0 };

inline void acquire(TicketLock *lock) // acquire lock { int myTicket = InterlockedExchangeAdd(&lock->ticket, 1); // get ticket [atomic] while (myTicket != lock->nowServing) // if not our turn… delay(myticket - lock->nowServing); // delay relative to… } // position in Q inline void release(TicketLock *lock) // release lock { lock->nowServing ++; // give lock to next CPU } // NB: not atomic

Page 43: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 43

Ticket Lock with Proportional Back Off… • think of waiting in a Q in the Andrews St. tourist office, ISS computer help desk, A&E, … • deterministic • ONLY 1 atomic instruction executed per lock acquisition • FAIR, locks granted in order of request which eliminates the possibility of starvation • back off proportional to position in Q

• if time in critical section is constant , the delay can be calculated such that the

subsequent test of lock->nowServing will just succeed • still polls a common location [lock->nowServing] which will cause some bus traffic with

an invalidate protocol • delay not necessary with a write-update protocol [Firefly]

Page 44: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 44

MCS Lock [Mellor-Crummy and Scott] • lockless queue of waiting threads • each thread has its own QNode which is linked into a Q of QNodes waiting for lock • a global variable lock points to tail of Q • acquire lock by adding a thread’s QNode [qn] to tail of Q and waiting until

qn->waiting==0 • release lock by setting qn->next->waiting=0 [if qn not at the tail of Q]

Page 45: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 45

MCS Lock… • before looking at the code for the MCS lock need to discuss

the Compare and Swap (CAS) instruction

how to allocate objects on a cache line boundaries

thread local storage

Page 46: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 46

Compare and Swap [CAS]

• pseudo C version of CAS

atomic long CAS(long *a, long e, long n) // memory address, expected value, new value { long r = *a; // read contents of memory address if (r == e) // compare with expected value and if equal… *a = n; // update memory with new value return r; // success if e returned }

• NB: returns expected value if exchange took place

• CAS can be mapped onto the IA32/x64 compare and exchange instruction

cmpxchg reg, mem // if (eax == mem) // ZF = 1, mem = reg // else // ZF = 0, eax = mem

Page 47: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 47

Compare and Swap…

• make use of following instrinsic defined in intrin.h long InterlockedCompareExchange(long volatile *a, long n, long e); NB: different parameter order than previous/normal definition of CAS

• for convenience can always define #define CAS(a, e, n) InterlockedCompareExchange(a, n, e)

Page 48: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 48

How to allocate objects aligned on a cache line • can allocate objects in their own cache line(s) to avoid false sharing • one straightforward approach is to use a template class to override new and delete

// // derive from ALIGNEDMA for aligned memory allocation // template <class T> class ALIGNEDMA { public: void* operator new(size_t); // override new void operator delete(void*); // override delete };

• C++ magic

Page 49: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 49

How to allocate objects aligned to a cache line…

// // new // template <class T> void* ALIGNEDMA<T>::operator new(size_t sz) {

sz = (sz+lineSz-1)/lineSz*lineSz; // make sz a multiple of lineSz return _aligned_malloc(sz, lineSz); // allocate on a lineSz boundary } // // delete // template <class T> void ALIGNEDMA<T>::operator delete(void *p) { _aligned_free(p); // free object }

Page 50: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 50

MCS Lock… • derive QNode from ALIGNEDMA

• each QNode will be allocated its own cache line aligned on a cache line boundary

class QNode : public ALIGNEDMA<QNode> { public: volatile int waiting; volatile QNode *next; };

• when a new QNode is created… QNode *qn = new QNode();

• the ALIGNEDMA new function is called to allocated space for the QNode

Page 51: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 51

Thread Local Storage [Tls]

• allocate next available Tls index DWORD tlsIndex = TlsAlloc(); // get a Tls index which all threads can use

• set value stored at tlsIndex

QNode *qn = new QNode(); // at start of worker function TlsSetValue(tlsIndex, qn);

• get value stored at tlsIndex

volatile QNode *qn = (QNode*) TlsGetValue(tlsIndex);

• TlsGetValue used by acquire() and release() to get a pointer to thread’s local QNode

Page 52: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 52

MCS Lock acquire inline void acquire(QNode **lock) { volatile QNode *qn = (QNode*) TlsGetValue(tlsIndex); qn->next = NULL; volatile QNode *pred = (QNode*) InterlockedExchangePointer((PVOID*) lock, (PVOID) qn); if (pred == NULL) return; qn->waiting = 1; pred->next = qn; while (qn->waiting); }

Page 53: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 53

MCS Lock… inline void release(QNode **lock) { volatile QNode *qn = (QNode*) TlsGetValue(tlsIndex); volatile QNode *succ; if (!(succ = qn->next)) { if (InterlockedCompareExchangePointer((PVOID*)lock, NULL, (PVOID) qn) == qn) return; do { succ = qn->next; } while(!succ); } succ->waiting = 0; }

Page 54: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 54

MCS Lock acquire

• pred = InterlockedExchange(lock, qn) performed atomically (1) • think about what happens if two or more threads try to acquire lock simultaneously • if pred is NULL [previous value of lock] then at head of Q so have lock otherwise… • set qn->waiting = 1 and… • link thread’s QNode to tail of existing Q by setting pred->next = qn (2) • wait until qn->waiting == 0

Page 55: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 55

MCS Lock release

• if (qn->next != NULL) set qn->waiting = 0 which passes lock to next thread in Q

• if (qn->next == NULL) use InterLockedCompareExchangePointer(lock, NULL, qn) to atomically set lock = 0 if its lock == qn and return if successful [there are no more threads waiting for lock] otherwise…

• a call to acquire() by another thread must have added a QNode between qn and lock • follow qn->next until not NULL and assign to succ which then points to next QNode in Q • set succ->waiting = 0 to pass lock to next thread [no explicit removal of QNodes from Q]

Page 56: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 56

Testing Framework • create a framework to compare the performance of locked and lockless lists

• will use VC++

• source code on CS4021 web site [single source for Win32 and x64] • implement an ordered list with add(key) and remove(key) operations • create n threads which pseudo randomly add or remove items from a list • add and remove operations occur with equal probability • generate keys pseudo randomly in range 0 .. maxkey-1 • changing key range controls the length of list and also the amount of contention

between threads [less contention with longer lists]

Page 57: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 57

Testing Framework… • vary key range [16, 64, 256, …] and number of threads [1, 2, 4, 8, …]

• limit maximum number of threads to be twice the number of cores • run each test for NSECONDS [e.g. 10 seconds] and report results

test set up and configuration and date key range number of threads runtime operations per second performance relative to a single thread

• make sure tests are run with PC set to the high performance power plan

• results generated on a DELL M4600 precision laptop

Page 58: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 58

C++ Node and List class definitions

• develop test framework for testing the performance of a list protected by different kinds of locks [CriticalSection, testAndSet, testAndTestAndSet,…]

class Node: public ALIGNEDMA<Node> { // derive from ALIGNEDMA public: int key; // key Node *next; // points to next node in list }; class List: public ALIGNEDMA<LIst> { // derive from ALIGNEDMA private: Node *head; // head of list DECLARE(); // macro to declare CriticalSection, testAndSet lock, … public: List(); // constructor ~List(); // destructor int add(int key); // return 1 if successful int remove(int key); // return 1 if successful };

Page 59: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 59

Updating List

• List constructor List::List() { // constructor head = new Node(0, NULL); // sentinel INIT(); // macro to initialise lock }

• ONLY ONE thread can update list at a time, protect by acquiring lock

int List::add(int key) { ACQUIRE(); // macro to acquire CriticalSection, testAndSet lock, … … // update protected by lock RELEASE(); // macro to release CriticalSection, testAndSet lock, … }

Page 60: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 60

testAndSet Results

Page 61: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 61

testAndSet Results…

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

1 2 4 8 16

MO

ps

per

second

# threads

list protected by a testAndSet lock

key=16

key=64

key=256

key=1024

key=4096

key=16384

Page 62: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 62

Spread sheet Model of testAndSet Lock • work in progress, will put on CS4021 web site

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1 2 4 8 16

Ops

rela

tive t

o s

ingle

thre

ad k

ey=16

# threads

key=16

key=64

key=256

key=1024

key=4096

key=16384

Page 63: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 63

testAndSet Results… • do the results make sense? and why are they so poor?

one thread will be updating list while all others will be trying to obtain the lock each attempt to acquire the lock requires the execution of an xchg instruction

each exchange instruction not only reads memory but also writes a 1 to the lock [even

if it's already a 1] invalidating copies of the lock in other caches [MESI protocol] this greatly increases the bus traffic [reads and writing of the lock will be to/from

memory] which significantly reduces the speed of the thread that has the lock if thread pre-empted holding lock, it will obstruct other threads from making

progress [this effect is probably not too significant]

significantly reduced performance due to increased bus traffic from (1) continuously executing the xchg instruction and (2) sharing modified list nodes

Page 64: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 64

Ticket Lock Results

NB: unusual

results when

threads = 16

Page 65: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 65

Ticket Lock Results… • unusual results when threads = 16

• otherwise better performance than testAndSet lock

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1 2 4 8 16

rela

tive t

o s

ingle

thre

ad m

axKey=16

# threads

ticket lock key=16

key=64

key=256

key=1024

key=4096

key=16384

Page 66: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 66

Ticket Lock Results • unusual results when threads > NCPUS [threads = 16, NCPUS= 8]

• assume the following code for acquire

inline void acquire(TicketLock *lock) { int myTicket = InterlockedExchangeAdd(&lock->ticket, 1); while (myTicket != lock->nowServing) _mm_pause(); }

• why? what's happening?

Page 67: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 67

Ticket Lock Results…

• idealised diagram of what is happening • to simplify diagram assume 4 cores and 8 threads

• threads run for an OS time quantum • need to wait for quantum to end before ticket 4, 8, … start to run • hence 4 tickets/updates per OS time quantum • what is the time OS quantum?

Page 68: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 68

Ticket Lock Results… • exchangeTicketRate.cpp • simply count how many ticket lock acquire and release operations can be performed per

second

0

5

10

15

20

25

30

35

40

45

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31Millions

of

Tic

ket

Exchanges

# threads

Ticket Exchange Rate

Millions of ticket exchanges per second

NB: 8 cores

Page 69: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 69

Ticket Lock Results… • graph in terms of how long it takes to exchange ticket lock between threads

• estimate of OS time quantum is 8 x 0.01ms = 0.1ms • seems too fast [need to find another way to get this value]

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

1 2 3 4 5 6 7 8 9 1011121314151617181920212223242526272829303132

ms

# threads

Time to exchange ticket between two threads

ticket exchanges per sec

NB: 8 cores

Page 70: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 70

Lockless List Implementation • objective is to implement an ordered lists where op/s increases with number of threads

• need to consider calls to new and delete which are called inside add and remove

• new and delete need to be lockless otherwise they will become the bottleneck

• memory management is critical

• same argument for rand()

• quite a challenge ahead

Page 71: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 71

Need to know meaning of the following terms deadlock

livelock

convoying

priority inversion

obstruction free

lock free

wait free

linearisation point

Page 72: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 72

Lockless List Implementation using CAS • use CAS to add nodes 15 and 35 • search for insertion point and execute and CAS with correct parameters

CAS(&a->next, b, c) CAS(&d->next, e, f)

• disjoint-access parallelism

Page 73: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 73

Lockless List Implementation using CAS

• if 2 threads try to add nodes at the same position

CAS(&a->next, b, c) // assume this CAS executes first and succeeds… CAS(&a->next, b, d) // consequently this CAS will fail

• first CAS executed succeeds, second fails as a->next != b • on failure need to RETRY operation • search AGAIN for insertion point and, if found, re-execute CAS [costly if list long]

Page 74: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 74

What can go wrong with an add? • imagine insertion point found, BUT before CAS(&a->next, b, c) is executed, thread is

suspended

• another thread then removes b from list and frees the memory block used by b [free(b)]

Page 75: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 75

What can go wrong with an add? • another thread adds another key [12] at same position in list using the SAME memory

block which just happens to be returned by the memory allocator [malloc()] • if suspended thread now resumes and executes its

CAS(&a->next, b, c) • the CAS will succeed as a->next still equals b, BUT node NOT inserted at correct position • known as the ABA problem

Page 76: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 76

Using CAS to remove nodes • search for node and then execute CAS with correct parameters • consider 2 threads removing non-adjacent nodes [disjoint-access parallelism]

CAS(&a->next, b, c) // both will succeed CAS(&c->next, d, 0) // both will succeed

Page 77: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 77

Using CAS to remove nodes • if two threads try to remove the same node

CAS(&a->next, b, c) CAS(&a->next, b, c)

• first CAS executed succeeds

• second CAS executed fails as a->next no longer equals b • retry on failure, which means searching AGAIN for node [NB. may now not be found!]

Page 78: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 78

What can go wrong with remove?

• imagine adding a node and removing a node concurrently

CAS(&a->next, b, c); // delete 20 CAS(&b->next, c, d); // insert 25

• NOT what was intended!

Page 79: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 79

What can go wrong with remove? • consider deleting adjacent nodes

CAS(&a->next, b, c) // delete 20 CAS(&b->next, c, d) // delete 30

• AGAIN NOT what was intended

Page 80: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 80

A Pragmatic Implementation of Non-Blocking Linked Lists Tim Harris [2001] • two step removal [consider remove(20)] • node atomically marked [logically deleted] before updating pointer using CAS

• marked node indicated by an odd address in next field [possible as nodes normally

aligned on 4 byte boundaries]

is_marked_reference(r) // returns 1 if marked get_marked_reference(r) // convert to marked reference get_unmarked_reference(r) // convert to unmarked reference

• tests, sets and clears LSB of address [which is stored in next field]

Page 81: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 81

A Pragmatic Implementation of Non-Blocking Linked Lists… • to atomically mark node [logically delete]

CAS(&b->next, c, get_marked_reference(c)); • then use CAS to update pointer

CAS(&a->next, b, c)

• can only update an unmarked pointer

• an intelligent find() removes ALL marked nodes to the immediate left of insertion point

or node to be deleted

Page 82: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 82

A Pragmatic Implementation of Non-Blocking Linked Lists… • examine code taken directly from paper, but note that it…

"is intended merely as pseudo-code and does not reflect an optimised (or even necessarily) correct implementation"

Page 83: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 83

class List<KeyType> {

Node<KeyType> *head;

Node<KeyType> *tail;

List() {

head = new Node<KeyType>();

tail = new Node<KeyType>();

head.next = tail;

}

}

List and Node Class Definitions

• how many mistakes can you spot in the code snippet above? will need to fix errors in order to get the code working

class Node<KeyType> {

KeyType key;

Node *next;

Node (KeyType key) {

this.key = key;

}

}

Page 84: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 84

add [insert]

public boolean List::insert(KeyType key) { Node *new_node = new Node(key); Node *right_node, *left_node; do { right_node = search(key, &left_node); if ((right_node != tail) && (right_node.key == key)) // T1 return false;

new_node.next = right_node; if (CAS(&left_node.next, right_node, new_node)) // C2 return true; } while(true); // B3 }

allocate new node to insert

returns pointers to the unmarked nodes to left and right of insertion point

return false if key already in list

keep trying if successful

try to insert node by using CAS to update pointer

Page 85: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 85

add [insert]…

insertion is reasonably straightforward, but makes use of an intelligent find function

• find should return adjacent left and right pointers

• CAS(&left.next, right, new) will only succeed if there are no nodes [marked or

unmarked] between left and right and if left is also unmarked

• of course, another thread could have inserted a node between left and right before the

CAS is executed

• a node cannot be inserted by linking to a marked [logically deleted] node thus avoiding one of the problems mentioned in the previous slides

Page 86: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 86

remove [delete] public boolean List::delete (KeyType search_key) { Node *right_node, *right_node_next, *left_node; do { right_node = search(search_key, &left_node); if (right_node == tail || right_node.key != search_key) //T1 return false; right_node_next = right_node.next; if (!is_marked_reference(right_node_next)) if (CAS(&right_node.next, right_node_next, get_marked_reference(right_node_next))) break; } while (true); //B4 if (!CAS(&left_node.next, right_node, right_node_next)) //C4 right_node = search(right_node.key, &left_node); return true; }

returns pointer of unmarked node to delete [if not present then unmarked node with next higher key] and address of unmarked node to its left

try to mark unmarked node; once marked, node is logically deleted

return if key not in list

keep trying

if CAS fails, use search to remove marked nodes from list

try to remove node from list by using CAS to update pointer can remove a number of adjacent nodes

Page 87: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 87

remove [delete]… • assume initial search has returned left and right and that the right node has been

marked [logically deleted]

• imagine that before before CAS(&left->next, right, right->next) is executed to remove

node from list, another thread inserts a node between left and right

Page 88: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 88

remove [delete]… • CAS to remove node will fail

• since node is logically deleted there is no point in calling delete again…. • BUT calling search again will remove any marked node(s) immediately before key

• NOT calling search would simply mean that the marked node(s) would remain in the list

until another node is inserted after 20 [in this example state]

• how could the list get into the following state?

Page 89: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 89

find [search] private Node *List::search (KeyType search_key, Node **left_node) { Node *left_node_next, *right_node; search_again: do { Node *t = head; Node *t_next = head.next; do { if (!is_marked_reference(t_next)) { *left_node = t; left_node_next = t_next; } t = get_unmarked_reference(t_next); if (t == tail) break; t_next = t.next; } while (is_marked_reference(t_next) || (t.key < search_key)); // T1 right_node = t; if (left_node_next == right_node) if ((right_node != tail) && is_marked_reference(right_node.next)) goto search_again; // G1 else return right_node; // R1 if (CAS (&(left_node.next), left_node_next, right_node)) // C1 if ((right_node != tail) && is_marked_reference(right_node.next)) goto search_again; // G2 else return right_node; // R2 } while (true); // B2 }

find left_node and right_node ignore any marked nodes

check left_node and right_node are adjacent

remove one or more marked nodes

optimisation

optimisation

Page 90: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 90

find [search] • step1: iterates along list to find the first unmarked node >= key; this is the right node;

the left node refers to the previous unmarked node found

• step 2: if the left node is the immediate predecessor of the right node, the search returns [returns with no marked nodes between left and right]

• step 3: use CAS to remove marked node(s) between the left and right nodes; on failure the search is retried

• the optimisation checks if the right node has become marked [logically deleted] and performs the search again rather than returning and then failing in add or remove

Page 91: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 91

A Pragmatic Implementation of Non-Blocking Linked Lists… • what is NOT said! • insert allocates a new node even if insertion fails • NO code for freeing or re-using nodes • nodes never become unmarked • avoids ABA problem by not re-using nodes which also… • avoids problem of threads traversing list using pointers to freed nodes • assumes nodes are garbage collected in a safe way [not an easy problem to solve] • ONLY a partial solution without memory management [perhaps the harder problem]

Page 92: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 92

Memory Management

• use garbage collection [Java, but not yet in C++] • reference counting • deferred freeing of nodes [see end of section 6 in Harris paper]

each node contains an additional link field so that it can be added to a per thread retireQ and reuseQ

each thread takes a copy of a global timer [e.g. clock()] before starting an add or remove operation and saves it in a global startOp array [each thread startOp stored in its own cache line for speed]

add and remove operations add any freed nodes to the retireQ and sets the key field to the startOp of the thread

add and remove operations, before they exit, can traverse the retireQ and transfer

nodes to a reuseQ if their startOp is less than the minimum startOp of any thread since no thread can still have a reference to the node

Page 93: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 93

Memory Management

• add retired nodes to end of retireQ • the minimum thread startOp time is 129 • can transfer all nodes in retireQ with startOp < 129 to reuseQ [first three node] • allocate nodes from per thread reuseQ and only call new/malloc if empty

• why is a link field needed? why not use next?

Page 94: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 94

Hazard Pointers

• Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged Michael (2004) IEEE Transactions on Parallel and Distributed Systems 15 (8): 491–504

• in terms of an ordered linked list, there are two active pointers as the list is traversed during a find operation [number will be different for other algorithms]

• these active pointers called hazard pointers [used to save cur and next p499 Fig 9]

• idea is not to reuse/delete/free nodes if they have hazard pointers pointing to them

Page 95: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 95

Hazard Pointers… • maintain a global array of per thread hazard pointers [each thread saving its hazard

pointers in its own cache line for speed]

• use per a thread retireQ and reuseQ as per previous example

• retire node by adding to retireQ and when length >= 2*nthreads*HAZARDSPERTHREAD make a local copy of all hazard pointers in global array [allocate a local array] sort hazard pointers in local array [optional] for each node on retireQ, if node address doesn’t match any hazard pointer in local

array transfer to reuseQ • again need to allocate nodes from per thread reuseQ and only call new/malloc if empty

Page 96: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 96

Some Results [NO memory management] • NB: number of nodes allocated [nmalloc]

Page 97: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 97

Some Results…

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

1 2 4 8 16

MO

ps

per

second

# threads

lockless list [no memory management]

key=16

key=64

key=256

key=1024

key=4096

key=16384

Page 98: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 98

Some Results…

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

1 2 4 8 16

speed-u

p re

lati

ve t

o a

sin

gle

thre

ad

# threads

lockless list [no memory management]

key=16

key=64

key=256

key=1024

key=4096

key=16384

Page 99: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 99

Transactional Memory • locks hard to manage effectively

pessimistic – inhibits parallelism priority inversion – lower priority thread pre-empted while holding a lock needed

by a higher priority thread convoying – thread holding lock is descheduled and other threads queue up

unable to progress deadlock – can be difficult to avoid in complex systems

• atomic primitives such as CAS operate on one word at a time resulting in complex

algorithms • MCAS [multiple compare and swap] some help • no hardware implementation • list of addresses, expected values and new values • can be implemented using CAS

Page 100: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 100

Transaction

• a sequence of steps executed by a single thread • transactions must be serializable meaning that they appear to execute sequentially in a

one-at-a-time order • serializability is a kind of coarse-grained version of linearizability [atomic method calls

on a given object appear to take effect instantaneously] • correctly implemented transactions do not deadlock or livelock

• composing atomic method calls is straightforward from a programming perspective [the

implementation, however, is far from straightforward]

atomic { x = q0.remove(); // atomic remove q1.add(x); // atomic add }

• atomic removal from one list and addition to another

Page 101: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 101

Transaction…

• transactions are executed speculatively; as a transaction executes it makes tentative changes to objects [memory locations]

• if it completes without encountering a conflict it then commits [the tentative changes become permanent] OR

• it aborts [the tentative changes are discarded] • the tentative changes are NOT visible to other transactions until the transaction

commits • each transaction maintains a read set and a write set

• each transactional load instruction adds the memory address to the read set and each

transactional store adds the memory address and value to the write set

Page 102: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 102

Transaction…

• two transactions conflict if one writes to a location accessed [read or written] by another

• conflict detection can be eager or lazy • eager detection checks every read or write to see if there is a conflicting operation in

another transaction requires all read and write sets to be visible to other transactions

• lazy detection checks when a transaction commits

Page 103: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 103

Conflict Detection Example • in both sequences, eager detection would

detect a conflict at Read X because the other transaction has already written to X

• (a)

lazy conflict detection would detect a conflict in T1 because T2 commits first implying that T1 should have used the result of the T2 Write X operation

• (b)

lazy conflict detection would allow both T1 and T2 to commit because T1 commits first and its Read X need not use the result of the T2 Write X

Page 104: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 104

How to resolve conflicts? • consider eager detection

• when T1 performs Read X, a conflict is

detected with Write X in T2 and one transaction has to be aborted

• If T2 is aborted, T1 will conflict later with

T3 [Read Y and Write Y] and another transaction will have to be aborted

• if, however, the policy had decided to abort

T1, T2 and T3 could have finished with only one transaction aborted instead of two

Page 105: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 105

Other issues • transactions can be nested

nested transactions are especially useful if a nested transaction can abort without

aborting its parent • zombies are transactions that are destined to abort, but are still running

• such transactions may have an inconsistent read set which could lead to erroneous

behaviour [e.g. infinite loop, index out of bounds]

• zombies can be avoided by validating the entire read set after each transactional load [expensive]

• validating explicitly checks for conflicts and will abort the transaction immediately

Page 106: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 106

Other issues… • consider the code for adding to a bounded [fixed size] transactional Q – items stored

in a “circular” array of length items void add(int key) atomic { if (count == items.length) // Q full retry; items[tail] = x; // add item if (++tail == items.length) // if necessary… tail = 0; // wrap tail index ++count; // increase count } }

Page 107: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 107

Other issues… • retry rolls back the enclosing transaction

• can be used to wait on multiple conditions

atomic { x = q0.remove(); } orElse { x = q1.remove(); } • if Q empty, retry called which is detected by orElse so q1.remove() tried instead

Page 108: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 108

Hardware Transactional Memory

• Transactional Memory: Architectural Support of Lock-Free Data Structures Maurice Herlihy and J. Eliot B. Moss Proceedings of the 20th Annual International Symposium on Computer Architecture 1993

• motivations

lock-free – operations on a data structure will not be prevented if one process/thread stalls mid execution

avoids common problems with mutual exclusion out performs best known locking techniques

Page 109: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 109

Hardware Transactional Memory

• a transaction [as defined here] is a finite sequence of machine instructions,

executed by a single process/thread, satisfying the following properties:

serializability: transactions appear to execute serially, meaning that the steps of

one transaction never appear to be interleaved with the steps of another

committed transactions are never observed by different processes/threads to

execute in different orders

atomicity: each transaction makes a sequence of tentative changes to shared

memory

when a transaction completes it either commits making its changes visible to

other processes/threads or it aborts causing its changes to be discarded

Page 110: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 110

Hardware Transactional Memory • implemented by modifying a multiprocessor cache coherency protocol [write-Once in

this example]

• tentative changes are made to a separate transaction cache

• ONLY when the transaction is committed do changes become visible atomically to other CPUs

Page 111: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 111

Hardware Transaction Memory • basic idea is that any cache coherency protocol capable of detecting accessibility

conflicts can also detect transaction conflicts at no extra cost

• instructions added to CPU instruction set for handling transactions – would be automatically generated a compiler

• Load transactional [LT] reads value from a shared memory location into transaction cache [and CPU register]

• Load transactional exclusive [LTX] read a value of a shared memory location into

transaction cache and mark it as RESERVED [use LTX if location likely to be updated] • Store transactional [ST] tentatively writes a value to a copy of the data in the

transaction cache which does NOT become visible to other processors until the transaction successfully commits

Page 112: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 112

Hardware Transactional Memory…

• commit [COMMIT] attempts to make a transaction’s tentative changes permanent and visible to other caches succeeds ONLY if no other transaction has written to any location in the

transaction's read or write set [and no other transaction has read any location in this transaction’s write set]

on failure all tentative changes to the write set are discarded returns success or failure

• Abort [ABORT] discards all updates to the write set • Validate [VALIDATE] tests the current transaction’s status

returns true if the transaction has not aborted [thus far] returns false if the current transaction has aborted, discards tentative updates

• CPU also keeps a TACTIVE flag indicating a transaction is in progress and a TSTATUS flag

indicating if the transaction is active or aborted; VALIDATE returns TSTATUS

Page 113: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 113

Transaction Cache States

• transaction cache lines have a write-once state AND a transaction state

• a memory location cannot be in a CPU’s normal cache and transaction cache simultaneously [exclusive caches]

• transactional cache states

EMPTY contains no data [invalid] NORMAL contains committed data XCOMMIT [discard on commit] contains original value read from “memory” XABORT [discard on abort] holds the tentative writes made to cache line

during a transaction [always paired with a XCOMMIT cache line]

• if a transaction commits successfully, the XCOMMIT lines are set to EMPTY and the XABORT lines switch to NORMAL

• must occur atomically using appropriate hardware support so ALL changes become visible “instantaneously”

Page 114: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 114

Transaction Cache States… • the snooping mechanism returns a BUSY status if CPUx tries to transactionally

read a memory location that is in another CPU’s transactional cache in the RESERVED or DIRTY state [because other CPU must have written to it]

• CPUx’s TSTATUS is set false [aborted] if it receives a BUSY status when attempting to execute a LT, LTX or ST

• if the transaction aborts, the XABORT lines set to EMPTY and the XCOMMIT

lines are set to NORMAL

Page 115: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 115

Intended use of transactional instructions

1. use LT or LTX to read a set of locations

2. use VALIDATE to check that the read set is consistent

on failure goto 1

3. use ST to modify a set of locations

4. use COMMIT to make changes permanent

on failure goto 1

Page 116: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 116

Consider the following transaction

atomic { a0 += 3; // add 3 a1 -=3; // subtract 3 }

• and compiler generated code sequence for transaction

tstart: ltx r1, a0 // know a0 will be modified ltx r2, a1 // know a1 will be modified add r1, 3, r1 // add 3 sub r2, 3, r2 // sub 3 st r1, a0 // tentative store st r2, a1 // tentative store commit // commit jeq tstart // retry on failure

• could add validate instructions to test for abort status earlier

Page 117: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 117

Example Transaction

address and value transactional cache state

write-once state I, V, R and D

transaction state

• assume a0 = 0 and a1 = 3 initially

• consider transaction state when executed on a single CPU just before COMMIT

Page 118: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 118

Example Transaction…

• memory locations a0 and a1 are read using LTX and so enter cache in the

Reserved | XCOMMIT state

• a copy of a0 and a1 also made in in Reserved | XABORT state

• memory locations are then written and the XABORT cache line is

changed to state Dirty | XABORT

• note that per CPU TACTIVE and TSTATUS flags indicate that a transaction is active and that its status is also active [rather than aborted]

Page 119: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 119

Example Transaction… • state after COMMIT executed • transaction successfully committed • transactional cache lines of type XCOMMIT

set to EMPTY • transactional cache lines of type XABORT

set to NORMAL • TACTIVE set to false ready for next

transaction • COMMIT operation updates the state of ALL

transaction cache lines atomically [needs appropriate hardware]

• other CPUs can now obtain updated contents of a0 and a1 from transactional cache

Page 120: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 120

Example Transaction…

• how are conflicts detected between concurrent transactions?

• assume CPU 0 has executed

its LTX r1, a0 and CPU 1 has been granted access to the bus to execute its LTX r1, a0

• CPU 0 has loaded a0 into its transactional cache in the Reserved state [LT would have loaded it the Valid state]

Page 121: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 121

Example Transaction…

• CPU 1 tries to read a0, but since CPU 0 has a copy in its transactional cache in the

RESERVED state, the transactional cache will detect the conflict and assert the BUSY

• when a BUSY response is received by CPU 1

it marks the transaction as being aborted by setting TACTIVE and TSTATUS to false

eager conflict detection

LTX will return arbitrary data

• when CPU1 [eventually] VALIDATEs or COMMITs transaction it will fail

sets all XABORT entries to EMPTY and sets all XCOMMIT entries to NORMAL

Page 122: CS4021 LOCKLESS ALGORITHMS - Trinity College Dublin algorithms.pdf · xadd reg, mem ; tmp = reg + mem, reg = mem, mem = tmp • without a LOCK prefix, XADD is executed non atomically

CS4021 LOCKLESS ALGORITHMS

© 2012 [email protected] 11-Dec-12

School of Computer Science and Statistics, Trinity College Dublin 122

Summary • you are now able to: