Inside Synchronization Jeff Chase Duke University.

Inside Synchronization

Jeff ChaseDuke University

Threads and blocking

thread librarythreads, mutexes, condition variables…

kernel thread supportraw “vessels”, e.g., Linux CLONE_THREAD+”futex”

thread APIe.g., pthreads

or Java threads

kernel interface for thread libs(not for users)

Threads can enter the kernel (fault or trap) and block.

PG-13

active ready orrunning

blocked

wait

wakeupsignal

Blocking

kernel TCB

active ready orrunning

blocked

wait

sleepwait

wakeupsignal

When a thread is blocked on a synchronization object (a mutex or CV) its TCB is placed on a sleep queue of threads waiting for an

event on that object.

This slide applies to the process abstraction too, or, more precisely,

to the main thread of a process.

How to synchronize thread queues and sleep/wakeup inside the kernel?

Interrupts drive many wakeup events.

sleep queue ready queue

Overview

• Consider multicore synchronization (inside the kernel) from first principles.

• Details vary from system to system and machine to machine….

• I’m picking and choosing.

Spinlock: a first try

int s = 0;

lock() {while (s == 1)

{};ASSERT (s == 0);s = 1;

}

unlock ();ASSERT(s == 1);s = 0;

}

Busy-wait until lock is free.

Global spinlock variable

Spinlocks provide mutual exclusion among cores without blocking.

Spinlocks are useful for lightly contended critical sections where there is no risk of preemption of a thread while it is holding the lock, i.e., in the lowest levels of the kernel.

Spinlock: what went wrong

int s = 0;

lock() {while (s == 1)

{};s = 1;

}

unlock ();s = 0;

}

Race to acquire.Two (or more) cores see s == 0.

We need an atomic “toehold”

• To implement safe mutual exclusion, we need support for some sort of “magic toehold” for synchronization.– The lock primitives themselves have critical sections to test and/or set the lock

flags.

• Safe mutual exclusion on multicore systems requires some hardware support: atomic instructions– Examples: test-and-set, compare-and-swap, fetch-and-add.

– These instructions perform an atomic read-modify-write of a memory location. We use them to implement locks.

– If we have any of those, we can build higher-level synchronization objects like monitors or semaphores.

– Note: we also must be careful of interrupt handlers….

– They are expensive, but necessary.

Atomic instructions: Test-and-Set

Spinlock::Acquire () { while(held); held = 1;}

Wrong load 4(SP), R2 ; load “this”busywait: load 4(R2), R3 ; load “held” flag bnz R3, busywait ; spin if held wasn’t zero store #1, 4(R2) ; held = 1

Right load 4(SP), R2 ; load “this”busywait: tsl 4(R2), R3 ; test-and-set this->held bnz R3, busywait ; spin if held wasn’t zero

loadteststore

loadteststore

Solution: TSL atomically sets the flag and leaves the

old value in a register.

Problem: interleaved

load/test/store.

One example: tsltest-and-set-lock

(from an old machine)

Spinlock: IA32

Spin_Lock:

CMP lockvar, 0 ;Check if lock is free

JE Get_Lock

PAUSE ; Short delay

JMP Spin_Lock

Get_Lock:

MOV EAX, 1

XCHG EAX, lockvar ; Try to get lock

CMP EAX, 0 ; Test if successful

JNE Spin_Lock

Atomic exchange to ensure safe acquire of an uncontended lock.

Idle the core for a contended lock.

XCHG is a variant of compare-and-swap: compare x to value in memory location y; if x == *y then set *y = z. Report success/failure.

Synchronization accesses

• Atomic instructions also impose orderings on memory accesses.

• Their execution informs the machine that synchronization is occurring.

• Cores synchronize with one another by accessing a shared memory location with atomic instructions.

• When cores synchronize, they establish happens-before ordering relationships among their accesses to other shared memory locations.

• The machine must ensure a consistent view of memory that respects these happens-before orderings.

7.1. LOCKED ATOMIC OPERATIONS

The 32-bit IA-32 processors support locked atomic operations on locations in

system memory. These operations are typically used to manage shared data

structures (such as semaphores, segment descriptors, system segments, or

page tables) in which two or more processors may try simultaneously to modify

the same field or flag….Note that the mechanisms for handling locked atomic operations have evolved

as the complexity of IA-32 processors has evolved….

Synchronization mechanisms in multiple-processor systems may depend upon

a strong memory-ordering model. Here, a program can use a locking

instruction such as the XCHG instruction or the LOCK prefix to insure that a

read-modify-write operation on memory is carried out atomically. Locking

operations typically operate like I/O operations in that they wait for all previous

instructions to complete and for all buffered writes to drain to memory….This is just an example of a principle on a particular

machine (IA32): these details aren’t important.

A peek at some deep tech

mx->Acquire();x = x + 1;mx->Release();

mx->Acquire();x = x + 1;mx->Release();

happensbefore

(<)

before

An execution schedule defines a partial order of program events. The ordering relation (<) is called happens-before.

1. Events within a thread are ordered.2. Mutex handoff orders events across

threads: the release #N happens-before acquire #N+1.

3. Happens-before is transitive: if (A < B) and (B < C) then A < C.

Two events are concurrent if neither happens-before the other. They might execute in some order, but only by luck.

Just three rules govern happens-before order:

The next schedule may reorder them.

Machines may reorder concurrent events, but they always respect happens-before ordering.

Happens-before and causality

• We humans have a natural notion of causality.

– Event A caused event B if B happened as a result of A, or A was a factor in B, or knowledge of A was necessary for B to occur….

• Naturally, event A can cause event B only if A < B!

– (A caused B) (A happens-before B), i.e., A precedes B

– This is obvious: events cannot change the past.

• Of course, the converse is not always true.

– It is not true in general that (A < B) (A caused B).

– Always be careful in inferring causality.

Causality and inconsistency

• If A caused B, and some thread T observes event B before event A, then T sees an “inconsistent” event timeline.

– Example: Facebook never shows you a reply to a post before showing you the post itself. Never happens. It would be too weird.

• That kind of inconsistency might cause a program to fail.

– We’re talking about events that matter for thread interactions at the machine level: load and store on the shared memory.

Memory ordering

• Shared memory is complex on multicore systems.

• Does a load from a memory location (address) return the latest value written to that memory location by a store?

• What does “latest” mean in a parallel system?

T1

M

W(x)=1

W(y)=1 OK

OK R(y) 1

T2

It is common to presume that load and store ops execute sequentially on a shared memory, and a store is immediately and simultaneously visible to load at all other threads. But not on real machines.R(x) 1

Memory ordering

• A load might fetch from the local cache and not from memory.

• A store may buffer a value in a local cache before draining the value to memory, where other cores can access it.

• Therefore, a load from one core does not necessarily return the “latest” value written by a store from another core.

T1

M

W(x)=1

W(y)=1 OK

OK R(y) 0??

T2

A trick called Dekker’s algorithm supports mutual exclusion on multi-core without using atomic instructions. It assumes that load and store ops on a given location execute sequentially.But they don’t. R(x) 0??

Memory ordering

T1

M

W(x)=1

W(y)=1 OK

OK R(y) 0??

T2

Memory accesses from T1 have no happens-before ordering defined relative to

the accesses from T2, unless the program uses synchronization (e.g., a

mutex handoff) to impose an ordering.

.R(x) 0??

• A load might fetch from the local cache and not from memory.

• A store may buffer a value in a local cache before draining the value to memory, where other cores can access it.

• Therefore, a load from one core does not necessarily return the “latest” value written by a store from another core.

Memory ModelsA Case for Rethinking Parallel Languages and Hardware.Sarita Adve and Hans Boehm, Communications of the ACM, Aug 2010, Vol. 53 Issue 8

Reordering any pair of accesses, reading values from write buffers, register promotion, common subexpression elimination, redundant read elimination: all may violate sequential consistency.

A compiler might reorder the two independent assignments to hide the latency of loading Y or X.

Modern processors may use a store buffer to avoid waiting for stores to complete.

sequential

The point of happens-before

• For consistency, we want a load from a location to return the value written by the “latest” store to that location.

• But what does “latest” mean? It means the load returns the value from the last store that happens-before the load.

• Machines are free to reorder concurrent accesses.

– Concurrent events have no restriction on their ordering: no happens-before relation. Your program’s correctness cannot depend on the ordering the machine picks for concurrent events: if the interleaving matters to you, then you should have used a mutex.

– If there is no mutex, then the events are concurrent, and the machine is free to choose whatever order is convenient for speed, e.g., it may leave “old” data in caches and not propagate more “recent” data.

The first thing to understand about memory behavior on multi-core systems

• Cores must see a “consistent” view of shared memory for programs to work properly. But what does it mean?– Answer: it depends. Machines vary.

– But they always respect causality: that is a minimal requirement.

– And since machines don’t know what events really cause others in a program, they play it safe and respect happens-before.

The first thing to understand about memory behavior on multi-core systems

• Cores must see a “consistent” view of shared memory for programs to work properly. But what does it mean?

• Synchronization accesses tell the machine that ordering matters: a happens-before relationship exists. Machines always respect that.

– Modern machines work for race-free programs.

– Otherwise, all bets are off. Synchronize!

T1

M

W(x)=1

W(y)=1 OK

OK R(y) 1

T2

The most you should assume is that any memory store before a lock release is visible to a load on a core that has subsequently acquired the same lock.

R(x) 0??

pass lock

The point of all that• We use special atomic instructions to implement locks.

• E.g., a TSL or CMPXCHG on a lock variable lockvar is a synchronization access.

• Synchronization accesses also have special behavior with respect to the memory system.

– Suppose core C1 executes a synchronization access to lockvar at time t1, and then core C2 executes a synchronization access to lockvar at time t2.

– Then t1<t2: every memory store that happens-before t1 must be visible to any load on the same location after t2.

• If memory always had this expensive sequential behavior, i.e., every access is a synchronization access, then we would not need atomic instructions: we could use “Dekker’s algorithm”.

• We do not discuss Dekker’s algorithm because it is not applicable to modern machines. (Look it up on wikipedia if interested.)

Where are we• We now have basic mutual exclusion that is very useful inside the

kernel, e.g., for access to thread queues.– Spinlocks based on atomic instructions.

– Can synchronize access to sleep/ready queues used to implement higher-level synchronization objects.

• Don’t use spinlocks from user space! A thread holding a spinlock could be preempted at any time.– If a thread is preempted while holding a spinlock, then other threads/cores may

waste many cycles spinning on the lock.

– That’s a kernel/thread library integration issue: fast spinlock synchronization in user space is a research topic.

• But spinlocks are very useful in the kernel, esp. for synchronizing with interrupt handlers!

Wakeup from interrupt handler

sleep

ready queue

interrupt

trap or fault return to user mode

wakeup

sleep queue

switch

Examples?

Note: interrupt handlers do not block: typically there is a single interrupt stack for each core that can take interrupts. If an interrupt arrived while another handler was sleeping, it would corrupt the interrupt stack.

Wakeup from interrupt handler

sleep

ready queue

interrupt

trap or fault return to user mode

wakeup

sleep queue

switch

Note: interrupt handlers do not block: typically there is a single interrupt stack for each core that can take interrupts. If an interrupt arrived while another handler was sleeping, it would corrupt the interrupt stack.

How should an interrupt handler wakeup a thread? Condition variable signal? Semaphore V?

Interrupts

An arriving interrupt transfers control immediately to the corresponding handler (Interrupt Service Routine).

ISR runs kernel code in kernel mode in kernel space.

Interrupts may be nested according to priority.

executing thread

low-priorityhandler (ISR)

high-priorityISR

Interrupt priority: rough sketch

• N interrupt priority classes

• When an ISR at priority p runs, CPU blocks interrupts of priority p or lower.

• Kernel software can query/raise/lower the CPU interrupt priority level (IPL).– Defer or mask delivery of interrupts at

that IPL or lower.– Avoid races with higher-priority ISR by

raising CPU IPL to that priority.– e.g., BSD Unix spl*/splx primitives.

• Summary: Kernel code can enable/disable interrupts as needed.

splx(s)

clock

splimp

splbio

splnet

spl0low

high

BSD exampleint s;s = splhigh();/* all interrupts disabled */splx(s);/* IPL is restored to s */

What ISRs do

• Interrupt handlers:– bump counters, set flags

– throw packets on queues

– …

– wakeup waiting threads

• Wakeup puts a thread on the ready queue.

• Use spinlocks for the queues

• But how do we synchronize with interrupt handlers?

Synchronizing with ISRs

• Interrupt delivery can cause a race if the ISR shares data (e.g., a thread queue) with the interrupted code.

• Example: Core at IPL=0 (thread context) holds spinlock, interrupt is raised, ISR attempts to acquire spinlock….

• That would be bad. Disable interrupts.

executing thread (IPL 0) in

kernel mode

disable interrupts for

critical section

int s;s = splhigh();/* critical section */splx(s);

Obviously this is just example detail from a particular machine (IA32): the details aren’t important.

Obviously this is just example detail from a particular OS (Windows): the details aren’t important.

Synchronizing with ISRs

executing thread (IPL 0) in

kernel mode

disable interrupts for

critical section

int s;s = splhigh();/* critical section */splx(s);

Yield() { next = FindNextToRun(); ReadyToRun(this); Switch(this, next);}

Sleep() { this->status = BLOCKED; next = FindNextToRun(); Switch(this, next);}

A Rough Idea

Issues to resolve:What if there are no ready threads?How does a thread terminate?How does the first thread start?

Thread.Wakeup(SleepQueue q) { lock and disable; q.RemoveFromQ(this); this.status = READY; sched.AddToReadyQ(this); unlock and enable;}

Thread.Sleep(SleepQueue q) { lock and disable interrupts; this.status = BLOCKED; q.AddToQ(this); next = sched.GetNextThreadToRun(); unlock and enable; Switch(this, next);}

A Rough Idea

This is pretty roughThe sleep and wakeup primitives must be used to implement synchronization objects like mutexes and CVs. And we are waving our hands at how that will work. Actually, P/V operations on a dedicated per-thread semaphore would be better than sleep/wakeup.

Thread.Wakeup(SleepQueue q) { lock and disable; q.RemoveFromQ(this); this.status = READY; sched.AddToReadyQ(this); unlock and enable;}

Thread.Sleep(SleepQueue q) { lock and disable interrupts; this.status = BLOCKED; q.AddToQ(this); next = sched.GetNextThreadToRun(); unlock and enable; Switch(this, next);}

A Rough Idea

This is pretty roughThere is some hidden synchronization: as soon as sleep unlocks, another sleep (or yield) on another core may try to switch into the sleeping thread before it switches out. And we have to worry about interrupts during context switch.

Example: Unix Sleep (BSD)

sleep (void* event, int sleep_priority){

struct proc *p = curproc;int s;

s = splhigh(); /* disable all interrupts */p->p_wchan = event; /* what are we waiting for */p->p_priority -> priority; /* wakeup scheduler priority */p->p_stat = SSLEEP; /* transition curproc to sleep state */INSERTQ(&slpque[HASH(event)], p); /* fiddle sleep queue */splx(s); /* enable interrupts */mi_switch(); /* context switch *//* we’re back... */

}

Illustration Only

/* * Save context of the calling thread (old), restore registers of * the next thread to run (new), and return in context of new. */switch/MIPS (old, new) {

old->stackTop = SP;save RA in old->MachineState[PC];save callee registers in old->MachineState

restore callee registers from new->MachineState

RA = new->MachineState[PC];SP = new->stackTop;

return (to RA)}

This example (from the old MIPS ISA) illustrates how context switch saves/restores the user register context for a thread, efficiently and without assigning a value directly into the PC.

switch/MIPS (old, new) {old->stackTop = SP;save RA in old->MachineState[PC];save callee registers in old->MachineState

restore callee registers from new->MachineStateRA = new->MachineState[PC];SP = new->stackTop;

return (to RA)}

Example: Switch()

Caller-saved registers (if needed) are already saved on its stack, and restored automatically on return.

Return to procedure that called switch in new thread.

Save current stack pointer and caller’s return address in old thread object.

Switch off of old stack and over to new stack.

RA is the return address register. It contains the address that a procedure return instruction branches to.

What to know about context switch• The Switch/MIPS example is an illustration for those of you who are

interested. It is not required to study it. But you should understand how a thread system would use it (refer to state transition diagram):

• Switch() is a procedure that returns immediately, but it returns onto the stack of new thread, and not in the old thread that called it.

• Switch() is called from internal routines to sleep or yield (or exit).

• Therefore, every thread in the blocked or ready state has a frame for Switch() on top of its stack: it was the last frame pushed on the stack before the thread switched out. (Need per-thread stacks to block.)

• The thread create primitive seeds a Switch() frame manually on the stack of the new thread, since it is too young to have switched before.

• When a thread switches into the running state, it always returns immediately from Switch() back to the internal sleep or yield routine, and from there back on its way to wherever it goes next.

Implementing Sleep on a Multiprocessor



s = splhigh(); /* disable all interrupts */p->p_wchan = event; /* what are we waiting for */p->p_priority -> priority; /* wakeup scheduler priority */p->p_stat = SSLEEP; /* transition curproc to sleep state */INSERTQ(&slpque[HASH(event)], p);/* fiddle sleep queue */splx(s); /* enable interrupts */mi_switch(); /* context switch *//* we’re back... */

}

What if another CPU takes aninterrupt and calls wakeup?

What if another CPU is handlinga syscall and calls sleep or wakeup?

What if another CPU tries to wakeupcurproc before it has completed mi_switch?

Illustration Only

Using Spinlocks in Sleep: First Try



lock spinlock;p->p_wchan = event; /* what are we waiting for */p->p_priority -> priority; /* wakeup scheduler priority */p->p_stat = SSLEEP; /* transition curproc to sleep state */INSERTQ(&slpque[HASH(event)], p);/* fiddle sleep queue */unlock spinlock;mi_switch(); /* context switch *//* we’re back */

}

Grab spinlock to prevent another CPU from racing with us.

Wakeup (or any other related critical section code) will use the same spinlock, guaranteeing mutual exclusion.

Illustration Only

Sleep with Spinlocks: What Went Wrong



lock spinlock;p->p_wchan = event; /* what are we waiting for */p->p_priority -> priority; /* wakeup scheduler priority */p->p_stat = SSLEEP; /* transition curproc to sleep state */INSERTQ(&slpque[HASH(event)], p);/* fiddle sleep queue */unlock spinlock;mi_switch(); /* context switch *//* we’re back */

}

Potential deadlock: what if we take aninterrupt on this processor, and callwakeup while the lock is held?

Potential doubly scheduled thread: what if another CPU calls wakeup to wake us up before we’re finished with mi_switch on this CPU?

Illustration Only

Using Spinlocks in Sleep: Second Trysleep (void* event, int sleep_priority){


s = splhigh();lock spinlock;

p->p_wchan = event; /* what are we waiting for */p->p_priority -> priority; /* wakeup scheduler priority */p->p_stat = SSLEEP; /* transition curproc to sleep state */INSERTQ(&slpque[HASH(event)], p); /* fiddle sleep queue */

unlock spinlock;splx(s);

mi_switch(); /* context switch *//* we’re back */

}

Grab spinlock and disable interrupts.

Illustration Only

Recap

• An OS implements synchronization objects using a combination of elements:– Basic sleep/wakeup primitives of some form.

– Sleep places the thread TCB on a sleep queue and does a context switch to the next ready thread.

– Wakeup places each awakened thread on a ready queue, from which the ready thread is dispatched to a core.

– Synchronization for the thread queues uses spinlocks based on atomic instructions, together with interrupt enable/disable.

– The low-level details are tricky and machine-dependent.

– The atomic instructions (synchronization accesses) also drive memory consistency behaviors in the machine, e.g., a safe memory model for fully synchronized race-free programs.

CMPXCHG

If our CPU loses the ‘race’, because another CPU changed ‘cmos_lock’ to some non-zero value after we had fetched our copy of it, then the (now non-zero) value from the ‘cmos_lock’ destination-operand will have been copied into EAX, and so the final conditional-jump shown above will take our CPU back into the spin-loop, where it will resume busy-waiting until the ‘winner’ of the race clears ‘cmos_lock’.

Inside Synchronization Jeff Chase Duke University.

Documents

Transcript of Inside Synchronization Jeff Chase Duke University.