Concurrency and Thread Management - cs.rochester.edu fileWhy is concurrency important? Better fit...

Concurrency and Thread Management

What is concurrency?A concurrent program maintains more than one active execution context (“thread” of control). Concurrency can be implemented if a program contains operations that could, but need not, be executed in parallel.

... versus a sequential program, which has only one execution context

in which all code is traversed in order.

Why is concurrency important?

● Better fit for the problem at hand● Better fit for the hardware being used● Better use of resources

Some programs are very easily split up into independent parts

- keep track of tasks as independent threads of control

- servers and graphical applications

Hardware-- multiprocessors, independent physical devices, networks-- each piece of hardware as independent thread of control

- interrupt driven switches

Resources-- time, speed (running many things at the same time), memory (by sharing parts of their address space)

Processes v. ThreadsTo begin, we must define a process:● single address space● single thread of control for executing a program● state information (per program, per process)

– page tables, swap images, file descriptors, queued I/O requests, saved registers

Processes are very expensive to create/maintain!

Process- sometimes called “heavyweight” thread, process

A thread is a thread of execution given to a section of a program. Multiple threads:

● can be run at the same time within a single process (address space); reduce the overhead of multiple copies of common information

● maintain an active execution context: – program counter– stack– control block (state info for thread management) – ready queue

Broad Thread TypesA kernel thread, sometimes called a “lightweight

process”, is handled by the system scheduler. A kernel thread runs in user mode when when executing user or library calls, and in kernel mode when executing system calls.

A user-level thread is handled in user mode (in applications) in an implementation dependent manner. User-level threads provide a portable interface to kernel threads using certain libraries.

Kernel- - “lightweight process”- system scheduled

User-level-- implementation dependent

Using Threads● Context Switching/Scheduling

– System methodically decides which thread it will work on at any given time

● Communication– Between threads (broadcast, etc.), between

processors (message passing, shared memory, etc.)

● Synchronization– Locks on code or data to maintain dependencies

Scheduling

Transferring Context Blocks (Co-routines): transfer(other):

save all callee-saves registers on stack, including ra and fp

*current := sp current := other sp := *current pop all callee-saves registers (including

ra, but NOT sp!)return (into different coroutine!)

- Example: Simula and Modula-2. - Multiple execution contexts, only one of which is active. - Other and current are pointers to CONTEXT BLOCKs. (Contains sp; may contain other stuff as well (priority, I/O status, accounting info, etc.)) - No need to change PC; always changes at the same place.

Scheduling...Simple uniprocessor scheduling:

reschedule: t : cb := dequeue(ready_list) transfer(t)

yield: enqueue(ready_list, current) reschedule

sleep_on(q): enqueue(q, current) reschedule

A more sophisticated way of scheduling threads is through the use of queues, lists of context blocks.

The primary data structures are the ready list and condition lists.

Ready list: Holds context blocks for threads that are runnable but not currently running

- current- yield control to another thread- q - Condition lists: Separate lists for each

block, each containing context blocks for threads waiting for that condition to be true

- synchronize- enqueue, and then take threads off the ready queue when the processor can

Scheduling...Preemption:

yield: disable_signalsenqueue(ready_list, current) reschedule re-enable_signals

disable_signals if not <desired condition>

sleep_on <condition queue> re-enable signals

To keep this scheduling going, systems often use timed signals to move threads on/off the ready_list. This is to keep from waiting while we could be freeing the ready_list.

BUT, the arbitrary nature of these signals can cause threads to be misplaced or only partially placed on the ready_list (involuntary yields), so we have to protect our scheduler data structures: (first func)

Solution: Disable these signals, before moving threads to/from the ready_list, and reenable them afterwords.

**Note that reschedule takes us to a different thread, possibly in code other than yield.

Scheduling...Multiprocessor scheduling: Extends others.

yield:

disable_signals

acquire(scheduler_lock)

enqueue(ready_list, current)

reschedule

release(scheduler_lock)

re-enable_signals

disable_signals

acquire(scheduler_lock)

if not <desired condition>

sleep on <condition queue> release(scheduler_lock)

re-enable signals

– Multiple processes share ready_list and related thread maintenance data structures

● have to put a lock on the scheduling operations, to account for multiple processors being able to access threads at the same time

● each process has to keep track of what thread is currently executing

– The processor design will determine how and how many threads can be run at a time

– Longest-waiting threads dequeued first (or those given priority by scheduler)

Thread Program Designs

Co-begin, par loop, launch at elaboration

Fork

JoinImplicit Receipt,Early Reply

Block

So, now with these scheduling capabilities, there are a a number of ways to design programs, create threads that communicate with each other so that the computation turns out correctly.

These two charts summarize these options, which you have all probably seen before. (go through)

Threads started in more nested ways (first), or a more spontaneous way (second).

Different designs can be chosen based on the problem and equipment at hand.

Most of these are thread creation mechanisms, but individual threads have a variety of functions.

Abstract Thread Package● Thread Creation and Initialization

– Allocate and initialize the control block and stack

– Placing the thread on the ready queue, and then removing it when the thread is executed

● Thread resume– Remove thread from the ready queue and

restore registers– Continue at the saved program counter

An abstract thread package, a general model for thread implementation packages, has several functions to maintain computations and communication (go through)

● Thread Blocking/Waiting– Update stack with current register values and program

counter– Place thread on condition queue to wait for a signal

event– Continue down the ready queue

● Signaling a Blocked Thread– Remove thread from the condition queue, place onto

the ready queue● Thread Finish

– Free stack and control block, and continue down the ready queue

(finish)The blocking mechanisms are known as

synchronization functions.

Synchronization

Generally, synchronization is achieved using:● Mutual exclusion locks● Semaphores● Condition variables (busy-wait, barriers)The amount of synchronization needed limits

the speed-up of a program; certain types of synchronization take less time and resources.

As we have been covering in class, synchronization is protecting data from being erroneously read or written by multiple threads. We have talked about several ways to do this, (list, briefly describe).

The cost of synchronization is one of the principle elements of the article that was assigned for today. (read)

To measure the effects of different choices, we can measure two things: the minimum time it takes to run and rate the threads can be created.

Measuring Thread Performance

● Latency: cost of thread management when there is no contention for locked (protected) data

● Throughput: rate at which threads can be created, maintained, and finished when there is contention for locked data

The article calls these Latency and Throughput. (read definitions)

Thread Management Alternatives● Single Lock: all central data structures protected

by a single lock● Multiple Locks: central data structures locked

separately● Local freelist: Single, shared, locked ready_list;

local, balanced free lists (control blocks, stacks)● Idle queue: idle processor pre-allocates a control

block and stack, waits for work● Local ready queue: split up ready list among

processors with separate locks; random choice

The article presents several ways of managing threads so that the data are protected. Using different data structures and locking mechanisms, we can serialize data access so that reads and writes are consistent and correct.

(go through)

Alternatives' Tradeoffs

● Single Lock: only 1 lock accessed | limited thpt● Multiple Lock: more locks accessed | better thpt ● Local free lists: needs balance | less locks, high thpt● Idle queue: extra work with busy processors | head

start with idle processors● Local ready queues: increasing search time |

increasing throughput

Each of these alternatives' have tradeoffs. (go through)

Latency

Throughput

● inc. with more locks● inc. with more complexity

● limited by contention for locks

These are the results for the different alternatives.In terms of both latency and throughput, local

ready queues proved to be the most efficient.

Using local, or per-processor, data structures avoids locking, so it decreases latency while keeping throughput high. Also, keeping the mechanism as simple as possible, with less checks or searches, saves time and increases throughput.

Throughput is mainly limited by contention for locks, which is worst when all data is in a single lock. Local ready queues keep locking to a minimum, and really are the only mechanism that can support really high creation rates.

Synchronization Alternatives

● Spin on exchange byte instruction (“xchgb”): loop on function that reads and writes a specific byte in memory until it's successful

● Spin on memory read: each processor tries once to execute the xchgb instruction, then if fail, loops on the read from memory

● Ethernet-style backoff: if the lock fails, try at times when it's more likely to be successful

Similarly, conditional synchronization can be done in several ways, with different trade-offs in latency and throughput.

Alternatives' Tradeoffs

● Spin on exchange byte instruction (“xchgb”): simple | consumes bus resources regardless

● Spin on memory read: decreases bus traffic | 1 processor's miss can travel down the line

● Ethernet-style backoff: less tries on a lock | unfair and longer time to acquire a lock

(go through)

Many processors

Few processors

Backoff and spin read are best

Spin on exchange bit is best

The most interesting results for these synchronization methods are shown in this graph.

When there are only a few processors in use, spinning on the exchange bit was shown to be fastest, because it takes less time to get working after being unlocked. A processor is found immediately.

As more processors are added, so increasing the time it takes to choose one to work on the recently acquired locks, the other algorithms start to win (equally). As more processors acquire a lock, fewer are queued, and so finding processors to work on the data becomes quicker.

Concurrency and Thread Management - cs.rochester.edu fileWhy is concurrency important? Better fit...

Documents

Transcript of Concurrency and Thread Management - cs.rochester.edu fileWhy is concurrency important? Better fit...