User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt...
-
date post
18-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Transcript of User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt...
![Page 1: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/1.jpg)
User-level Techniques on uniprocessors and
SMPs
Focusing on Thread Scheduling
CSM211
Kurt Debattista
![Page 2: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/2.jpg)
Literature
• Any good operating system book – dinosaur book
• Vahalia U. Unix Internals. Prentice Hall, 1996
• Moores J. CCSP - A portable CSP-based run-time system supporting C and occam. Volume 57, Concurrent Systems Engineering Series, pages 147-168, IOS Press, April 1999
• System’s software research group http://www.cs.um.edu.mt/~ssrg Literature on thread scheduling on uniprocessors, SMPs, avoiding blocking system calls (Vella, Borg, Cordina, Debattista)
![Page 3: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/3.jpg)
Overview
• Processes
• Threads– Kernel threads– User threads
• User-level Memory Management
• User-level Thread Scheduling
• Multiprocessor hardware
• SMP synchronisation
• SMP thread schedulers
![Page 4: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/4.jpg)
Time
Second 1
Millisecond 0.001
Microsecond 0.000001
Nanosecond 0.000000001
![Page 5: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/5.jpg)
Processes
• UNIX like operating systems provide for concurrency by means of time-slicing for long-lived, infrequently communicating processes– Concurrency is usually an illusion
(uniprocessor)– Parallelism (SMPs, distributed
systems)• Within a single application?
![Page 6: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/6.jpg)
Processes – Vertical switch
• Every time we enter the kernel (system call) we incur a vertical switch – Save current context– Change to kernel stack– Change back
![Page 7: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/7.jpg)
Processes – Horizontal switch
• Context switch from one processes to another – horizontal switch– Enter kernel (vertical switch)– Dispatch next process– Change memory protection
boundaries– Restore new process context
![Page 8: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/8.jpg)
Processes - Creation
• Creating a new process– Enter kernel (vertical switch)– Allocating memory for all process
structures– Init new tables– Update file tables– Copy parent process context
• All operations on processes in the order of hundreds of microseconds and sometimes milliseconds
![Page 9: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/9.jpg)
Multithreading
• Programming style used to represent concurrency within a single application
• A thread is an independent instance of execution of a program represented by a PC, a register set (context) and a stack
![Page 10: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/10.jpg)
Multithreading (2)
• In multithreaded applications threads co-exist in the same address space
• A traditional process could be considered an application composed of a single thread
• Example – web server– Share the same data between threads
– Spawn threads
– Communicate through shared memory
![Page 11: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/11.jpg)
Multithreading (3)
• Two types of threads1. Kernel-level threads
2. User-level threads
![Page 12: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/12.jpg)
Kernel threads
• Kernel threads are threads that the kernel is aware of
• The kernel is responsible for the creation of the threads and schedules them just like any other process
![Page 13: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/13.jpg)
Kernel threads (2)
Advantages1. Concurrency
within a single application
2. No memory boundaries
3. IPC through shared memory avoiding kernel access
4. Kernel interaction
Disadvantages
1. Thread creation (vertical switch)
2. Context switch (horizontal switch)
3. Kernel interaction
![Page 14: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/14.jpg)
Kernel threads (3)
• Thread management in the order of tens of microseconds (sometimes hundreds) – Creation– Context switch– Kernel based IPC
• Fast shared memory IPC in the order of hundreds (even tens) of nanoseconds
![Page 15: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/15.jpg)
User threads
• Kernel is unaware of threads
• All scheduling takes place at the user level
• All scheduler data structures exist in the user-level address space
![Page 16: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/16.jpg)
User-level Thread Scheduling
![Page 17: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/17.jpg)
User-level thread scheduling
• Library
• Pre-emption and cooperative multithreading– Pre-emption (like UNIX)– Cooperative (think Windows 3.1)
• Performance in the order of tens of nanoseconds
![Page 18: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/18.jpg)
User-level thread library
• Application is linked with a thread library that manages threads at run-time
• Library launches the main() function and is responsible for all thread management
![Page 19: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/19.jpg)
Fast multithreaded C library
• Scheduler initialisation and shutdown
• Structures for scheduler– Run queue(s)– Thread descriptor– Communication constructs
• Provide functions for thread creation, execution, IPC, yield, destroy
![Page 20: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/20.jpg)
Scheduler intialisation
• Traditional main() function is part of library, the application programmer uses cthread_main() as main function
• Initialise all scheduler data structures
• cthread_main() is in itself a thread
![Page 21: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/21.jpg)
Scheduler structures
• Thread run queue– Fast FIFO queue– Priority based scheduling
• Multiple queues
• Priority queue
• Thread descriptor– PC and set of registers (jmp_buf)– Stack (pointer to chunk of
memory)
![Page 22: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/22.jpg)
Communication structures
• Structures for communications– Semaphores– Channels– Barrier– others
![Page 23: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/23.jpg)
Thread API
• Provide functions for threads– Initialisation - cthread_init()– Execution – cthread_run()– Yield – cthread_yield()– Barrier – cthread_join()– Termination - automatic
![Page 24: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/24.jpg)
Context switch
• Use functions setjmp() and longjmp() to save and restore context (jmp_buf)
• setjmp() saves the current context
• longjmp() restores the context
![Page 25: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/25.jpg)
User-level Thread Scheduling
• Thread scheduling is abstracted from the kernel
• Thread management occurs at the user level– No expensive system calls
• Mini-kernel on top of the OS kernel
• Ideal for fine-grained multithreading
![Page 26: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/26.jpg)
User-level Thread Schedulers
• Successful user-level thread schedulers exist in the form of– CCSP
– KRoC
– MESH
– Mach Cthreads
– smash
– Sun OS threads
![Page 27: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/27.jpg)
Thread Management Issues
• Interaction with operating system– Blocking kernel threads
– Multiplexing kernel threads
– Scheduler activations (Anderson)
– System call wrappers
– See Borg’s thesis for more info. (SSRG homepage)
• Thread co-opeation– Automatic yield insertion (Barnes)
• Active context switching (Moores)
![Page 28: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/28.jpg)
Blocking Kernel Threads
• In single kernel-threaded schedulers, when the kernel thread blocks on a blocking system call the entire scheduler blocks
• Multiplexing kernel threads– Reduces the problem (though the
problem is still there)– Increases the amount of
horizontal switching
![Page 29: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/29.jpg)
Blocking Kernel Threads (2)
• Partial solutions– Allocate kernel thread for particular
functions (e.g. Keyboard I/O)– Horizontal switching
• System Call Wrappers– Wrap all (or required) blocking system
calls with wrappers that launch a separate kernel thread
– Require wrapper for each call– Ideal when wrappers already exist (e.g.
occam)– Incurs horizontal switching overhead– Blocking system call might NOT block
![Page 30: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/30.jpg)
Scheduler Activations
• Scheduler activations offer interaction between user-level space and kernel space
• Scheduler activations are the executing context on which threads run (like kernel threads)
• When an activation blocks the kernel creates a new activation and informs the user space that an activation is blocked (an upcall)
• Moreover a new activation is created to continue execution
• One of the most effective solution but must be implemented at the kernel level
• Also useful for removing the extended spinning problem on multiprogrammed multiprocessor systems
![Page 31: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/31.jpg)
Scheduler Activations (2)
![Page 32: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/32.jpg)
Web Server Example
![Page 33: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/33.jpg)
User-level Memory Management
• Memory management can also benefit from user-level techniques
• Replace malloc()/free() with faster user-level versions– Remove system calls sbrk()– Remove page faults
![Page 34: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/34.jpg)
User-level Memory Management (2)
• Ideal for allocating/de-allocating memory for similar data structures – No fragmentation– Pre-allocate large chunk using
malloc( ) whenever required– User-level data structures handle
allocation/de-allocation
• Results– malloc()/free 323ns– ul_malloc()/ul_free() 16ns
![Page 35: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/35.jpg)
User-level Memory Management (3)
• More complex allocation/de-allocation possible through building complete user-level memory manager– Avoid page faults (physical
memory allocation)– Enable direct SMP support
• Results– malloc()/free() 323ns– ul_malloc()/ul_free() 100ns (ca.)
![Page 36: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/36.jpg)
Multiprocessor Hardware (Flynn)
• Single Instruction Single Data (SISD)– Uniprocessor machines
• Single Instruction Multiple Data (SIMD)– Array computers
• Multiple Instruction Single Data (MISD)– Pipelined vector processors (?)
• Multiple Instruction Multiple Data (MIMD)– General purpose parallel computers
![Page 37: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/37.jpg)
Memory Models
• Uniform Memory Access (UMA)– Each CPU has equal access to memory
and I/O devices
• Non-Uniform Memory Access (NUMA)– Each CPU has local memory and is
capable of accessing memory local to other CPUs
• No Remote Memory Access (NORMA)– CPUs with local memory connected
over a high-speed network
![Page 38: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/38.jpg)
UMA
![Page 39: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/39.jpg)
NUMA
![Page 40: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/40.jpg)
Hybrid NUMA
![Page 41: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/41.jpg)
NORMA
![Page 42: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/42.jpg)
Symmetric Multiprocessors
• The UMA memory model is probably the most common
• The tightly-coupled, shared memory, symmetric multiprocessor (SMP) (Schimmel)
• CPUs, I/O and memory are interconnected over a high speed bus
• All units are located at a close physical distance from each other
![Page 43: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/43.jpg)
Symmetric Multiprocessors (2)
• Main memory consists of one single global memory module
• Each CPU usually has access to local memory in terms of a local cache
• Memory access is symmetric– Fair access is ensured
• Cache
![Page 44: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/44.jpg)
Caching
• Data and instruction buffer
• Diminish relatively slow speeds between memory and processor
• Cache consistency on multiprocessors– Write through protocol – Snoopy caches
• False sharing
![Page 45: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/45.jpg)
Synchronisation
• Synchronisation primitives serve to– provide access control to shared
resources– event ordering
• Valois describes the relationship of synchronisation methods
Lock-free
Non-blocking
Wait-free
Mutual exclusion
Blocking Busy -waiting
![Page 46: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/46.jpg)
Synchronisation Support
• Synchronisation on multiprocessors relies on hardware support
• Atomic read and write instructions
• “Read-modify-write” instructions– swap– test and set– compare and swap– load linked / store conditional– double compare and swap
• Herlihy’s hierarchy
![Page 47: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/47.jpg)
Mutual Exclusion
• Blocking or busy-waiting
• Rule of thumb is to busy-wait if time expected to wait is less than the time required to block and resume a process
• In fine grain multithreaded environments critical sections are small so, busy-waiting is usually preferred
![Page 48: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/48.jpg)
Spin locks
• Spin locks are mostly likely the simplest locking primitives
• A spin lock is a variable that is in one of two states (usually 1 or 0)
• Two operations act on a spin lock1. Spin / acquire lock
2. Spin release
![Page 49: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/49.jpg)
Spin Lock Implementation
Acquire spin lockspin: lock ; lock bus for btsl
btsl lock, 0 ; bit test and setjnc cont ; continue if
carry is 0
jmp spin ; else go to spin
cont:
Release spin locklock ; lock bus for btrl
btrl lock, 0 ; release lock
![Page 50: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/50.jpg)
Test and Test and Set Lock
• Segall and Rudolph present a spin lock that does not monopolise the bus
Acquire spin lock – Segall and Rudolphspin: lock ; lock bus for btsl
btsl lock, 0 ; bit test and set jnc cont ; continue if
carry is 0
loop; btl lock, 0 ; test only
jc loop ; loop if carry is set
jmp spin ; else go to spin
cont:
![Page 51: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/51.jpg)
Alternative Spin Lock Techniques
• Anderson’s exponential back-off algorithm
• Pre-emption safe algorithms for multiprogrammed environments– Elder– Marsh
• Hybrid locking – blocking techniques– Ousterhout– Lim
![Page 52: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/52.jpg)
Lock-free Synchronisation
• Alternative form of synchronisation which dispenses with serialising concurrent tasks
• Lock-free algorithms rely on – powerful atomic primitives
– careful ordering of instructions
![Page 53: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/53.jpg)
Lock-free Properties
• Lock-free data structures may have two properties
1. They may be non-blocking in which case some operation on a lock-free structure is guaranteed to complete in finite time
1. If all such operations are guaranteed to complete in finite time the data structure is said to be wait free
• Any structure that uses mutual exclusion cannot be non-blocking or wait free
![Page 54: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/54.jpg)
Scheduling Strategies
• Scheduling strategies on SMPs can be broadly divided into two categories
1. Shared run queue schedulers where processors acquire threads by accessing a shared run queue (usually implemented as a FIFO queue)
2. Per processor run queue scheduler where each processor acquires threads from it’s own local run queue (and in some cases from other run queues)
![Page 55: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/55.jpg)
Shared Run Queue Thread Scheduling
![Page 56: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/56.jpg)
Shared Run Queue Thread Scheduling (2)
• Central run queues balance the work load equally amongst all processors
• Threads are obtained from the shared run queue and serviced
• Serviced threads are placed back onto the run queue (unless they terminate)
• Shared run queues favour the principal of load balancing
![Page 57: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/57.jpg)
Per Processor Run Queue Thread Scheduling
![Page 58: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/58.jpg)
Per Processor Run Queue Thread Scheduling (2)
• Per processor run queue schedulers maintain a local run queue for each processor
• The principal of locality is favoured
• Contention is reduced because processors reference data close to it, thus reducing communication (which can lead to contention)
![Page 59: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/59.jpg)
Locality vs. Load Balancing
• Both locality and load balancing are intended to improve performance
• Both policies however interfere with each other
• Per processor run queues best represent locality based scheduling
• Shared run queues best represent load balancing
![Page 60: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/60.jpg)
Shared Run Queue Scheduler Problems
• When using shared run queues threads very often migrate across processors thus loosing out on the principle of locality
• Poor locality on shared run queue schedulers also means an increased contention since processors always access the shared run queue– especially for fine grain multithreading
• Existing results are disappointing (SMP KRoC and SMP MESH) for fine grained applications
![Page 61: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/61.jpg)
Per Processor Run Queue Scheduler Problems
• Per processor run queue schedulers can run into load imbalance problems
• Threads are usually placed close to their parent
• Diagrams show the possible load imbalances compared with a shared run queue
![Page 62: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/62.jpg)
Shared Run Queue Load Balancing
![Page 63: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/63.jpg)
Per Processor Run Queue Possible Load Imbalances
![Page 64: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/64.jpg)
Locality vs. Load Balancing (2)
• NORMAs the high cost of migrating threads for load balancing, resolves the conflict in terms of locality (Eager et al.)
• NUMAs Bellosa also rules in favour of locality
• On UMAs due to the low costs of thread migration load balancing is usually preferred
![Page 65: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/65.jpg)
Locality vs. Load Balancing (3)
• The situation might indeed be through for coarse grain threads (Gupta et al.)– Less contention for shared resources– Locality in terms of cache affinity
scheduling is still catered for (since processes long-lived and do not synchronise)
• Fine grained multithreading the situation is different (Markatos and LeBlanc, Anderson)– Contention is increased– Caching is not used
![Page 66: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/66.jpg)
Locality vs. Load Balancing (4)
• Results by Anderson indicate per processor scheduling has its advantages
• Results of shared run queue based fine grain thread schedulers are relatively disappointing
• Markatos and Le Blanc point out the difference in improving CPU speeds as opposed to memory speeds will favour locality
![Page 67: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/67.jpg)
Shared Run Queue Scheduler’s Run Queues
• When scheduling using a shared run queue, the choice of run queue implementation is a fundamental issue
• Michael and Scott survey the fastest concurrent queues– Michael and Scott’s non-blocking
queue– Valois’ non-blocking queue– Michael and Scott’s dual-lock queue
• Including pre-emption safe locks
– Mellor-Crummey’s lock-free queue– Traditional single lock queue
![Page 68: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/68.jpg)
Michael and Scott’s Dual-Lock Concurrent Queue
• Head and tail are protected by means of separate locks
• Enqueue and dequeue operations can occur concurrently
• Algorithm is possible because of use of dummy head and nodes as a vehicle to transport data (avoid head and tail interaction)
![Page 69: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/69.jpg)
Michael and Scott’s Dual-Lock Concurrent Queue (2)
struct {node * head; node * tail;int taillock; int headlock;
} queue;
void enqueue(queue * q, data * d) {node * n = node_init(d);spinlock(q->taillock);q->tail->next = n;q->tail = n;spinrelease(q->taillock);
}
![Page 70: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/70.jpg)
Michael and Scott’s Dual-Lock Concurrent Queue (3)
data * dequeue(queue * q) {
node * n, * new_head;
data * d;
spinlock(q->headlock);
n = q->head;
new_head = n->next;
if (new_head == NULL) {
spinrelease(q->headlock);
return NULL;
}
d = new_head->data;
q->head = new_head;
spinrelease(q->headlock);
return d;
}
![Page 71: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/71.jpg)
Per Processor Run Queue Scheduler - Load
Balancing
• Per processor run queue schedulers need methods of load balancing to avoid strong load imbalances
• Threads could be mapped statically to processors at compile-time (only useful if all threads are the same longevity)
• Threads can be migrated across run queues or across common pools of threads
![Page 72: User-level Techniques on uniprocessors and SMPs Focusing on Thread Scheduling CSM211 Kurt Debattista.](https://reader035.fdocuments.us/reader035/viewer/2022062714/56649d235503460f949f963a/html5/thumbnails/72.jpg)
Multiprocessor Batching