CS350: Operating Systems

hsv experimental 05/01/20 13:08CS 350

CS350: Operating Systems

Helena S. Ven

27 Apr. 2019


Language: C, MIPS

Instructors: Lesley Ann Istead

Textbook: Three Easy Pieces

Topics:

1. Introduction

2. Threads and Concurrency

3. Synchronisation

4. Processes and the Kernel

5. Virtual Memory

6. Scheduling

7. Devices and Devices Management

8. Fils Systems

9. Virtual Machine

Important things to note:

• See the appendix for path names.

• This document is based on the lecture slides of Prof. Lesley Ann Istead.


Index

1 Introduction 31.1 Theme: Virtualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Theme: Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Theme: Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 G-Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Concurrency and Synchronisation 62.1 Sequential Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Spawning Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Yield and Pre-emption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Atomicity and Volatility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Volatile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Spinlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Wait channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5.1 Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.2 Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.6.1 Producer-Consumer Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Condition Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.7.1 Join and Exit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.7.2 Producer-Consumer Queue with Condition Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Processes and the Kernel 263.1 Process Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 System calls for Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 execv() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3 Kernel Privilege . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.4 Inter-Process Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Virtual Memory 324.1 Physical Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 Dynamic Relocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2.3 Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3 Translation Lookaside Buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 Address Space in OS/161 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5 Executable and Linkable Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.6 Virtual Memory for the Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.7 Exploiting Secondary Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.7.1 Optimising Page Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Scheduling 405.1 Simple Scheduling Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.2 Multi-Level Feedback Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.3 Linux Completely Fair Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.4 Scheduling on Multi-Core Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.4.1 Load-Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

1


6 Devices and Device Management 446.1 Device I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.1.1 Device Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446.2 Hard Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2.1 Disk Head Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.2.2 Solid State Drives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2.3 Persistant Ram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 File Systems 497.1 File Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497.2 Directories and File Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.2.1 Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.2.2 Virtual File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.3 File System Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.3.1 i-nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.3.2 Alternatives to Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.4 File System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

8 Virtual Machines 538.1 Hypervisors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

A OS/161 54A.1 Testing The OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

A.1.1 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54A.2 Directory Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2


Caput 1

Introduction

Generally, an operating system is a system that

• Manages resources

• Creates execution environments

• Loads programs

• Provide common services and utilities

The operating system is different from the desktop environment.The primary way the OS operates is by virtualisation: Turnig a physical resource into a more general virtual

form, so the von Neumann model for executing programs line by line is valid in this virtual machine. The purpose ofvirtualisation is to make the system easy to use.

Three perspectives:

• (Application) What service does a OS provide?

The OS provides an execution environment that

– Provides processor time and memory space.

– Provides interfaces for network, storage, I/O devices, and other system hardware components.

– Isolates running programs from one another and prevents unauthorised interactions.

The execution environment includes abstractions such as

– Files and file systems → Secondary storage

– Address space → Primary Memory

– Processes, threads → Program execution

– Sockets, pipes → Network or other message channels.

• (System) What problem does a OS solve?

The OS

Descriptio 1.1: Schematic View of an Operating System

User Programs

OS Kernel

Resources (CPU, Memory)

System Calls

Commands and Data

3


– Manages the hardware resources of a computer system, including processors, memory, disks, and other storagedevices, network interfaces, I/O.

– Allocates resources among running programs

– Control the sharing of resources.

The OS itself also must share resources with these programs.

• (Implementation) How is a OS built?

The OS is a concurrent (Multiple programs running at the same time), real-time program:

– Concurrency arises naturally in an OS when it supports concurrent applications.

– Hardware interactions impose timing constraints.

– Programs must respopnd to events within specific timing constraints.

A OS exports API, or system calls to applications. Since the OS provides these calls to run programs, the OSprovides a standard library to applications.

The kernel1 of the operating system is a part that respondes to system calls, interrupts, and exceptions. The operatingsystem includes the kernel, utilities, command interpreters, and programming libraries.

A real-time OS is a OS with stringent event response times, and pre-emptive scheduling. Linux and Windows arenot real-time OS. Real-time OS has very specific applications such as automated driving.

1.1 Theme: Virtualisation

When the CPU is virtualised, many programs can be executed concurrently on a single core, but this is only an illusion.In reality the operating system creates a very large number of virtual CPUs from a single CPU. The fork UNIX commanddoes exactly this.

Process 1

Process 2

Process 3

Physical

The OS likewise virtualises memory. Each process has its own virtual address space which starts at 0. This addressis mapped via a table into the physical memory.

The execution environment provided by the OS includes a variety of abstract entities.

Abstraction Device

Files and File systems Secondary StorageAddress spaces Primary Memory (RAM)

Processes/Threads Program ExecutionSockets, Pipes Network or other message channels

1.2 Theme: Concurrency

The OS needs to handle concurrency: A process may spawn many threads2. A thread is a function running within thesame memory space as other functions, with more than one of them active at a time. This generates problems such aswhat happens when two threads modify the same address.

The OS itself is a concurrent program. Why threads?

• Resource Utilisation: Blocked/Waiting threads give up resources e.g. CPU, to others

1This document will describe a monolithic kernel, one which the file system, utilies are integrated instead of a minimalistic microkernel.

Most consumer operating systems are monolothic. Microkernel OS only has the bare minimum and everything else is delegated to the userprograms. e.g. GNU Hurd

2e.g. Using the UNIX library pthread

4


• Parallelism: Multiple threads executing simultaneously

• Responsiveness: Dedicate threads to UI, others to loading or long tasks.

• Priority: Higher priority threads have more CPU time

• Modularisation: Organisation of execution tasks/responsibilities.

1.3 Theme: Persistence

In system memory, data stored in volatile devices such as DRAM may be lost. Thus we need hardware and software tobe able to store data persistently. The hardwares are in the form of I/O devices. A hard drive is a common repositoryfor long-lived information.

The software in the OS that manages the disk is the file system and is responsible for storing any files the usercreates.

1.4 G-Notation

To assist with the reader’s understanding, we use a intuitive graphical representation of code instead of writing the linesof code.

A function call is a box:

The no-op is

which represents no operation being done. The end of a function is represented by .A condition is

?

x = 0

x 6= 0

Frequently a boolean is represented by flags, w is false and v is true.Triggering an assertion is represented by a explosion

5


Caput 2

Concurrency and Synchronisation

Detour

The predecessor of UNIX is MULTICS. MULTICS had very innovative features:

• Everything is a file

• “single level storage”: No separation between file space and address space.

• Dynamic linking: Processes can access data, code, outside of address space, and add them/use/execute thisdata.

• Re-configuration of hardware while operating. e.g. Removal of CPU, disks.

• Designed for security: Rings, master to user

• Hierarchical file system and symbolic links

• Per-process stacks within the kernel

• Command processor as user application.

2.1 Sequential Programs

A sequential program consists of a single thread1 of execution. A sequential program executes the Fetch-ExecuteCycle:

1. Fetch instruction PC points to

2. Decode and Execute instruction

3. Move the PC

Each sequential program has a stack2.Threads provide a way to express concurrency. Parallelism

• Enables parallel execution if the underlying hardware supports it. In which case programs can run faster.

• Enables better processor utilisation. When one thread blocks, another thread may execute.

Threads may block, i.e. cease execution for a period of time, or until some condition has been met. When a threadblocks, it is not executing instructions.

• Avoid blocking the program due to slow I/O. This problem arises when the program sends/receives a message, orwaiting for a page fault to finish. This enables overlap of I/O with other activities within a single program.

• Instead of using multiple processes, the threads share the same address space and it is easy to share data.

1A thread is represented as a structure or object

2The “top” of the stack has the lowest address by convention!

6


Descriptio 2.1: Address space for single/multi threaded application

0

232 − 1

Instructions

Heap

(free)

Stack

0

232 − 1

Instructions

Heap

(free)

Stack (2)

(free)

Stack (1)

The state of a single thread is similar to a process. The thread itself has

• A program counter (PC)

• Set of registers for computation

• A stack, or thread-local storage. The heap is shared, while the stack is not.

We use threads because:

1. Resource Utilisation: Blocked/Waiting threads give up resources (e.g. CPU) to other threads.

2. Parallelism: Multiple threads execute simultaneously which improves performance.

3. Responsiveness: Dedicate threads to UI and others to loading or long tasks.

4. Priority: Priorities determine the amount of CPU time

5. Modularisation: Organisation of execution tasks and responsibilities.

2.2 Spawning Threads

Threads provide a way for programmers to express concurrency in a program. In threaded concurrent programs there aremultiple threads of execution, all occuring at the same time.

Kernel API

($KI/thread.h)

• thread_fork(NAME, PROC, ENTRY, DATA1, DATA2): Create a new thread from an existing one. If PROC

is empty, the process is inherited from the caller. It will start on the same CPU. The new thread callsENTRY(DATA1, DATA2). The types are void *data1 and unsigned long data2.

• thread_exit(): Terminate the calling thread.

thread_yield(): Yield execution but remain runnable.

The implementation is in $KTh/thread.c.

Related thread libraries and functions:

• join: Block current thread until another thread finishes. Does not exist in OS/161.

• pthreads: POSIX threads, a popular threading API

• OpenMP: A cross-platfform simple multi-processing and thread API.

7


• GPGPU Programming: General purpose GPU programming.

All threads share access to the program’s global variables but their function activations are private to that thread.There are several possible implementations:

1. Hardware support: If there are P processors, each with C cores, M multithreading degree per core, PCM threadscan execute simultaneously.

2. Time-sharing: Multiple threads take turns on the same hardware, rapidly switching between each thread.

3. Combination of Hardware support and time-sharing.

When timesharing, the switch from one thread to another is a context switch. 3 steps happen:

1. Decide which thread will run next (scheduling)

2. Save register contents of current thread

3. Load register contents of next thread

A context switch can happen when

1. Call to thread_yield, voluntarily allow another thread to run.

This is a high-level context switch.

2. Call to thread_exit, terminate the current thread.

3. Thread blocks via a call to wchan_sleep.

4. The running thread is pre-empted (involuntarily stops running). This requires the use of a scheduler.

This is a low-level context switch.

Ready

RunningBlocked

Ready pool

Wait channels CPU

Dispatch

Pre-em

ption

thread_yield

Resource unavailable

wchan_sleep

Resourceavailable

wake_all

,wake_one

thread_exit

A thread can be in one of three possible states:

• Running: Currently executing

• Ready: Ready to execute

• Blocked: Waiting for something, not ready to execute.

Thread execution continuously changes the context so the context must be saved/restored carefully.The code handling context switch is in $KAM/thread/switch.S, which switches from switchframe pointer (old) $a0

to switchframe pointer (new) $a1. The pointers stored are

1. $s0 – $s6, $s8, the callee-save except for s7.

2. $gp, Global pointer.

3. $ra, Return address.

8


Descriptio 2.2: Stack during an yield and stack during a pre-emption

Switchframe

top

thread_switch

thread_yieldyield

other frames

stack

Switchframe

top

thread_switch

thread_yield

Interrupt HandlerStack frames

top

Trap frame

other frames

stackinterrupt

2.2.1 Yield and Pre-emption

Switching one thread from another is a context switch3. When a thread calls thread_yield, it voluntarily gives upCPU time for other threads. Several causes for context switch:

Cause Effect

thread_yield Voluntarily allows other threads to runthread_exit Current thread is terminated

Blocks (via wchan_sleep) Dormant until a condition becomes truePre-emption Involuntarily allows other threads to run.

1. Program calls thread_yield.

2. thread_yield calls thread_switch.

3. thread_switch chooses a new thread.

4. thread_switch creates a switchframe on the top of the stack.

5. Calls switchframe_switch to perform low-level context switch.

thread_switch is the caller. It saves/restores caller-save registers.

thread_switch is the callee. It saves/restores callee-save registers.

MIPS R3000 is pipelined. Delay slots are used to protect against:

• Load-use hazards

• Control hazards

A thread can also involuntarily (pre-emption) give up CPU time. In timesharing, the scheduling quantum is a limiton the CPU time a thread can use before it must yield the CPU. The scheduling quantum is reset every time a threadis dispatched to the CPU and the thread is not obligated to use the CPU until the scheduling quantum expires (i.e. thequantum is an upper bound). Pre-emption requires the threading library to have a interrupt that transfers control froma thread to the threading library.

Scheduling quantum cannot be too long or too short

• Too short: Much of the CPU time is wasted on excessive context switches.

• Too long: The response time of processes may be too large. This is especially a problem for interactive applications.

3The code for handling context switch is in $KTh/thread.c:thread_switch and this calls $KAMT/switch.S:switchframe_switch

9


A interrupt is an event that occurs during the execution of a program They are caused by system devices (a timer,disk controller, network interface, etc.). When an interrupt occurs, the hardware transfers control to a fixed location, theinterrupt handler. This interrupt handler (in a trap table) is initialised during bootup.

The interrupt handler does:

1. Move pointer from user stack to kernel stack.

Remember: No kernel data on user stack!

2. Create a trap frame to record thread context at time of interrupt.

3. Determines which device caused the interrupt and performs device specific processing.

4. Restores the saved thread context from the trap frame.

Note. Another interrupt cannot happen during the execution of an interrupt handler because the interrupt handler disablesthe interrupts. But this is sometimes a architecture specific problem.

If the interrupt handler determines that the current thread exceeded its quantum, thread_yield is called:

Switchframe

switchframe_switch

thread_switch

thread_yield

Interrupt Handler

Trap frameinterrupt

top

other frames

stack

A key difference between trap and switch frames is that the switch frame does not need to store registers that don’t haveto be preserved across function calls. Pre-emption (involuntary) generates a trap frame. Yielding (voluntary) generatesa switch frame.

Detour

DTSS (Dartmouth Time Sharing System) by GE is the first time-sharing operating system.

2.3 Atomicity and Volatility

A problem arises when we allow two threads to access the same address space. Suppose there is a variable a and thread1, 2 are both instructed to perform a← a + 1 and launched in parallel. In MIPS the instruction for this is (for example)

lw $v0 , 0x8049a1c($0)

addi $v0 , $v0 , 1

sw $v0 , 0x8049a1c($0)

Thread 2

Thread 1

a a + 1

lw sw

lw sw

Thread 2

Thread 1

a a + 2

lw sw

lw sw

10


The final value of a can be a + 1 or a + 2 depending on how the scheduler arranges the threads. This situation is a racecondition (or more specifically a data race), since the result now depend on the order of execution of the code. Theresult from this program is no longer deterministic. This code is a critical section, a piece of code that accesses a sharedvariable and must not be concurrently executed by more than one thread.

Race conditions can occur for other reasons. The volatile keyword prevents compiler optimisations of variables intoregisters. This ensures that each access to the volatile variable is an access to memory. Variables (subject to change)shared between threads should be declared volatile.

The CPU can also re-order load instructions which sometimes cause race conditions. The MIPS R3000 CPU does nothave this feature.

Detour

To find a critical section, we can use several techniques:

• Inspect each variable and determine if it is possible for multiple threads to read/write the variable at the sametime?

• Constants and read-only memory do not cause race conditions.

A solution to this problem is to have a instruction that computes a ← a + 1 in a single step, or atomically4. Anexample of non-atomic instruction is assignment: The result of assigning a 32 bit integer to another can require two steps,moving 16 bits each time. This could corrupt data in unexpected ways. Unfortunately, most instruction combinations donot have such a bundled instruction.

Thus what we can do instead is to to build a general set of synchronisation primitives.

2.3.1 Volatile

It is faster to access values from a register than from memory. Compilers optimise for this. However, if we have

int sharedval = 0;

int f() { ... }

int g() { ... }

If f, g both access sharedval, the value of this shared variable may diverge in the registers of the two functions, even if weuse synchronisation primitives. The compiler may also do reordering. The volatile keyword prevents this optimisation.

volatile int sharedval = 0;

int f() { ... }

int g() { ... }

2.4 Spinlocks

The fundamental problem in concurrent programming is to execute a series of instruction atomically. In this chapter weattack this problem directly using a lock ( ). Consider the critical section

a = a + 1;

To use a lock, we declare a global lock variable of some kind, and check the availability of the lock around the criticalsection:

lock_t mutex; // global lock variable

...

lock(& mutex);

a = a + 1; // Critical section

unlock (& mutex);

The state of the lock variable at any time is either available (or unlocked, free) and thus no thread holds the lock,or acquired (or locked, held). A lock provide mutual exclusion and are often called a mutex.

An attempt to acquire a held lock blocks the current thread until the lock is released, avoiding any race conditions. Ifthe lock is acquired by the a thread, the thread is the owner of the lock.

There are several criteria we aim for when building a lock:

4“Atomic” means as a unit. A commonly associated phrase is “all or nothing”: To a external observer it should either appear as if all the

actions packaged in an atom occurred, or none of them occurred.

11


Descriptio 2.3: Naıve implementation of a Spinlock

:= ? v

w

:= w

• Mutual Exclusion: A lock prevents multiple threads from entering the same critical section.

• Fairness: Every thread contending for a lock has a chance to acquire the lock, and thus is not starved.

• Performance: The time overhead added by using the lock should be minimal. There are a few cases:

1. No contention: A single thread is running, acquires and releases the lock.

2. Multiple contending threads on same CPU

3. Multiple contending threads on different CPUs.

One of the earliest solution on a single processor system is to disable interrupts for critical sections, so

void acquire () { disableInterrupts (); }

void release () { enableInterrupts (); }

This ensures and code placed in the acquire(); /* ... */ release(); block executes atomically since the control cannever be handed to another thread. This method is very simple, but has many drawbacks:

• The thread needs to perform a privileged operation, i.e. able to turn on and off the interrupts, and thus we trustwhatever thread making the acquire() and release() calls. A program can monopolise the processor by callingacquire() at the beginning of its execution, and there is no way for the OS to regain control of the system.

• This method will not work on multiple processors.

• Turning off interrupts is a very risky operation and is inefficient.

Nevertheless, this approach can be used by internals of the OS since the OS can trust itself and ensure atomicity whenaccessing its own data structures.

A very naıve implementation of the lock is a spinlock( )5 based on a Boolean value. The lock spin waits, or havingits condition repetitively checked in a loop.

typedef bool* spinlock_t;

void spinlock_acquire(spinlock_t l)

{

while (*l); // "Block" the thread until lock is released

*l = true;

}

void spinlock_release(spinlock_t l)

{

*l = false;

}

Unfortunately, this does not work since the dereference and compare operations are not ensured to have atomicity.The spin-waiting of the lock also wastes resources.

We have to rely on hardware-based functionalities to provide a lock. The simplest bit of hardware is a test-and-set(or atomic exchange6) instruction. The test-and-set implements the following function atomically:

int test_and_set(int *old_p , int new)

{

int old = *old_p;

*old_p = new;

return old;

}

5Interface for spinlock can be found in $KI/spinlock.h

6xchg src, addri in x86

12


This function “tests” the old value while simultaneously “sets” the new value. The key is that this sequence must beexecuted atomically, which can be implemented in hardware. On top of this it is possible to build a simple spin-lock.

struct spinlock

{

/*

* 0: Lock available

* 1: Lock held

*/

int flag;

};

void spinlock_init(struct spinlock *l) { l->flag = 0; }

void spinlock_acquire(struct spinlock *l)

{

while (test_and_set (&l->flag , 1) == 1); // Spin -wait

}

void spinlock_release(struct spinlock *l) { l->flag = 0; }

This lock requires a pre-emptive scheduler, one which interrupts a thread via a timer.The spin-lock does not provide fairness guarantees. Threads under contention may spin forever and lead to starvation.

On a single cycle, spin-locks work reasonably well if the number of threads is similar to the number of CPUs.Another hardware primitive is compare-and-swap (on SPARC) or compare-and-exchange (on x86), which does:

int compare_and_swap(int* ptr , int expected , int new)

{

int old = *ptr;

if (old == expected)

*ptr = new;

return old;

}

We can likewise build a spin-lock with this:

void spinlock_acquire(struct spinlock *l)

{

// Spin the lock as long as the old value is 1

while (compare_and_swap (&l->flag , 0, 1) == 1);

}

In MIPS, a pair of (I-format) instructions exist to help build locks:

• ll $t, i($s) (Load-linked): $t← [$s + i]

• sc $t, i($s) (Store-conditional):

– (Success): If the value at [$s + i] has not changed since the last ll instruction, store [$s + i] ← $t and set$t← 1

– (Failure): If the value has been changed, $t← 0

The lock can then be implemented as

void spinlock_acquire(sturct spinlock* lock)

{

while (1)

{

while (ll(&lock ->flag) == 1); // Spin lock

if (sc(&lock ->flag , 1) == 1)

{

// If flag := 1 was a success , return. Otherwise try again.

return 1;

}

}

}

13


or it can be written succintly using short-circuiting:

void spinlock_acquire(struct spinlock *lock)

{

while (ll(&lock ->flag) || !sc(&lock ->flag , 1));

}

Kernel API

In OS/161, a spinlock structure is defined in $KI/spinlock.h, which offers the interface functions:

spinlock_init(lock): Initialises the lock object.

spinlock_acquire(lock)

spinlock_release(lock)

The implementation is in $KTh/spinlock.c.

The spinlock keeps track of its owner (a struct cpu *), since any thread owning the spinlock effectively owns theCPU.

Consider the code

li $t, 1 // Load value 1 into $t

ll $s, lock // Load the value of the lock into $s

sc $t, lock // Try to store $t into the lock if its value has not changed.

At the end of sc statement, the value of $s and $t can be used to determine the owner of the spinlock:

s (lock before ll) t (Success?) lock (before sc) lock (after sc) Status

0 0 1 1 Not possible to determine whether lock is held0 1 0 1 Current thread acquired lock1 0 0 0 Not possible to determine whether lock is held1 1 1 1 Another thread holds the lock

Note that since a register can take non-binary values, it is impossible to determine the status of the lock when t = 0.An example usage is an atomic c← c + 1:

++

Note: because spinlocks can replace locks but are inefficient, their operation icons are the same as locks but with a top.

2.5 Wait channels

Sometimes a thread will need to wait for something, e.g.

• Wait for a lock to be released by another thread

• Wait for data

• Wait for input from keyboard

• Wait for busy device to idle.

When a thread sleeps, it stops running:

1. Scheduler chooses a new thread to run.

2. Context switch from blocking thread to the new thread

3. Sleeping thread is held in a wait queue.

4. Eventually a sleeping thread is signaled and awakened by another thread.

5. Sleeping threads are not awakened by the scheduler.

The queue is realised by a wait channel

14


Kernel API

Wait channels can be found in $KI/wchan.h and are implemented with queues in $KTh/wchan.c.

• wchan_create(char const *name): Name should be a string constant.

• wchan_destroy(struct wchan *): Destroy a wait channel, which must be empty and unlocked.

zzz wchan_sleep(struct wchan *):Blocks the calling thread on given wait channel, causes context switch. Thecurrent thread sleeps until is waken up by wchan_wakeone or wchan_wakeall.

Pre-condition: The channel must be locked. Post-condition: The channel is unlocked.

• wchan_wakeall(struct wchan *): Unblock all threads sleeping on the channel.

wchan_wakeone(struct wchan *): Unblock one thread (i.e. dequeue) sleeping on the channel.

wchan_lock(struct wchan *): Prevent operations on channel. The wait channel holds a spinlock and thisoperation acquires it.

• wchan_unlock(struct wchan *): Unlock the wait channel lock.

• wchan_isempty(struct wchan *)

struct wchan * is the data type for a wait channel.

Recall Figure 2.1. Ready threads are queued on the ready queue. Blocked threads are queued on wait channels.

2.5.1 Locks

The spin-lock wastes resources on a single CPU or when the number of threads exceed the number of CPUs. Waiting fora scheduling quantum on a spinning thread is not the most efficient way to use resources, so we have to get creative.

Instead of waiting endlessly on a spin-lock, we can yield when the flag is active:

void lock(lock_t* lock)

{

while(test_and_set (&lock ->flag , 1) == 1)

yield ();

}

Instead of spinning to wait on a condition, this lock blocks the current thread.Unfortunately, this lock is still costly on a round-robin scheduler. If there are 100 threads, and the thread holding the

lock is pre-empted, the other 99 threads will each call a context switch. We also have not solved the starvation problemat all and a therad could get caught in an endless yield loop.

Instead of leaving everything to chance, we use a queue to store the next thread that is supposed to wake up. Thelock stores a spinlock (“guard”), a boolean (“flag”) and a wait channel (“queue”).

struct lock

{

int flag;

struct spinlock *guard;

struct wchan *wc;

};

• lock_acquire:

1. Acquire the guard lock

2. If the flag is not set, set the flag and release the guard lock.

Now the current thread owns the lock.

3. If the flag is set

– (wchan_lock) Enqueue the current thread onto the wait channel.

– (spinlock_release) Release the guard lock.

– (wchan_sleep) Sleep on the queue.

15


Descriptio 2.4: Second attempt at a lock

:= ?

zzz

v

v

w

:= ?

wEmpty

Non-empty

Descriptio 2.5: Implementation of a lock given in spring 2019

:= ?

zzz

v

v

w

:= w

• lock_release:

1. Acquire the guard lock.

2. If the wait channel is empty, unset the flag.

3. Otherwise, wake up one thread from the wait channel.

• lock_try_acquire:

1. Acquire guard lock

2. If lock is held, release and return false.

3. Otherwise, set the lock owner.

4. Release the spin lock and return true.

Kernel API

A lock interface is provided in $KI/synch.h, with the functions:

• lock_create(char const *name): Returns struct lock *

• lock_destroy(struct lock*)

lock_acquire(struct lock*)

lock_release(struct lock*)

Unlike spinlocks, the implementation is not given and a stub is provided in $KTh/thread.c.

Assignment 1 Part 1.Implement the lock in $KI/synch.h base on the stub in $KTh/thread.c. Then test your lock with sys161 kernel "sy2;q".Find a way for a thread to test its ownership of the lock.

The while loop in the implementation is necessary. Consider the following example:

1. Suppose there are 3 threads

2. 3 acquires the lock.

3. 1, 2 calls lock_acquire and sleep on the wchan.

4. 3 unlocks the lock.

16


2.5.2 Deadlocks

Consider the locks and two functions:

lock *lockA , *lockB;

int funcA()

{

lock_acquire(lockA);

lock_acquire(lockB);

...

lock_release(lockA);

lock_release(lockB);

}

int funcB()

{

lock_acquire(lockB);

lock_acquire(lockA);

...

lock_release(lockB);

lock_release(lockA);

}

Consider the following order of execution:

Thread 1

Thread 2

lockA lockB

lockB lockA

· · ·

· · ·

This creates a deadlock, since each thread is now waiting for the other to release a lock. Now they are permanentlystuck.

Two techniques to solve this problem:

• No hold and Wait: Prevent a thread from requesting resources if it currently holds allocated resources. A threadthat wish to hold several resources must lock all at once.

Example for 2 resources:

lock_acquire(la);

while (! lock_try_acquire(lb))

{

lock_release(la);

lock_acquire(la);

}

This impacts performance.

• Resource ordering: Order the resource types and require that each thread acquire resources in increasing resourcetype order.

2.6 Semaphores

A semaphore7 is a synchronisation primitive for enforcing mutual exclusion and solve some other synchronisationproblems.

7from Ancient Greek σˆηµα- (sema, “sign”) and -ϕορο (phoros, “bearing,bearer”)

17


Definition

A semaphore is an object with an integer value which supports two atomic operations:

• P a (“wait”): If the semaphore value is greater than 0, decrement the value. Otherwise wait until the value isgreater than 0 and then decrement.

• V b (“signal”): Increment value of the semaphore.

Types of semaphores:

• Binary semaphore: A semaphore with a single resource, behaves like a lock but does not track ownership.

• Counting semaphore: A semaphore with an arbitrary number of resources.

afrom Dutch passering, “Passing” a railroad signal

bfrom Dutch vrijgave, “Release”

Kernel API

Interface of a semaphore is found in $KI/synch.h. The type is struct semaphore *.

• sem_create(char const *name, int initial): Create a semaphore with given initial count.

• sem_destroy(struct semaphore *)

• P(struct semaphore *)

• V(struct semaphore *)

Dijkstra style semaphore is implemented in $KTh/synch.c.

The semaphore can be used to guard critical sections via:

volatile int total = 0;

struct semaphore *sem = sem_create("mutex", 1);

void add()

{

P(sem);

{

++total;

}

V(sem);

}

The semaphore holds four fields:

struct semaphore

{

char *name;

struct wchan *wchan;

struct spinlock *lock;

volatile int count;

};

2.6.1 Producer-Consumer Queue

Suppose we have some threads (producers) that add items to a buffer and threads (consumers) that remove items fromthe buffer. Suppose we also want

• Consumers do not consume if the buffer is empty.

• The buffer has finite size and when the buffer is full, producers wait.

18


Descriptio 2.6: Implementation of a Semaphore, c is the count

P := ?

--

zzz

c > 0

c = 0

V := ++

struct semaphore *sem_items , *sem_spaces;

sem_items = sem_create("Buffer Items", 0);

sem_spaces = sem_create("Buffer Spaces", N);

void produce ()

{

P(sem_spaces );

// Add items to the buffer

V(sem_items );

}

void consume ()

{

P(sem_items );

// Remove item from the buffer

V(sem_spaces );

}

Notice that there is still race condition when modifying the buffer. A third synchronisation primitive is required to protectthe buffer.

Let’s consider another scenario. Suppose we have two queues, A,B, and we need two functions to move items betweenA and B, with the following requirements

• Items need to stay in order.

• dequeue should not be called on an empty queue.

• One queue can only be used by one thread at a time.

• Items must be enqueued onto B in the same order that they are dequeued from A, and vice versa.

• No deadlocks.

struct semaphore

*nA, *nB // Number of elements in A,B. Initialised to N,0

*lockA , *lockB // Lock each queue , initialised to 1

*lockAB , *lockBA // Atomicity of transfer operations , initialised to 1

;

19


AtoB := --

nA lockAB lockA A, x lockA lockB B, x lockB lockAB

++

nB

BtoA := --

nB lockBA lockB B, x lockB lockA A, x lockA lockBA

++

nA

2.7 Condition Variables

Locks are not the only primitives that are needed to build concurrent programs. There are cases when a thread wishes tocheck if a condition is true before continuing its execution.

For example, the join() operation in pthread checks if another thread has finished execution. We could use a sharedvolatile variable, but this is hugely inefficient as the thread spins and wastes CPU time. We would like some way tosleep a thread until the condition becomes true.

Definition

A condition variable is an explicit queue that threads can put themselves on when some state of execution(condition), is false.

Each condition variable is intended to work together with a lock, and the variables are only used from within thecritical section protected by the lock.

Kernel API

Interface for a Condition Variablea is provided in $KI/synch.h.

• cv_wait(cv, lock): Blocks the calling thread, and releases the lock associated with the condition variable.

Once the thread is unblocked, the lock is re-acquired by the calling thread.

cv_signal(cv, lock): Unblock one of the threads that was previously blocked in the signalled condition.

cv_broadcast(cv, lock): Like signal, but unblocks all threads.

For all three operations, the current thread must hold the lock passed as the argument into the functions. In normalcircumstances, the same lock is used with the same CV.

The type for a CV is struct cv *.

aMesa-style. There are alternative CV semantics which ensure some condition must be true when the CV exits, but they tend to be

slower. The Mesa semantic is implemented on most platforms.

• Mesa style: Condition variables that the thread calls cv_signal or cv_broadcast continues execution.

• Hoare style: Condition variables that the thread calls cv_signal or cv_broadcast is blocked, gives up the lock andthe waiting thread continues executing.

The calling thread of cv_wait holds the lock before and after cv_wait. Between the call and return, the caller’s lockis released and other threads may enter.

The following code using semaphores can be converted from locks and condition variables:

struct semaphore *sa , *sb , *sc;

void func1() {

P(sa);

funcA ();

V(sa);

P(sc);

}

void func2() {

P(sb);

funcB ();

20


V(sb);

V(sc);

}

(Hint: The solution below does not quite work)

struct lock *la , *lb;

int nResource = 0

struct cv *cv;

struct lock *lcv;

void func1() {

lock_acquire(la);

funcA ();

lock_release(la);

lock_acquire(lcv);

--nResource;

while (nResource <= 0)

cv_wait(cv , lcv);

lock_release(lcv);

}

void func2() {

lock_acquire(lb);

funcB ();

lock_release(lb);

lock_acquire(lcv);

++ nResource;

if (nResource > 0)

cv_signal(cv , lcv);

lock_release(lcv);

}

In a Producer-Consumer queue, the producer waits for the condition c < N to be true. When the condition is nottrue, a thread can wait on the corresponding condition variable. When a thread detects that a condition is true, it usescv_signal or cv_broadcast to notify some or all waiting threads.

For example:

volatile int nGeese = 100;

lock *mutex;

int safe_to_walk ()

{

lock_acquire(mutex);

while (nGeese > 0)

{

lock_release(mutex);


}

}

?g > 0

g = 0

Context Switch

21


Repetitively acquiring and releasing a lock provides an opportunity for a context switch. However, the thread should notbe waiting for the lock but should wait for the condition g = 0 to be true.

Instead we can use a cv:

volatile int nGeese = 100;

lock *mutex;

cv *zeroGeese;

int safe_to_walk ()

{


while (nGeese > 0)

cv_wait(zeroGeese , mutex);

}

The access of variable nGeese is protected by the mutex (recall the exit condition for cv_wait).

?

•g > 0

g = 0

Another Thread

It is recommended to use a while loop since sometimes the thread may wake up without a signal (a spurious wakeup),so it is necessary to check the condition. Another reason is that when a thread is woken up, it is placed on the readyqueue. A more subtle reason is given in producer-consumer queues.

2.7.1 Join and Exit

Consider the program

bool done = false;

struct lock *mutex;

struct cv *c;

void child(void *, unsigned long)

{

printf("child\n");

th_exit ();

}

int main()

{

/* Initialise Lock and CV */

printf("Parent: Begin\n");

thread_fork("child", NULL , child , NULL , 0);

th_join ();

printf("Parent: End\n");

return 0;

}

We can implement thread_join() and thread_exit() by

void th_exit ()

{


done = true;

cv_signal(c, mutex );


}

void th_join ()

22


{


while (done == false );

cv_wait(c, mutex);


}

th_join() := ?

•w

v

th_exit() := v

To understand the necessity of every condition, we have two cases:

1. Parent creates child thread and immediately continues running and immediately calls th_join().

2. Parent creates child and child immediately runs and calls th_exit(). After which parent calls th_join().

We shall consider the following variations on the above code:

1. No signalling variable:

th_join() := •

th_exit() :=

In case (2), the child calls th_exit() and exits immediately, and the parent gets stuck when it calls th_join().

2. No lock (suppose we can wait for a conditional variable without a lock like pthread):

th_join() := ? •w

v

th_exit() := v

The operation is no longer atomic. This creates race condition in the following order of execution:

Parent ? •w

v

Child v

The parent detects the condition and sees that w. Then before the parent thread waits on the CV, the child threadsets the flag to v and signals, but there are no threads to receive this signal. After the child exits, the parent sleepsand gets stuck.

Assignment 1 Part 2.Implement the condition variables in $KI/synch.h base on the stub in $KTh/thread.c. Then test your lock withsys161 kernel "sy3;q".

23


2.7.2 Producer-Consumer Queue with Condition Variables

The CV can also implement a Producer-Consumer buffer.

volatile int count = 0; /* Must initially be 0 */

struct lock *mutex;

struct cv *notfull , *notempty;

void produce(item_t) := ?

•

notfull

c = N

c < N ++

count notempty

item_t consume() := ?

•

notempty

c = 0

c > 0 --

count notfull

Several alternative broken implementations exist:

1. Single CV, if instead of while:

We have the variables

volatile int count = 0; /* Must initially be 0 */

struct cv *cv;

struct lock *lock;

Because there is only one CV and lock, we can omit the variable name in the diagrams:


•c = N

c < N ++

count


•c = 0

c > 0 --

count

Consider the following execution order, such that a consumer C2 is spuriously woken up, and C1 finds no item toconsume:

P

C1

C2

? •

? ++ ? •

consume()

•

The philosophy of a CV is to notify a thread that the state of the world possibly has changed. With Mesa semantics,always use while loops.

2. Single CV, with while:

The problem above is easy to fix by changing the if to a while:


•c = N

c < N ++

count

24



•c = 0

c > 0 --

count

The problem now is that if there are multiple producers, and one producer’s signal wakes up another producer, allthreads will go to sleep, which is bad. We need two CV’s, one to wake up the consumers and one the producers.

Assignment 1 Part 3.Use locks, semaphores, wait channels, and condition variables to implement a efficient and fair scheme of allowing vehiclesto traverse through intersections.

Detour

In Mesa-style CV, when blocked threads are signalled, they are put onto the ready queue, they may run briefly, onlywhen to be put onto the wait channel of the lock used with the CV. This is double wakeup, To mitigate this wecan use wait morphing : Put the thread woken in cv_signal directly onto the wait channel of the lock, or mark thesignalled thread as the lock owner prior to putting it into the ready queue.

25


Caput 3

Processes and the Kernel

A program is a piece of data. A running program, or a process, is the result of executing a program.The machine state of a process is what a program can read or update when it is running. The machine state consists

of

• Memory/Address space

• Registers

• PC (Program counter)

• Stack pointer/Frame pointer

3.1 Process Creation

How are programs transformed into processes? The OS enters a series of stages when starting a process from a program:

1. Load the code and static data into memory.

Programs are usually stored in some kind of executable format.

Modern OS perform this process lazily, by loading data only when it is necessary. This requires paging and swappingwhich are covered in the chapter for Virtual Memory.

2. Allocate memory for the program’s run-time stack or just stack, and intiailise the stack with argc and argv

arguments.

3. Allocate memory for the program’s heap. The heap starts small and grows via the malloc() API.

4. Other initialisation tasks, such as I/O.

For example, in UNIX systems each process has three open file descriptors, stdin,stdout,stderr. This allowsthe program to read input from the terminal and output to the screen.

A process can have three states, running,ready, blocked, which are similar to threads. A process which initiates aI/O process may be blocked (and have its CPU time allocated to another process) until the I/O process is complete.

To track the state of each process, the OS keep a process list.

3.2 System calls for Processes

A process is an environment in which an application program runs. A process includes virtualised resources that theprogram can use:

• Threads

• Virtual memory, to be used for the program’s code and data.

• Other resoruces (e.g. file and socket descriptors)

26


Processes are created and managed by the kernel. Each program’s process isolates it from other programs in otherprocesses. Each process is associated with a process identifier, or PID.

Processes can be created, managed, and destroyed. OS/161 supports a variety of functions to perform these tasks.

Linux OS/161

Creation fork,execv fork,execvDestruction exit,kill exit

Synchronisation wait,waitpid, pause waitpid

Attributes getpid,getuid,nice getpid

Note. The OS/161 process management calls are not implemented yet.

Note. It is not possible in OS/161 to have two trapframes back-to-back, since when interrupts happen, interrupt is turnedoff and are not turned back on until mips_trap.

Kernel API

In $KI/syscall.h:

fork(): Clone the caller process (parent) and spawns a clone (child).

After the fork,

– Parent and child execute copies of the program.

– Virtual memories of parent and child are identical at the time of fork, but may diverge afterwards.

– fork is called by the parent, but returns in both the parent and child.

– Returns 0 to the child, and pid to the parent. Returns < 0 if failed.

• _exit(code): Terminate the process that calls it. Process supply a exit code for which the kernel records.

• waitpid(int pid): Let a process wait for another to terminate and retrieve its exit status code.

The idea that fork() returns in both the parent and the child is a bit surprising. The parent and child processes canbe thought as running in parallel universes and fork() returns a different number in each one. This number allows foreasy differentiation of parent and child.

Detour: Where is my system call?

One might wonder where are the system calls user interfaces for, e.g. fork() generated. A grep finds no trace ofany function body of fork(), even if we take the user directory $U into account.

The system calls are generated in the system call library, in $UL/libc/syscalls/gensyscalls.sh. Thisscript transforms the list of system calls, which is easily accessible in $KIK/syscall.h, into a MIPS file. TheSYSCALL macro is defined in $UI/libc/arch/mips/syscalls-mips.S, and the generated file is located inbuild/user/lib/libc/syscalls.S.

But this isn’t the end. If we naıvely call any tool in testbin, we get an error

$ osrun ’p testbin/forktest; q’

...

Unknown syscall 0

...

This is because the system call handler in $KAMS/syscall.c contains a gigantic switch-case clause that correspondsto every single system call. The unhandled fork() is routed to “Unknown syscall 0”.

If both processes run on a single CPU, the fork() is not deterministic, and either the parent or child can run. Theorder of execution is decided by the CPU scheduler.

void child_code ();

void parent_code(int pid);

int main()

{

27


int rc = fork ();

if (rc < 0)

/* fork failed */

exit (-1);

else if (rc == 0)

child_code ();

else

/* rc == child ’s pid */

parent_code(rc);

}

The parent, can for example, wait for the child to exit:

void parent_code(int pid)

{

waitpid(pid);

/* Child exited */

}

3.2.1 execv()

Kernel API

execv() changes the program that a process is running.

• The calling process’s current virtual memory is destroyed.

• The process gets a new virtual memory and the new program runs on it.

• The process id stays the same.

For example,

int main()

{

int rc = 0;

char *args [3];

args [0] = (char *) "/testbin/argtest";

args [1] = (char *) "string";

args [2] = NULL; // Terminator

rc = execv(args[0], args);

/* If execv succeeds , the code below are not executed */

printf("If you see this then execv failed\n");

printf("rc = %d, errno = %d\n", rc , errno);

exit (0);

}

We could combine fork() and execv() to spawn a new process and obtain its PID:

int main()

{

char *args [...];

rc = fork ();

if (rc == 0)

{

status = execv(args[0], args);

// At this stage execv failed.

exit (-1);

}

else

{

28


/* Now rc holds pid of newly spawned process */

parent_code ();

}

}

One may ask, why such an odd semantic? Why is there no syscall that just spawns another process and returns itsPID? The reason is that the fork-exec semantic is essential to building a UNIX shell, since it lets the shell run code afterfork() but before exec(), the shell code can change the environment of the about-to-start program.

For example, the shell may see

$ ./prog 1 2 > out.txt

In this example, the output of $ ./prog 1 2 is redirected into out.txt. The shell implements this very easily: Beforecalling exec, stdout is redirected to the file out.txt.

3.3 Kernel Privilege

The CPU implements different levels of execution privilege as a security and isolation mechanism. The kernel runs atthe highest privilege level.

Applications run at a lower privilege level because user code should not be able to perform dangerous tasks such as

• Modify the page table

• Halting the CPU

Programs cannot execute code or instructions belonging to a higher level of privilege1. This allows the kernel to isolateeach process from others and from the kernel.

System calls or syscalls are the interface between processes and the kernel. Since application programs cannotdirectly call the kernel, how does a program make a system call such as fork()?

There are only two things that make kernel code run:

1. Interrupts: Generated by devices when the normal of flow of execution needs to yield to special handlers.

Interrupts are raised by devices (hardware) and transfers the control to a fixed location (interrupt handler) islocated. Interrupt handlers are a part of the kernel. When this occurs, the processor switches to privileged executionmode. This is how the kernel gets execution privilege.

2. Exceptions: When a running program enters a illegal state or needs special handling. i.e. Exceptions result fromthe execution of an instruction.

Exceptions are conditions (arithmetic overflows, illegal instructions, page faults) that occur during the execution ofa program.

Exceptions are deteced by the CPU during instruction execution. The CPU handles execptions like it handlesinterrupts, via exception handlers, which is a part of the kernel2.

3. Syscalls (This is a type of exception)

To perform a system call (i.e. transfer control to the kernel), the application needs to cause an exception to make thekernel execute. In MIPS,

1. Application put a code3 corresponding to a system call (e.g. fork,getpid) into a specified location.

In OS/161, the code is in register v0. For example, li v0, 0 loads the call code for fork into v0, since SYS_fork

is defined to be 0.

2. Application calls the special purpose instruction syscall and triggers EX_SYS.

3. Kernel exception handler checks v0 to determine which system call has been generated.

The code and code location are part of the kernel ABI (Application Binary Interface).System calls

1The Meltdown vulnerability on Intel chips let user code bypass execution privilege and access other parts of memory.

2The MIPS exception types can be found in $KAMI/trapframe.h

3The OS/161 system call codes are in $KIK/syscall.h.

29


• Can take parameters using the registers a0,a1,a2,a3.

• Return success/fail code is a3

• Return value/error code is v0

When an application generates a system call:

1. Application calls library wrapper function for system call.

2. Library function performs syscall.

3. Kernel exception handler starts:

(a) Create trap frame to save application program state

(b) Determine that the exception is a system call

(c) Query v0 to see which system call is generated.

(d) (System call body)

(e) Restore application program state from trap frame

(f) Return from exception

4. Library wrapper function returns

5. Application continues execution

In OS/161:

1. Load into register v0 the syscall code for system call.

2. Load into register a0 to a3 the system call parameters

3. Raise a syscall exception using the syscall instruction.

4. CPU raises privilege. The exception handler common_exception executes and disables interrupts.

5. Switch the current stack for the current thread from user stack to kernel stack.

6. Generate a trapframe on kernel stack using mips_trap.

7. The system call dispatcher, syscall, is called.

8. syscall calls the appropriate handler depending on trapframe.v0.

9. If a timer interrupt happens at this time,

(a) The newly spawned common_exception detects this and does nothing.

(b) Interrupt handler for the clock is called.

(c) thread_yield, thread_yield. Context switch.

(d) Eventually context switches back to this thread.

10. The syscall executes and populated v0 (return value) and a0 (error code)

11. Increment the PC

12. mips_trap returns to common_exception. The trapframe data is restored.

13. Switch to unprivileged mode using rfe.

Every OS/161 process thread has two stacks, but only uses one at a time. (The previous illustration in the note inwhich a thread has only one stack is false)

• User stack: used while application code is executing

– Located in the application’s virtual memory

– Holds activation records for application functions

– Kernel creates this stack when it sets up the virtual address memory.

30


• Kernel stack4: Used while the thread is executing kernel code after an exception of interrupt.

– Stack is a kernel structure

– This stack holds activation records for kernel functions

– This stack holds trap frames and switch frames.

3.4 Inter-Process Communications

Processes are isolated from each other, but they can still communicate. Inter-Process Communication (IPC) is afamily of methods used to send data between processes.

• File: Data to be shared is written to a file and is accessible for both processes.

• Socket: Data is sent via network interface between processes

• Pipe: Data is sent unidirectionally from one process to another via an OS-managed buffer

• Shared Memory: Data is sent via block of shared memory visible to both processes.

This method requires synchronisation, and is not very flexible.

• Message Passing/Queue: A queue/data stream provided by the OS to send data between processes.

4t_stack field of the thread structure points to this

31


Caput 4

Virtual Memory

A bit can store 0 or 1. A byte can store 8 bits.

4.1 Physical Memory

If a physical address is represented by P bits, the maximum addressible amount of physical memory is 2P bytes. 1

18 bits is sufficient to address 218 bytes or 256 KB.The kernel implements a private virtual memory for each process. The virtual memory of a process holds the code,

data and stack for the program that is running. If virtual addresses are V bits, the maximum size of a virtual memory is2V bytes.2. Applications only see virtual addresses. e.g.

• The PC and stack pointer hold virtual addresses of the next instruction/stack.

• All pointers are virtual addresses.

• Targets of j instructions are virtual addresses.

Each process is isolated in its own virtual memory and cannot access other process’ virtual memories.3.This serves several purposes:

• Isolate processes/kernel from each other

• Potential to support virtual memory much larger than physical memory.

4.2 Address Translation

Each virtual memory is mapped to a different part of physical memory. When a process tries to access a virtual address,the virtual address is translated to its corresponding physical address. This translation process is calculated in hardware,on the Memory Management Unit (MMU), using information provided by the kernel.

There are several different methods of address translation.

4.2.1 Dynamic Relocation

Definition

The offset or relocation (R) is the position in physical memory where a process’s virtual memory begins. Thelimit is the amount of memory used by the process.

The MMU has a register for each. Given a virtual address v, it can be translated as

p(v) :=

{v + R if v < L

exception otherwise

Addresses that cannot be translated will produce exceptions.

1In Sys161 MIPS architecture, physical addresses are 32 bits long and have a maximum memory of 2

32bytes, or 4GB.

2For MIPS, V = 32

3Special mechanisms exist to inject into other process’ memories. This type of program could be flagged as a virus

32


Descriptio 4.1: Dynamic Relocation

Virtual 1

Virtual 2

Offset LimitPhysical

Descriptio 4.2: Segmented memory of two processes

Virtual 1

Virtual 2

Physical

Unfortunately, this unsegmented model suffers from numerous problems. In OS161, Since the program code and dataexists near 0x0 and the top of the stack exists at 0x7FFFFFF, a single contiguous block containing both would require2GB of space.

Although efficient, dynamic relocation suffers from fragmentation.

4.2.2 Segmentation

Instead of mapping the entire virtual memory to physical, we map each segment of virtual memory that the applicationuses separately. The kernel maintains an offset and limit for each segment. With segmentation, a virtual address can bethought of having two parts:

• 2K bits of segment id.

• 2V−K bits of address.

The kernel decides where each segment is placed in physical memory.Two different implementation of segmentation exists

• MMU has a relocation eregister and a limit register for each segment.

Let Ri be the relocation offset and Li the limit for the ith segment.

1: procedure Translate(v)2: (s, a)← split(v) . Segment number and address3: if a ≥ Ls

4: exception5: else6: return p← Ri + a7: end if8: end procedure

In this method, each thread has an exact number of segments.

• Maintain a segment table. In this method the number of segments for a thread is capped.

One bit in the segment table may be reserved for protection. The segment table is stored in the MMU.

Fragmentation is still possible in segmentation.

33


Descriptio 4.3: Single level paging with a virtual address length of 16 bits

Page Table

Virtual Address

Physical Address

Page Number Offset

Frame Number

Descriptio 4.4: Two level paging with a virtual address length of 16 bits

Dir Table

Virtual Address

Physical Address

Page Number Offset

Frame Number

4.2.3 Paging

This is the big one: Virtual memory is dividied into fixed-size chunks called pages. Each process has a page table,mapping the process’s virtual addresses to physical addresses.

Each row in the page table is a page table entry (PTE).

Page Frame Valid? ReadOnly?

0x0 14 1 0

A page table can get very large. The size of a page table is

(Page Table Size) = (Number of Pages) · (Size of PTE)

To reduce the size of the page table, we use multi-level paging4. We organise it and split the page table into multiplelevels. If a table contains no valid PTEs, do not create the table.

How many levels of paging is optimal? Ideally, each table would fit on a single page. As V (virtual address length),so does the need of more tables.

• When V = 40, page size is 4KB, and PTE is 4 bytes.

• There are 240/212 = 228 pages in virtual memory.

• 212/22 = 210 PTEs fit on a single page.

• Need up to 228/210 = 218 page tbales, so the directory must hold 218 entries

• Directory takes 218 · 22 = 220 or 1MB of space.

Formulae:

• BitsOffset = log2 PageSize

• PTEsPerPage = PageSize/PTESize

• BitsPageNum = log2 PTEsPerPage

•Levels =

⌈V − BitsOffset

BitsPageNum

⌉Notice that V − BitsOffset is the number of bits left for the page index

34


Descriptio 4.5: Example: Page Size is 220, V = 64, Page Table Entry is 24

12 16 16 20

Page Number Page

Level 1 Level 2 Level 3 Offset

In paging, the kernel:

• Manages MMU registers on address space switches (context swithc from thread in one process to thread in a differentprocess)

• Create and manage page tables

• Manage (allocate/deallocate) physical memory

• Handle executions raised by the MMU

MMU:

• Translate virtual addresses to physical addresses

• Check and raise exceptions when necessary

4.3 Translation Lookaside Buffer

Constantly fetching address translation through a page table can add significant overhead. Since the program instructionsare indexed by virtual addresses as well, every instruction is accompanied by a page table fetch.

The solution is to include a Translation Lookaside Buffer (TLB) in the MMU. TLB is a small, fast, dedicatedcache of address translations in the MMU. Each entry in the TLB stores a (#Page→ #Frame) mapping.

When the MMU needs to translate a virtual address on page p, two different things can happen depending on theimplementation of MMU,

• Hardware-Managed TLB:

1: if (p, f) ∈ TLB2: return f . TLB Hit3: else4: f ← PageTable.FrameNumber(p)5: TLB.evict6: TLB.add(p, f)7: return f . TLB Miss8: end if

The MMU handles TLB misses, icnluding page table lookup and replacement of TLB entries. This requires MMUto understand the kernel’s page table format.

• Software-Managed TLB:

1: if (p, f) ∈ TLB2: return f . TLB Hit3: else4: Raise exception . TLB Miss5: end if

The kernel in this case must determine the frame number for p and add (p, f) to the TLB, evicting another entryif necessary.

MIPS uses a software-managed TLB . Handled in $KAMI/tlb.h.The MIPS TLB has room for 64 entries. Each entry is 64 bits (8 bytes) long.

• Paging does not introduce external fragmentation

• Multi-level paging reduces the amount of memory required to store page-to-frame mappings.

• TLB misses are increasingly expensive with deeper page tables.4Linux supports 4 level or even 5 level (in recent kernel versions) paging.

35


Descriptio 4.6: MIPS TLB Entry

20 6 20

High Word (32) Low Word (32)

Page # PID (Unused) Frame #Write Permission

Valid

64 44 38 32 12 8 0

4.4 Address Space in OS/161

struct addrspace is very simple. Each program is divided into 3 segments: Code, Data, and Stack.

struct addrspace {

vaddr_t as_vbase1; /* Base vaddr of code segment */

paddr_t as_pbase1; /* Base paddr of code segment */

size_t as_npages1; /* Size (in pages) of code segment */

vaddr_t as_vbase2; /* Base vaddr of data segment */

paddr_t as_pbase2; /* Base paddr of data segment */

size_t as_npages2; /* Size (in pages) of data segment */

paddr_t as_stackpbase; /* Base paddr of stack */

};

vbase1 npages1

vbase2 npages2

stackpbase

pbase1pbase2

Code Data Stack

Code DataStack

Vir

tual

Physi

cal

4.5 Executable and Linkable Format

A program’s code and data is described in an executable file. OS/161 and some other OS expect executable andlinkable format (ELF).

The ELF contains:

• Address space segment descriptions. The ELF header describes the segment images.

• The ELF file identifies the (virtual) address of the program’s first instruction (entry point)

• Some other information: Section descriptors, symbol tables. These are useful to compilers, linkers, debuggers,loaders, and other tools.

In OS/161, the dumbvm implementation assumes that an ELF file has two segments:

• Code/Text Segment: Contains code and read-only data

• Data Segmentnt: Contains other data

Note that ELF does not store stack data.To execute a new program,

1. Count number of arguments and copy them into kernel

2. Copy program path into kernel

3. Open program file using vfs_open

36


4. Create new address space and set process to the new address space and as_activate.

as_activate is a terribly named function whose only purpose is to clear the TLB.

5. Using the opened program file, load the program image

6. Need to copy the arguments into the new address space. Consider copying the arguments onto the user stack aspart of as_define_stack.

7. Delete old address space.

8. Call enter_new_process with arguments on the stack, the stack pointer (from as_define_stack), and programentry point (from vfs_open).

Assignment 2b.Implement execv in OS/161.

The main difficulty of doing this is argument passing. When copying from/to userspace, the kernel must use copyinor copyout.

Common problems:

• copystrin/copystrout’s length does not count the NULL terminator.

• User pointers should be of the type userptr_t.

• Make sure to pass a pointer to the top of the stack to enter_new_process.

Pointers to strings must be 4-byte aligned. Strings do not have to be aligned.

4.6 Virtual Memory for the Kernel

We would like the kernel to live in virtual memory as well. There are some challenges:

• Bootstrapping: Since the kernel helps to implement virtual memory, how can the kernel run in virtual memorywhen it is just starting?

Solutions are architecture specific.

• Sharing: Sometimes data need to be copied between kernel and application program.

This can be addressed by making the kernel and process virtual memories overlap. i.e. The address range above0x80000000 is always kernel address and is shared among processes.

When the CPU is in an unprivileged state, the CPU can only translate addresses below 0x80000000.

Note. The Sys161 emulator only allows for at most 1GB of physical memory.

• kseg0 (512MB): For kernel data structures, stacks, etc.

To translate kseg0 addresses, subtract 0x8000000 from virtual addresses. Thus kseg0 maps to the first 512MB ofphysical memory. The kernel does not have to use all of this memory. This region uses dynamic relocation and notTLB.

• kseg1 (512MB): For addressing devices

To translate kseg1 addresses, subtract 0xA0000000 from the virtual address. This region is also mapped to the first512MB of physical memory.

• kseg2 (1GB): Unused

Note. Physical memory is divided into frames. Frame use is managed by the kernel in the coremap. OS/161 does nothave this.

37


Descriptio 4.7: OS/161 Virtual Memory and Physical Memory Mappings

kuseg − paged kseg0 kseg1 kseg2

User Memory Kernel Memory0x800000

000xA00000

00

Unusable

Physical Address

4.7 Exploiting Secondary Storage

Some programs are very large, on the size of several gigabytes. It is inefficient to load the entire program into memoryat the same time. We can allow pages from virtual memories to be stored in secondary storage (i.e. on disks or SSDs).Pages/Segments are swapped between secondary storage and primary memory.

When swapping is used, some pages of virtual memory will be in memory, and some others will not be in memory.The set of virtual pages present in physical memory is the resident set of a process. A process’s resident set will changeover time.

To track which pages are in physical memory, each PTE needs to contain an extra bit, the present bit. Note that anon-present page should not be in the TLB. When a process tries to access a page that is not in memory, the problem isdetected because the page’s present bit is zero.

• In hardware-managed TLB, the MMU detects this when it checks the page’s PTE, and generates an exception forthe kernel to handle.

• In software-managed TLB, the kernel detects the problem when it checks the page’s PTE after a TLB miss.

The event of accessing a non-resident page is a page fault. When this happens,

1. Swap the page into memory from secondary storage, evicting another page from memory.

2. Update PTE (set present bit)

3. Return from the exception so the application can retry the virtual memory access that caused the page fault.

4.7.1 Optimising Page Faults

Page faults are slow. Accessing secondary storage can be orders of magnitude slower than RAM. To improve performanceof virtual memory with on-demand paging, reduce the occurrence of page faults.

• Limit number of processes, so that there is enough physical memory per process.

• Try to be smart about which pages are kept in physical memory, and which are evicted.

• Hide latencies e.g. by pre-fetching pages before a process needs them.

Some inpractical page replacement policies:

• The optimal page replacement policy is to evict the page that will not be referenced for the longest time. However,this requires knowledge of the future.

• Least Recently Used: MMU cannot have a clock and store the time for each page.

Real programs do not access virtual memories randomly. Instead, they exhibit locality.

• Temporal Locality : Programs are more likely to access pages that they have accessed recently.

• Spatial Locality : Programs are likely to access parts of memory that are close to parts of memory they have accessedrecently.

38


Locality helps the kernel keep page fault rates low.Simple scheme:

• Add a use bit (or reference bit) to each PTE.

• Set the use bit each time the page is used.

• Periodically the kernel clears all use bits.

This gives the clock algorithm.

1: p : int . Index to next candidate frame2: procedure ClockEvict3: while use[p]4: use[p]← false5: p← (p + 1) modN . N is number of frames6: end while7: evict(p)8: p← (p + 1) modN9: end procedure

39


Caput 5

Scheduling

5.1 Simple Scheduling Model

We are given a set of jobs to schedule. Only one job can run at a time. For each job, we are given the arrival time aiand run time ri.

Definition

The response time is the time between the job’s arrival and the beginning of execution.The turnaround time is the time between the job’s arrival and the end of execution.

We must decide when each job should run, to achieve some goal. Jobs are not atomic and can be interrupted to yieldthe CPU time to another job.

The simplest scheduling algorithm is First-Come-First-Serve (FCFS):

• Jobs run in order of arrival.

• Simple, avoids starvation.

The figures use the example:

j1 j2 j3 j4

ai(Arrival) 0 0 0 5ri(Runtime) 5 8 3 2

Round-Robin (used in OS161):

• Pre-emptive FCFS

Shortest-Job-First:

• Increasing order of runtime

• Minimises average turnaround time

• Starvation is possible.

Shortest-Remaining-Time-First:

Descriptio 5.1: FCFS Scheduling

Time0 2 4 6 8 10 12 14 16 18 20

j1

j2

j3

j4

40


Descriptio 5.2: Round-Robin Scheduling

Time0 2 4 6 8 10 12 14 16 18 20

j1

j2

j3

j4

Descriptio 5.3: Shortest-Job-First Scheduling

Time0 2 4 6 8 10 12 14 16 18 20

j1

j2

j3

j4

• Pre-emptive variant of SJF. Arriving jobs pre-empty running job.

• Select the one with shortest remaining time when a job exits.

• Starvation is possible.

In CPU scheduling, the “jobs” to be scheduled are the threads. CPU scheduling differes from simplex schedulingmodel:

• The runtime of threads are normally not known.

• Threads are sometimes not runnable. e.g. when blocked.

• Threads may have different priorities.

The objective of a scheduler is to achieve a balance between:

• Responsiveness

• Fairness

• Efficiency

5.2 Multi-Level Feedback Queues

Multi-Level Feedback Queues are the most commonly used scheduling algorithm. The objects are:

• Good responsiveness for interactive thread. (These are frequently blocked) Higher proprities are given to interactivethreads, and the run whenever they are ready.

• Non-interactive threads should make as much progress as possible.

Descriptio 5.4: Shortest-Remaining-Time-First Scheduling

Time0 2 4 6 8 10 12 14 16 18 20

j1

j2

j3

j4

41


Descriptio 5.5: Completely Fair Scheduling. Thickness is proportional to factor.

Time0 5 10 15 20 25 30 35 40 45 50

j1

j2

j3

In MLFQ, there are n levels of queues:(Qn, qn)...

(Q1, qn)

,

{Priority: Qn > · · · > Q1

Quantum: q1 ≥ · · · ≥ qn

There are n round-robin ready queues where the priority of Qi > Qj if ij . Threads in Qi have quantum qi.

1. Scheduler selects a thread from the highest priority queue to run. Threads in Qi−1 are only selected if Qi is empty.

2. Pre-empted threads from Qi are enqueued on to the next lower-priority queue Qi−1,

This indicates that the thread is less likely to block.

3. When a thread wakes from blocking, it is put onto Qn (highest priority).

This indicates that the thread is more likely to block.

Unfortunately, MLFQ is not immune from starvation, but the easy fix is to bump every thread to the top queue oncein a while.

Note. Many variants of MLFQ will pre-empty low priority threads when a thread wakes to ensure a fast response.

5.3 Linux Completely Fair Scheduler

The Linux Completely Fair Scheduler (CFS) is used in Linux. Each thread can be assigned a weight. The goal ofthe processor is to ensure that each thread gets a “share” of the processor in proportion to its weight.

1. Track the “virtual” runtime of each runnable thread.

2. Always run the thread with the lowest virtual runtime.

The virtual runtime is the actual runtime adjusted by the thread weights.

tvirtual := tactual ·∑

j wj

wi︸︷︷︸Factor

Time advances slowly for high-weight threads.Example: Thread 1 will run at t.

Time Thread Weight Factor Actual T. Virtual T.

t 1 25 2 5 102 20 5/2 5 12.53 5 10 5 50

The tie-breaker at equal virtual time is usually the last-executed time.

5.4 Scheduling on Multi-Core Processes

Two methods:

• Per-core ready queue: This one scales to a large number of cores more easily.

• Shared ready queue: The access needs to be guarded by a lock and the CPU’s will have to contest the lock. Also athread may run on another core when woken up, giving low affinity of cache.

42


5.4.1 Load-Balancing

In per-core design, queues may have different lengths. This results in load imbalance across the cores. Some coresmay be idle and others may be busy. This is not an issue in shared queue design. Per-core designs typically need somemechanism for thread migration.

43


Caput 6

Devices and Device Management

Devices are how a computer receives input and produces output. e.g. Keyboard, printer, touch screen.In Sys/161, examples;

• Timer/Clock: Current time, timer, beep

• Disk drive: Persistent storage

• Serial console: Character I/O

• Text screen: Character-oriented graphics

• Network Interface: Packet I/O

6.1 Device I/O

The bus is a communication pathway between various devices in a computer.

• Internal bus: Memory bus or front side bus is for communication between CPU and RAM. It is fast and close tothe CPU.

• Peripheral or Expansion bus: Allows devices in the computer to communicate.

A bridge connects two different buses.Communication with devices carried out through device registers located in memory. There are three primary types:

• Status: Tells you something about device’s current state. Typically read only.

• Command: Issue a command to the device by writing a particular value to this register.

• Data: Used to transfer larger blocks of data to/from device.

Some device registers are combinations of primary types e.g. status and command. IRQ means interrupt request.

6.1.1 Device Drivers

A device driver is a part of the kernel that interacts with a device.Example: writing a character to serial output device:

// only one writer at a time

P(write_semaphore );

write_characters ()

while (status != COMPLETED)

read_IRQ ();

V(write_semaphore );

To avoid polling, we may use a semaphore combined with interrupts:

• Device Driver Write:

44


P(write_semaphore );

write_characters ();

• Interrupt Handler

write_IRQ (); // Acknowledge completion

V(write_semaphore );

How can a device driver access device registers? Two method:

• Port-Mapped I/O with special I/O instructions. Device registers are assigned port numbers which correspond toregions of memory in a separate smaller address space.

Special I/O instructors on x86 transfer data between a specified port and CPU register.

• Memory Mapped I/O: Each device register has a physical memory address. Data transfer is done with load/store.

Large data blocks can be transferred using other methods:

• Program-Controlled I/O: The device driver moves the data between memory and buffer on the device. The CPUis busy while data is transferred.

• Direct Memory Access (DMA): The device itself is responsible for moving data to/from memory. CPU is notbusy during this data transfer.

6.2 Hard Disks

A hard disk is a commonly used persistant storage device. A hard disk is made from a number of spinning, ferromagnetic-coated platters read/written by a R/W head.

• A disk is an array of numbered blocks (sectors)

• Each block is the same size (e.g. 512 bytes)

• Blocks are unit of transfer between disk and memory. Usually one or more contiguous blocks can be transferred ina single operation.

• Assume for simplicity that each track contains the same number of sectors.

This is a massive simplification but will hold if the disk is a drum instead of a platter.

In a seek:

1. Move the R/W head to the correct track (radius).

2. Rotate the platter so the correct sector is under the R/W head

Cost model for Disk I/O:

• Seek time: Move the R/W head to the appropriate track. This depends on the seek distance.

• Rotational latency: Wait until the target sector spin to the R/W head. This depends on the rotation speed ofdisk.

• Transfer time: Wait while the target sector spin past the R/W heads.

(Request Service Time) = (Seek Time) + (Rotational Latency) + (Transfer Time)

• BytesPerTrack = DiskCapacity/NumTracks

• BytesPerSector = BytesPerTrack/NumSectorsPerTrack

• Maximum rotational latency: MaxLatency = 60/RPM

• AverageSeek = MaxSeek/2

• AverageLatency = MaxLatency/2

45


• SectorLatency = MaxLatency/NumSectorsPerTrack

• TransferTime = SectorLatency ·#ConsecutiveSectors

• RequestServiceTime + Seek + RotationalLatency + TransferTime.

If the position of R/W head and platter is not known, we use AverageSeek and AverageLatency

e.g. Suppose DiskCapacity = 232,NumTracks = 220,NumSectorsPerTrack = 28,RPM = 10000,MaxSeek = 0.02. Alltime units are in seconds.

• BytesPerTrack = 212

• BytesPerSector = 24

• MaxLatency = 0.006

• AverageSeek = 0.01

• AverageLatency = 0.003

• SectorLatency = 0.006/256

• RequestSeviceTime = 0.0132

Large transfers to/from a disk device are more efficient than smaller ones. That is, the cost/byte is smaller forlarge transfers. Sequential I/O is faster than non-sequential I/O. Sequential I/O is not always possible so we try to grouprequets to try and reduce average request time.

Note. Historically, seek time is the dominating cost.

6.2.1 Disk Head Scheduling

The R/W tasks are put onto a task queue. The goal of the task queue is to reduce seek time by controlling the order inwhich requests are serviced. Disk head scheduling may be performed in software, hardware, or a combination of both.

Possible algorithms:

• First-come-first-served: Fair and simple but offers no optimisation for seek time.

• Shortest-Seek-Time-First (SSTF): Choose closest request. Seek times are reduced but requests may starve.

• Elevator Algorithm (SCAN): The disk head moves in one direction until there are no more requests in front ofit. Then it reverses.

Device registers in Sys/161:

Offset Size Type Description

0 4 Status Number of sectors4 4 Status and Command Status8 4 Command Sector Number12 4 Status Rotation Speed

32768 512 Data Transfer Buffer (1 sector)

• disk_semaphore starts with 1, and disk_completion_semaphore starts with 0.

• Device driver write handler: (Remember OS/161 use memory mapped I/O)

P(disk_semaphore)

/*

* 1. Copy data from memory to device transfer buffer (in RAM)

* 2. Write target sector number to disk sector number register

* 3. Write "write" command to disk status register

*/

P(disk_completion_semaphore) // Wait for request completion

V(disk_semaphore)

46


• Interrupt handler for disk device:

/*

* 1. Make device ready again.

* 2. Write disk status register to acknowledge completion

*/

V(disk_completion_semaphore)

• Device driver read handler:

P(disk_semaphore)

/*

* 1. Write target sector number to disk number register

* 2. Write "read" command to disk status register

*/

P(disk_completion_semaphore) // Wait for request completion

V(disk_semaphore)

• Interrupt handler for disk device:

/*

* 1. Make device ready again.

* 2. Write disk status register to acknowledge completion

*/

V(disk_completion_semaphore)

6.2.2 Solid State Drives

A SSD has no moving parts. THey have integrated circuits for persistant storage instead of magnetic surfaces. There area variety of implementations:

• DRAM: Requires constant power to keep values

• Flash Memory: Traps electons in quantum cage

A SSD is logically divided into blocks and pages:

• 2,4,8 KB pages (initialised to 1’s)

• 32KB to 4MB blocks

• Read/Writes are at page level.

Writing/Deleting from Flash Memory:

• Naive solution:

1. Read whole block into memory

2. Re-initialise the entire block

3. Write back to SSD

• SSD controller handles requests (faster):

1. Mark page to be deleted/overwritten as invalid

2. Write to an unused page

3. Update translation table

4. Requires garbage collector

Note. Each block of an SSD has a limited number of write cycles before it becomes read-only. SSD controllers performwear leveling, distributing writes evenly across blocks so the blocks wear down at an even rate

Hence defragmentation can be harmful to the lifespan of an SSD. Additionally, since there are no moving parts,defragmentation serves no performance advantage.

47


6.2.3 Persistant Ram

Values are persistant in the absence of power.

• ReRAM: Resistive RAM

• 3D XPoint, Intel Optane

This can be used to improve the performance of secondary stroage.

48


Caput 7

File Systems

Files are persistent, named data objects.

• Data consists of a sequence of numbered bytes.

• Files may change size over time

• Files may have associated metadata (type, timestamp, access control)

Definition

File systems are data structures and algorithms used to store, retrieve, and access files.

• Logical file ssytem: High-level API, what a user sees

• Virtual file system: Abstraction of lower level file systems and presents the different underlying file systemsto the user as one.

This layer does not have to exist.

• Physical file system: How files are stored on physical media.

7.1 File Interface

File system has the interface calls:

• open(filename, flags): Returns a file identifier (or handle, descriptor), which is used in subsequent opera-tions to identify the file.

Other operations require the file descriptor as a parameter.

• close(file): Closes a file descriptor.

Kernels keep track of valid file descriptors for each process.

• read,write; Read/Write copies data from/to a file into a virtual address space

R/W operations start from the current file position and update the current file position as bytes are read/written.

• seek/lseek: Moves the file position. Used for non-sequential reading/writing

Seeks on Windows and Linux platforms do not check if the seeked position is valid, but R/W on an invalid positionwill cause an error. (Behaviour dependent on API)

• get/set file meta-data. (Unix fstat, chmod, ls -la)

Each file descriptor has an associated file position. The position is 0 when the file is opened.

49


7.2 Directories and File Names

A directory is a file that maps file names to i-numbers. A i-number is a unique (within a file system) identifier for afile or directory. Given an i-number, the file system can find the data and meta-data for the file.

Directories provide a way for applications to group related files. Since directories can be nested, a filesystem’s directoriescan be viewed as a tree, with a single root directory. The files are leaves in the directory tree.

Files may be identified by pathnames which describe a path from the root directory to the file. Directories also havepathnames.

Only kernels is permitted to edit directories (as files).

7.2.1 Links

A hard link is an association between a name (string) and an i-number. Each entry in a directory is a hard link.

• A hard link is created upon a file creation.

• Once a file is created, additional hard links can be made to it1.

Linking to an existing file creates a new pathname for that file.

It is not possible to link to a directory to avoid cycles.Hard links are rare and in most use cases are replaced by soft links (or symbolic link). Soft links are files storing

the path to the hard link.Hard links can be removed using unlink <path>. This removes the link specified by <path>. The file’s content is not

actually removed, but the file system lost the ability to access it.

7.2.2 Virtual File Systems

Sometimes a system has multiple file systems.Mounting does not make two file systems into one file system. Mounting creates a single, hieararchical namespaces

that combines the namespaces of two file systems. The new namespace is temporary and exists only when the file systemis mounted.

7.3 File System Implementation

Data need to be stored persistently: File data, File meta-data, Directories and links, and File system meta-data.Non-persistent information are not stored: File descriptors, File positions, Open file table, Cached copies of persistent

data.We shall use Very Simple File System (VSFS) as an example. Suppose

• Disk size is 246 KB

• Sector size is 512 B. Total 512 sectors on disk.

Memory is usually byte addressable and disk is sector addressable.

• 8 consecutive sectors are a block.

Blocks are 4 KB with 64 total blocks.

The blocks 8, . . . , 63 (last 56 blocks) are used to store user data.The first 8 blocks is used to store metadata.

• Each i-node is 256 B, and blocks 3, . . . , 7 are allocated i-nodes, so there are 80 total i-nodes/files.

The i-nodes are stored in an array and the index into the array is the file’s index number (i-number).

• Block 1 is allocated to a bitmap to indicate which i-nodes are unused.

• Block 2 is allocated to a bitmap to indicate which blocks. unused.

A block size of 4 KB means we can track 32K i-nodes and blocks which is far more than we actually need.

• Block 0 is the superblock and contains meta-information about the entire file system: How many i-nodes andblocks are in the system, where does the i-node table begin, etc.

1The unix linking utility is ln [-s] <target> <source>. -s indicates soft link

50


Descriptio 7.1: Allocation of first 8 blocks of VSFS

Super i d I I I I I

I-Nodes

0 7

7.3.1 i-nodes

i-node fields may include:

• File type, permission, length

• Number of file blocks

• Time of last access/file update

• Time of last i-node update

• Number of hard links

• Direct data block pointers

• Single, double, and triple indirect data block pointers.

Assume that the disk blocks can be referenced based on a 4-byte address. Then the maximum block size is

232 · 4 KB = 16 TB

Assume that there is enough roof for 12 direct pointers to blocks in a i-node. Then the maximal possible file size is

12 · 4KB = 48 KB

To store larger files, we need indirect pointers, pointers to blocks full of direct pointers. The maximum file size for nindirect pointers of k indirection levels, such that each pointer block consist of M pointers, is

n ·Mk · (Block size)

• i-node bitmap: Keeps track of which i-node is used

• data bitmap: Keeps track of which data blocks are used.

Suppose we have a file path P = /D1/ · · · /Dm/F .

• open(filename) when F already exists:

1. Read i-node of Di (check permission)

2. Read data of Di to find i-node of Di+1

3. Repeat until we reach the file i-node F .

• create(filename)

1. Read i-node of Di (check permission)

2. Read data of Di to find i-node of Di+1

3. Repeat until we reach Dm

4. Read and write i-node bitmap to indicate that a new i-node is being allocated.

5. Read to file i-node F : This step is needed since a i-node is smaller than a block, so we need to read the block’scontent before modifying it and writing it back.

6. Write to file i-node F

7. Write to folder i-node Dm

• read(file):

51


1. Read file i-node F to acquire the pointers.

2. Dereference the pointer until it becomes a direct one.

3. Read file content.

4. Write i-node F to change file last access time.

• write(file):

1. Read file i-node F to acquire the pointers.

2. Read data bitmap

3. Write data bitmap

4. Write file data (do not need to read first since this is a whole block write)

5. Write i-node F to change file last access time.

The OS has a i-node cache to improve performance.

7.3.2 Alternatives to Pointers

VSFS uses a per-file index (direct and indirect pointers) to access blocks. There are two alternative methods:

• Chaining: Each block includes a pointer to the next block

Uses a lot of memory. A 4TB file may use up to 4GB of pointer space.

• External Chaining: The chain is kept as an external structure.

Microsoft’s File Allocation Table (FAT) uses external chaining.

Chaining is acceptable for sequential access, but is very slow for random access.

7.4 File System Design

File system parameters:

• How many i-nodes should a file system have?

• How many direct and indirect blocks should an i-node have?

• What is the block size?

For a general purpose file system (designed to be efficient for the common case)

• Most files are small, 2 KB

• Average file size growing

• On average there are 100 K files

• Directories usually contain few files.

• Even as disks grow large, the average file system usage is 50%.

A single logical file system operation may require several disk I/O operations. What if a logical operation fails half-way? This may destroy in-memory file system structures. Persistent structures should be crash consistent. i.e. should beconsistent when system restarts after a failure.

Solutions not perfect):

• Special purpose consistency checkers: Unix fsck, ext2 file systems

The checkers run after a crash and before normal operations resume. The checker finds and attempts to repairinconsistent file system data structures.

• Journaling (Veritas, NTFS, ext3), write-ahead logging:

Record filesystem meta-data changes in a journal (log) so that sequences of changes can be written to disk in asingle operation.

Changes are applied after they have been journaled, update the disk data structures (write-ahead logging)

After a failure, redo journaled updates in case they were not done before the failure.

52


Caput 8

Virtual Machines

A virtual machine is a simulated or emulated computer system. Sys/161 is an emulation of a MIPS R3000. Virtualmachines provide the ability for one machine to act as many.

Operating systems and programs can run on virtual machines1 in isolation. The OS and programs should not be awareof the virtualised hardware and operate normally without modification or patching.

8.1 Hypervisors

A system is composed of a CPU, RAM, and other devices. The CPU executes the instructions of both the OS and userprograms. These devices need to be virtualised.

The virtual CPU executes the instructions of its operating system (guest) and user programs running within thatOS. The virtual CPU captures the instructions and translate them into instructions for the real CPU where they may beexecuted as a normal program.

1Virtual machines are invented in 1960s and are very slow. They resurged in 2005 when CPU’s started to offer hardware support for

virtualisation.

53


Additamentum A

OS/161

The examples in this course use OS/161 version 1.99, a educational operating system.OS/161 is a operating system that runs on a mips1 simulator.The following convention is used for register allocation:

Number Name Use

0 z0 Always zero1 at Assembler Reserved2 v0 Return value/Syscall Number3 v1 Return value

4–7 a0 – a3 Subroutine Args8–15 t0 – t7 Temp (Caller-Save)16–23 s0 – s7 Saved (Callee-Save)24–25 t8 – t9 Temps (Caller-Save)

28 gp Global Pointer29 sp Stack Pointer30 s8/fp Frame Pointer (Callee-Save)31 ra Return Address

A.1 Testing The OS

OS/161 does not reclaim memory upon thread exits and thus will run out of memory, creating a kernel panic, so insteadof testing with

$ sys161 kernel

OS/161 kernel [? for menu]: test

...

The commands can be combined succintly to exit directly after a test ends:

$ sys161 kernel "test; q"

A very useful breakpoint to set is in the panic function, which has interface in KI/lib.h and implemented in$KL/kprintf.c.

A.1.1 Debugging

gdb is used to debug this OS, but it has to be used in a very specific way. Directly running gdb on the simulator wouldbe debugging the simulator, which does not work, instead the following workflow is used:

1. Compile and install the OS.

2. Prepare two instances of ssh. Make sure the log onto the same machine. $ hostname to check the machine name.

3. $ sys161 -w kernel: This instructs the simulator, sys161, to run the kernel and wait to attach a debugger.

4. In the other instance, execute

1The register allocation can be found in $KAMI/kern/regdefs.h

54


$ cd $ROOT

$ cs350 -gdb kernel

(gdb) dir ../os161 -1.99/ kern/compile/ASST0

(gdb) target remote unix:. sockets/gdb

5. Now the debugger is ready and the instance of sys161 should prompt about a debugger has been attached. Tocontinue the program, use cont (shortcut c), or set breakpoints.

A.2 Directory Abbreviations

$PROJ = os161 -1.99

$ROOT = root

$K = $PROJ/kern

$KA = $K/arch

$KAM = $KA/mips

$KAMI = $KAM/include

$KAMS = $KAM/syscall

$KAMT = $KAM/thread

$KC = $K/conf

$KD = $K/dev

$KF = $K/fs

$KI = $K/include

$KL = $K/lib

$KP = $K/proc

$KSu = $K/startup

$KSp = $K/synchprobs

$KSc = $K/syscall

$KT = $K/test

$KTh = $K/thread

$KVf = $K/vfs

$KVm = $K/vm

$U = $PROJ/user

$UB = $U/bin

$UI = $U/include

$UL = $U/lib

$UT = $U/testbin

55

CS350: Operating Systems

Documents

Transcript of CS350: Operating Systems