LectureCA All Slides

1

Slide 1 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Computer Architecture


This is the collection of lecture slides* of the lecture „Computer Architecture“ tought in Wintersemester 06/07 at University Duisburg-Essen. I slightly revised the surveys of the subjects and added slidenumbers now.

* Actually, this is the internet version of the lecture slides. With respect to the slides used in the lectures, animations are removed (errors hopefully as well) and additional text is added.

Stefan Freinatis, March 2007



Dr.-Ing. Stefan FreinatisFachgebiet Verteilte Systeme (Prof. Geisselhardt)Raum BB 1017

Dipl.-Math. Kerstin LuckFachgebiet Verteilte SystemeRaum BB 910

Lecture

Exercises


Times & Dates Computer Architecture

1. 25.10.0601.11.06 All Saint‘s Day (public holiday in NRW, no lectures)

2. 08.11.063. 15.11.064. 22.11.065. 29.11.066. 06.12.067. 13.12.068. 20.12.069. 10.01.0710. 17.01.0711. 24.01.0712. 31.01.0713. 07.02.07

Lecture: 08:15 – 09:45Exercise: 10:00 – 10:45

2


Resources Computer Architecture

Homepage „Verteilte Systeme“http://www.fb9dv.uni-duisburg.de/vs/de/index.htm

Direct link to homepage of lecture „Computer Architecture“http://www.fb9dv.uni-duisburg.de/vs/en/education/dv3/index2006.htm

Select ‚English‘ → ‚Lectures‘→ ‚Winter semester 2006/2007‘ → ‚Computer Architecture‘


TopicsIntroduction & History

1. Operating Systems (slide 34) System layers, batching, multi-programming, time sharing

2. File Systems (slide 65)Storage media, files & directories, disk scheduling

3. Process Management (slide 151)Processes, threads, IPC, scheduling, deadlocks

4. Memory Management (slide 351)Memory, paging, segmentation, virtual memory, caches


Literature

[HP03] J. Hennessy, D. Patterson: Computer Architecture – A Quantitative Approach, 3rd ed., Elsevier Science, 2003, ISBN 1-55860-724-2.

[HP06] J. Hennessy, D. Patterson: Computer Architecture – A Quantitative Approach, 4th ed., Elsevier Science, 2006, ISBN 0-12-370490-1 .

[Ta01] A. Tanenbaum: Modern Operating Systems, 2nd ed., Prentice Hall, 2001, ISBN 0-13-092641-8.

[Sil00] A. Silberschatz: Applied Operating System concepts, 1sted., John Wiley & Sons, 2000, ISBN 0-471-36508-4.



Introduction

3


IntroductionComputer Architecture is the conceptual designand fundamental operational structure of a computer system [Wikipedia].

• Instruction set architecturestack or accumulator or general purpose register architecture

• Organizationmemory system, bus structure, CPU design

• Hardwaremachine specifics, logic design, technology

Computer Architecture encompasses [HP03 p.9]:


Computer Application Areas

General Purpose desktopsbalanced performance for range of tasks, graphics, video, audio

Scientific desktops and servershigh-performance floating point and graphics

Introduction

Commercial serversdatabases, transaction processing, highly reliable

Embedded computinglow power, small size, safety critical


ComputerIntroduction

A computer is a person or an apparatus that is capable of processing information by applying calculation rules.

A computer is a machine for manipulating data according to a list of instructions known as program [Wikipedia] .

Generalized technology independent definition.


HistoryIntroduction

~ 5000 bc Basis of calculating is counting.10 fingers ⇒ decimal system

~ 1000 bc Abacus (Suan Pan, Soroban)

Chinese Suan Pan Roman Abacus

4


HistoryIntroduction

← Book from 1958

See also: http://www.ee.ryerson.ca/~elf/abacus/leeabacus/

Finger technique (from Japanese book 1954)↓


HistoryIntroduction

300 bc …1000 ac Roman numeral systemaddition system, no zero

1I5V

10X50L

100C500D

1000MValueNumeral

Value ‚19‘:XVIIII or XIX

Not suitable forperforming multiplications.


HistoryIntroduction

~ 500 ac Hindu-Arabic Numeral System,place value system, introduction of 0

Indian (3rd century bc)

Indian (8th century)

West-Arabic (11th century)

European (15th century)

European (16th century)

Today

Forms the basis point for the developmentof calculation on machines.


HistoryIntroduction

1623 Wilhelm SchickardCalculation machine

1641 Blaise PascalAdding machine

1679 G.W. LeibnizDyadic system (binary system)

1808 J. M. JaquardPunch card controlled loom

5


HistoryIntroduction

1833 Charles BabbageDifference Engine

• Data memory, program memory

• Instruction based operation

• Conditional jumps

• I/O unit


HistoryIntroduction

1847 George BooleLogic on mathematical statements

1890 H. HollerithPunch card based tabulating machine

Digital data loggingon punch cards.First electro mechanical data processing.


HistoryIntroduction

1936 Alan TuringPhilosophy of information, Turing machineFounder of Computer Science

1941 Konrad ZuseFirst electro-mechanic computer Z3Binary arithmetic, floating point

Z3 rebuild in 1961


History

By functiontable ROM

YesNoYes1948

Partially, by rewiring

YesNoYes1944

USAENIAC

By punchedpaper tape

NoNoYes1944USAHarvard Mark I

Partially, by rewiring

YesYesYes1943UKColossus

NoYesYesYesSummer 1941USAAtanasoff - Berry Computer

By punchedfilm stock

NoYesYesMay 1941GermanyZuse Z3

ProgrammableElectronicBinaryDigitalShownworking

NationComputer

Characteristics of the first 5 operative digital computersIntroduction

Information source: Wikipedia on „Z3“ or on „ENIAC“, English

6


HistoryIntroduction

1945 John v. NeumannConcept of universal computer systemsFounder of Computer Architecture

The von Neumann model of a universal computer (stored program computer)

Input

Output

Memory ALU

Control

data

data

data

control signals

instructions

control signals


v. Neumann ModelIntroduction

•Control UnitInterpretation of the program. Timing control of units.

•MemoryStorage for program and data. Addressable storage locations. Read / Write.

•ALU (Arithmetic Logic Unit)Performs calculations.

A computer consists of 5 units

• Input Unit

•Output UnitCommunication with the environment


v. Neumann ModelIntroduction

Addresses

Data

Control

Inpu

t / O

utpu

t

Mem

ory

Microprocessor(CPU)

Keyboard

Monitor

...

Microcomputer

Today:• Input unit and output unit are combined (not necessarily

physically!) to form the Input/Output unit (short: I/O unit).

• The control unit and the ALU are combined to formthe microprocessor.

The v. Neumann model (or architecture) basically still applies to the majority of modern computer systems.


Characteristicsvon Neumann Model

• Architecture is independent of problem to be processedUniversal stored program computer, not tailored to specific problem.

• Random accessible memory locationsSelection of location by means of an address. All locations have same capacity.

• Both program and data reside in memoryThe state of the machine (control unit) decides whether thecontent of a memory location is considered data or code.

• Computer is centrally controlledCPU has the master role.

• Sequential processing Execution of a program is done instruction by instruction.

7


v. Neumann Model

1. Fetch instruction from memory and put itinto instruction register (in CPU).

2. Evaluate instruction (decode instruction)

3. When needed for this particular instruction,address the data (the operands) in memory.

4. Fetch the data (usually into CPU internal registers).

5. Perform operation on the data (usually this iscarried out by the ALU) and write back the results.

6. Adjust address counter to point to next instruction.

Steps in executing an instruction

Instruction phase

Data phase

Introduction


Bus System

v. Neumann Bottleneck

address of instruction

instruction

address of A

A

address of B

B

address of C

C

TheC

PU

si de

TheMem

oryside

IntroductionMemory accesses in executing C = A + B

timeA, B, C: data in memoryaddress busdata bus


v. Neumann BottleneckIntroduction

The data is processed faster by the CPUthan it can be taken from or stored in memory.

The processor↔memory interface is crucial for the overall computation performance.

Reduction of the bottleneck effect through introduction of a hierarchical memory organization.

Register – Cache – Main memory


Computer Performance

Performance, the work done ina certain amount of time.

Introduction

Performance is like Power.t

WP =

• processing an instruction,• carrying out a floating-point or an integer operation,• processing a standardized program (benchmark)

Work can have the meaning of

8


Computer Performance

• Clock rate [Hz]The frequency at which the CPU is clocked.

• MIPSMillion instructions per second

• FLOPSFloating point operations per second

Introduction

Popular performance measures


Computer PerformanceIntroduction

Many performance measures are notvery expressive ... as they do not

• consider the number of instructions being carriedout per cycle (parallel execution),

• cover the effective throughput between CPUand memory,

• distinguish between complex instruction set computer (CISC) and reduced instruction set computer (RISC).


Computer PerformanceIntroduction

Many performance measures are notvery expressive ... as they do not

• consider the number of instructions being carriedout per cycle (parallel execution),

• cover the effective throughput between CPUand memory,

• distinguish between complex instruction set computer (CISC) and reduced instruction set computer (RISC).

Computer performance comparedto a VAX-11/780 from 1978.

Figure from [HP06 p.3]


Moore‘s LawGordon Moore empirically observedin 1965 thatthe number of transistors on a chipdoubles approximately every 12 months.

In 1975 he revised his prediction tothe number of transistors on a chipdoubling every two years.

See also: www.thocp.net/biographies/papers/moores_law.htm

tt NN ⋅⋅≈ 15.0

0 10 where t is in [years]Moore‘s Law:

Gordon E. Moore

Introduction

9

Moore‘s Law

Image source: Wikipedia on „Moore‘s Law“, English Slide 34 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis


Operating SystemsSystem layers (36)Early computer Systems (42)Batch systems (46)Multi-program systems (50)Time sharing systems (54)Modern systems (57)


Operating SystemsAn operating system is a program that acts as an intermediary between a user of a computer and the computer hardware [Sil00 p.3].

Purpose: provision of an environment in which auser can execute programs.

Objectives:

• to make the system convenient to useUsability, extending the machine beyond low level hardware programming

• to use the hardware in an efficient mannerResource management, manage the hardware allocation among different programs


Operating Systems

Computer system layersFigure from [Sil00 p.4]

10


System Layers

1. Hardware – provides basic computing resources.CPU, Memory, I/O, and devices connected to I/O.

2. Operating system – coordinates the use of the hardware among the various applicationprograms for the various users.

3. Applications programs – the programs used to solve the computing problems of the users.

4. User – people, machines, or other computers using the computer system.

Operating systems


System LayersOperating systems

Computer system layersFigure from lecture CA WS 05/06, original source unknown


Operating SystemsUsability – the operating system as an Extended Machine

The architecture of most computers at the machine language level is awkward to program, especially for I/O.

• shields the programmer from the hardware details,

• provides simple(r) interfaces,

• offers high level abstractions and, in this view, presents the user with the equivalent of an extended machine.

The operating system

See also [Ta01 p.4]


Operating SystemsResource Management – The operating system as a

Resource Manager

Computer resources: processor(s), memory, timer,disks, network interfaces, printer, graphic card, ...

• keeps track of who is using which resource,

• grants or denies resource requests,

• accounts the usage of resources.

The operating system

See also [Ta01 p.5]

11


Resource ManagementResource management may be divided into

Operating systems

• time management (e.g. CPU time, printer time),

• and space management (e.g. memory or disk space).

• process management,

• memory management,

• file system management,

• device management.

Resource management incorporates

Before going in into these subjects, let‘s have a look at the computer development since 1945.


Early Computer Systems

• Vacuum tubes

• A single group of people did all the workdesign, construction, programming, operating, maintenance

• Programming in machine languageplugboard, no programming languages

Operating systemsFirst computer generation (1945 – 55)

• Users directly interact with computer system

• Programs directly interact with hardware

• No operating system


Early Computer SystemsOperating systems

First computer generation (1945 – 55)

IBM 407 Accounting MachineElectro mechanical tabulator

Source: http://www.columbia.edu/acis/history/407.html

Wiring panel (plugboard)


Early Computer SystemsOperating systems

First computer generation (1945 – 55)

IBM 402 plugboard

Source: http://www.columbia.edu/acis/history/plugboard.html

12


Early Computer Systems

• Vacuum tubes

• A single group of people did all the workdesign, construction, programming, operating, maintenance

• Programming in machine languageplugboard, no programming languages

Operating systemsFirst computer generation (1945 – 55)

• Users directly interact with computer system

• Programs directly interact with hardware

• No operating systemSlide 46 Computer Architecture WS 06/07 Dr.-Ing. Stefan Freinatis

Batch Systems

• Transistors, Mainframe computers

• First high level programming languagesFortran (Formula translation), Algol (Algorithmic language), Lisp (List Processing)

• No direct user interaction with computerEverything went via the computer operators.

• Users submit job to operatorjob = program + data + control information.

• Operator batched jobsComposition of jobs with similar needs

Operating systemsSecond computer generation (1955 – 65)


Batch SystemsOperating systems

Figure from [Ta01 p.9]

Structure of a typical FMS (Fortran Monitor System) batch job

Second computer generation (1955 – 65)


Batch SystemsOperating systems


IBM IBM IBM

Batch job processing scence [Tanenbaum]


13


Batch Systems

• Resident monitor program in memory Monitor program loading one job one after another (from tape).

Operating systems

Monitorprogram

job

Memory

• Sequenced job inputJobs from tape or from card reader. Monitorprogram cannot select jobs on its own.

• One job in memory at a time

• CPU often idlewaiting for slow I/O devices



Multiprogram Systems

• DisksDirect access to several jobs on disk. Now the operatingsystem can select jobs (job scheduling).

• Multiprogrammed Batch Systems

Operating systemsThird computer generation (1965 – 80)

OperatingSystem

job 1

Memory

job 2

job 3

job 4

Several jobs in memory at the same time

Operating system shares CPU timeamong the jobs (CPU scheduling).

Better CPU utilization

• Integrated Circuits


Multiprogram SystemsOperating systems

Assume program A being executed on a single-programcomputer. The program needs two I/O operations.

Assume program B being executed on the same computerat some other time. The program needs no I/O.

A1 A2 A3

I/O I/Ot

A A

CPU usageover time

B1 B2 B3 tCPU usageover time


Multiprogram SystemsOperating systems

Total execution time on a single-program computer:

Total execution time on a multi-program computer:

Now assume program A and B being executed on amulti-program computer.

B1 B2 B3A1 A2 A3

I/O I/Ot

A A

CPU usageover time

14


Multiprogram Systems

• Multiprogram computers were still batch systems

• Desire for quicker response timeIt took hours/days until output ready. A single misplaced comma could cause a compilation to fail, and the programmer wasted half a day [Ta01 p.11].

• Desire for interactivityUsers wanted to have the machine ‘for themselves’, working online.


⇒ Requests paved the way for timesharingsystems (still in third computer generation)


Time Sharing Systems

• Direct user interactionMany users share a computer simultaneously. Terminals ↔ Host.

• Multiple job execution with high frequent switchingOperating system must provide more sophisticated CPU scheduling.

• Disk as backing store for memoryVirtual memory

• Many jobs awaiting execution


Operating SystemSwapping,address translation,protecting memory

(memory management)

• Disk as input / output storageNeed for the OS to manage user data (file system management)


Time Sharing SystemsAssume program A and B as previously. Executionon a time sharing system: Program B has finished

Time sharing is not necessarily faster. Compare tothe multiprogramming example:

Small time slices allow for interactivity (quasi parallel execution)

B1 B2 B3A1 A2 A3

I/O I/OtCPU usage

over time

I/O I/Ot

A A

CPU idleCPU usageover time


Memory LayoutOperatingSystem

program 1

Memory

program 2

program 3

program 4

program n

program 5

program 6

WorkingMemory

Time sharing system

OperatingSystem

job 1

Memory

job 2

job 3

job 4

Multi program system

Monitorprogram

job

Memory

Batch system

15


Modern SystemsOperating systems

Fourth computer generation (1980 – present)

• Single-chip CPUs

• Personal Computers

• Real-Time Systems

• Multiprocessor Systems

• Distributed Systems

• Embedded Systems

CP/M

MS-DOS, DR-DOS

Windows 1.0 ... Windows 98 / ME

Windows NT 4.0 ... 2003, XP

XENIX, MINIX, Linux, FreeBSD


Real Time Systems

• Rigid time requirements

• Hard Real TimeIndustrial control & robotics

Guaranteed response times

Slimmed OS features (no virtual memory)

• Soft Real TimeMultimedia, virtual reality

Less restrictive time requirements

RT System

RT System

Modern systems


Multiprocessor Systems

• n processors in system (n > 1), tightly coupled

• Resource sharing

• Symmetric Multiprocessing

• Asymmetric Multiprocessing

Each CPU runs identical copy of OS

All CPUs are peers (no master-slave)

Each CPU is assigned specific task

Task assignment by master CPU

User User User User

CPU CPU CPU CPU

Operating System

Operating System

User User User User

CPU CPU CPU CPU

Modern systems


Distributed Systems

• Individual computers

• Autonomous operation

• Communication via network

• Network Operating System

Modern systems

File Sharing

Message exchange

• n computers/processors (n > 1), loosely coupled

16


Embedded Systems

• Dedicated to specific tasks

Modern systems

• Encapsulated in host deviceinvisible, usually not repaired when defect

• Small in size, low energy

• Sometimes safety-criticalautomotive drive by wire, medical apparatus

• Custom(ized) operating systemLittle or no file I/O, sometimes multitasking,no fancy OS’s.


Resource ManagementOperating systems

• File system management

• Memory management

Creation and organization of a logical storage location where data (user data, system data, programs) can be persistently stored in terms of files. Assigning rights and managing accesses. Maintentance.

• Process managementCreation of processes (programs in execution) and sharing the CPU among them. Control of execution time. Enabling communication between processes.

• Device management.

Assigning memory areas to processes. Organizing virtual memory.

Low level administrative work related to the specifics of the I/O devices. Translations, low level stream processing. Usually by device drivers.


Operating Systems

An operating system in the wide senseis the software package for making a computer operable.

The operating system in the narrow sense is the one program running all the time on the computer (the kernel).It consists of several tasks and is asked for services through system calls.

Imag

e so

urce

:Wik

iped

iaon

‚ker

nel‘,

Eng

lish


Operating SystemsOperating system categories

• Single User - Single Tasking

• Single User - Multi Tasking

• Multi User - Single Tasking

MS-DOS

Windows, MacOS

• Multi User - Multi Tasking

CP/M

Unix, VMS

17



File System Management


File System ManagementStorage Media (67)

Magnetic Disks (71)

Files and Directories (81, 90)

File Implementation (98)

Directory Implementation (114)

Free Block Management (124)

File System Layout (129)

Cylinder skew, disk scheduling (135)

Floppy Disks (145)


Storage MediaStorage hierarchyFigure from [Sil00 p.31]

low

high

acce

ss ti

me

seco

n day

sto r

age

prim

ary

stor

age


Storage MediaCost versus access time for DRAM and magnetic disks [HP06 p.359]

1ms 10ms

Flash

18


Storage Media

• Store large amount of dataMuch more data than fits into (virtual) memory

• Persistent storeThe information must survive the terminationof the process creating or using it.

• Concurrent access to dataMultiple processes should be able to access the data simultaneously.

⇒ Storage of data on secondarystorage media in terms of files.

Requirements for secondary storage



Magnetic Disks (71)







Floppy Disks (145)


Magnetic DisksMagnetic disk drive principle

Figure from [Sil00 p.29]

Disk drive

Computer

diskcontroller

hostcontroller


Magnetic DisksSector: Smallest addressable unit on magnetic disk.

Data size between 32 and 4096 bytes (standard 512 bytes).

Several sectors may be combined to form a logical block. Thecomposition is usually performed by a device driver. In this way thehigher software layers only deal with abstract devices that all havethe same block size, independent of the physical sector size.Such a block is also termed cluster.

A disk sectorFigure from [Ta01 p.315]

512 bytes

19


Magnetic DisksFormatted Disk Capacity

= bytes per sector x tracks per cylinderx sectors per track

capacity of a track

capacity of one platter side

x cylinder

number of tracks on a platter

capacity of all platter sides = disk capacity

Capacity = 63 kB

CHS = (7, 2, 9), sector size: 512 byte

C = cylinderH = Heads = tracks per cylinderS = sectors per track


Magnetic Disks

Disk parameters for the original IBM PC floppy diskand a Western Digital WD 18300 hard disk [Ta01 p.301].

(heads)


Magnetic Disks

Physical disk geometry

On older disks the number of sectors per track was the same for all cylinders.

The physics of the inner track sectors defined the maximum number of bytes per sector.

From physics, the outer sectors could have stored more bytes than defined,as the areas are bigger.

Waste of space / capacity


Magnetic Disks

Physical geometry (left) and corresponding virtual geometry example (right)

Modern disks are divided into zones with more sectors in the outer zones than in the inner zones (zone bit recording).

Figure from [Ta01 p.302], modified

This

mus

t be

seen

as

two

sect

ors

20


Magnetic DisksPhysical geometry: The true physical disk layout. Withmodern disks only the internal electronic knows about it.

Virtual geometry: The published disk layout to theexternal world (device driver, operating system, user)

CHS (for old disks)or not published any more

CHS (e.g. WD 18300 example)

LBA (logical block addressing)Disk sectors are just numbered consecutivelywithout regard of the physical geometry.

A disk is a random access storage device.


Magnetic DisksLow level formatting: Creation of the physical geometry on the disk platters. Defect disk areas are masked out and are replaced by spare areas. Done by disk drive internal software.

High level formatting: A partition receives a boot block and an empty file system (free storage administration, root directory).Done by application program or by operating system administration tool.

Partitioning: The disk is divided into independent partitions, each logically acting as a separate disk. Definition of a masterboot record in first sector of the disk. Done by application program.


Logical Disk Layout

File system

Magnetic disks




Magnetic Disks (71)







Floppy Disks (145)

21


Files

A file is a named collection of related information recorded on secondary storage. [Sil00 p.346]

A file is a logical storage unit. It isan abstract data type. [Sil00 p345, 347]

Files are an abstraction mechanism for storing information and retrieving it back later. [after Ta01 p.380]


File Structure

Logical file structure examples [Ta01 p.382]

Files


File Structurea) Byte sequence

Unstructured. The OS does not know or care what is in the file. Meaning imposed by application program. Maximum flexibility. Approach used by Unix and Windows.

b) Sequence of records (fixed-length)Each record has some internal structure. Background idea: read / writeoperations from secondary storage have record size.

c) Tree of recordsHighly structured. Records may be of variable size. Access to a record through key (e.g. „Pony“). Lookup / read / write / append are performed by OS, not by application program. Approach used in large mainframe computers (commercial data processing systems).

Files


File Access

Figure from [Sil00 p.355], modifiedRecord

• Sequential AccessSimple and most common. Based on the tape model of a file.Data is processed in order (byte after byte or record after record).Operations: read, write, rewind.Records need not to be of same length (e.g. text files with eachline posing a record. Remember Pascal readln, writeln.

Files

22


File Access• Direct Access

Bytes or fixed-length logical records. Records are numbered. Access can be in no particular order. Access by record number. Based on disk model of a file. Useful for immediate access to large data records (e.g. database).Operations: read, write, seek.

Figure from [Sil00 p.355], modified

1 2 3 4 5 6 7 8 9 10 11 12

(file pointer)

Byte or record seek

Files


File Access• Indexed Access

Index file holds keys. Keys point to records within relative file.Suited for tree structures.

Example of index file and relative file, figure from [Sil00 p.358]

Files


File Names

• Name assigned by creation processandrew 2day urgent! fig_14

• Case sensitivityAndrew ANDREWandrew

• Two-part file names: basename.extensionreadme.txt prog.c.Z lecture.doc

Unix: case sensitive. MS-DOS: not sensitive.

Extensions are often just conventions, not mandatory by the operating system (although convenient when the OS knows about them).

Files


File Attributes• Additional information about a file.

Depends on operating system and file system what attributes there are.

• Assigned by the operating system.• Stored in the file system

Regular file or directory file or ... File type

Whether the file content is text or is binarytext / binary flag

Date of file creationCreation date

Whether or not file name is displayed in listingsHidden flag

If set, file is a temporary file and is deleted on process exitTemp flag

Who can access the file and in what way?Access rights

Some possible file attributes

Files

23


File Types

Block special files Character special filesRegular files DirectoriesUnixWindows

Files

Contain bytes (words in Unicode) according to a standardizedcharacter set, such as EBCDIC, ASCII or Unicode. The content is directly printable (screen, printer). Data.

Contents not intended to be printed (at least directly). Contenthas meaning only to those programs using the files.Program (binary executable) or data.

Files for maintaining the logical structure of the file system

Text files (also termed ASCII files)

Binary files


Single-level directory


DirectoriesA directory is a named logical place to put files in.

• Early operating systems (CP/M, MS-DOS 1.0)

• Still used in tiny embedded systems

• File names are unique

This is the directory entry for the file called records,pointing to the file content on the storage media.

This is the file contentof the file records.


user1 user2 user3 user4

DirectoriesTwo-level directory


• Absolute file names, relative file names, path names/user1/test, /user3/test test, ../user4/data /user3

• Absolute file names are unique

• Hierarchical structure (tree of depth 1)

root directory

sub directories


DirectoriesMulti-level directory


leve

l

24


Multi-Level DirectoryDirectories

• Hierarchical structure of arbitrary depthTree structure, graph structure. Logical organization structure.

• One root directoryArbitrary number of sub (sub sub ...) directories

• Efficient file searchTree / Graph traversing routines. Much faster than sequential search.

• Logical groupingSystem files, user files, shared files, ...

• Most common structure

• Generalization of two-level directory


Multi-Level Directory


DirectoriesAcyclic graph directory structure

• Additional directory entries (Links)

• Shared directories

• Shared files

• More than one absolute namefor a file (or a directory)

• Dangling link problem

Shared directoryShared files


Multi-Level Directory


DirectoriesGeneral graph directory structure

Avoiding cycles:

Forbid any links to directoriesNo more shared directories then

Use cycle detection algorithm

Allowing links to point to directoriescreates the possibility of cycles.


File System ManagementNow turning from the user’s view to the implementor’s view. Users are concernedwith how files are named, what operationare allowed and what the directories look like.

how files and directories are stored on the disk,

how the disk space is managed,

and how to make everything work efficiently.

Implementors are interested in

25



Magnetic Disks (71)







Floppy Disks (145)


File Implementation

• Contiguous Allocation

• Indexed Allocation

• Linked Allocation Chained BlocksChained Pointers

The most important issue in implementing files is the way how the available disk space is allocated to a file.


Contiguous AllocationFile Implementation

Each file occupies a set of contiguous blockson the disk. File defined by disk address (first block)and by length in block units.

Advantage

• Simple implementationFor each file we just need to know its start block and its length

• Fast accessAccess in one continuous operation. Minimum head seeks.

Disadvantage

• Disk fragmentationProblem of finding space for new file. The final file size mustbe known in advance!



(a) Contiguous allocation of disk space for 7 files

(b) State of the disk after files D and E have been removedFigure from [Ta01 p.401]

26



Free disk space is broken into chunks (holes) which are spread all over the disk. New files are put into available holes, often not filling them up entirely and thus leaving smaller holes. A big problem arises when the largest available hole is too small for a new file.

A file usually does not fill up its last block entirely, so the remaining space in theblock is left unused.

External Fragmentation

Internal Fragmentation

used


Linked AllocationFile Implementation

Each file is a linked list of disk blocks. Theblocks may be scattered anywhere on the disk.

Each block has besides its data a pointer to the next block. The pointer is a number (a block number).

next

data

next

data

next

data... ...nil

disk

Chained blocks




Advantage

• Simple implementationOnly first block number needed.

• No external fragmentationFiles consist of blocks scattered on the disk.No more ‚useless‘ blocks on disk.

The file ‚jeep‘ starts with block 9. It consists of the blocks 9, 16, 1, 10, and 25 in this order.



Disadvantage

• Free space managementSomehow all the free blocks must be recorded in some free-block pool.

• Higher access timeMore seeks to access the whole file owing to block scattering.

• Space reductionSome bytes of each block are needed for the pointer.

• ReliabilityIf a pointer is broken, the remainder of the file is inaccessible.

• Not efficient for random accessTo get to block k we must walk along the chain.

27



A table contains as many entries as there are disk

blocks. The entries are numbered by block number.

The block numbers of a file are linked in this table

in chain manner (as with chained blocks).

This table is called file allocation table (FAT).

In particular the last

disadvantage of thechained blocks allocation method,

the unsuitability for random accesses to files,

lead to the chained pointers allocation method.




Figures from [Ta01 p.403,404], modified

block

block

Chained block allocation

block

Chained pointer allocation (FAT)

The FAT is stored on disk and is loaded intomemory when the operating system starts up.


Chained pointersAdvantage

• Simple implementationOne simple table for both file allocation and free-block pool.

• Whole block available for dataNo more pointers taking away data space.

• Suitable for random accesssAlthough the principle of getting to block k did not change, the search(counting) is now done on the block numbers, not on the blocks themselves.

Linked Allocation

Disadvantage

• FAT takes up disk space and memory (when cached)One table entry for each disk block. Table size proportional to disk size.

• Higher access time (compared to contiguous allocation)Still it needs many seeks to collect all the scattered blocks.


Indexed AllocationFile Implementation

Each file is assigned an index block.

The index block is an array of block numbers, listing in order the blocks belonging to the file. To get to block k of a file, one reads the kth entry of the index block.

next data

next data

next data

next data

nildiskindex block

28


Indexed AllocationFile Implementation

The file ‚jeep‘ is describedby index block 19.

The index block has 8 entries of which 5 are used.

Index blocks are also called index nodes,short i-nodes or inodes.



Indexed AllocationAdvantage

• Good for random accessFast determination of block k of a file.

• Lesser memory occupationOnly for those files currently in use (open files) the correspondingindex blocks are loaded into in memory.

• Lesser disk space occupationOnly as many index blocks needed as there are files in the file system.

File Implementation

Disadvantage

• Free block managementA separate free-block pool must be available.

• Index block utilizationUnused entries in index block do waste space.


Indexed Allocation

• Linked index blocksThe last entry in an index block points to another index block (chaining).

• Multilevel index blocksAn entry does not point to the data, but points to a first-level index block (single indirect block) which then points to the data. Optionally, additional level available through second-level and third-level index blocks.

• Combined schemeMost entries point to the data directly. The remaining entries point tofirst-level and second-level and third-level index blocks. Used by Unix.

File ImplementationWhat if a file needs more blocks than entries available in an index block?


Indexed Allocation

data

data

data

data

Combined scheme example (Unix V7)from [Ta01 p.447], modified

File Implementation

Note: The inodes are no disk blocks, but are records stored in disk blocks.The single / double / triple indirect blocks are disk blocks.

29



Magnetic Disks (71)







Floppy Disks (145)


Directory ImplementationBefore accessing a file, the file must

first be opened by the operating

system. For that, the OS uses the

path name supplied by the user to

locate the directory entry.

A directory entry provides

• the name of the file,

• the information needed to find the blocks of the file,

• and information about the file‘s attributes.


Directory EntryAttribute placement

The attributes may be stored

a) together with the file name in the directory entry (MS-DOS, VMS)

b) or off the directory entry (Unix)Figure from [Ta01 p.406]

Directory Implementation


Directory EntryMS-DOS directory entry

Directory entry size: 32 byte

File attributes stored in entry.

First block number points to first file block, respectively to the corresponding entry in the FAT (DOS uses chained pointers).



30


attributes

Directory EntryUnix directory entry (Unix V7)

Entry size: 16 byte.Modern Unix versions allow for longer file names.

File attributes are stored in the inode.

The rest of the inode points to the file blocks


directory entry


Directory EntryMS-DOS file attributes

A : Archive flag

D: Directory flag

V: Volume label flag



A D V S H R

S : System file flag

H: Hidden flag

R: Read-only flag

of file creation


Directory ImplementationAn MS-DOS directory (not the entry) itself is a file (a binary file)with the file type attribute set to directory.

The disk blocks pointed to contain other directory entries (eachagain of 32 byte size) which either depict files or subsequent directories (sub directories).

Upon installing an MS-DOS file system, there is automatically created a root directory.

Similar applies to Unix. When the file type attribute is set to directory, the file blocks contain directory entries.

Windows 2000 and descendants (NTFS) treat directories asentities different from files.



directory entry

directory entry

directory entry

directory entry

directory entry

directory entry

directory entrydirectory entry

disk block

...

pointing to disk blocks containing directory entries

pointing to disk blocks containing file data

= Directory attribute set

= Directory attribute not set. Regular file.

Legend:

MS-DOS directory

31


File LookupHow to find a file name in a directory

• Linear SearchEach directory entry has to be compared against the searchname (string compare). Slow for large directories.

• Binary SearchNeeds a sorted directory (by name). Entering and deleting files requires moving directory entries around in order to keep them sorted (Insertion Sort).

• Hash TableIn addition to each file name, an hash value (a number) is created andstored. Search is then done on the hash value, not on the name.

• B-treeFile names are nodes and leafs in a balanced tree. NTFS.



File Lookup

The steps in looking up the file /usr/ast/mbox in classical Unix

Directory ImplementationFigure from [Ta01 p.447]



Magnetic Disks (71)







Floppy Disks (145)


Free Block ManagementTo keep track of the blocks available for allocation (free blocks), the operating system must somehow maintain a free block pool.

When a file is created, the pool is searched for free blocks. When a file is deleted, the freed blocks are added to the pool.

File systems using a FAT do not need a separate free block pool.Free blocks are simply marked in the table by a 0.

• Linked List• Free List• Bit Map

Free Block Pool Implementations:

32


Free Block ManagementLinked ListThe free blocksform a linked list where each block pointsto the next one (chained blocks).

• Simple ImplementationOnly first block number needed.

• Quick AccessNew blocks are prepended (LIFO principle)

• Disk I/OUpdating the pointers involves I/O.

• Block ModificationModified content hinders ‘undelete’ of the block.



17 18 0

Free Block ManagementFree List

The free block numbers are listed in a table. The table is stored in disk blocks.The table blocks may belinked together.

• SpaceEach free block requires 4 byte in table

• ManagementAdding and deleting block numbers needs time, in particular when a table blockis almost full (additional disk I/O required).

Figu

re fr

om [T

a01

p.41

3]


Free Block Management

To each existing block on disk a bit is assigned. Whena block is free, the bit is set. When the block is occupied, the bit is reset (or vice versa). All bits form a bit map.

Bit Map

• CompactEach block represented by a single bit. Fixed size.

• Logical orderNeighboring bits represent neighboring blocks (logical order).Quite easy to find contiguous blocks, or blocks located close together.

• Conversion block number ↔ bit positionFrom the block number the corresponding bit position must becalculated and vice versa.




Magnetic Disks (71)







Floppy Disks (145)

33


File System Layout

File system


Each Partition starts with a boot block (first block) which is followedby the file system. The boot block may be modified by the file system.


FAT copy Files and directoriesFAT Root dir

Layout of FAT file system

File System Layout

Information about the filesystem location is stored in the boot block.

Number of entries in root directory is limited,except for FAT-32 where it is a cluster chain.

A copy of the FAT for reliability reasons

http://www.microsoft.com/whdc/system/platform/firmware/fatgen.mspxMicrosoft FAT-32 specification at


File System LayoutPossible file system layouts for a UNIX file system

Super block Inodes Root dir Files and directories

Super block Inodes Root dir Files and directoriesFree block pool

The inode for the root directory is located at a fixed place.

Information about filesystem (block size, volume label,size of inode list, next free inode, next free block, ...)

Bit map free block management


File areaMFT System files

Layout of NTFS file system

File System Layout

Files for storing metadata about the file system.Actually, the MFT itself is a system file.

Master File Table. Linear sequence of 1kB records. Each record describes one file or directory. MFT is a file, may be located anywhere on disk.

More about NTFS: http://www.ntfs.com/ntfs_basics.htm

Information about the filesystem location is stored in the boot block.

34



Magnetic Disks (71)







Floppy Disks (145)


Cylinder Skew

Cylinder skew example

Assumption: Reading frominner tracks towards outertracks.

Here: skew = 3 sectors.

After head has moved tonext track, sector 0 arrives just in time. Reading can continue right away.

Performance improvementwhen reading multiple tracks.

Physical disk geometry, figure from [Ta01 p.316]

Disk Performance


Disk SchedulingModern disk drives are addressed as large one-dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer. The array of logical blocks is mapped into the sectors of the disk sequentially. Sector 0 is the first sector of the first track on the outermost cylinder. Mapping proceeds in order through that track, then the rest of the tracks in that cylinder, and then through the rest of the cylinders from outermost to innermost.

However, it is difficult to convert a logical block into CHS:

• The disk may have defective sectors which are replaced by spare sectors from elsewhere on the disk.

• Owing to zone bit recording the number of sectors per track is not the same for all cylinders. After [Sil00 p.436]

Disk Performance


Disk Scheduling• Fast access desired (high disk bandwidth)

Disk bandwidth is the total number of bytes transferred, divided by the total timefrom the first request for service until completion of the last transfer.

• Bandwidth depends on– Seek time, the time for the disk to move the heads to the

cylinder containing the desired sector.

– Rotational latency, the additional time waiting for the disk torotate the desired sector to the disk head.

• Seek time ≈ seek distance.

• Scheduling goal: minimizing seek timeScheduling in earlier days done by OS, nowadays by either OS (then guessing the physical disk geometry) or by the integrated disk drive controller.

Disk Performance

35


Disk Scheduling

• First-Come First-Served (FCFS)

• Shortest Seek Time First (STTF)

• SCAN

• C-SCAN

• C-LOOK

Scheduling Algorithms

For the following examples: Assumption that there are 200 tracks ona single sided disk. Read requests are queued in some queue.The queue is currently holding the requests for tracks 98, 183, 37, 122, 14, 124, 65 and 67. The head is at track 53.

Disk Performance


FCFSDisk Scheduling

The requests are serviced in the order of their entry (first entry is served first).


time

track


SSTFDisk Scheduling

The next request served is the one that is closest to current position (shortest seek time).


time

track


SCANDisk Scheduling

Disk arm starts at one end of the disk and sweeps over to the other end, thereby servicing the requests.

At the other end the head reverses direction and servicing continues on the return trip.


time

track

36


C-SCANDisk Scheduling

Disk arm starts at one end of the disk and sweeps over to the other end, thereby servicing the requests. At the other end the head returns to the beginning of

the disk without servicing on the return trip.


time

track


C-LOOK

Like SCAN or C-SCAN, but the head moves only as far as the final request in each direction.


time

track

Disk Scheduling


• SSTF is common and has a natural appeal

• SCAN and C-SCAN perform better for systemsthat place a heavy load on the disk.

• Performance depends on the number and typesof requests.

• Requests for disk service are influenced by thefile allocation method.

• Either SSTF or LOOK is a reasonable choice as default algorithm.

Disk SchedulingDisk Performance



Magnetic Disks (71)







Floppy Disks (145)

37


Floppy Disks

Figure from www.computermuseum.li

• Portable storage media• 8“ floppy in 1969• 5.25“ floppy in 1978• 3.5“ floppy in 1987

720K, 1.44 MB360k ... 1.2M80K ... 1.2MCapacity:

3.5“ disk5.25“ disk8“ disk

Floppy disks almost displaced by Flash Memory (e.g. USB Stick) now, except for the purpose of booting computers (bootable floppies).


Floppy Disks

Two sided floppy disk

1

2

3

4

5

7

89

19

20

21

2223

25

2627

37

38

39

4041

43

4445

Drehrichtung

Spurnummer

Seite 1(Rückseite)

Seite 0

BIOS 0,0,1

Beginn der Spuren

Sektorerkennung BDOS: 42BIOS: 0,2,6 (Seite, Spur, Sektor)

03 2 1

62442Page 1 (Back)

Track start Rotation direction

Page 0 (Front)

Sector number

Track index (0, 1, 2, ... )

BIOS: 2,0,6 (CHS)BDOS: 42


Floppy Disks

1

2

3

4

5

7

89

19

20

21

2223

25

2627

37

38

39

4041

43

4445

Drehrichtung

Spurnummer

Seite 1(Rückseite)

Seite 0

BIOS 0,0,1

Beginn der Spuren

Sektorerkennung BDOS: 42BIOS: 0,2,6 (Seite, Spur, Sektor)

03 2 1

62442Page 1 (Back)

Track start Rotation direction

Page 0 (Front)

Sector number

Track index (0, 1, 2, ... )

BIOS: 2,0,6 (CHS)BDOS: 42

BIOS = Basic Input Output SystemStored in (EP)ROM

Sector access through invoking a software-interrupt and addressinga sector by means of CHS.

BDOS = Basic Disk Operating SystemOriginates from CP/M operating system. Higher abstraction level than BIOS.

Sector access through invoking a software-interrupt and addressing a sector by means of a logical consecutive sector number (1, 2, ...).


Floppy DisksSector Structure

1 2 3 4 5 6 7 8 9

Sync IAM track index

head index

sector index

sector length CRC DAM 128-1024

data bytesCRC/ ECC

Inter-Record Gap

Address field Data field

CRC: Cyclic Redundancy CheckECC: Error Checking/CorrectionIAM: Index Address MarkDAM: Data Address Mark

sectors

38


Floppy Disks

342011211.44 M

30169211.2 M

158521720 K

136421360 K

DataRoot dirFAT 2FAT 1Boot sectorDisk

Starting sector numbers for system and data areas (FAT file system). All numbers are in decimal notation.


Floppy Disks

Data(5)

Data(4)

Data(3)

Data(2)

Data(1)

Data(6)

10

11

12

1314

15

16

17

18

Spur 0, Seite 1

Data(1)Data

(2)

Data(3)

Data(4)

Data(5)

Data(6) Dir.

(5)

Dir.(6)

Dir.(7)

Track 0, Page 1Track 0, Page 0

Bootstraploader

1

2

3

45

6

7

8

9

BootstrapLoader

FAT(1)

FAT(2)

FAT(3)FAT

(4)

Dir.(1)

Dir.(2)

Dir.(3)

Dir.(4)

Dir. = allocated space for root directory Track 0 of a 360 kB floppy disk



Process Management


Process Management

Processes (153)

Threads (178)

Interprocess Communication (IPC) (195)

Scheduling (247)

Real-Time Scheduling (278)

Deadlocks (318)

39


ProcessesA process is a set of identifiable, repeatable actions which are ordered in some way and contribute to the fulfillment of an objective.(General definition)

A process is a program in execution.(Computer oriented definition)

• Process: dynamic, activeActing according to the recipe (cooking) is a process.

• Program: static, passiveA cooking recipe is a program.


Process Model• Several processes are working quasi-parallel.

• A process is a unit of work.

• Conceptually, each process has its own virtual CPU.In reality, the real CPU switches back and forth from process to process.

Processes

Sequential viewProcess model view

Processes makeprogress over time



Processes

a) CPU-boundspends more time doing computations – few very long CPU bursts.

b) I/O-boundspends more time doing I/O than computations – many short CPU bursts.


A process may be described as either (more or less)


Address SpaceProcesses

A process is an executing program, and encompasses the current values of the program counter, of the registers, of the variables and of the stack.

code section (text section or segment)This is the actual program code (the machine instructions).

CS

DS

SS

SP

PC

Memory

CPU

data section (data segment)This segment contains global variables (global to theprocess, not global to the computer system).

stack section (stack segment)The stack contains temporary data (local variables,return addresses, function parameters).

40


Process StatesProcesses


Note: Only in the running state the process needs CPU cycles, in all other states it is actually ‚frozen‘ (or nonexistent any more).


Process StatesProcesses

• NewThe process is created. Resources are allocated.

• ReadyThe process is ready to be (re)scheduled.

• RunningThe CPU is allocated to the process,that is, the program instructions are being executed.

• WaitingThe process is waiting for some event to occur. Without this event the processcannot continue – even if the CPU would be free.

• TerminatedWork is done. The process is taken off the system (off the queues)and its resources are freed.


ProcessesEvents at which processes are created

• Operating System Start-UpMost of the system processes are created here. A large portion of them are background processes (daemons).

• Interactive User RequestA user requests an application to start.

• Batch jobJobs that are scheduled to be carried out when the system has available the resources (e.g. calendar-driven events, low priority jobs)

• Existing process gives birthAn existing process (e.g. a user application or a system process) creates aprocess to carry out some related (sub)tasks.


Process CreationProcesses

• Resource sharing– Parent and child share all resources.– Child shares subset of parent’s resources.– Parent and child share no resources.

• Execution– Parent and child execute concurrently.– Parent waits until child terminates.

• Address Space– Child is copy of parent– Child has program loaded into it

Syst

em c

alls

for c

reat

ing

a ch

ild p

roce

ss:

Unix

: fork()

Win

dows

: CreateProcess()

• Parent process creates a child processwhich in turn may create other processes, forming a tree of processes.

41


fork() exampleProcess Creation

#include <stdio.h>

void main(){

int result;printf(„Parent, my pid is: %d\n", getpid());result = fork();if (result == 0) { /* child only */printf(„Child, my pid is: %d\n", getpid());...} else { /* parent only */printf(„Parent, my pid is: %d\n", getpid());...

}}

system call that tells a process its pid(process identifier) which is a unique process number within the system.

from here on think parallel

Exe

cut e

d by

chi

ld

Executed by parent


fork() exampleProcess Creation

Parent, my pid is: 189

Child, my pid is: 190

Parent, my pid is: 189

Terminal output:order depends on whether parent or child is scheduled first after fork().

pid = 189

Before fork()

...fork()...

PC

After fork()

pid = 190

...fork()...

pid = 189

...fork()...PC PC


Process CreationProcesses


A Unix process tree


Process TerminationProcesses

• Process asks the OS to delete itWork is done. Resources are deallocated (memory is freed,open files are closed, used I/O buffers are flushed).

• Parent terminates child– Child may have exceeded allocated resources.

– Task assigned to child is no longer required.

– Parent‘s parent is exiting.Some OS do not allow a child to continue when itsparent terminates. Cascading termination (a subtree is deleted).

Events at which processes are terminated

Syst

em c

alls

for s

elf-t

erm

inat

ion

of a

pro

cess

:

Uni

x: exit()

Win

dow

s: ExitProcess()

42


Process Control BlockProcesses

• Operating system maintains a process table

• Each entry represents a process

• Entry often termed PCB (process control block)

A PCB contains all information about a process that must be saved when the processis switched from running into waiting or ready, such that it can later be restartedas if it had never been stopped.

Info regarding process management,regarding memory occupationand open files.


PCB example


Process Control BlockProcesses

Table from [Ta01 p.80]

Typical fields of a PCB


Context SwitchProcesses

• Saving the state of old processSaving the current context of the process in its PCB.

• Loading the state of new processRestoring the former context of the process from its PCB.

The task of switching the CPU from one process to anotheris termed context switch (sometimes also process switch):

Context switching is pure administrative overhead. The duration of a switch lies in the range of 1 ... 1000 µs. The switch time depends on the hardware. Processors with multiple sets of registers are faster in switching. Context switching poses a certain bottleneck, which is one reason for the introduction of threads.


Context SwitchProcesses


cont

ext s

witc

hco

ntex

t sw

itch

context switch time

context switch time

43


SchedulingProcesses

• Job QueueHolds the future processes of the system.

• Ready Queue (also called CPU queue)Holds all processes that reside in memory and are ready to execute.

• Device Queue (also called I/O queue)Each device has a queue holding the processes waiting for I/O completion.

• IPC QueueHolds the processes that wait for some IPC (inter process communication) event to occur.

On a uniprocessor system there is only one process running,all others have to wait until they are scheduled. They arewaiting in some scheduling queue:


Ready queue

Tape

Ethernet

Disk

Terminal

registers registers registers

registers

SchedulingProcesses


Dev

ice

queu

es

These queues are empty

The ready queue and some device queues


SchedulingProcesses

From the job queue a new process is initially put into theready queue. It waits until it is dispatched (selected for execution). Once the process is allocated the CPU, one of these events may occur.

• InterruptThe time slice may be expired or some higher priority process is ready. Hardware error signals (exceptions) also may cause a process to be interrupted.

• I/O requestThe process requests I/O. The process is shifted to a device queue. After the I/O device has ready, the process is put into the ready queue to continue.

• IPC requestThe process wants to communicate with another process through some blockingIPC feature. Like I/O, but here the „I/O-device“ is another process.

A note on the terminology: Strictly spoken, a process (in the sense of an active entity)only exists when it is allocated the CPU. In all other cases it is a ‚dead body‘.


SchedulingProcesses

Ready queue CPU

Device queueI/O request

Job queue

Interrupt

IPC queueIPC request

Queueing diagram of process scheduling

processes are new

processes are ready process is running

processes are waitingevents

44


SchedulingProcesses

The OS selects processes from queues and puts them intoother queues. This selection task is done by schedulers.

• Long-term SchedulerOriginates from batch systems. Selects jobs (programs) from the pooland loads them into memory.

Invoked rather infrequently (seconds ... minutes). Can be slow.Has influence on the degree of multiprogramming (number of processes in memory).Some modern OS do not have a long-term scheduler any more.

• Short-term SchedulerSelects one process from among the processes that are ready to execute, and allocates the CPU to it. Initiates the context switches.

Invoked very frequently (in the range of milliseconds). Must be fast, that is,must not consume much CPU time compared to the processes.


SchedulingProcesses

Ready queue CPUJob queue

Long-term scheduler Short-term scheduler

Schedulers and their queues


SchedulingProcessesSometimes it may be advantageous to remove processes

temporarily from memory in order to reduce the degree of multi-programming. At some later time the process is reintroduced intomemory and can be continued. This scheme is called swapping, performed by a medium-term scheduler.


swap queue

swap out

swap in

Medium-term scheduler


Process Concept• Program in execution

Several processes may be carried out in parallel.

• Resource groupingEach process is related to a certain task and groups togetherthe required resources (Address space, PCB).

Traditional multi-processing systems:

• Each process is executed sequentiallyNo parallelism inside a process.

• Blocked operations → Blocked processAny blocking operation (e.g. I/O, IPC) blocks the process. The processmust wait until the operation finishes.

In traditional systems each process has a single thread of control.

Processes

45


Process Management

Processes (153)

Threads (178)


Scheduling (247)


Deadlocks (318)


Threads

Here: a sequence of instructionsthat may execute in parallel with others

• a piece of yarn,A thread is

A thread is a line of execution within the scope of a process. A single threaded process has a single line of execution (sequential execution of program code), the process and the thread are the same. In particular, a thread is

• a basic unit of CPU utilization.

• a screw spire,• a line of thoughts.


Threads


Three single threadedprocesses in parallel

A process with threeparallel threads.


Threads

- Reading from keyboard- Formatting and displaying pages- Periodically saving to disk- ... and lots of other tasks

A single threaded process would quite quickly result in an unhappy usersince (s)he always has to wait until the current operation is finished.

Multiple processes?Each process would have its own isolated address space.

As an example, consider of a word processing application.

Multiple threads!The threads operate in the same address space and thushave access to the data.

46


Threads


Three-threaded word processing application

formatting anddisplaying

Reading keyboardSaving to disk


Threads• Multiple executions in same environment

All threads have exactly the same address space (the process address space).


• Each thread has own registers, stack and state


Threads

Table from [Ta01 p.83]

Items shared by all threads in a process

Items private to each thread


User Level Threads

• Take place in user spaceThe operating system does not know about the applications’ internal multi-threading.

Threads

• Can be used on OS not supporting threadsIt only needs some thread library (like pthreads) linked to the application.

• Each process has its own thread tableThe table is maintained by the routines of the thread library.

• Customized thread schedulingThe processes use their own thread scheduling algorithm. However, no timercontrolled scheduling possible since there are no clock interrupts inside a process.

• Blocking system calls do block the processAll threads are stopped because the process is temporarily removed from the CPU.

47


User Level Threads


Thread management isperformed by theapplication.

Examples- POSIX Pthreads- Mach C-threads- Solaris threads

Threads


Kernel Threads• Take place in kernel

The operating system manages the threads of each process

Threads

• Available only on multi-threaded OS‘sThe operating system must support multi-threaded application programs.

• No thread administration inside processsince this is done by the kernel. Thread creation and management howeveris generally somewhat slower than with user level threads [Sil00 p.118].

• No customized schedulingThe user process cannot use its own customized scheduling algorithm.

• No problem with blocking system callsA blocking system call causes a thread to pause. The OS activates anotherthread, either from the same process or from another process.


Kernel Threads

Thread management isperformed by theoperating system.

Examples- Windows 95/98/NT/2000- Solaris- Tru64 UNIX- BeOS- Linux


Threads


Multithreading ModelsThreads

• Many user level threadsare mapped to a singlekernel thread.

• Used on systems that donot support kernel threads.

Many-to-One Model


48


Multithreading Models

Each user level thread is mapped to one kernel thread.

One-to-One ModelThreads



Multithreading Models

Many user level threads are mapped to many kernel threads.

Many-to-Many ModelThreads



MultithreadingThreads

Solaris 2 multi-threading example



Threads• Implements one-to-one mapping• Each thread contains

- a thread id- register set- separate user and kernel stacks- private data storage area

Windows 2000:

• One-to-one model (pthreads), many-to-many (NGPT)• Thread creation is done through clone() system call.clone() allows a child to share the address space of the parent. This system call is unique to Linux, source code not portable to other UNIX systems.

Linux:

49


Threads• Provides support at language level.• Thread scheduling in JVM

Java:

class Worker extends Thread {public void run() {System.out.println("I am a worker thread");}

}

public class MainThread {public static void main(String args[]) {Worker worker1 = new Worker();worker1.start();System.out.println("I am the main thread");}

}

Exam

ple:

Cre

atio

n of

a th

read

by

inhe

ritin

g fro

m Thread

clas

s

thread creation and automatic call of run() method


Process Management

Processes (153)

Threads (178)


Scheduling (247)


Deadlocks (318)


IPC

• Managing critical activitiesMaking sure that two (or more) processes do not getinto each others' way when engaging critical activities.

• SequencingMaking sure that proper sequencing is assuredin case of dependencies among processes.

• Passing informationProcesses are independent of each other and have private address spaces.How can a process pass information (or data) to another process?

Purpose of Inter Process Communication

Process Synchronization

Data exchange

Thread Synchronization

Less important for threads since they operate in the same environment


Race ConditionsIPC


shared variables

next empty slot

Print spooling example

Situations, where two or more processes access some shared resource, and the final result depends on who runs precisely when, are called race conditions.

50


Race ConditionsIPCProcesses A and B want to print a file

Both have to enter the file name into a spooler directoryout points to the next file to be printed. This variable is accessed only by the printer daemon. The daemon currently is busy with slot 4.in points to the next empty slot. Each process entering a file name in the empty slot must increment in.

Process A reads in (value = 7) into some local variable. Before it can continue, the CPU is switched over to B.Process B reads in (value = 7) and stores its value locally. Then the file name is entered into slot 7 and the local variable is incremented by 1. Finally the local variable is copied to in (value = 8).Process A is running again. According to the local variable, the file name is entered into slot 7 – erasing the file name put by B. Finally in is incremented.User B is waiting in the printer room for years ...

Now consider this situation:


Race ConditionsIPCAnother example at machine instruction level

R1 xR1 = R1+1R1 x

R3 xR3 = R3+1R3 x

Process 1

Shared variable x (initially 0)

Process 2

x=1

Scenario 1

x=2 x=1

x=0

Scenario 2

R1 x

R1 = R1+1R1 x

R3 xR3 = R3+1

R3 x

Process 1 Process 2x=0


Critical RegionsIPCHow to avoid race conditions?

Find some way to prohibit more than one process from manipulating the shared data at the same time.

→ „Mutual exclusion“

Part of the time a process is doing some internal computations and other things that do not lead to race conditions.

Sometimes a process however needs to access shared resources or does other critical things that may lead to race conditions. These parts of a program are called critical regions (or critical sections).

critical regiontProcess A


Critical RegionsIPC

1. No two processes simultaneously in critical regionwhich would otherwise controvert the concept of mutuality.

2. No assumptions about process speedsNo predictions on process timings or priorities. Must work with all processes.

3. No process outside its critical regions must blockother processes, simply because there is no reason to hinderothers entering their critical region.

4. No process must wait forever to enter a critical region.For reasons of fairness and to avoid deadlocks.

Four conditions to provide correctly working mutual exclusion:

51


Critical Regions


Mutual exclusion using critical regionsIPC


Mutual ExclusionProposals for achieving mutual exclusion

IPC

• Disabling interruptsThe process disables all interrupts and thus cannot be taken away from the CPU.

Not appropriate. Unwise to give user process full control over computer.

• Lock variablesA process reads a shared lock variable. If the lock it is not set, the process setsthe variable (locking) and uses the resource.

In the period between evaluating and setting the variablethe process may be interrupted. Same problem as with printer spooling example.


Mutual ExclusionProposals for achieving mutual exclusion (continued)

IPC

• Strict AlternationThe shared variable turn keeps track of whose turn it is. Both processes alternate in accessing their critical regions.

while (1) {while (turn != 0);critical_region();turn = 1;noncritical_region();

}

while (1) {while (turn != 1);critical_region();turn = 0;noncritical_region();

}

Process 0 Process 1


Mutual ExclusionProposals for achieving mutual exclusion (continued)

IPC

• Strict Alternation (continued)

Busy waiting wastes CPU time.

Violation of condition 3.

busy waiting for turn = 0Process 0

Process 1

turn = 0

tturn = 1 turn = 0

No good idea when one process is much slower than the other.

52


Mutual ExclusionIPC

Proposals for achieving mutual exclusion (continued)

int turn;bool interested[2];

void enter_region(int process) {int other = 1 – process;interested[process] = TRUE;turn = process;while (turn == process && interested[other] == TRUE);

}

void leave_region(int process) {interested[process] = FALSE;

}

• Peterson Algorithmshared variables

Two processes, number is either 0 or 1


Mutual Exclusion

while (turn == 0 && interested[1] == TRUE);

while (turn == 1 && interested[0] == TRUE);

Peterson Algorithm (continued)

Assume process 0 and 1 both simultaneously entering critical_region()

IPC

Both are manipulating turn at the same time. Whichever store is lastis the one that counts. Assume process 1 was slightly later, thus turn = 1.

other = 1interested[0] = trueturn = 0

other = 0interested[1] = trueturn = 1P

roce

ss 1

Pro

ces s

0

Process 0 passes its while statement, whereas process 1 keeps busy waitingtherein. Later, when process 0 calls leave_region(), process 1 is released from the loop.

Good working algorithm, but uses busy waiting


Mutual Exclusion

enter_region: TSL R, lockCMP R, #0JNZ enter_regionRET

leave_region: MOV lock, #0RET

Pseudo assembler listing providing the functions enter_region() and leave_region().

Proposals for achieving mutual exclusion (continued)

• Test and Set Lock (TSL)Atomic operation at machine level. Cannot be interrupted. TSL reads the content of the memory word lock into register R and then stores a nonzero value at the memory address lock. The memory bus is locked, no other process(or) can access lock.

IPC

indivisible operation

CPU must support TSLBusy waiting


Mutual ExclusionIntermediate Summary

• Disabling Interrupts• Lock Variables• Strict Alternation• Peterson Algorithm• TSL instruction

IPC

Not recommended for multi-user systems.

Problem remains the same.

Violation of condition 3. Busy waiting.

Busy waiting.

Solves the problem through atomic operation.Should be used without busy waiting.

In essence, what the last three solutions do is this: A process checks whether the entry to its critical region is allowed. If it is not, the process just sits in a tight loop waiting until it is.Unexpected side effects, such as priority inversion problem.

53


Priority Inversion ProblemIPC

Consider a computer with two processes

Process H with high priorityProcess L with low priority

The scheduling rules are such that H is run whenever it is in ready state.

At a certain moment, with L in its critical region, H becomes ready and is scheduled. H now begins busy waiting, but since L is never scheduled while H is running, L never has the chance to leave its critical region. H loops forever. This is sometimes referred to as the priority inversion problem.

Solution: blocking a process instead of wasting CPU time.


Sleep and wake upIPC

sleep()A system call that causes the caller to block, that is, the process voluntarily goes from the running state into the waiting state. The scheduler switches over to another process.

wakeup(process)A system call that causes the process process to awake from its sleep() and to continue execution. If the process process is not asleep at that moment, the wakeup signal is lost.

Note: these two calls are fictitious representatives of real system calls whose names and parameters depend on the particular operating system.


Producer – Consumer Problem• Shared buffer with limited size

The buffer allows for a maximum of N entries (it is bounded). The problemis also known as bounded buffer problem.

shared buffer

Producer

• Producer puts information into bufferWhen the buffer is full, the producer must wait until at least one itemhas been consumed.

Consumer

• Consumer removes information from bufferWhen the buffer is empty the consumer must wait until at least one new itemhas been entered.

IPC

const int N = 100;int count = 0;

void producer() {while (TRUE) {int item = produce_item();if (count == N) sleep();insert_item(item);count++;if (count == 1) wakeup(consumer);

}}

void consumer() {while(TRUE) {

if (count == 0) sleep();item = remove_item();count--;if (count == N-1) wakeup(producer);consume_item(item);

} }

constantly producing

produce itemsleep when buffer is fullenter item to bufferadjust item counter

when the buffer was empty beforehand (and thus now has 1 item), wakeup any consumer(s) that may be waiting

constantly consuming

sleep when buffer is emptyremove one itemadjust item counter

when the buffer was full beforehand (and thus now has N-1 items), wakeup producer(s) that may be waiting.

Producer – Consumer Implementation ExampleThis implementation suffers from race conditions

A

54


Producer – Consumer Problem

The buffer is empty and the consumer has just read count to see if it is 0.At that instant (see A in listing) the scheduler decides to switch over to the producer.

The producer inserts an item in the buffer, increments count and notices that count is now 1. Reasoning that count was just 0 and thus the consumer must be sleeping, the producer calls wakeup() to wake the consumer up.

However, the consumer was not yet asleep, it was just taken away the CPU shortly before it could enter sleep(). The wakeup signal is lost.

When the consumer is rescheduled and resumes at A , it will go to sleep. Sooner or later the producer has filled up the buffer and goes asleep as well.

Both processes will sleep forever.

A race condition may occur in this case: Mutual Exclusion


Producer – Consumer Problem

• The variable count is unconstrainedAny process has access any time.

• Evaluating count and going asleep is a non-atomic operationThe prerequisite(s) that lead to sleep() may have changed when sleep() is reached.

Reasons for race condition

Workaround:• Add a wakeup waiting bit

When the bit is set, sleep() will reset that bit and the process stays awake.

• Each process must have a wakeup bit assigned Although this is possible, the principal problem is not solved.

What is needed is something that does testing a variable and going to sleep – dependent on that variable – in a single non-interruptible manner.

Mutual Exclusion


Semaphores• Introduced by Dijkstra (1965)

• Counting the number of wakeupsAn integer variable counts the number of wakeups for future use.

down(int* sem) {if (*sem < 1) sleep();*sem--;

}

principle of down-operation

up(int* sem) {*sem++;if (*sem == 1) wakeup a process

}

principle of up-operation

Mutual Exclusion

• Two operations: down and updown is a generalization of sleep. up is a generalization of wakeup. Both operations are carried out in a single, indivisible operation (usually in kernel). Oncea semaphore operation is started, no other process can access the semaphore.


Semaphores• Up and down are system calls

in order to make sure that the operating system briefly disables all interruptswhile carrying out the few machine instructions implementing up and down.

const int N = 10;typedef int semaphore;semaphore empty = N;semaphore full = 0;semaphore mutex = 1;

Producer – Consumer problem using semaphores (next page)Definition of variables:

a semaphore is an integercounting empty slots

counting full slots

mutual exclusion on buffer access

• Semaphores should be lock-protectedThis is recommended at least in multi-processor systems to prevent another CPU from simultaneously accessing a semaphore. TSL instruction helps out.

Mutual Exclusion

55

void producer() {while (TRUE) {int item = produce_item();down(&empty);down(&mutex);insert_item(item);up(&mutex);up(&full);

}}

void consumer() {while(TRUE) {

down(&full);down(&mutex);item = remove_item();up(&mutex);up(&empty);consume_item(item);

} }

possibly sleep, decrement empty counter

release mutex, wake up other processincrement full counter, possibly wake up other ...

Producer – Consumer Implementation ExampleThis implementation does not suffer from race conditions

possibly sleep, claim mutex (set it to 0) thereafter

possibly sleep, decrement full counter

possibly sleep, claim mutex (set it to 0) thereafter

release mutex, wake up other processincrement empty counter, possibly wake up other ...


Semaphores

t

Assume N = 5. Initial condition: empty = 5, full = 0.

4

1

producer

3

2

producer

2

3

producer

1

4

producer

0

5

producer

0

sleeppro...

4

1

consumer

3

2

consumer

2

3

consumer

1

4

consumer

0

5

consumer

0

sleepcon...

Scenario: producer is working, no consumer present

empty

full

5

0

tfull

empty

Scenario: consumer is working, no producer present

Initial condition: empty = 0, full = 5.

5

0

Mutual Exclusion


Semaphores

empty

full

0

5

0

sleep

full

empty

t

t


5

0

producer

5

0

0

5

producerpro...

4

1

consumer

3

2

consumer

1

4

4

1

Scenario: Consumer waking up producer

4

1

consumer

4

1

waking up producer

Mutual Exclusion


Semaphores

empty

full

full

empty

t

t0

5

0

sleep


4

1

1

4

consumer con...

5

0

Scenario: Producer waking up consumer

4

1

producer

4

1

4

1

producer

3

2

producer

0

5

consumer

5

0

waking up consumer

Mutual Exclusion

56


Semaphores

empty

full

2

2

full

empty3

t

t

If processes overlap, then temporary it may be that empty + full ≠ N


producer

consumer

4

consumer

11

3

2

2

3

1

2

3

2

down

up

down down

up up

1

4

Note that consumer and producer may almost concurrently change the same semaphore legally.

Mutual Exclusion


Mutex• Simplified semaphore

when counting is not needed.

• Two statesLocked or unlocked. Used for managing mutual exclusion (hence the name).

mutex_lock: TSL R, mutexCMP R, #0JZ okCALL thread_yieldJMP mutex_lock

ok: RET

mutex_unlock: MOV mutex, #0RET

Pseudo assembler listing implementing mutex_lock() and mutex_unlock().

get and set mutex

was it unlocked?

if yes: jump to okif no: sleeptry again acquiring mutex

unlock mutex

Mutual Exclusion


Monitors

• High level synchronization primitiveat programming language level. Direct support by some programming languages.

• A collection of procedures, variables and data structures grouped together in a module

Mutual Exclusion

A monitor has multiple entry pointsOnly one process can be in the monitor at a timeEnforces mutual exclusion – less chances for programming errors

• Monitor implementationCompiler handles implementationLibrary functions using semaphores


MonitorsMutual Exclusion

monitor example;integer i;condition c;

procedure producer()......

end;

procedure consumer()......

end;

end monitor;A monitor in Pidgin Pascal, from [Ta01 p.115]

Variables not accessible from outside the monitor‘s own methods (capsulation).

Functions (methods) publicly accessible to all processes,however only one process at a timemay call a monitor function.

If the buffer is full, the producer must wait.If the buffer is empty the consumer must wait.

57



• How can a process wait inside a monitor?Cannot put to sleep because no other process can enter the monitor meanwhile.

• Use a condition variable!A condition variable supports two operations.

wait(): suspend this process until it is signaled. The suspended process is not considered inside the monitor any more. Another process is allowed to enter the monitor.

signal(): wake up one process waiting on the condition variable. No effect if nobody is waiting. The signaling process automatically leaves the monitor (Hoare monitor).

Condition variables usable only inside a monitor.



Producer-Consumer problem with monitors, from [Ta01 p.117]


Processes approaching barrier Waiting for C to arrive All processes continuing

BarriersIPC

• Group synchronizationIntended for groups of processes rather than for two processes.

• Processes wait at a barrier for the othersaccording to the all-or-none principle

• After all have arrived, all can proceed



BarriersApplication example

Process 2 working on these elements

Process 3 working onthese elements

... and so on for theremaining elements

An array (e.g. an image) is updated frequently by some process 0 (producer). Many processes are working in parallel on certain array elements (consumers). All consumers must waituntil the array has been updated and can then start working again on the updated input.

Process 0

IPCProcess 1 working on these elements

58


IPCIntermediate Summary (II)

• SemaphoresCounting variable, used in non-interruptible manner. Down may put the caller to sleep, up may wake up another process.

• MutexesSimplified semaphore with two states. Used for mutual exclusion.

• MonitorsHigh level construct for achieving mutual exclusion at programming language level.

• BarriersUsed for synchronizing a group of processes.

These mechanisms all serve for process synchronization.For data exchange among processes something else is needed: Messages.


Messages• Kernel supported mechanism for data exchange

Eliminates the need for ‚self-made‘ (user-programmed) communication via shared resources such as shared files or shared memory.

IPC

OS (kernel space)Process 1

send

Process 2

receive

Copy from user space to kernel space Copy from kernel space to user space

Systembuffers

Some data (a message)

send(): send datareceive(): receive data

• Two basic operations:provided by the kernel (system calls)


Direct Communication

• Processes must name each other explicitly

Messages

- send(P, message): send data to process P- receive(Q, message): receive data from process Q

• Communication link properties- One process pair has exactly one link- The link may be unidirectional or bidirectional

• Both processes must existAs the name direct implies, you cannot send a message to a future process.

Symmetry in addressing. Both processes need to know each other by some identifier. This is no problem if both were fork()ed off the same parent beforehand, but is a problem when they are ‚strangers‘ to each other.


Indirect CommunicationMessages

- Each mailbox has a unique identifier- Processes communicate when they access the same mailbox

• Communication link properties- Link is established when processes share a mailbox- A link may be associated with many processes (broadcast)- Unidirectional or bidirectional communication

• Messages are send / received from mailboxesThe mailbox must exist, not necessarily the receiving process yet.

• Primitives- send(A, message): send message to mailbox A- receive(A, message): receive message from mailbox A

59


Synchronous CommunicationMessages

• Also called blocking send / receive• Sender waits for receiver to receive the data

The send() system call blocks until receiver has received the message.


send

Process 2

receive

Acknowledgement from receiver A single buffer (for the pair) is sufficient

• Receiver waits for sender to send dataThe receive() system call blocks until a message is arriving.


Asynchronous CommunicationMessages

• Also called non-blocking send / receive• Sender drops message and passes on

The send() system call returns to the caller when the kernel has the message.

• Receiver peeks for messagesThe receive() system does not block, but rather returns some error code telling whetherthere is a message or not. Receiver must do polling to check for messages.


send

Process 2

receive

Multiple buffers (for each pair) needed


Messages• Send by copy

The message is copied to kernel buffer at send time. At receive time the message is copied to the receiver. Copying takes time.

• Send by referenceA reference (a memory address or a handle) is copied to the receiver which uses the reference to access the data. The data usually resides in a kernel buffer (is copied there beforehand). Fast read access.

• Fixed sized messagesThe kernel buffers are of fixed size – as are the messages. Straightforward system level implementation. Big messages must be constructed from many small messages which makes user level programming somewhat more difficult.

• Variable sized messagesSender and receiver must communicate about the message size. Best use of kernel buffer space, however, buffers must not grow indefinitely.

IPC


UNIX IPC Mechanisms

• PipesSimple(st) communication link between two processes. Applies first-in first-out principle. Works like an invisible file, but is no file. Operations: read(), write().

• FIFOsAlso called named pipe. Works like a file. May exist in modern Unices just in the kernel (and not in the file system). There can be more than one writer or reader on a FIFO.Operations: open(), close(), read(), write().

• MessagesAllow for message transfer. Messages can have types. A process may read all messages or only those of a particular type. Message communication worksaccording to the first-in first-out principle.Operations: msgget(), msgsnd(), msgrcv(), msgctl().

IPC

60


UNIX IPC Mechanisms

• SemaphoresCreation and manipulation of sets of semaphores.Operations: semget(), semop(), semctl().

IPC

For an introduction into the UNIX IPC mechanisms (with examples) see

Stefan Freinatis: Interprozeßkommunikation unter Unix - eine Einführung, Technischer Bericht, Fachgebiet Datenverarbeitung, Universität Duisburg, 1994.http://www.fb9dv.uni-duisburg.de/vs/members/fr/ipc.pdf

• Shared memoryA selectable part of the address space of process P1 is mapped into the address space of another process P2 (or others). The processes have simultaneous access.Operations: shmget(), shmat(), shmdt(), smhctl().

const int FIXSIZE=80

void main() {

int fd[2]; // file descriptors for pipe

pipe(fd); // create pipe

int result = fork(); // duplicate process

if (result == 0) { // start child’s code

printf(“This is the child, my pid is: %d\n", getpid());

close(fd[1]); // we do not need writing

char buf[256]; // a buffer

read(fd[0], buf, FIXSIZE) // wait for message from parent

printf(“Child: received message was: %s\n", buf);

exit(0); // good bye

} // end child, start parent

close(fd[0]); // we do not need reading

printf(“This is the parent, my pid is: %d\n", getpid());

write(fd[1], "Hallo!", FIXSIZE); // write message to child

}

Simple pipe example. Parent is writing, child is reading.


Classical IPC ProblemsIPC


• Five philosophers sitting at a tableThe problem can be generalized to more thanfive philosophers, of course.

• Each either eats or thinks• Five forks available• Eating needs 2 forks

Slippery spaghetti, one needs two forks!

• Pick one fork at a timeEither first the right fork and then the left one,or vice versa.

The dining philosophersAn artificial synchronization problem posed and solved by Edsger Dijkstra 1965.


Dining philosophersClassical IPC problems

The life of these philosophers consists of alternate periods of eating and thinking. When a philosopher becomes hungry, she tries to acquire her left and right fork, one at a time, in either order. If successful in acquiring two forks, she eats for a while, then puts down the forks and continues to think.

Can you write a program that

makes the philosophers eating and thinking (thus creation of 5 threads or processes, one for each philosopher),

allows maximum utilization (parallelism), that is, two philosophers may eat at a time (no simple solution with just one philosopher eating at a time),

is not centrally controlled by somebody instructing the philosophers,

and that never gets stuck?

Text from [Ta01 p.125]

61


Dining philosophersClassical IPC problems

const int N=5;

void philosopher(int i) { // N philosophers in parallel

while(TRUE){ // for the whole life

think();

take_fork(i); // take left fork

take_fork((i+1)%N); // take right fork

eat();

put_fork(i); // put left fork

put_fork((i+1)%N); // put right fork

}

} A nonsolution to the dining philosophers problem

If all philosophers take their left fork simultaneously, none will be able to take the right fork. All philosophers get stuck. Deadlock situation.


Classical IPC ProblemsIPC

• Database systemsuch as an airline reservation system.

• Many competing processes wish to read and writeMany reading processes is not the problem, but if one process wants to write, no otherprocess may have access – not even readers.

The Readers and Writers ProblemAn artificial shared database access problem by Courtois et. al, 1971

How to program the readers and writers?

Writer waits until all readers are goneNot good. Usually there are always readers present. Indefinite wait.

Writer blocks new readersA solution. Writer waits until old readers are gone and meanwhile blocks new readers


Classical IPC ProblemsIPCThe sleeping barber problem

An artificial queuing situation problem


customer chairs

barber sleeps when nocustomers are present


Sleeping BarberIPC

How to program the barber and the customers without getting into race conditions?

The barber shop has one barber, one barber chair, and n chairs for customers, if any, to sit on. If there are no customers present, the barber sits down in the barber chair and falls asleep. When a customer arrives, he has to wake up the sleeping barber. If additional customers arrive while the barber is cutting a customer’s hair, they either sit down (if there are empty chairs) or leave the shop (if all chairs are full). Text from [Ta01 p.129]

const int CHAIRS=5; // number of chairs

typdef int semaphore;

semaphore customers = 0; // number of customers waiting

semaphore barbers = 0; // number of barbers waiting

semaphore mutex = 1; // for mutual exclusion

int waiting = 0;

62


IPC

void barber() { // barber processwhile(TRUE){ // for the whole life

down(&customers); // sleep if no customersdown(&mutex); // acquire access to ‘waiting’waiting--; up(&barbers); // one barber ready to cutup(&mutex); // release ‘waiting’cut_hair(); // cut hair (non critical)

}}

void customer() { // customer processdown(&mutex); // enter critical regionif (waiting < CHAIRS){ // when seats availablewaiting++; // one more waitingup(&customers); // tell barber if first customerup(&mutex); // release ‘waiting’down(&barbers); // sleep if no barber availableget_haircut(); // get serviced

} else up(&mutex); // shop is full, leave}

A solution to the sleeping barber problem [Ta01 p.131]


Scheduling

Processes (153)

Threads (178)


Scheduling (247)


Deadlocks (318)


Scheduling• Better CPU utilization through multiprogramming

• Scheduling: switching CPU among processes

• Productivity depends on CPU bursts



Short-Term SchedulerScheduling


Short-term scheduler

Also called CPU scheduler. Selects one process fromamong the ready processes in memory and dispatches it. The dispatcher is a module that finally gives CPU control to the selected process (switching context, switching from kernel mode to user mode, loading the PC).

63


Scheduling decisionsSchedulingCPU scheduling decisions may take place when a process

1. switches from running to waiting,2. switches from running to ready,3. switches from waiting to ready,4. or terminates.


1.

2.

3.

4.


Preemptive(ness)Scheduling

With preemptive scheduling the operating system can additionally force a context switch at any time to satisfy the priority policies. This allows the system to more reliably guarantee each process aregular "slice" of operating time.

Preemptiveness determines the way of multitasking.

the process became blocked,it completed,or it voluntarily gave up the CPU.

With non-preemptive scheduling (cooperative scheduling),a running process is taken away the CPU because


Preemptive(ness)Scheduling

Cooperative (non-preemptive) scheduling:

• CPU occupation depends on processin particular on the CPU burst distribution.

• Applicable on any hardware platform• Lesser problems with shared resources

at least the elementary parts of shared data structures are not inconsistent

Preemptive scheduling:

• Scheduler can interrupt• Special timer hardware required

for the timer-controlled interrupts of the scheduler.

• Synchronization of shared resourcesAn interrupted process may leave shared data inconsistent.


Scheduling Criteria

• CPU UtilizationKeeping the CPU as busy as possible. The utilization usually ranges from40% (light loaded system) to 90% (heavy loaded).

• ThroughputThe number of processes that are completed per time unit. For long processes the throughput rate may be one process per hour, for shortones it may be 10 per second.

• Turnaround timeThe interval from the time of submission to the time of completion of a process. Includes the time to get into memory, times spent in the ready queue, execution time on CPU and I/O time.[ With real-time scheduling this time-period is called reaction time ]

SchedulingThe scheduling policy depends on what criteria are emphasized [Sil00 p.140]

64


Scheduling Criteria• Waiting time

The scheduling algorithm does not affect the time a process executes or spends doing I/O. It only affects the amount of time a process spends waiting in the ready queue. The waiting time is the sum of time spent waiting in the ready queue.

• Response timeIrrespective of the turnaround time, some processes produce an output fairly early and continue computing new results while previous results are output to the user. The response time is the time from the submission of a request until the first response is produced.[ Remark: In the exercises the response time is defined as the time from submission until theprocess starts (that is, until the first machine instruction is executing). ]

Scheduling

Different systems (batch systems, interactive computers, control systems) mayput focus on different scheduling criteria. See next slide.


Scheduling

Criteria importance by system [Ta01 p.137]


OptimizationScheduling

• Maximize(average(CPU utilization))

• Maximize(average(throughput))

• Minimize(average(turnaround time))

• Minimize(average(waiting time))

• Minimize(average(response time))

Sometimes it is desirable to optimize the minimum or maximum values rather than the average. For example, to guarantee that all users receive a good service in terms of responsiveness, we may want to minimize the maximum response time. [Note: we do not delve into ‘optimization’ any further].

Common criteria:


Static / Dynamic SchedulingScheduling

With static scheduling all decisions are made before thesystem starts running. This only works when there is perfect information available in advance about the work needed to be done and the deadlines that have to be met. Static scheduling - if applied - is used in real-time systems that operate in a deterministic environment.

With dynamic scheduling all decisions are made at run time. Littleneeds to be known in advance. Dynamic scheduling is required when the number and type of requests is not known beforehand (non deterministic environment). Interactive computer systems like personal computers use dynamic scheduling. The scheduling algorithm is carried out as a (hopefully short) system process in-between the other processes.

65


Scheduling Algorithms

• First Come – First Served

• Shortest Job First

• Priority Scheduling

• Round Robin

• Multilevel Queueing

Scheduling

These algorithms typically are dynamic scheduling algorithms.


First Come - First ServedScheduling

The process that entered the ready queue firstwill be the first one scheduled. The ready queue is a FIFO queue.Cooperative scheduling (no preemption).

3 msP3

3 msP2

24 msP1

Burst timeProcess Let the processes arrive in the order P1, P2, P3. The Gantt chart for the schedule is:

Waiting time for P1 = 0 ms, for P2 = 24 ms, for P3 = 27 ms.

Average waiting time: (0 ms + 24 ms + 27 ms) / 3 = 17 ms.

P1 P2 P3

24 27 300

ms


First Come - First ServedScheduling

Let the processes now arrive in the order P2, P3, P1.The Gantt chart for the schedule is:

Waiting time for P1 = 6 ms, for P2 = 0 ms, for P3 = 3 ms.

Average waiting time: (6 ms + 0 ms + 3 ms) / 3 = 3 ms.

• Much better average waiting time than previous case.• With FCFS the waiting time generally is not minimal.• No preemption.

P1P3P2

63 300

t [ms]


Shortest Job First (SJF)Scheduling

Associate with each process the length of its next CPU burst.Use these lengths to schedule the process with the shortest time.

• Non-preemptive SJFOnce the CPU is given to the process, it cannot be preempteduntil the CPU burst is completed.

• Preemptive SJFWhen a new process arrives with a CPU burst length less than the remaining burst time of the current process, the CPU is given to the new process. This scheme is known as the Shortest Remaining Time First (SRTF)

Two schemes:

With respect to the waiting time, SJF is provably optimal. It gives the minimum average waiting time for a given set of processes.Processes with long bursts may suffer from starvation.

66


Shortest Job FirstScheduling

4 ms5 msP4

1 ms4 msP3

4 ms2 msP2

7 ms0 msP1

Burst timeArrival timeProcess

Waiting time for P1 = 0 ms, for P2 = 6 ms, for P3 = 3 ms, for P4 = 7 ms.

Average waiting time: (0 ms + 6 ms + 3 ms + 7 ms) / 4 = 4 ms.

For non-preemptive schedulingthe Gantt chart is:

P1 P3 P2

73 160

P4

8 12

t [ms]



4 ms5 msP4

1 ms4 msP3

4 ms2 msP2

7 ms0 msP1

Burst timeArrival timeProcess

Waiting time for P1 = 9 ms, for P2 = 1 ms, for P3 = 0 ms, for P4 = 2 ms.

Average waiting time: (9 ms + 1 ms + 0 ms + 2 ms) / 4 = 3 ms.

For preemptive scheduling (SRTF)the Gantt chart is:

P1 P3P2

42 110

P4

5 7

P2 P1

16

t [ms]



Predicting the CPU burst time

The next CPU burst is predicted as the exponential average of the measured lengths of previous bursts:

( ) nnn t ταατ −+=+ 11

tn = actual length of nth burst

τn + 1 = predicted length of next burst

α: 0 ≤ α ≤ 1controls the relative contributions of the recent and the past history


10 2 3 4 5 6 7 8i



Exponential average for α = ½ and τ0 = 10

67


Shortest Job FirstSchedulingExponential average for α = ½ and τ0 = 10

τ1 = ½ · 6 + ½ · 10 = 8

τ2 = ½ · 4 + ½ · 8 = 6

τ3 = ½ · 6 + ½ · 6 = 6

τ4 = ½ · 4 + ½ · 6 = 5

τ5 = ½ · 13 + ½ · 5 = 9

τ6 = ½ · 13 + ½ · 9 = 11

τ7 = ½ · 13 + ½ · 11 = 12

( ) nnn t ταατ −+=+ 11


Priority SchedulingScheduling

Each process is assigned a priority. The process withhighest priority is allocated the CPU.

• Non-preemptive• Preemptive

When a new process arrives with a priority higher than a running process,the CPU is given to the new process.

Two schemes:

• SJF scheduling is a special case of priority scheduling in whichthe ‚priority‘ is the inverse of the CPU burst length.

• Solution to starvation problem: The priority of a process increases as the waiting time increases (aging technique).



Assume low numbers representing high priorities

25 msP5

51 msP4

42 msP3

11 msP2

310 msP1

PriorityBurst timeProcess

All processes arrive at time 0.For non-preemptive scheduling the Gantt chart is:

P1P5P2

61 160

P4P3

18 19t [ms]



12 ms

6 ms

2 ms

2 ms

0 ms

Arrival time

25 msP5

51 msP4

42 msP3

11 msP2

310 msP1

PriorityBurst timeProcess

Timing diagramProcesses sorted by priority

= running= ready

P4

P3

P1

P5

P2

0 5 10 15 20t [ms]

Here: preemptive scheduling.

68


Round RobinScheduling

Each process gets a small unit of CPU time (time quantum),usually 10-100 milliseconds. After the quantum has elapsed, the process is preempted and added to the end of the ready queue.

Burst ≤ quantumWhen the current CPU burst is smaller than the time quantum, the process itself will release the CPU (changing state into waiting).

Burst > quantumThe process is interrupted and another process is dispatched.

If the time quantum is very large compared to the processes‘ burst times, the scheduling policy is the same as FCFS.

If the time quantum is very small, the round robin policy turns into processor sharing (seems as if each process has its own processor).



Waiting time for P1 = 0 + 57 + 24 = 81 ms, for P2 = 20 ms,for P3 = 37 + 40 + 17 = 94 ms, for P4 = 57 + 40 = 97 ms.

Average waiting time: (81 + 20 + 94 + 97) / 4 = 73 ms.

24 msP4

68 msP3

17 msP2

53 msP1

Burst timeProcess

Suppose a time quantum of 20 ms.The Gantt chart for the schedule is:

P1 P2 P3 P4 P1 P3 P4 P1 P3 P3

0 20 37 57 77 97 117 121 134 154 162t [ms]



Round Robin typically has higher average turnaroundsthan SJF, but has better response.

Context switch and performanceThe smaller the time quanta, the more the context switches do affect performance.Following is shown a process with a 10 ms burst, and time quanta of 12, 6 and 1 ms.


Context switches cause overhead



Turnaround time depends on time quantum


All processes arrive at same time.Ready queue order: P1, P2, P3, P4

Turnaround time as function of time quantum

Burst time

69



Average turnaround time for time quantum = 1ms

P4

P3

P2

P1

0 5 10 15 20t [ms]

Turnaround (P1) = 15 ms




Average turnaround:

(15 + 9 + 3 + 17) ms4

= 11 ms



Average turnaround time for time quantum = 2 ms

P4

P3

P2

P1

0 5 10 15 20t [ms]





Average turnaround:

(14 + 10 + 5 + 17) ms4

= 11.5 ms



Average turnaround time for time quantum = 6 ms





Average turnaround:

(6 + 9 + 10 + 17) ms4

= 10.5 ms

P4

P3

P2

P1

0 5 10 15 20t [ms]

Side note: policy now is like FCFS


Multilevel QueueScheduling

The ready queue is partitioned into separate queues.Each queue has its own CPU scheduling algorithm. There is alsoscheduling between the queues (inter queue).


Interqueue schedulingFixed priorityTime slicing

70


Process Management

Processes (153)

Threads (178)


Scheduling (247)


Deadlocks (318)


Real-Time SchedulingScheduling

RT System

t

t

t

t

Tdist

Technical process

waiting (ready)

context switch

executioninclusive output

TRmax

TR

Tw

Tcs

∆e

r

s

d

c

Realtime condition: TR ≤ TRmax otherwise realtime-violation


Real-Time SchedulingA technical process generates events (periodically or not). A real-time computing system is requested to respond to the events. The response must be delivered within the period TRmax.

Scheduling

The technical system requests computation by raising an interrupt at

time r at the real-time system. The time from the occurrence of the request (interrupt) until the context switch of the corresponding computer process is the waiting time Tw . Switching the context takes the time TCS .

The point in time at which execution starts is the start time s. The

execution time ∆e is the netto CPU time needed for execution (even if

the process is interrupted). The process finishes at completion time c.

Tdist



The reaction time (also called response time) TR is thetime interval between the request (the interrupt) and the endof the process: TR = Tw + TCS + ∆e. This is the time interval thetechnical system has to wait until response is received.

Starting from the request, the maximum response time TRmax defines

the deadline d (a point in time) at which the real-time system must have responded.

Note: For all following considerations, the context switch time TCS is neglected, that is, we assume TCS = 0 µs.In accordance with “D. Zöbel, W. Albrecht: Echtzeitsysteme, page 24, ISBN 3-8266-0150-5

A hard real-time system must not violate the real-time conditions.

71


Real-Time Violation

Two technical processes TP1 and TP2 on some machine require response from a real-time system. The corresponding computer processes are P1 and P2. The technical processes generate events as follows:

TP2

TP1

0 5 10

0 5 10

t [ms]

t [ms]a b c d

a b cTRmax2

The execution time of P1 is 1ms, the execution time of P2 is 4 ms, and the scheduling algorithm is preemptive priority scheduling. The context switch time is considered negligible (0 µs).

RT-SchedulingExample RT.1

Response must be given latest just before the next event (thus within Tdist)

TRmax1


Real-Time Violation

TP2

TP1

0 5 10

0 5 10

t [ms]

t [ms]a b c d

a b c

t [ms]

P2

P1

0 5 10

a

a b

b c

c

6 ms

4 ms

TRmax

TP2

TP1

Machine

4 ms

1 ms

response time TR

HIGH

LOW

Priority

P2

P1

ProcessCase 1:P1 low priorityP2 high priority

Real-time violation, response to TP1 is too late!


Real-Time Violation

TP2

TP1

0 5 10

0 5 10

t [ms]

t [ms]a b c d

a b c

t [ms]

P2

P1

0 5 10

a

a a

b

b b

c d

c

Case 2:P1 high priorityP2 low priority

6 ms

4 ms

TRmax

TP2

TP1

Machine

4 ms

1 ms

response time TR

LOW

HIGH

Priority

P2

P1

Process

No real-time violation. Fine!



Theorem

are known (deterministic systems).

For a system with n processors (n ≥ 2) there is no optimal scheduling algorithm for a set of processes P1 ... Pm unless

all starting times s1, ... sm,all execution times ∆e1, ... ∆em,all completion times c1, ... cm

An algorithm is optimal when it finds an effective solutionif such exists.

Often, technical processes (or natural processes) are non-deterministic, at least to a part.

72


Branch-and-Bound SchedulingRT-Scheduling

Find a schedule by searching all combinations of processes.

the request time (interrupt arrival time) rknown in case of periodical technical processes

the response time TRknown from analysis or worst-case measurements

the deadline dgiven by the technical system

Example:

Of each process (non-preemptive!) must be known in advance:

30 ms

50 ms

20 msexecution time ∆e

100 ms0 msP3

90 ms0 msP2

30 ms0 msP1

deadline direquest time riProcess



P1

P1, P2

P1, P2 , P3

P1, P3

P1, P3 , P2

P2, P1 P2, P3

P2, P3 , P1

P3, P1 P3, P2

P3, P2 , P1

P2 P3

Ø

P2, P1 , P3 P3, P1 , P2

Search tree for the example

For n processes: tree depth (number of levels) = n, number of combinations = n!



P1

P2

P3

t [ms]0 10 20 30 40 50 60 70 80 90 100 110

d1 d2 d3

d3

Sequence P1, P2 , P3

P1

P2

P3

0 10 20 30 40 50 60 70 80 90 100 110

d1 d2


t [ms]

Real-time violation



P1

P2

P3

t [ms]0 10 20 30 40 50 60 70 80 90 100 110

d1 d2 dd3


P1

P2

P3

0 10 20 30 40 50 60 70 80 90 100 110

d1 d2 d3


t [ms]

Real-time violation

Real-time violation

73



P1

P2

P3

t [ms]0 10 20 30 40 50 60 70 80 90 100 110

d1 d2 d3


Real-time violation

P1

P2

P3

0 10 20 30 40 50 60 70 80 90 100 110

d1 d2 d3


t [ms]

Real-time violation

Real-time violation



P1

P1, P2

P1, P2 , P3

P1, P3

P1, P3 , P2

P2, P1 P2, P3

P2, P3 , P1

P3, P1 P3, P2

P3, P2 , P1

P2 P3

Ø

P2, P1 , P3 P3, P1 , P2

Search tree for the example

The only solution: P1 must be first, P2 must be second.



P1

P1, P2

P1, P2 , P3

P1, P3

P1, P3 , P2

P2, P1 P2, P3

P2, P3 , P1

P3, P1 P3, P2

P3, P2 , P1

P2 P3

Ø

For small n one may directly investigate the n! combinationsat the leafs. For bigger n it is recommended to start from the root and investigate all nodes (level by level). When a node violates the real-time conditionthe corresponding sub tree can be disregarded.


Deadline SchedulingRT-Scheduling

Priority Scheduling. The process with the closest deadlinehas highest priority. When processes have the same deadline, selection is done arbitrarily or according to FCFS.

The deadline scheduling algorithm is also known as earliest deadline first (EDF). The algorithm is optimal for the one-processor case.If there is a solution, it is found. If none is found then there is no solution.

• Non-preemptiveThe algorithm is carried out after a running process finishes. Intermediaterequests are saved (interrupt flip-flops) meanwhile.

• PreemptiveThe algorithm is carried out when a request arrives (interrupt routine) orafter a process finishes.

74



13 ms5 ms0 msP4

2 ms

1 ms


7 ms0 msP3

7 ms0 msP2

5 ms0 msP1


Example RT.2: Non-preemptive scheduling

P1

P2

P3

P4

0 5 10 15 20t [ms]

d1 d2, d3 d4

Deadline is the same, choice is arbitrary.Could be sequence P3, P2 as well.



P1

P2

P3

P4

0 5 10 15 20t [ms]

10 ms4 ms5 msP4

3 ms

3 ms


12 ms6 msP3

14 ms3 msP2

4 ms0 msP1


d1 d2d3d4

Example RT.3: Preemptive scheduling

Rem

embe

r, co

ntex

t sw

itch

time

is n

egle

cted

.


Deadline SchedulingRT-SchedulingContinuation of example RT.3

t = 0 ms: Request for P1 arrives. Since there is no other process, P1 is scheduled.

t = 2 ms: P1 finishes. Since there are no requests, the scheduler has nothing to do.

t = 3 ms: Request for P2 arrives. Since there is no other process, P2 is scheduled.

t = 5 ms: Request for P4 arrives. The deadline d4 is closer than the deadline of therunning process P2. P4 has higher priority and is scheduled.

t = 6 ms: Request for P3 arrives. Deadline d3 is more distant than any other, so nothing changes. P4 continues.

t = 9 ms: P4 finishes. The closest deadline now is d3, so P3 is scheduled.

t = 12 ms: P3 finishes. The closest deadline now is d2, so P2 is scheduled again.

t = 13 ms: P2 finishes. There are no processes ready. Nothing to schedule.



For multi-processor systems, the algorithm is not optimal.

4 ms

5 ms


9 ms0 msP3

9 ms0 msP2

10 ms0 msP1


Example RT.4: Three processes and two processors. Non-preemptive scheduling.

t [ms]

P1

P2

P3

0 5 10

Processor 1

Processor 2

Real-time violation P1

75


Real-Time Scheduling

then the processes can be scheduled on a single processor without real-time violation, if

11

≤∆∑

=

n

i

i

Te

idist

When there are n processes that are

periodic,

independent of each other,

preemptable,

and the response is to be delivered latestat the end of each period (that is TRmax = Tdist)

t

Tdist

TRmax

Scheduling

„Schedulability Test“



k · 200 ms15 msP3

k · 70 ms25 msP2

k · 30 ms15 msP1

deadline diexecution time ∆eProcessExample RT.5 :

1 0.9350.0750.360.520015

7025

3015

dist

≤=++=++=∆∑

=

n

i i

i

Te

1

P1

P2

P3

0 50 100 150 200t [ms]

5 ms

15 ms

15 ms

10 ms 10 ms 15 ms

The processes can be scheduled. Deadline scheduling would yield:

Scheduling


Real-Time SchedulingContinuation of example RT.5

t = 0 ms: Requests for P1, P2, P3 arrive. P1 has closest deadline and is scheduled.

t = 15 ms: P1 finishes. The deadline of P2 is closer than the deadline of P3. P2 isscheduled.

t = 30 ms: Request for P1 arrives. Reevaluation of the deadlines yields that P1 hashighest priority. P1 is scheduled.

t = 45 ms: P1 finishes. The deadline of P2 still is closer than the deadline of P3. P2 isscheduled.

t = 55 ms: P2 finishes. The only waiting process is P3. P3 thus is scheduled.

t = 60 ms: Request for P1 arrives. Reevaluation of the deadlines yields that P1 hashighest priority. P1 is scheduled.

t = 70 ms: Request for P2 arrives. Deadline of P1 is closest, P1 continues....

Scheduling


Real-Time SchedulingExample RT.6:

1.1350.420.2150.5125

143

42

dist

=++=++=∆∑

=

n

i i

i

Te

1

deadlines diexecution time ∆eProcess

5 ms

3 ms

2 ms

k · 12 msP3

k · 14 msP2

k · 4 msP1

This means an overutilization of the microprocessor. The processor would have to execute more than one process at a time (which is impossible). Therefore there is no schedule that would not violate the real-time condition sooner or later (on a single-processor system). The schedulability test failed.

Scheduling

76


Laxity SchedulingRT-Scheduling

Priority Scheduling. The process with the leastlaxity has highest priority. For equal laxities the selection policy is arbitrary or FCFS.

Laxity: ∆lax = (d - now) – ∆e

The laxity is the period of time left in which a process can be started without violating its deadline. Latest when the laxity is 0 the process must be started, otherwise it will not finish in time.The execution time ∆e of the process must be known, of course

t∆e

∆lax dnow

now is the point at time at which the laxity is(re)calculated. Usually this is the point in time at whicha new request arrives (preemptive scheduling) or at which a process finishes.



4 ms

5 ms


9 ms0 msP3

9 ms0 msP2

10 ms0 msP1


Example RT.7: Three processes and two processors. Non-preemptive scheduling. Same as in example RT.4.

Deadline scheduling focuses on the deadline, but does not take into account the execution time ∆e of a process. Laxity scheduling does, it sometimes finds a solution that deadline scheduling does not find.

Processes now undergoing laxity scheduling (see next slide)



t = 0 ms: Requests for P1, P2, P3 arrive. The laxities are: ∆lax1 = 2 ms, ∆lax2 = 4 ms,

∆lax3 = 5 ms. Least laxity is ∆lax1, so P1 is scheduled on processor 1.

Processor 2 is not yet assigned, so P2 is chosen (∆lax2 < ∆lax3).

t = 5 ms: P2 finishes. The only process waiting is P3, so it is scheduled.

t = 8 ms: P1 finishes. No new processes to schedule.

t [ms]

P3P2

0 5 10

Processor 1

Processor 2

P1

Continuation of example RT.7

No real-time violation as opposed to the deadline scheduling example RT.4



5 ms3 ms0 msP3

5 ms

5 ms


8 ms0 msP4

6 ms0 msP2

1 ms0 msP1


Example RT.8: Four processes and two processors. Non-preemptive scheduling.

Laxity scheduling, like deadline scheduling,is generally not optimal for multi-processors.That is, it does not always find a solution.

Continuation on next slide

77



t = 0 ms: Requests for P1, P2, P3, P4 arrive. The laxities are: ∆lax1 = 0 ms,∆lax2 = 1 ms, ∆lax3 = 2 ms, ∆lax4 = 3 ms. Least laxity is ∆lax1,so P1 is scheduled on processor 1. Second least laxity is ∆lax2,so P2 is chosen for processor 2.

t = 1 ms: P1 finishes. Least laxity is ∆lax3 (now 1ms), so P3 is scheduled onprocessor 1.

t = 4 ms: P3 finishes. Least laxity is ∆lax4 (now -1 ms), so P4 is scheduled onprocessor 1 ... but it is already too late (negative laxity).

t [ms]

P4

P2

0 5 10

Processor 1

Processor 2

P1


P3

d4

Real-time violation P4



t [ms]

P4

P2

0 5 10

Processor 1

Processor 2

P1


P3

However, there exists a schedule that works well:

Scheduling non-preemptive processes in a multi-processor system is a complex problem.This is even the case in a two-processor system when all request times ri are the same

and all deadlines di are the same.

Non-violating schedulefound through deadline scheduling

d4


Rate Monotonic SchedulingPriority scheduling for periodical preemptive processes where the deadlines are equal to the periods. The process with highestfrequency (repetition rate) has highest priority. Static scheduling.

tTechnical process 1

Tdist

tTechnical process 2

Tdist

Computer process P2 has higher priority than process P1 since its rate is higher.

Although the algorithm is not optimal, it is often used in real-time applications because it is fast and simple (at run time!). Note, static scheduling!


Rate Monotonic SchedulingThe classic static real-time scheduling algorithm for preemptable, periodic processes is RMS (Rate Monotonic Scheduling). It can be used for processes that meet the following conditions:

Each periodic process must complete within its period.No process is dependent on any other process.Each process needs the same amount of CPU time on each burst.Any non periodic processes have no deadlines.Process preemption occurs instantaneously and with no overhead.

RMS works by assigning each process a fixed priority equal to the frequency of occurrence of its triggering event. For example, a process that must run every 30ms (= 33Hz) receives priority 33, a process that must run every 40ms (= 25 Hz) receives priority 25. The priorities are linear with the rate, this is why it is called rate monotonic.

A more thorough explanation from [Ta01 p.472]

78


Rate Monotonic Scheduling

(k+1) · 50 ms5 msk · 50 msC

15 ms


(k+1) · 40 msk · 40 msB

(k+1) · 30 msk · 30 msAdeadline direquest time riProcess

Three periodic processes [Ta01 p.471]


Example RT.9



The processes A, B, C scheduled with

Continuation Example RT.9

Rate Monotonic Scheduling (RMS),Deadline scheduling (EDF).



Rate Monotonic SchedulingContinuation Example RT.9

Up to t = 90 the choices of EDF and RMS are the same. At t = 90 process A is requested again. The RMS scheduler votes for A (process A4 in the figure) since its priority is higher than the priority of B, thus B is interrupted.The deadline scheduler in contrast has a choice because the deadline of A is the same as the deadline of B (dA = dB = 120). In practice, preempting B has some nonzero cost associated, therefore it is better to let B continue.

See next example (Example RT.10) to dispel the ideathat RMS and EDF would always give same results.



(k+1) · 50 ms5 msk · 50 msC

15 ms


(k+1) · 40 msk · 40 msB

(k+1) · 30 msk · 30 msAdeadline direquest time riProcess

Example RT.10: Like RT.9 but process A now has 15ms execution time

0.9750.10.3750.5505

4015

3015

dist

=++=++=∆∑

=

n

i i

i

Te

1

The schedulability test yields that the processes are schedulable.

Nevertheless, RMS fails in this example while EDF does not. →

79


Rate Monotonic SchedulingContinuation Example RT.10

RMS leads to a real-time violation. Process C is missing its deadline dC = 50.




For n = 2 processes, RMS will work for sure if the CPU utilization is below 0.828.For n = 3 processes, RMS will work for sure if the CPU utilization is below 0.780.For n → ∞ processes, RMS will ... if the CPU utilization is below ln 2 (0.694).

Why did RMS fail?

Using static priorities only works if the CPU utilization is not too high.It was proved* that RMS is guaranteed to work for any system of periodic processes if

).12(1

1−⋅≤

∆∑=

nn

i

i nT

e

idist

* C.L. Liu, James Layland: Scheduling Algorithms for Multiprogramming in a Hard-Real-TimeEnvironment, Journal of the ACM, 1973, http://citeseer.ist.psu.edu/liu73scheduling.html


Rate Monotonic SchedulingIn example RT.9 the utilization was 0.808 (thus higher than 0.780),why did it work?

We were just lucky. With different periods and execution times, a utilization of 0.808 might fail. In example RT.10 the utilization was so high that there was little hope RMS could work.

In contrast to RMS, deadline scheduling always works for any schedulable set of processes (single-processor system). Deadline scheduling can achieve 100% CPU utilization. The price paid is a more complex algorithm [Ta01 p.475]. Because RMS is static all priorities are known at run time. Selecting the next process is a matter of just afew machine instructions.



staticscheduling

dynamic scheduling

dynamic scheduling

static scheduling

Preferably used in

Highest repetition rate (frequency) has highest priority. Execution time is not taken into account.

Least laxity has highest priority.Execution time is taken into account.

Earliest deadline has highest priority. Execution time is not taken into account.

Try all permutations of processes.Description

German Name

“Planen nachmonotonen Raten”

“Planen nachSpielräumen”

“Planen nachFristen”

“Planen durchSuchen”

RMSLaxityDeadline(EDF)

Branch and Bound

Overview real-time scheduling algorithms

80



Processes (153)

Threads (178)


Scheduling (247)


Deadlocks (318)


Deadlocks

{request(resource1);request(resource2);...release(resource1);release(resource2);

}

Process 1


}

Process 2

Consider two processes requiring exclusive access to some sharedresources (e.g. file, tape-drive, printer, CD-Writer).

Fictitious system call for requesting exclusive access to a resource. When access cannot be granted, the call blocks until the resource is available.


Deadlocks

When the two processes are executed sequentially (one after the other), no problem arises.


} Process 1


}

time

Process 2


Deadlocks

When process 1 has acquired the resources before process 2starts trying the same, no problem arises. Process 2 just has to wait.


} Process 1

time

{request(resource2);

request(resource1);...release(resource2);release(resource1);

} Process 2

blocked

81


Deadlocks

Occasionally, when both processes are carried out in parallel asdepicted above, both their attempts to acquire the missing resource will cause the processes to block. Since each process holds a resource that the other one needs, and since each process cannotrelease its resource, both processes do wait forever (deadlock).

{request(resource1);request(resource2);

Process 1

time

{request(resource2);request(resource1);

Process 2

blocked blocked


DeadlocksA set of processes is deadlocked when each process in the set is waiting for an event that only another process in the set can cause.

Waiting for an event:• Waiting for the availability of a resource

• Waiting for some input

• Waiting for a message (IPC) or a signal• or any other type of event that a

process is waiting for in order to continue


Deadlocks

Yields to carat right




Classical deadlock problem from the non-computer world

Every car is ought to give way to the car on the right.

None will proceed.

Figure from lecture slides „Computer Architecture“ WS 05/06 (Basermann / Jungmaier)


Resources• Anything a process / thread needs to continue

Examples: I/O-devices like printer, tape, CD-ROM, files, but also internal resources such as process table, thread table, file allocation table or semaphores / mutexes.

Deadlocks

• Exclusive accessOnly one process at a time can use the resource (e.g. printer or writingto a shared file).

• Non-exclusive accessMore than one process can use the resource at the same time (e.g. reading from a shared file)

• Preemptable resourcesThe resource can (with some non-zero cost) be temporarily taken away from a process and given to another process (e.g. memory swapping).

• Non-preemptable resourcesThe resource cannot be temporarily assigned to another process (e.g. printer, CD-Writer) without leading to garbage.

82


Deadlocks

• Mutual ExclusionEach resource is either currently assigned to exactly one process or is available.

• Hold and WaitA process currently holding a resource can request new resources.

• Non-preemptable resourcesResources previously granted cannot be forcibly taken away from a process.

• Circular WaitThere must be a circular chain of processes, each of which is waiting for a resource held by another process in the chain.

The following four conditions must be present for a deadlock to occur.

If one of these conditions is absent, no deadlock is possible


Deadlock ModelingResource allocation graphs

a) Holding a resource (Process A holds resource R)b) Requesting a resource (Process B requests resource S)c) Deadlock situation: Process D requests U which is held by process C.

Process C requests T which is held by D.

Deadlocks

Process Resource



A B C

Deadlock ModelingDeadlocks


time

Resource allocation order leading to a deadlock


(o) (p) (q)

Deadlock ModelingDeadlocks


time

Example of resource allocation not resulting in a deadlock

83


DeadlocksStrategies for dealing with deadlocks:

1. Ignore the problemSounds silly, but in fact many operating systems do exactly this – assuming that deadlocks occur rarely.

2. Detection & RecoveryThe OS tries to detect deadlocks and then takes some recovery action.

3. AvoidanceResources are granted in such a way that deadlocks cannot occur.

4. PreventionTrying to break at least one of the four conditions such thatno deadlock can happen.


DeadlocksStrategy 1 (Ignoring the problem)

Most operating systems, including UNIX and Windows, just ignore the problem on the assumption that most users would prefer an occasional deadlock to a rule restricting all users to one process, one open file, and one of everything.

If deadlocks could be eliminated for free, there would not be much discussion. But the price is high. If deadlocks occur on the average once a year, but system crashes owing to hardware failures and software errors occur once a week, nobody would be willing to pay a large penalty in performance or convenience to eliminate deadlocks (After Ta01 p.167 ).

For that, the deadlock problem often is disregarded.


DeadlocksStrategy 2 (Detection & Recovery)The operating system tries to detect deadlocks and to recover.

Process A holds R and wants S

Process B holds nothing and wants T

Process C holds nothing and wants S

Process D holds U and wants S and T

Process E holds T and wants V

Process F holds W and wants S

Process G holds V and wants U.

Example DL.1 : Consider the following system state:

Is the system deadlocked, and if so, which processes are involved?


Strategy 2

Constructing the resource allocation graph (a):

Continuation of example DL.1 (deadlock detection)

The extracted cycle (b) shows the processes and resources involved in a deadlock.


deadlock

Deadlocks

84


Strategy 2Deadlock detection with multiple instances of a resource type

n processes: P1 ... Pn

m resource classes

We have (respectively we define):

Ei = the number of existing resource instances of resource class i, 1 ≤ i ≤ m.

E is the existing resource vector, E = (E1 ... Em).

A is the available resource vector. Each Ai in A gives the number of currently available resource instances. A = (A1 ... Am).

Deadlocks

Relation X ≤ Y is defined to be true if each Xi ≤ Yi.


Strategy 2


Deadlock detection with multiple instances of a resource typeDeadlocks

Definition of current allocation matrix and request matrix:

P1P2


Strategy 2Deadlock detection with multiple instances of a resource type

Deadlocks

Deadlock detection algorithm:

1. All processes are initially unmarked

2. Look for an unmarked process Pi for which row Ri ≤ AHere the algorithm is looking for a process that can be run to completion (theresource demands of the process can be satisfied immediately).

3. If such a Pi is found, add row Ci to A and mark Pi. Go to step 2.After Pi is (or would have) finished, its resources are given back to the pool.The process is marked (in the sense of ‘successful completion’).

4. If no such process exists, terminate.

All unmarked processes, if any, are deadlocked!


Strategy 2Example DL.2 (deadlock detection algorithm):

Deadlocks


Consider the following system state:

Is there (or will there be) a deadlock in the system?

85


Strategy 2Continuation of example DL.2 (deadlock detection algorithm) Deadlocks

Checking P1: R1 is not ≤ A (CD-ROM is missing). P1 cannot run and is not marked.

Checking P2: R2 is not ≤ A (Scanner is missing). P2 cannot run and is not marked.

Checking P3: R3 is ≤ A, thus P3 can run and is marked. The resources are given backto the pool. A = (2 2 2 0).

Checking P1: R1 still is ≤ A (CD-Rom still not available).

Checking P2: R2 now is ≤ A, thus P2 can run and is marked. The resources are givenback to the pool. A = (4 2 2 1).

Checking P1: R1 now is ≤ A. P1 can run and is marked. The resources are givenback to the pool. A = (4 2 3 1) = E.

No more unmarked processes: termination.

No deadlocks.


Strategy 2Example DL.3 (deadlock detection algorithm):

Deadlocks

Same as DL.2 but now C2 = (2 1 0 1) and thus A = (2 0 0 0).

The entire system is deadlocked!

Checking P1: R1 is not ≤ A (CD-ROM is missing). P1 cannot run and is not marked.

Checking P2: R2 is not ≤ A (Scanner is missing). P2 cannot run and is not marked.

Checking P3: R3 is not ≤ A (Plotter is missing). P3 cannot run and is not marked.

All processes checked. Nothing will change: termination.


Strategy 2Detection & Recovery

• Resource PreemptionForcibly taking away a resource from a process. May have ill side effects.Difficult or even impossible in many cases.

• Process RollbackA process periodically writes its complete state to file (checkpointing). In case of a deadlock, the process is rolled back to an earlier state in which it occupied lesser resources. Program(ming) overhead!

• Killing ProcessesCrudest but simplest method. One or more processes from the chain are terminated and must be started all over again at some later point in time. May also cause ill effects – consider a process updating a data base twice instead of once.

Deadlocks


DeadlocksStrategy 3 (Avoidance)Do not allow system states that may result in a deadlock.

A state is said to be safe when it is not deadlocked and there exists some scheduling order in which every process can run to completion even if all of them request their maximum number of resources. An unsafe state mayresult in a deadlock, but does not have to.

Assume there is a total number of 10 instances available.Then the state is a safe state since there is a way to run all processes.

maximum number of resource instances needed (requests)

number of resource instances currently held (allocation)

86


Strategy 3

a) starting situation (question: is this a safe state?).There are 3 resources left in the pool.

b) B is granted 2 additional resources.

c) B has finished. Now 5 resources are free.

d) C is granted another 5 resources.

e) C has finished. Now 7 resources are free.Process A can be run without problems. Thus (a) is a safe state.

Deadlocks

(a) (b) (c) (d) (e)



Strategy 3

a) starting situation as before (this is a safe state)

b) A is granted one additional resource.

c) B is granted the remaining 2 resources.

d) B has finished. A and C cannot run because each of themneeds 5 resources to complete. Deadlock.

Deadlocks

Any other sequence starting from (b) also ends up in a deadlock.Therefore state (b) is an unsafe state. The move from (a) to (b) was bringing the system from a safe state to an unsafe state.

(a) (b) (c) (d)



Strategy 3DeadlocksBanker’s Algorithm (Dijkstra 1965)

Think of a small-town banker who deals with a group of customers to whom he has granted lines of credit. If granting a request leads to an unsafe state, the request is denied. If a request leads to a safe state, the request is granted.

Knowing that not all customers need their credit line immediately, the banker has reserved 10 money units instead of 22 to service them.

Initial state

There are four customers (processes) demandingfor a total of 22 money units (resources).

The banker (operating system) has provided 10money units in total.


Strategy 3DeadlocksContinuation Banker’s Algorithm

The banker’s algorithm considers each request as it occurs. A request is granted when the state remains safe, otherwise the request is postponed until later.

a) Initial state (safe)

b) Safe state: C’s maximum request can be satisfied. When C has paid back the 4 money units, B’s request (or D’s) can be satisfied. ...

c) Unsafe state: If any of the customers requests the maximum,the banker would be stuck (deadlock). Figure from [Ta01 p.178]

(b)(a) (c)

87


Strategy 3DeadlocksBanker’s Algorithm for multiple resource instances

Current allocation matrix C Request matrix R

Existing

Possessed (allocated)

Available




1. Look for a row Ri whose unmet requirements are smaller than (or equal) to A. If no such row exists, the system will deadlock sooner or later since no process can run to completion.

2. Assume the process of the row chosen requests its maximum resources (which is guaranteed to be possible) and finishes. Mark the process as terminated and add its resources to the pool A.

3. Repeat steps 1 and 2 until either all processes are marked (in which case the initial state was safe), or until a deadlock occurs (in which case the initial state was unsafe).



The pool is A = (1 0 2 0).

Process D can be scheduled next because (0 0 1 0) < (1 0 2 0).When finished, the pool is A = (1 0 1 0) + (1 1 1 1) = (2 1 2 1) .

Process A can be scheduled because (1 1 0 0) < ( 2 1 2 1).When finished, the pool is A = (1 0 2 1) + (4 1 1 1) = (5 1 3 2).

Process B can be scheduled because (0 1 1 2) < (5 1 3 2).When finished, the pool is A = (5 0 2 0) + (0 2 1 2) = (5 2 3 2).

Process C can be scheduled because (3 1 0 0) < (5 2 3 2).When finished, the pool is A = (2 1 3 2) + (4 2 1 0) = (6 3 4 2).

Process E can be scheduled because (2 1 1 0) < (6 3 4 2).When finished, the pool is A = (4 2 3 2) + (2 1 1 0) = (6 3 4 2).



The state shown is a safe state since we have found at least oneway to complete all processes. Other sequences are possible.

No more processes. All processes have successfully completed.

In practice the banker’s algorithm is of minor use, because

processes rarely know in advance the maximum numberof resources needed,

the number of processes is not constant over time as userslog in and out (or other events require computational attention).

88


DeadlocksStrategy 4 (Deadlock Prevention)Break (at least) one of the four conditions for a deadlock.

• Avoiding mutual exclusionSometimes possible. Instead of using a printer exclusively, the processes write into a print spooler directory. This way several processes can use the printer at the same time. However, an internal system table (e.g. process table) cannot be spooled. Similar applies to a CD-Writer.

• Breaking the “hold and wait”Processes request all their resources at once (“either all or none”). However, not all processes know their demand from the beginning. Moreover, the resources are not optimally used then (degradation in multi-programming).

Variation: each time an additional resource is needed, the process releases all its resources first and then tries to acquire all of them at once. This way a process does not occupy resources while waiting for a new one.


Strategy 4

• Attacking the “no preemption” conditionForcibly removing a resource from a process is barely possible.

• Breaking “circular wait”Provide a global numbering of all resources (ranking). Resource requests must be made in ascending order. This way a resource allocation graph can have no cycles. In the figure, B cannot request the scanner – even if it would be available.

Deadlocks

1. Imagesetter2. Scanner3. Plotter4. Tape drive5. CD-Rom drive

A B

Scanner Plotter

However, not all resources allow for a reasonable order. How to order table slots, disk spooling space, locked database records?



Memory Management


Memory Management

Memory (353)

Paging (381)

Segmentation (400)

Paged Segments (412)

Virtual Memory (419)

Caches (471)

89


MemoryCore Memory

Image source: http://www.psych.usyd.edu.au/pdp-11/core.html

• Period: 1950 ... 1975

• Non-volatile

• Matrix of magnetic cores

• Storing a bit by changing themagnetic polarity of a core

• Access time 3µs ... 300ns• Destructive read

After reading a core, the content is lost. A read cyclemust be followed by a write cycle i.o. to restore.


MemorySemiconductor Memory (≈1970 ...)

• Storing a bit by charging a capacitor(sometimes just the self-capacitance of a transistor)

• One transistor per bitHigh density / capacity per area unit

• Volatile

• Destructive read

• Self-dischargingPeriodic refresh needed

Dynamic memory (DRAM)

Image source: http://www.research.ibm.com/journal/rd/391/adler.html

Memory Management


MemorySemiconductor Memory (≈1970 ...)

• Storing a bit in a flip-flopSetting / Resetting the flip-flop

• 6 transistors per bitMore chip area than with DRAM

• Volatile

• Non-destructive read

• No self-discharge

• Fast!

Static memory (SRAM)

Image source: Wikipedia on „SRAM“ (English)

Memory Management


Memory Hierarchy

Memory hierarchy levels in typical desktop / server computers, figure from [HP06 p.288]

Program(mer)s want unlimited amountsof fast memory. Economical solution: Memory hierarchy.

Memory Management

90


Main Memory

• Central to computer system

• Large array of words / bytes

• Many programs at a timefor multi-programming / tasking to be effective

OperatingSystem

program 1

Memory layout of a time sharing system →

program 2

program 3

program 4

program n

program 5

program 6

WorkingMemory


Address Binding• Program = binary executable file• Code/data accessible via addresses

...i = i + 1;check(i);...

Addresses in the source code are symbolic,here: i (a variable) and check (a function).

The loader finally binds the relocatable addresses to absolute addresses, such as „i is at 74014“ when loading the code into memory.

The compiler typically binds the symbolic addresses to relocatable addresses, such as „i is 14 bytes from the beginning of the module“. The compiler may also

be instructed to produce absolute addresses (non-relocatable code).

Memory Management


Address Binding Schemes

• Compile time (Program creation)The resulting code is absolute code. All addresses are absolute. The program must be loaded exactly to a particular logical address in memory.

• Load timeThe code must be relocatable, that is, all addresses are given as an offset from some starting address (relative addresses). The loader calculates and fills in the resulting absolute addresses at load time (before execution starts).

• Execution timeThe relocatable code is executed. Address translation from relative to absolute addresses takes place at execution time (for every single memory access). Special hardware needed (MMU).

The binding of code and data to logicalmemory addresses can be done at three stages:

Memory Management


Logical / Physical AddressesLogical AddressThe address generated by the CPU, also termed virtual address. All logical addresses form the logical (virtual) address space.

Physical AddressThe address seen by the memory. All physical addresses form the physical address space.

In compile-time and load-time address-binding schemesthe logical and the physical addresses are the same.

In execution-time address-binding the logical and physical addresses differ.

Memory Management

91


Memory Management UnitHardware device that maps logical addresses tophysical addresses (MMU).

A program (a process) deals with logical addresses, it never sees the real physical addresses.


Memory Management


Protection• Protecting the kernel against user processes

No user process may read, modify or even destroy kernel data (or kernel code). Access to kernel data (system tables) only through system calls.

• Protecting user processes from one anotherNo user process may read or modify other processes` data or code. Any data exchange between processes only via IPC.

MMU equipped with limit register

• Loaded with the highest allowed logical addressThis is done by the dispatcher as part of the context switch.

• Any address beyond the limit causes an error• Assumption: contiguous physical memory per process

Memory Management


Protection


Limit register for protecting process spaces against each other

Memory Management


Memory Occupation

• Dynamic Loading„Load what is needed when it is needed“.

• Overlays„Replace code by other code“.

• Dynamic Linking (Shared Libraries) „Use shared code rather than back-pack everything“.

Obtaining better memory-space utilizationInitially the entire program plus its data (variables) needed to be in memory

• Swapping„Temporarily kick out a process from memory“.

Memory Management

92


Dynamic Loading

• Routines are kept on diskMain program is loaded into memory.

• Routine loaded when neededUpon each call it is checked whether the routine is in memory.If not, the routine is loaded into memory.

• Unused routines are never loadedAlthough the total program size may be large, the portion that isactually executed can be much smaller.

• No special OS support requiredDynamic loading is implemented by the user. System libraries (and corresponding system calls) may help the programmer.

Memory Occupation


Overlays

• Existing code is replaced by new codeSimilar to dynamic loading, but instead of adding new routines to the memory, existing code is replaced by the loaded code.

• No special OS support requiredOverlay technique implemented by the user.

Example: Consider a two-pass assembler

Pass 1 70 kBPass 2 80 kBSymbol table 20 kBCommon routines 30 kB

Pass 1 and pass 2 do not need to be in memory at the same time → Overlay

Loading everything at oncewould require 200 kB.

Memory Occupation


Memory

Overlays


Pass 1, when finished, is overlayed by pass 2.An additional overlay driver is needed (10 kB), but the total memory requirement now is 140 kB instead of 200 kB.

Memory Occupation


Dynamic Linking• Different processes use same code

This especially true for shared system libraries (e.g. reading from keyboard,graphical output on screen, networking, printing, disk access).

• Single copy of shared code in memory Rather than linking the libraries statically to each program (which increases the size of each binary executable), the libraries (or individual routines) are linked dynamically during execution time. Each library only resides once in physical memory.

• „Stub“is a piece of program code initially located at the library references in the program. When first called it loads the library (if not yet loaded) and replaces itself with the address of the library routine.

• OS support requiredsince a user process cannot look beyond its address space whether (and where) the library code may be located in physical memory (protection!).

Memory Occupation

93


Swapping• A process can be swapped temporarily out of memory to a

backing store, and then brought back into memory for continued execution.

• Backing store: fast disk large enough to accommodate copies of all memory images for all users; must provide direct access to these memory images.

• Roll out, roll in – swapping variant used for priority-based scheduling algorithms; lower-priority process is swapped out so higher-priority process can be loaded and executed.

• Major part of swap time is transfer time; total transfer time isdirectly proportional to the amount of memory swapped.

Memory Occupation


Swapping


Figure: Process P1 is swapped out, and process P2 is swapped in.

Memory Occupation


Memory Allocation

• ContiguousThe physical memory space is contiguous (linear) for each process.

• Non-ContiguousThe physical memory space per process is fragmented (has holes).

Fixed-sized partitionsVariable sized partitionsPlacement schemes: first fit, best fit, worst fit

PagingSegmentationCombination of Paging and Segmentation

Allocation of physical memory to a processMemory Management


OperatingSystem

process 1

process 2

process 4

process 3

free partition

Contiguous Memory Allocation

• Fixed-sized partitionsMemory is divided into fixed sized partitions. Originally used by IBM OS/360, no longer in use today.

The physical memory allocated to a process is contiguous (no holes).

Simple to implement

Degree of multiprogramming is bound by the number of partitions

Internal fragmentation

94


Contiguous Memory AllocationThe physical memory allocated to a process is contiguous (no holes).

OS must keep a free listlisting free memory (holes)

OS must provide placement scheme

Degree of multiprogramming onlylimited by available memory

No (or very little) internal fragmentation

External fragmentationThe holes may be too small for a new process

• Variable-sized partitionsPartitions are of variable size.

OperatingSystem

process 2

process 4

process 3

process 1


CompactionReducing external fragmentation (for variable-sized partitions)

OperatingSystem

process 2

process 4

process 3

process 1

OperatingSystem

process 2

process 4

process 3

process 1

Copy operation is expensive

free memory


Placement SchemesSatisfying a request of size n from a list of free holes.

• First fitFind the first hole that is large enough. Fastest method.

• Best fitFind the smallest hole that is large enough. The entire list must be searched (unless it is sorted by hole size). This strategy produces the smallest leftover hole.

• Worst fitFind the largest hole. Search entire list (unless sorted). This strategy produces the largest left-over hole, which may be more useful than the smallest leftover hole from the best-fit approach.

General to the following schemes: find a large enough hole, allocate the portion needed, and return the remainder (leftover hole) to the free list.


First FitExample: we need this amount of memory:Search starts at the bottom.

OperatingSystem

process 2

process 4

process 3

process 1

OperatingSystem

process 2

process 4

process 3

process 1

Search

leftover hole

The first hole encountered is large enough.

95


Best FitExample: we need this amount of memory:Search starts at the bottom.

OperatingSystem

process 2

process 4

process 3

process 1

OperatingSystem

process 2

process 4

process 3

process 1

Search

leftover hole

We have to search all holes. The top hole fits best.

This scheme creates the smallest leftover hole among the three schemes.


Worst FitExample: we need this amount of memory:Search starts at the bottom.

OperatingSystem

process 2

process 4

process 3

process 1

Search

OperatingSystem

process 2

process 4

process 3

process 1

leftover hole

We have to search all holes. The bottom hole is found to be the largest.

This scheme creates the largest leftover hole among the three schemes.


Memory Allocation

• ContiguousThe physical memory space is contiguous (linear) for each process.

• Non-ContiguousThe physical memory space of a process is fragmented (has holes).

Fixed-sized partitionsVariable sized partitionsPlacement schemes: first fit, best fit, worst fit

PagingSegmentationCombination of Paging and Segmentation

Allocation of physical memory to a process


Memory Management

Memory (353)

Paging (381)

Segmentation (400)



Caches (471)

96


Paging• Physical address space of a process can

be non-contiguous

• Physical memory divided into fixed-sized framesFrame size is power of 2, between 512 bytes and 8192 bytes

• Logical memory divided into pagesPrage size is identical to frame size.

• OS keeps track of all free frames (free-frame list)

• Running a program of size n pages requiresfinding n free frames

• Page table translates logical to physical addresses.

• Internal fragmentation, no external fragmentation.


Address TranslationAddress generated by CPU is divided into:

Page number p – used as in index into a page table which contains the base address f of the corresponding frame in physical memory.

Page offset d – the offset from the frame start,physical memory address = f + d.

page number page offset

logical address p dm – n n

Logical address is m bits wide. Page size = frame size = 2n.

Paging


Paging

low memory

high memory


Physical address = f + df = PageTable[p]p = m-n significant bits of logical addressd = n least significant bits


Paging


Paging model: logical address space is contiguous, whereasthe corresponding physical address space is not.

97


PagingFigure from [Sil00 p.272]

n = 2 (page size is 4 byte)

m = 4 (logical address space is 16 byte)

10D = 1010 B

frame 0

frame 1

frame 2

frame 3

frame 4

frame 5

frame 6

frame 7

What is the physical address of k?

k is located at logical address 10D

0 201 242 43 8

Physical address = f + d = 4 + 2 = 6

frame number

10 10

p d

f = PageTable[2] = 4

PageTable

frame address

p = 2, d = 2.


Free-Frame List

1413182015

free-frame list

page 0page 1page 2page 3

new process

15

free-frame list13

14

15

16

17

18

19

20

page 0

page 1

page 2

page 3

0 141 132 183 20

page table ofnew process

frame number

13

14

15

16

17

18

19

20

free

The OS must maintain a table of free frames (free-frame list)Paging


Page-Table

• Dedicated registers within CPUOnly suitable for small memory. Used e.g. in PDP-11 (8 page registers,each page 8 kB, 64 kB main memory total). Fast access (high speed registers).

• Table in main memoryA dedicated CPU register, the page-table base register (PTBR), points to thetable in memory (the table currently in use). With each context switch the PTBR is reloaded (then pointing to another page table in memory).The actual size of the page table is given by a second register, the page table length register (PTLR).

Where to locate the page table?

With the latter scheme we need two memory accesses, one for the page table, and one for accessing the memory location itself. Slowdown!Solution: Special hardware cache: translation look-aside buffer (TLB)

Paging


Translation Look-Aside Buffer

A translation look-aside buffer (TLB) is a small fast-lookup associative memory.

The associative registers contain page ↔ frame entries (key | value). When a page number is presented to the TLB, all keys are checked simultaneously. If the desired page number is not in the TLB, it must be fetched from memory.

Paging

key value

page numberframe address orframe number

5 120 141 134 42 186 159 173 20

2 18

98


Translation Look-Aside Buffer

Paging hardware with TLB. Figure from [Sil00 p.276]

Paging


Memory Access TimeAssume: Memory access time = 100ns. TLB access time = 20ns

When page number is in TLB (hit):total access time = 20ns + 100ns = 120ns

When page number is not in TLB (miss):total access time = 20ns + 100ns + 100ns = 220ns

With 80% hit ratio:average access time = 0.8 · 120ns + 0.2 · 220ns = 140ns

With 98% hit ratio:average access time = 0.98 · 120ns + 0.02 · 220ns = 122ns

Paging


ProtectionWith paging the processes’ memory spaces are automatically protected against each other since each process is assigned its own set of frames.

If a page is tried to be accessed that is not in the page table (or is marked invalid -- see next slide), the process is trapped by the OS.


frame 0

frame 1

frame 2

frame 3

frame 4

frame 5

frame 6

frame 7

0 201 242 43 8page table

frame address

Valid physical addresses:20 ... 2324 ... 2704 ... 0708 ... 11

Paging


Frame AttributesEach frame may be characterized by additional bits in the page table.


Valid / invalidWhether the frame is currentlyallocated to the process

Read-OnlyFrame is read-only

Execute-OnlyFrame contains code

SharedFrame is accessible toother processes as well.

Paging

99


Shared PagesImplementation of shared memory through paging is rather easy.

A shared page is a page whose frame is allocated to other processes as well. Many processes share a page in that each of the shared pages is mapped to the same frame in physical memory.

Shared code must be non-self modifying code (reentrant code).

Figure on the next slide:

Three processes are using an editor. The editor needs 3 pages for its code. Rather than loading the code three times into memory, the code is shared. It is loaded only once into memory, but is visible to each process as if it is their private code.

The data (the text edited), of course, is private to each process. Each process thushas its own data frame.

Paging


Shared Pages


Note:Free memory is shown in gray,occupied memory is in white.

0123

0123

0123

01

2

3

01

2

3

01

2

3

Pages 0,1,2 of each process are mapped to physical frames 3,4,6.


PagingLogical address space of modern CPUs: 232 ... 264

Assume: 32-bit CPU, frame size = 4K⇒ 232 / 212 = 220 page table entries (per process)

Each entry size = 20 bit + 20 bit = 5 byte20 bit for page number. 20 bit for frame number (less thanrequiring 32 bit for the frame address).

⇒ 220 x 5 byte = 5MB per page table!

page number frame numberpage table entry20 20


Two-Level PagingOften, a process will not use all of its logical address space. Ratherthan allocating the page table contiguously in main memory (for the worst case), the page table is divided into small pieces and is paged itself.

outer page table

inner page table

output points to a frame containing pagetable entries (inner page table entries)

output points to finaldestination frame

Paging

100


Two-Level Pagingpage number page offset

logical address p2 d

10

p1

10 12

Numbers are for the 32-bit, 4kB frame, example


max 210 entrieseach page of innertable has 210 entries

final destinationframe in memory

Paging


Multi-Level Paging• Tree-Structure principle

Each outer page entry defines a root node of a tree.

Four-level paging with 98% hit rate:Effective access time = 0.98 · 120 ns + 0.02 · 520 ns = 128 ns

• Two / three / four – level pagingSPARC (32 bit): three-level paging.Motorola 68030 (32 bit): four-level paging.

• Better memory utilizationthan using a contiguous (and possibly maximum-sized) page table.

• Increase in access timesince we hop several times until final memory location is reached. Caching (TLB) however helps out a lot.

Paging



Memory (353)

Paging (381)

Segmentation (400)



Caches (471)


SegmentationUser views of logical memory:

Linear array of bytesReflected by the ‘Paging’ memory scheme

A collection of variable-sized entitiesUser thinks in terms of “subroutines”, “stack”, “symbol table”, “main program” which are somehow located somewherein memory.


Segmentation supports this userview. The logical address space isa collection of segments.

101


Segmentation

1

3

2

4

Segmentation model: The user space (logical address space) consists of a collection of segments which are mapped through the segmentation architecture onto the physical memory.

User space

1

4

2

3

Physical memory


Segmentation• Physical address space of a process can

be non-contiguous as with paging

• Logical address consists of a tuple<segment number, offset>

• Segment table maps logical address onto physical address

• Segment table can hold additional segment attributesLike with frame attributes (see paging).

• Shared SegmentsShared segments are mapped to the same segment in physical memory.

base: physical address of segmentlimit: length of segment


Segmentation


s selects the entry from the table. Offset d is checked against the maximum size of the segment (limit).Final physical address = base + d.


Segmentation• Segments are variable-sized

Dynamic memory allocation required (first fit, best fit, worst fit).

• External fragmentationIn the worst case the largest hole may not be large enough to fit in a new segment.Note that paging has no external fragmentation problem.

• Each process has its own segment tablelike with paging where each process has its own page table. The size of the segment table is determined by the number of segments, whereas the size of the page table depends on the total amount of memory occupied.

• Segment table located in main memoryas is the page table with paging

• Segment table base register (STBR)points to current segment table in memory

• Segment table length register (STLR)indicates number of segments

102


SegmentationExample:

A program is being assembled. The compiler determines the sizes of the individual components (segments) as follows:

1100 bytestack

1000 bytesubroutine

400 bytefunction sqrt()

1000 bytesymbol table

400 bytemain program

SizeSegment


SegmentationExample (continued):

The process is assigned 5 segments in memory as well as a segment table.



Shared Segments

Process P1 and P2 share the editor code. Segment 0 of each process is mapped onto the same physical segment at address 43062.


The data segments are private to each process, so segment 1 of each process is mapped to its own segment in physical memory.

Segmentation


Paging versus Segmentation• With paging physical memory is divided into fixed-size frames. When

memory space is needed, as many free frames are occupied as necessary. These frames can be located anywhere in memory, the user process always sees a logical contiguous address space.

• With segmentation the memory is not systematically divided. When a program needs k segments (usually these have different sizes), the OS tries to place these segments in the available memory holes. The segments can be scattered around memory. The user process does not see a contiguous address space, but sees a collection of segments (of course each individual segment is contiguous as is each page or frame).

103


Paging versus Segmentation

13

14

15

16

17

18

19

20

seg1

seg2

seg4

seg3

Paging is based on fixed-sizeunits of memory (frames)

Segmentation is based on variable-sizeunits of memory (segments)

free memorycan be allocated

unused memoryinternal fragmentation


Paging versus Segmentation• Each process is assigned its page table.

• Page table size proportional to allocated memory

• Often large page tables and/or multi-level paging

• Internal fragmentation

• Free memory is quickly allocated to a process

• Motorola 68000 line is based on a flat address space

• Each process is assigned a segment table

• Segment table size proportional to number of segments

• Usually small segment tables

• External fragmentation.

• Lengthy search times when allocating memory to a process.

• Intel 80X86 family is based on segmentation

Pag

ing

Seg

men

tatio

n


Memory Management

Memory (353)

Paging (381)

Segmentation (400)



Caches (471)


Paged Segments

13

14

15

16

17

18

19

20

seg1

seg2

seg4

seg3

Combining segmentation with paging yields paged segments

segmentation paged segments

With segmentation, each segment is a contiguous space in physical memory.

With paged segments, each segment is sliced into pages. The pages can be scattered in memory.

104


Paged Segments

13

14

15

16

17

18

19

20

seg1

seg2

seg4

seg3

Each segment has its own page table

15

16

17page table

14

13

18

20

logical process space

page table

page table

page table

frame numbers

unused memoryinternal fragmentation

physical memory


Paged SegmentsThe MULTICS (predecessor of UNIX) operating system solved the problems of external fragmentation and lengthy search timesby paging the segments.

This solution differs from pure segmentation in that each segment table entry does not contain the base address of the segment, but rather contains the base address of a page table for this segment.

In contrast to pure paging where each process is assigned a pagetable, here each segment is assigned a page table.

The processes still see just segments – not knowing that the segments themselves are paged.

With paged segments there is no more time spent on optimal segment placing, however, there is introduced some internal fragmentation.


Paged Segments

The logical address is a tuple <segment number s, offset d>. The segment number is added to the STBR (segment table base register) and by this points to a segment table entry. The segment table is located in main memory. From the entry the page table base is derived which points to the beginning of the corresponding page table in memory. The first part p of the offset ddetermines the entry in the page table. The output of the page table is the frame address f (or alternatively a frame number). Finally f + d´ is the physical memory address.

PageTable = SegmentTable[s].base;

f = PageTable[p];

final address = f + d´

Steps in resolving the final physical address:

logical address s p d´d

Explanation of next slide (principle of paged segments)


Paged Segments

Principle of paged segments

105


Paged Segments

• Combination of segmentation and pagingUser view is segmentation, memory allocation scheme is paging

• Used by modern processors / architectures

CPU has 6 segment registerswhich act as a quick 6-entry segment table

Up to 16384 segments per process possiblein which case the segment table resides in main memory.

Maximum segment size is 4 GBWithin each segment we have a flat address scheme of 232 byte addresses

Page size is 4 kBA two-level paging scheme is used

Example: Intel 80386



Memory (353)

Paging (381)

Segmentation (400)



Caches (471)


Virtual MemoryWhat if the physical memory is smaller than required by a process?

Dynamic LoadingOverlays

Require special precautions and extra work by the programmer.

It would be much easier if we would not have to worry about the memory size and could leave the problem of fitting a larger program into smaller memory to the operating system.

„Virtual Memory“

Memory is abstracted into an extremely large uniform array of storage, apart from the amount of physical memory available.


Virtual Memory

• Based on locality assumptionNo process can access all its code and data at the same time, therefore theentire process space does not need to be in memory at all time instants.

• Only parts of the process space are in memoryThe remaining ones are on disk and are loaded when demanded

• Logical address space can be much largerthan physical address space

A program larger than physical memory can be executed

More programs can (partially) reside in memory which increases the degree of multiprogramming!

106


Virtual Memory

OS OS

program physical memory physical memory

backing store(usually a disk)

free memory

virtual memory conceptsize

Virtual memory concept (one program)


Virtual Memory

OS

programA physical memory backing store

size

programB

programC

A‘

A‘‘

B‘

C‘ B‘‘ C‘‘

Virtual memory concept (three programs)


Virtual Memory

• Demand SegmentationUsed in early Burroughs‘ computer systems and in IBM OS/2.Complex segment-replacement algorithms.

• Demand PagingCommonly used today. Physical memory is divided into frames (paging principle). Demand paging applies to both paging systems and paged segment systems.

Virtual memory can be implemented by means of

Figure next slide: Virtual memory usually is much larger than physical memory (e.g. modern 64-bit processors). The pages currently needed by a process are in memory, the other pages reside on disk. From the page table is known whether a page is in memory or on disk.


page table

Virtual Memory


disk

Virtual memory consists of more pages than there are frames in physical memory

107


Demand PagingA page is brought from disk into memory when it is needed (when it is demanded by the process)

• Less I/Othan loading the entire program (at least for the moment)

• Less memory neededsince a (hopefully) great part of the program remains on disk

• Faster responseThe process can start earlier since loading is quicker

• More processes in memoryThe memory saved can be given to other processes

Loading a page on demand is done by the pager (a part of the operating system – usually a daemon process).

Virtual Memory


Demand PagingQ: How does the OS know that a page is demanded by a process?

• If v = 1 ⇒ page is in memory• If v = 0 ⇒ page is in on disk• validity bit is also termed valid-invalid bit

A: When the process tries to access a page that is not in memory!A process does not know whether or not a page is in memory, only the OS knows.

Each page table entry has a validity bit (v)

During address translation, when the validity bit is found 0, the hardware causes a page fault trap to the operating system.

Virtual Memory


Page Fault

1. A reference to some page is made

2. The page is not listed in the table (or is marked invalid) whichcauses a page fault trap (a hardware interrupt) to the operating system.

3. An internal table is checked (usually kept with the process control block) to determine whether the reference was a validor an invalid memory access. If the reference was valid, a free frame is to be found.

A page fault is the fact that a non-memory-resident pagewas tried to be accessed by some process.

Steps in demand paging:

→

Virtual Memory


Page Fault

4. A disk operation is scheduled to read in the desired page into the free frame.

5. When disk read is complete, the internal tables are updated to reflect that the page now is in memory.

6. The process is restarted at the instruction that caused the page fault trap. The process can now access the page.

These steps are symbolized in the next figure →

Virtual Memory

108


Virtual Memory


Page table – indicating that pages 0, 2 and 5 are currently in memory, while pages1, 3, 4, 6, 7 are not.


Page Fault

Figure from [Sil00 p.302]Steps in handling a page fault

Virtual Memory


Performance of Demand PagingPage fault rate 0 ≤ p ≤ 1Average probability that a memory reference will cause a page fault

if p = 0 ⇒ no page faults at allif p = 1 ⇒ every reference causes a page fault

Memory access time tmaTime to access physical memory (usually in the range of 10 ns ...150 ns)

Effective access time teffAverage effective memory access time. This time finally counts for system performance

teff = (1 – p) · tma + p · ‚page fault time‘

Virtual Memory


Performance of Demand PagingPage fault timeThe time from the failed memory reference until the machine instruction continues

Trap to the OSContext switchCheck validityFind a free frameSchedule disk read

Context switch to another process (optional)Place page in frameAdjust tablesContext switch and restart process

Assuming a disk system with an average latency of 8 ms, average seek time of 15 ms and a transfer time of 1 ms (and neglecting that the disk queue may hold other processes waiting for disk I/O), and assuming the execution time of the page fault handling instructions to be 1 ms, the page fault time is 25 ms.

Virtual Memory

109


Performance of Demand Paging

When each memory reference causes a page fault (p = 1), the system is slowed down by a factor of 250000.

⇒ Effective access time

teff = (1 – p) · 100ns + p · 25ms

= 100ns + p · 249999ns≈ 25ms

tma

When one out of 1000 references causes a page fault (p = 0.001), the system is slowed down by a factor of 250.

For less than a 10% degradation, the page fault rate p must be less than 0.000004 (1 page fault in 2.5 million references).

Virtual Memory


Performance of Demand Paging

• Increase page sizeWith larger pages the likelihood of crossing page boundaries is lesser.

• Use „good“ page replacement schemePreferably one that minimizes page faults.

• Assign „sufficient“ framesThe system constantly monitors memory accesses, creates page-usage statistics and on-the-fly adjusts the number of allocated frames. Costly, but used in some systems (so-called working set model).

• Enforce program localityPrograms can contribute to locality by minimizing cross-page accesses. This applies to the implemented algorithms as well as to the addressing modes of the individual machine instructions.

Some possibilities for lowering the page fault rate Virtual Memory


Page Size

Small Pageslittle internal fragmentationlarge page tablesslower disk I/Omore page faults

Large Pagesinternal fragmentationsmaller page tablesfaster disk I/Oless page faults

Intel 80386: 4 kBIntel Pentium II: 4 kB or 4 MBSun UltraSparc: 8 kB, 64 kB, 512 kB, 4MB

What should be the page (= frame) size?

Trend goes toward larger pages. Page faults are more costly todaybecause the gap between CPU-speed and disk speed increased.

Virtual Memory


Page AttributesNext to the validity bit v, each page may in addition beequipped with the following attribute bits in the page table entry:

• Reference bit rUpon any reference to the page (read / write) the bit is set. Once thebit is set it remains set until cleared by the OS.

• Modify bit mEach time the page is modified (write access), the bit is set. The bit remains set until cleared by the OS.

A page that is modified is also called dirty. The modify bit isalso termed dirty bit. When the page is not modified it is clean.

Virtual Memory

110


Finding Free Frames

• Terminate another processNot acceptable. The process may already have done some work (e.g. changed a data base) which may mistakenly be repeated when the process is started again.

• Swap out a processAn option only in case of rare circumstances (e.g. thrashing).

• Hold some frames in spareSooner or later the spare frames are used up. Memory utilization is lower sincethe spare memory is not used productively.

• Borrow frames Yes! Take an allocated frame, use it, and give it (or another one) back tothe owner later.

What options does the OS have when needing free frames?

⇒ Page Replacement

Virtual Memory


Page ReplacementPage replacement scheme:

If there is a free frame use it,

otherwise use a page-replacement algorithmto select a victim frame.

Save the victim page to disk and adjust the tablesof the owner process.

Read in the desired page and adjust the tables.

ImprovementPreferably use a victim page that is clean (not modified, m = 0).Clean pages do not need to be saved to disk.

Two page transfers

Virtual Memory


Page Replacement


Need for page-replacementUser process 1 wants to access module M (page 3).All memory however is occupied.Now a victim frame needs to be determined.

0123

0123

Virtual Memory


Page ReplacementFigure from [Sil00 p.310]

Page-replacementThe victim is saved to disk (1) andthe page table is adjusted (2). Thedesired page is read in (3) and thetable is adjusted again.In this figure the victim used to be apage from the same process (or same segment in case of paged segments).

Virtual Memory

111


Page ReplacementGlobal Page ReplacementThe victim frame can be from the set of all frames, that is, one process can take a frame from another.Processes can affect each others page fault rate, though.

Local Page ReplacementThe victim frame may only be from the own set of frames, that is, the number of allocated frames per process does not change.No impact onto other processes.

The figure on the previous slide shows a local page replacement strategy.

Virtual Memory


Page ReplacementPage replacement algorithms

First-in first-out (FIFO)and its variations second-chance and clock.

Optimal page replacement (OPT)

Least Recently Used (LRU)

LRU Approximations

Desired: Lowest page-fault rate!

Evaluation of the algorithms through applying them ontomemory reference strings.

Virtual Memory


Memory Reference StringsAssume the following address sequence:(e.g. recorded by tracing the memory accesses of a process)

Assuming a page size of 100 bytes, the sequence can be reduced to

This memory reference string lists the pages accessed over time (at the time steps at which page access changes).

0100, 0432, 0101, 0612, 0102, 0103, 0104, 0101, 0611, 0102, 01030104, 0101, 0610, 0102, 0103, 0104, 0101, 0609, 0102, 0105

1, 4, 1, 6, 1, 6, 1, 6, 1, 6, 1

If there is only 1 frame available, the sequence would cause 11 page faults.

If there are 3 frames available, the sequence would cause 3 page faults.

Virtual Memory


Page faults versus number of frames

Memory Reference Strings


In general, the more frames available the lesser is the expectednumber of page faults.

112


0

1

2

frame contents over time

FIFO Page Replacement


Principle: Replace the oldest page (old = swap-in time).

Memory reference string: 7,0,1,2,0,3, 0,4, 2, 3, 0,3,2,1, 2,0, 1, 7, 0,1

Number of frames: 3

Total: 15 page faults.

Example VM.1

Virtual Memory



Memory reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5Number of frames: 3

1 12

1 2

12

3

3

42

3

4

413

1

41

2

2

51

2

5

51

2

1

51

2

2

532

3

53

4

53

4

4 5

Example VM.2

9 page faults

Virtual Memory



Memory reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 (as in VM.2)

Example VM.3

1 12

12

3

12

3

4

1 2 3 4 1 2 5 1 2 3 4 5

12

3

4

12

3

4

52

3

4

513

4

51

2

4

51

2

3

41

2

3

152

3

10 page faults

Number of frames: 4

Although we have more frames available than previously, the page fault rate did not decrease!

Virtual Memory


FIFO Page ReplacementFrom the examples VM.2, VM.3 it can be noticed thatthe number of page faults for 4 frames is greater than for 3 frames.

This unexpected result is known as Belady‘s Anomaly1:

For some page-replacement algorithms the page-fault rate may increase as the number of allocated frames increases.

1 Lazlo Belady, R. Nelson, G. Shedler: An anomaly in space-time characteristics of certain programs running in a paging machine, Communications of the ACM, Volume 12, Issue 6, June 1969, Pages: 349 - 353, ISSN:0001-0782, also available online as pdf from the ACM.

Virtual Memory

113


Belady‘s Anomaly


Page faults versus number of frames for the string 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5.

Virtual Memory


Second-Chance Algorithm This algorithm is a derivative of the FIFO algorithm.

Start with the oldest page

Inspect the page

If r = 0: replace the page. Done.

If r = 1: give the page a second chance byclearing r and moving the page to the top of the FIFO

Proceed to next oldest page

When a page is used often enough to keep the r bit set, it will never be replaced. Avoids the problem of throwing out a heavily used page (as may happen with strict FIFO). If all pages have r =1, the algorithm however is FIFO.

Virtual Memory


r = 1

Second-Chance Algorithm


Example: page A is the oldest in the FIFO (see a). With pure FIFO it would have been replaced. However, as r = 1 it is given a second chance and is moved to the top of the FIFO (see b). The algorithm continues with page B.

FIFO

Virtual Memory


Clock AlgorithmSecond chance constantly moves pages within the FIFO (overhead)!When the FIFO is arranged as a circular list the overhead is less.

Initially the hand (a pointer) pointsto the oldest page.

The algorithm then appliedis second chance.


Virtual Memory

114



Optimal Page Replacement


Principle: Replace the page that will not be used for the longest time.


Number of frames: 3

Total: 9 page faults.

Example VM.4


LRU Page Replacement


Principle: Replace the page that has not been used for the longest time.


Number of frames: 3

Example VM.5


Virtual Memory


LRU Page ReplacementPossible LRU implementations:

• Counter ImplementationEvery page table entry has a counter field. The system hardware must have a logical counter. With each page access the counter value is copied to the entry.

Update on each page access requiredSearching the table for finding the LRU pageAccount for clock overflow

• Stack ImplementationKeep a stack containing all page numbers. Each time a page is referenced, its number is searched and moved to the top. The top holds the MRU pages, the bottom holds the LRU pages.

Update on page access requiredSearching the stack for the current page number

Virtual Memory


LRU Page ReplacementExample for the stack implementation principle

bottom of stack


Virtual Memory

115


LRU ApproximationNot many systems provide sufficient hardware support fortrue LRU page replacement. ⇒ Approximate LRU!

• Use reference bitWhen looking for LRU page, take a page with r = 0

No ordering among the pages (only used and unused)

• History FieldEach page table entry has a history field h (e.g. a byte)

When page is accessed, set most significant bit (e.g. bit 7)

Periodically (e.g. every 100 ms) shift right the bits

When looking for LRU page, take page with smallest unsigned int(h)

Better ordering among the pages (256 history values)

Virtual Memory


LRU ApproximationHistory field examples

00000000 = Not used for the last 8 time periods

11111111 = Used in each of the past 8 periods

01001000 = Used in last period and in the fifth last period

111011601107011150101

value (unsigned int)history field

This page will be chosen as victim

Virtual Memory


0

5

10

15

20

25

30

35

40

6 8 10 12 14

Number of Frames Allocated

Pag

e Fa

ults

per

100

0 R

efer

ence

s

FIFOClockLRUOpt

Page Replacement

Figure from lecture slides WS 05/06

Exemplary page fault rates

Differences noticeable only for smaller number of frames

Virtual Memory


Page Replacement

First-in first-out (FIFO)Simplest algorithm, easy to implement, but has worst performance. The clock version is somewhat better as it does not replace busy pages.

Optimal page replacement (OPT)Not of practical use as one must know future! Used for comparisons only. Lowest page fault rate of all algorithms.

Least Recently Used (LRU)The best algorithm usable, but requires much execution time or highly sophisticated hardware.

LRU ApproximationsSlightly worse than LRU, but faster. Applicable in practice.

Algorithms SummaryVirtual Memory

116


Thrashing


When the number of allocated frames falls belowa certain number of pages actively used by a process,the process will cause page fault after page fault.

This high paging activity is called thrashing.

A too high degree of multi-programming results inthrashing becauseeach process does nothave „enough“ frames.

Virtual Memory


ThrashingCountermeasures

• Switching to local page replacementA thrashing process cannot steal frames from others. The page device queue (used by all) however is still full of requests – lowering overall system performance.

• Swap outThe thrashing process or some other process can be swapped out for a while. Choice depends on process priorities.

• Assign „sufficient“ framesHow many frames are sufficient?

Working-set model: All page references are monitored (online memory reference string creation). The pages recently accessed form theworking-set. Its size is used as the number of ‚sufficient‘ frames.

Virtual Memory


Working-Set

The working-set model uses a parameter ∆ to define the working-set window. The set of pages in ∆ defines the working-set WS.The OS allocates to the process enough frames to maintain the size of the working-set.Keeping track of the working set requires the observationof memory accesses (constantly or in time intervals).


Virtual Memory


Program LocalityDemand paging is transparent to the user program. A program however can enforce locality (at least for data).

int A[][] = new int[128][128];

for (int j = 0; j < 128; j++)for (int i = 0; i < 128; i++)A[i][j] = 0;

Program A

Assume a page size of 128 words and consider the following program which clears the elements of a 128 x 128 matrix.

row column

Program clearing the matrix elementscolumn-wise.

Virtual Memory

117


Program Locality

In row major storage, a multidimensional array in linear memory is accessed such that rows are stored one after the other. It is the approach used by C, Java, and many other languages, with the notable exception of Fortran.

The array is stored in memory row major.

1 2 34 5 6

int A[2][3]= { {1,2,3}, {4,5,6} };

1

2

34

5

6

low

high

word memory

For example, the matrix is definedin C as

and is stored in memory row-wise.

row

1ro

w 2

Virtual Memory


Program Locality

If the operating system allocates only one frame (for the data) to process A, the process will cause128 x 128 = 16384 page faults!

Thus, each row of the 128 x 128 matrix occupies one page.

j

j

j

j

i

i+1

i+2

i+3

This is because the process clears one word in each page (word j), then the next word, ..., thus „jumping“ from page to page in the inner loop.

for (int j = 0; j < 128; j++)

for (int i = 0; i < 128; i++)

A[i][j] = 0;

Virtual Memory


Program Locality

Now, if the operating system allocates only one frame (for the data) to process B, the process will cause only 128 page faults!

By changing the loop order, the process first finishesone page before going to the next.

j

j

j

j

i

i+1

i+2

i+3int A[][] = new int[128][128];

for (int i = 0; i < 128; i++)for (int j = 0; j < 128; j++)A[i][j] = 0;

Program B

Virtual Memory


Program Locality

Consider a three-address instruction, such as ADD A,B,C which performs C:=A+B. In the worst case the operands A, B, C arelocated in 3 different pages.

Locality is also influenced by the addressing modes atmachine instruction level.

Another example is the PDP-11 instruction MOV @(R1)+,@(R2)+which in the worst case straddles across 6 pages. R1

PDP 11 addressing mode 3for the source operand

Virtual Memory

118


Virtual Memory• Separation logical – physical memory

The user / programmer can think of an extremely large virtual address space.

• Pure Paging / Paged Segments Virtual memory can be implemented upon both memory allocation schemes.

• Execution of large programswhich do not fit into physical memory in their entirety.

• Better multiprogrammingas there can be more programs in memory.

• Not suitable for hard real-time systems!Virtual memory is the antithesis of hard real-time computing. This is because the response times cannot be guaranteed owing to the fact that processes mayinfluence each other (page device queue, thrashing, ...).



Memory (353)

Paging (381)

Segmentation (400)



Caches (471)


Memory Hierarchy

Memory hierarchy levels in typical desktop / server computers, figure from [HP06 p.288]

The farther away from CPU, the larger and slower the memory. The hierarchy is the consequence of locality.

Caches


Locality Principle

Programs tend to reuse data and instructions.

Rule of thumb:

A program spends 90% of its execution time in only 10% of the code.

Temporal locality: recently accessed items are likely to be accessed in near future.

Spatial locality: items whose addresses are near one another tend to be referenced close together in time.

[HP06 p.38]

Caches

119


Locality Principle

Example of a memory-accesstrace of a process



Caches

Cache: a safe place for hiding or storing thingsWebsters‘ Dictionary [HP06 p. C-1]

Here: Fast memory that stores copies of data from the most frequently used main memory locations. Used by the CPU to reduce the average time to access memory locations.

Result: faster program execution ⇒ improved system performance

Effect: instructions (in execution) can proceed quicker.

Instruction fetch is quickerMemory operands are accessed quicker

from the CPU‘s point of view


Cached Memory Access

• CPU requests content from a memory location

• Cache is checked for this datum

• When present, deliver datum from cache

• When not, transfer datum from main memory to cache

• Then deliver from cache to CPU

Steps in accessing memory (here: reading from memory), simplified.Caches


CachesTo take advantage of spatial locality a cache contains blocks of data rather than individual bytes. A block is a contiguous line of processor words. It is also called a cache line.

Common block sizes: 8 ... 128 bytes

block transfer

word transfer

Cache componentsData Area

Tag Area

Attribute Area

120


Data AreaAll blocks in the cache make up the data area.

0

1

2

3

4

B–1

Block

... .........

Data area

Cache capacity = B · N bytes

N bytes per block

Caches


Tag AreaThe block addresses of the cached blocks make up the tags of the cache lines. All tags form the tag area.

0

1

2

3

4

B–1

Block

... .........

Data area

N byte per block

Tag area

Caches

The statement is slightly simplified. In real caches, often just a fraction of the block address is used as tag.

1

1


Attribute AreaThe attribute area contains attribute bits for each cache line.

0

1

2

3

4

B–1

Block

... ......

Data area

N bytes per block

...

Tag areaAttributes

V D• Validity bit V

indicates whether the cache line holds valid data

• Dirty bit Dindicates whether the cache line data is modified with respect to main memory

V = 1 ⇒ data is validV = 0 ⇒ data is invalid

D = 1 ⇒ data is modifiedD = 0 ⇒ data is not modified

Caches


CachesEach cache line plus its tag plus its attributes forms a slot.

...

Attributes

V D

...

Tag area

0

1

2

3

4

B–1

Block / Slot

... ...

Data area

N bytes per block

Cache slot

121


CachesHow to find a certain byte in the cache?

• Block address is compared against all tags simultaneously

• In case of a match (cache hit), the offset selects the byte

The address generated by the CPU is divided into two fields.

• High order bits make up the block address• Low order bits determine the offset within that block

block address offset

m

m - n n

Remark: CPU address space = 2m, Cache line size (block size) = 2n

Caches


Block AddressMemory can be considered as an array of blocks.

Caches

000000000100001000001100010000010100011000011100100000100100

Memory address (binary)


The block address should not be confused with the memory address at which the block starts. The block address is a block number.

0

4

8

16

20

24

28

32

36

12

memory address (decimal)

4 bytes per block

Memory

0

1

2

4

5

6

7

8

9

3

blockaddress

block address = memory address DIV block size


Caches

...

V D

...

Tags

...

Data

Comparator


hit / miss Data out

CPU memory address

Cache mechanism


Hit RateCaches

Cache capacity is smaller than the capacity of main memory.

Consequently, not all memory locations can be mirrored in the cache. When a required datum is found in the cache, we have a cache hit, otherwise a cache miss.

The miss rate is the fraction of cache accesses that result in a miss.

The hit rate is the fraction of cache accesses that result in a hit.

Hit rate = number of memory accesses

number of hits

122


Amdahl

P: proportion of the system improved, 0 ≤ P ≤ 1

S: speedup of that proportion, S > 0, usually S > 1

SPP

I+−

=)1(

1

I: maximum expected improvement, I > 0 (usually I > 1)

Used to find the maximum expected improvement to an overall system when a part of the system is improved. The law is a general law, not restricted to caches or computers.

’s Law Caches


Amdahl’s LawExample: „30% of the computations can be made twice as fast.“

⇒ P = 0.3, S = 2.

N: grade of parallelism (e.g. N processors), N > 0

F: proportion of sequential calculations (no speedup possible), 0 ≤ F ≤ 1

177.115.07.0

1

23.0)3.01(

1=

+=

+−=IImprovement

NFF

I )1(1−+

=

Amdahl’s Law in the special case of parallelization

See lecture „AdvancedComputer Architecture“

Caches


Caches

Example: Assume: Cache = 1 ns, main memory = 100 ns, 90% hit rate.What is the overall improvement?

175.9

1009.0)9.01(

1=

+−=I

10 ms100 ns1 ns250 psAccess time

1 TB1 GB64 kB500 ByteMemory space

I/O Devices (disks)

Main memory (DRAM)

Cache (SRAM)

CPU (registers)

1001

100,9.0 ===nsnsSP

Memory accesses (as seen by the CPU) now are more than 9 times as fast than without a cache.


Read AccessReading from memory (improvement)

• CPU requests datum

• Search cache while fetching block from memory

• Cache hit: deliver datum, discard fetched block

• Cache miss: put block in cache and deliver datum

In case of a hit, the datum is available quickly.In case of a miss there is no benefit from the cache, but also no harm.

Things are not that easy when writing into memory. Let’s look at thecases of a write hit and a write miss.

Caches

123


Write Hit PolicyAssume a write hit. How to keep cache and main memory consistent on write-accesses?

• Write throughCPU

Cache

Write Buffer

MemoryThe datum is written to both the block in the cache and the block in memory.

• Write backThe datum is only written to the cache (dirty bit is set). The modified block is written to main memory once it is evicted from cache.

Write speed = cache speedMultiple writes to the same block still result in only one write to memoryLess memory bandwidth needed

Cache always clean (no dirty bit required) CPU write stall (problem reduced through write buffer)Main memory always has the most current copy(cache coherency in multi-processor systems)

Caches


Write Miss PolicyAssume a write miss. What to do?

The block containing the referenced datum is transferred from main memory to the cache. Then one of the write hit policies is applied. Normally used with write back caches.

• Write allocate

• No-write allocateWrite misses do not affect the cache. Instead the datum is modified only in main memory. Write hits however do affect the cache. Normally used with write through caches.

Caches


Write Miss PolicyAssume an empty cache and the following sequence of memory operations.

WriteMem[100]WriteMem[100]ReadMem[200]WriteMem[200]WriteMem[100]

What are the number of hits and misses when using no-write allocateversus write allocate?

missmissmisshit

miss

misshit

misshithit

WriteMem[100]WriteMem[100]ReadMem[200]WriteMem[200]WriteMem[100]

No-write allocate Write allocate

Caches


Caches

... Memory

• Where exactly are the blocks placed in the cache?

• What if the cache if full?

Cache

⇒ Cache Organization

⇒ Replacement Strategies

124


Cache OrganizationWhere can a block be placed in the cache?

• Direct MappedWith this mapping scheme a memory block can be placed in only one particular slot. The slot number is calculated from

((memory address) DIV (blocksize)) MOD (slots in cache).

• Fully AssociativeThe block can be placed in any slot.

• Set AssociativeThe block can be placed in a restricted set of slots. A set is a group of slots. The block is first mapped onto the set and can then be placed anywhere within the set. The set number is calculated from

(memory address) DIV (block size) MOD (number of sets in cache).

Caches


Direct Mapped

0

4

8

16

20

24

28

...

12

0

1

3

2

...

Each memory block is mapped to exactly one slot in thecache (many-to-one mapping).

Memory Cache

Block size = 4 byteCache capacity = 4 x 4 = 16 byte

Slot


If slot occupied (V = 1) evict cache line

Caches


Direct Mapped

12 DIV 4 = 33 MOD 4 = 3 (slot 3)

20 DIV 4 = 55 MOD 4 = 1 (slot 1)

Examples• In which slot goes the block located at address 12D?

Slot = ((memory address) DIV (blocksize)) MOD (slots in cache).

• In which slot goes the block located at address 20D?

• Where goes the byte located at address 23D?23 DIV 4 = 55 MOD 4 = 123 MOD 4 = 3

The byte goes in cache line (slot) 1 at offset 3

offset within slot = (memory address) MOD (blocksize).

0

2

13 MOD 4

Caches


Direct MappedExtracting slot number and offset directly from memory address

• Where goes the byte located at address 23D?

Example

23D = 1 0 1 1 1B

slot offset ⇒ Slot 1, offset 3

The lower bits of the block address select the slot. The size of the slot field depends on the number of slots (size = ld(number of slots)).

ld = logarithmus dualis (base 2)

block address

offset

m

m - n n

slottag bits

Caches

125


Direct Mapped

offset

31 30 29 28 . 14 13 12 5 47 6. . 3 2 1 0. . 1618.. 19. 15. 17.

16 12

Address (showing bit positions)

Valid

16

Byte

. .

Tag Data

128 bits16 bits

32

16kentries

Hit

=

Data

32 32 32

MUX

32

2

4 klines

64 kByte cache usingfour-word (16 Byte) blocks Figure from lecture CA WS05/06

WordoffsetSlot

Caches


Direct MappedExplanations for previous slide

Logical address space of CPU: 232 byte

Number of cache slots: 64kB / 16 Byte = 4K = 4096 slots.

Bit 0,1 determine the position of the selected byte in a word. However, as the CPU uses 4-byte words as smallest entity, the byte offset is not used.

Bit 2,3 determine the position of the word within a cache line.

Bits 4 to 15 (12 bits) determine the slot. 212 = 4K = number of slots.

Bits 16 to 31 are compared against the tags to see whether or not the block is in the cache.

Caches


Fully Associative

0

4

8

16

20

24

28

32

36

12

A memory block can go in any cache slot (many-to-many).

0

1

3

2

4 choices

Slot selection

check all tags (preferably simultaneously)

take a slot with V = 0 (a free slot)

otherwise select a slot accordingto some replacement strategymemory address (decimal)

Caches


Set Associative

0

4

8

16

20

24

28

32

36

12

A memory block goes into a set, and can be placed anywhere within the set (many-to some)

Slot selection

Determine set from block address

In this set, take a free slot ...

... or evict a slot according tosome replacement strategy

0

1

1

0

0

1

Slot Set

2-way set associative cache


Caches

126


Set AssociativeSet = ((memory address) DIV (blocksize)) MOD (sets in cache).

12 DIV 4 = 3 (block address)3 MOD 2 = 1 (set 1)In which slot the block finally goes depends on occupation and replacement strategy

Example• In which set goes the block located at address 12D?

block address

offset

m

m - n n

settag bits

Similar to direct mapping, the low order bits of the block address determine the destination set.

Caches


Set AssociativeN-way set associative cache

N = number of slots per set, not the number of sets

N is a power of 2, common values are 2, 4, 8.

Extremes

N = 1 There is only one slot per set, that is, each slot is a set. The set number (thus the slot) is drawn from the block address.

N = B There is only one set containing all slots (B = number of blocks in cache = number of slots).

⇒ Direct Mapped

⇒ Fully Associative

Caches

Set Associative

AMD Opteron CacheTwo-way set associative

Figure from [HP06 p. C-13]

Caches


Set AssociativeOpteron cache

Physical address is 40 bits. Address is divided into 34 bit block address (subdivided into 25 tag bits and 9 index bits) and 6 bits byte offset ( ).

Cache capacity: 64 kB in 64-byte blocks (1024 slots)

Cache is two-way set associative: ⇒ 512 sets à 2 cache lines

Hardware: Two arrays with each 512 caches lines, that is, each set has one cache line in array1 and one in array2.

Figure: The index selects the set (29 = 512), see . The two tags of the set are compared against the tag bits. The valid bit must be set for a hit ( ). On a hit, the corresponding data is delivered using the winning input from a 2:1 multiplexer ( ). The data goes to „Data in“ of the CPU. The victim buffer is needed when a cache line has to be written back to main memory (replacement).

Caches

127


Cache Organization

Memory

Cache

Where canblock 12 go?

Block address

Figure from [HP06 p.C-7]

Caches


Cache OrganizationFor the previous figure,

assume block 12 and block 20 being used very often. What is the problem?

Fully associative: No special problem. Both blocks can be stored in the cache at the same time.

Direct mapped: Problem! Only one of them can be stored at the same time since both map to the same slot.12 mod 8 = 20 mod 8 = 4

Set associative: No special problem. Both blocks can be stored in the same set at the same time.

Caches


Cache OrganizationDirect Mapped

Hard-Allocation (no choice)Simple & InexpensiveNo replacement strategy requiredIf a process uses 2 blocks mapping to the same slot, cache misses are high.

Fully AssociativeFull choiceExpensive searching (hardware) for free slotReplacement strategy required

Set AssociativeCompromise between direct mapped and fully associativeSome choiceReplacement strategy required

Caches


Replacement Strategies

•RandomVictim cache lines are selected randomly. Hardware pseudo-random number generator generates slot-numbers.

•Least-Recently Used (LRU)Relies on the temporal locality principle. The least recently used block ishoped to have smallest likelihood of (re)usage. Expensive hardware.

•First-in, First out (FIFO)Approximation of LRU by selecting the oldest block (oldest = load time).

Strategies for selecting a slot to evict (when necessary)Caches

128


Replacement Strategies

92.5

103.9

115.5

FIFO

92.1

102.4

111.7

LRU

92.1

102.3

115.1

Rand

92.5

103.1

113.3

FIFO

92.1

99.7

109.0

LRU

92.1

100.5

111.8

Rand

92.5

100.3

110.4

FIFO

92.192.2256 kB

104.3103.464 kB

117.3114.116 kB

RandLRUCapacity

Table: Data cache misses per 1000 instructions

• LRU is best for small caches

• little difference between all strategies for large caches

Data from [HP06 p.C-10]

Two-way Four-way Eight-way

Data collected for Alpha architecture, block size = 64 byte.

Caches


Miss Categories•Compulsory Misses

The very first access to a block cannot be in the cache, so the block must be loaded. Also called cold-start misses.

•Capacity MissesOwing to the limited capacity of the cache, capacity misses will occur in addition to compulsory misses.

•Conflict MissesIn set associative or direct mapped caches too many blocks may map to the same set (or slot). Also called collision misses.

• Coherency Missesare owing to cache flushes to keep multiple caches coherent in a multiprocessor. Not considered in this lecture (see lecture „Advanced Computer Architecture“).

Caches


Cache OptimizationAverage memory access time = hit time + miss rate · miss penalty

• Larger block size

• larger cache capacity

• higher associativity

reducing miss rate

• avoiding address translation

reducing hit time

• Multilevel caches

• read over write

reducing miss penalty

Caches


Block Size

• reduced miss rate, taking advantage of spatial localitymore accesses will likely go to the same block

• increased miss penaltyMore bytes have to be fetched from main memory

• increased conflict missescache has less slots (per set)

• increased capacity missesonly for small caches. In case of high locality (e.g. repeatedly access to only one byte in a block) the remaining bytes are unused and waste up cache capacity.

The data area gets larger cache lines (but less lines), theoverall cache capacity remains the same.

Common block sizes are 32 ... 128 bytes

Caches

129


Block Size

Miss rate versus block size [from HP06 p.C-26]Cache capacity

Caches


Block Size

Average memory access time = hit time + miss rate · miss penalty

For the previous figure, assume the memory system takes 80 clock cycles

of overhead and then delivers 16 bytes every 2 clock cycles.

Assume the hit time to be 1 clock cycle independent of block size.

Which block size has the smallest average memory access time?

4K cache, 16 byte block:

Average memory access time = 1 + (8.57 % · 82) = 8.027 clock cycles

4K cache, 32 byte block:

Average memory access time = 1 + (7.24 % · 84) = 7.082 clock cycles

... and so on for all cache sizes and block sizes.

Caches


Block Size

1.5492.2884.68511.651112256

1.4701.9793.6598.46996128

1.4491.9333.3237.1608864

1.5882.1343.4117.0828432

1.8942.6734.2318.0278216

256K64K16K4KMiss penaltyBlock size

Average memory access time (in clock cycles) versus block size for 4 different cache capacities

cache capacity

green values = best (smallest access time) per column (thus per cache)

Caches


Cache Capacity

• reduced miss rateowing to less capacity misses

• potentially increased hit timeowing to increased complexity

• increased hardware & power consumption

The cache is enlarged by adding more cache slots.

0.51%1.06%2.64%7.00%256K64K16K4K

38 % 40 % 48 %

Miss rates for block size 64 bytes

Cache capacity

Miss rate

Caches

130


Associativity

Eight-way set associative is almost as effectiveas fully associative.

A direct mapped cache with capacity N has about the same miss rate as a two-way set associative cache of capacity N/2.

The higher the associativity the more slots per set.

Common associativities are 1 (direct mapped), 2, 4, 8

• reduced miss rateprimarily owing to less conflict misses

• increased hit timetime needed for finding a free slot in the set

Rules of Thumb

Caches


Associativity4 K cache

7.18-way7.14-way7.62-way9.81-way

Miss rate [%]Degree

8 K cache

4.48-way4.44-way4.92-way6.81-way

Miss rate [%]Degree

16 K cache

4.18-way4.14-way4.12-way4.91-way

Miss rate [%]Degree

128 K cache

1.98-way1.94-way1.92-way2.11-way

Miss rate [%]Degree

64 K cache

2.98-way3.04-way3.12-way3.71-way

Miss rate [%]Degree

512 K cache

0.68-way0.64-way0.72-way0.81-way

Miss rate [%]Degree


Caches


Associativity

(hit time)

Caches


Multi-Level CachesBuilding a cache hierarchy.

CPU

• First-Level Cache (L1)small high speed cache usually located in the CPU

L1Main

MemoryL3

• Third level cache (L3), optionalSeparate memory chip between L2 and main memory

L2

• Second level cache (L2)fast and bigger cache located close to CPU (chip set)

Caches

131


Multi-Level CachesMulti-level caches reduce the average miss penaltybecause – on a miss – the block can be fetched fromthe next higher level instead from main memory.

Distinction between local and global cache considerations:

Local miss rate = number of cache misses

number of cache accesses

Local to a cache (e.g. L1, L2, ...)

Global miss rate = number of cache misses

number of memory references by CPU

local misses versus global references

Caches


Multi-Level CachesExample CH.1: Suppose that in 1000 memory references there are 40

misses in L1 and 20 misses in L2. What are the various miss rates?

L1 L2CPU MainMemory

Local miss rateL1 = global miss rate L1 = 401000

= 4 %

Local miss rateL2 = 2040

= 50 %

Global miss rateL2 = 201000

= 2 %

These 4% go from L1 to L2

These 2 % go from L2 to main memory

Local miss rate L2 islarge because L1 skimsthe cream of memory accesses.

Caches


Multi-Level Caches

L1 L2CPU MainMemory

Average memory access time = hit timeL1 + miss rateL1 · miss penaltyL1

miss penaltyL1 = hit timeL2 + local miss rateL2 · miss penaltyL2

= hit timeL1 + miss rateL1 · (hit timeL2 + local miss rateL2 · miss penaltyL2)

Caches


Multi-Level CachesUsing the miss rates from example CH.1, and the following data

hit time L1 = 1 clock cycle,

hit timeL2 = 10 clock cycles,

miss penaltyL2 = 200 clock cycles,

the average memory access time is

= 1 + 0.04 · (10 + 0.05 · 200) = 5.5 clock cycles

hit timeL1 + miss rateL1 · (hit timeL2 + local miss rateL2 · miss penaltyL2)

Caches

132


Read over WriteAssume a direct mapped write-through cache with 512 slots,

and a four-word write buffer that is not checked on a read miss.

CPUCache

Write Buffer

Memory

Read-after-write hazard: The data in R3 is placed in write buffer. causes a read miss. Cache line is discarded. again causes a read miss. If the write buffer has not completed writing R3 into memory, will read an incorrect value from mem[512].

SW R3, 512(R0) ; mem[512]:= R3 (cache slot 0)

LW R1, 1024(R0) ; R1:= mem[1024] (cache slot 0)

LW R2, 512(R0) ; R2:= mem[512] (cache slot 0)load word

store word

Caches


Read over WriteSolutions to previous problem

• Read misses wait until write buffer is empty,and thereafter the required memory block is fetched into cache.

• Check contents of write buffer,if referenced data not in buffer let the read-access continue fetching the blockinto the cache. Write buffer is flushed later when memory system is available.

Also applicable to write-back caches. The dirty block is put into a write buffer that allows inspection in case of a read miss. Read misses check the buffer before directly going to memory.

„Giving reads priority over writes“

Caches


Address TranslationWhat addresses are cached, virtual or physical addresses?

A fully virtual cache uses logical addresses only, a fully physical

cache uses physical addresses only.

CPUvirtualcache translation memory

virtual

address

physical

address

virtual

address

segment tables / page tables / TLB

CPU translationphysicalcache memory

virtual

address

physical

address

physical

address

Caches


Address TranslationFully virtual cache:

• No address translation time on a hit

• Cache must have copies of protection informationProtection info must be fetched from page/segment tables.

• Cache flush on processor switchIndividual virtual addresses usually refer to different physical addresses.

• Shared MemoryDifferent virt. addresses refer to same phys. address. Copies of same data in cache.

Fully physical cache:

• Very well on shared memory accesses

• Always address translation (time)Hits are of no advantage regarding address translation

Caches

133


Address TranslationSolution: get the best from both virtual and physical caches

Two issues in accessing a cache:

• Indexing the cachethat is, calculating the target set (or slot with direct mapping)

• Comparing tagscomparing the tag field with (parts of) the block address

The page offset (the part that is identical in both virtual and physical address space) is used to index the cache. In parallel, the virtual part of the address is translated into the physical address and used for tag comparison. Improved hit time.

⇒ „virtually indexed, physically tagged cache“

Caches


Address TranslationVirtually indexed, physically tagged cache

Cache

page number

.........Translation

(TLB, page table)

page offset

tags data

word offset

virtual address

physical address

frame address offset

next memory level

CPU

data

next memory level

Caches


Cache Optimization

Widely used1+Address translation

Widely used1+Read over write

Costly hardware; harder if L1

block size ≠ L2 block size2+Multi-level cache

Widely used1+–Higher associativity

Widely used for L21+–Larger cache capacity

Trivial0+–Larger block size

CommentComplexityMiss

rate

Miss

penalty

Hit

timeTechnique

+ = improves a factor, – = hurts a factor, blank = no impact

Summary of basic cache optimizations


Caches


Exam Computer Architecture

09.03.2007 ST 025/118 8:30 hrs Date Location Time

CACA

Duisburg - Ruhrort !

March 9th, 2007

134



LectureCA All Slides

Documents

Transcript of LectureCA All Slides