Chapter 1: Introduction to Operating Systemdocshare01.docshare.tips/files/15627/156273040.pdf ·...

OPERATING SYSTEMTABLE OF CONTENTS

Chapter 1: Introduction to Operating System

• What Is Operating System?

• History of Operating System

• Features

• Examples of Operating Systems

Chapter 2: Operating System Structure

• System Components

• Operating Systems Services

• System Calls and System Programs

• Layered Approach Design

• Mechanisms and Policies

Chapter 3: Process

• Definition of Process

• Process State

• Process Operations

• Process Control Block

• Process (computing)

• Sub-processes and multi-threading

• Representation

• Process management in multi-tasking operating systems

• Processes in Action

• Some Scheduling Disciplines

1

Chapter 4: Threads

• Threads

• Thread Creation, Manipulation and Synchronization

• User Level Threads and Kernel Level Threads

• Context Switch

Chapter 5: The Central Processing Unit (CPU)

• The Architecture of Mic-1

• Simple Model of a Computer - Part 3

• The Fetch-Decode-Execute Cycle

• Instruction Set

• Microprogram Control versus Hardware Control

• CISC versus RISC

• CPU Scheduling

• CPU/Process Scheduling

• Scheduling Algorithms

Chapter 6: Inter-process Communication

• Critical Section

• Mutual Exclusion

• Proposals for Achieving Mutual Exclusion

• Semaphores

Chapter 7: Deadlock

• Definition

• Deadlock Condition

• Dealing with Deadlock Problem

2

Chapter 8: Memory Management

• About Memory

• Heap Management

• Using Memory

Chapter 9: Caching and Intro to File Systems

• Introduction to File Systems

• File System Implementation

• An old Homework problem

• File Systems

• Files on disk or CD-ROM

• Memory Mapping Files

Chapter 10: Directories and Security

• Security

• Protection Mechanisms

• Directories

• Hierarchical Directories

• Directory Operations

• Naming Systems

• Security and the File System

• Design Principles

• A Sampling of Protection Mechanisms

Chapter 11: File System Implementation

• The User Interface to Files

• The User Interface to Directories

• Implementing File Systems

3

• Node

• Software Levels

• Multiplexing and Arm Scheduling

Chapter 12: Networking

• Network

• Basic Concepts

• Other Global Issues

4

CHAPTER 1

INTRODUCTION TO OPERATING SYSTEM

What is Operating System?

Operating system (commonly abbreviated to either OS or O/S) is an interface between hardware and user; it is responsible for the management and coordination of activities and the sharing of the limited resources of the computer. The operating system acts as a host for applications that are run on the machine. As a host, one of the purposes of an operating system is to handle the details of the operation of the hardware. This relieves application programs from having to manage these details and makes it easier to write applications. Almost all computers, including handheld computers, desktop computers, supercomputers, and even video game consoles, use an operating system of some type. Some of the oldest models may however use an embedded operating system, that may be contained on a compact disk or other data storage device.

Operating systems offer a number of services to application programs and users. Applications access these services through application programming interfaces (APIs) or system calls. By invoking these interfaces, the application can request a service from the operating system, pass parameters, and receive the results of the operation. Users may also interact with the operating system with some kind of software user interface (UI) like typing commands by using command line interface (CLI) or using a graphical user interface (GUI, commonly pronounced “gooey”). For hand-held and desktop computers, the user interface is generally considered part of the operating system. On large multi-user systems like Unix and Unix-like systems, the user interface is generally implemented as an application program that runs outside the operating system. (Whether the user interface should be included as part of the operating system is a point of contention.)

Common contemporary operating systems include Mac OS, Windows, Linux, BSD and Solaris. While servers generally run on Unix or Unix-like systems, embedded device markets are split amongst several operating systems.

5

The most important program that runs on a computer. Every general-purpose computer must have an operating system to run other programs. Operating systems perform basic tasks, such as recognizing input from the keyboard, sending output to the display screen, keeping track of files and directories on the disk, and controlling peripheral devices such as disk drives and printers.

For large systems, the operating system has even greater responsibilities and powers. It is like a traffic cop -- it makes sure that different programs and users running at the same time do not interfere with each other. The operating system is also responsible for security, ensuring that unauthorized users do not access the system.

Operating systems can be classified as follows:

• Multi-user: Allows two or more users to run programs at the same time. Some operating systems permit hundreds or even thousands of concurrent users.

• multiprocessing: Supports running a program on more than one CPU.

• multitasking: Allows more than one program to run concurrently.

• Multithreading: Allows different parts of a single program to run concurrently.

• real time: Responds to input instantly. General-purpose operating systems, such as DOS and UNIX, are not real-time.

Operating systems provide a software platform on top of which other programs, called application programs, can run. The application programs must be written to run on top of a particular operating system. Your choice of operating system, therefore, determines to a great extent the applications you can run. For PCs, the most popular operating systems are DOS, OS/2, and Windows, but others are available, such as Linux.

As a user, you normally interact with the operating system through a set of commands. For example, the DOS operating system contains commands such as COPY and RENAME for copying files and changing the names of files, respectively. The commands are accepted and executed by a part of the operating system called the command processor or command line interpreter. Graphical user interfaces allow you to enter commands by pointing and clicking at objects that appear on the screen.

History of Operating System

The history of computer operating systems recapitulates to a degree the recent history of computer hardware.

Operating systems (OSes) provide a set of functions needed and used by most application-programs on a computer, and the necessary linkages for the control and synchronization of the computer's hardware. On the first computers, without an operating system, every program needed the full hardware specification to run correctly and perform standard tasks, and its own drivers for peripheral devices like printers and card-readers. The growing complexity of hardware and application-programs eventually made operating systems a necessity

6

Background

Early computers lacked any form of operating system. The user had sole use of the machine and would arrive armed with program and data, often on punched paper and tape. The program would be loaded into the machine, and the machine would be set to work until the program completed or crashed. Programs could generally be debugged via a front panel using switches and lights. It is said that Alan Turing was a master of this on the early Manchester Mark 1 machine, and he was already deriving the primitive conception of an operating system from the principles of the Universal Turing machine.

Later machines came with libraries of support code, which would be linked to the user's program to assist in operations such as input and output. This was the genesis of the modern-day operating system. However, machines still ran a single job at a time; at Cambridge University in England the job queue was at one time a washing line from which tapes were hung with different colored clothes-pegs to indicate job-priority.

As machines became more powerful, the time to run programs diminished and the time to hand off the equipment became very large by comparison. Accounting for and paying for machine usage moved on from checking the wall clock to automatic logging by the computer. Run queues evolved from a literal queue of people at the door, to a heap of media on a jobs-waiting table, or batches of punch-cards stacked one on top of the other in the reader, until the machine itself was able to select and sequence which magnetic tape drives were online. Where program developers had originally had access to run their own jobs on the machine, they were supplanted by dedicated machine operators who looked after the well-being and maintenance of the machine and were less and less concerned with implementing tasks manually. When commercially available computer centers were faced with the implications of data lost through tampering or operational errors, equipment vendors were put under pressure to enhance the runtime libraries to prevent misuse of system resources. Automated monitoring was needed not just for CPU usage but for counting pages printed, cards punched, cards read, disk storage used and for signaling when operator intervention was required by jobs such as changing magnetic tapes.

All these features were building up towards the repertoire of a fully capable operating system. Eventually the runtime libraries became an amalgamated program that was started before the first customer job and could read in the customer job, control its execution, clean up after it, record its usage, and immediately go on to process the next job. Significantly, it became possible for programmers to use symbolic program-code instead of having to hand-encode binary images, once task-switching allowed a computer to perform translation of a program into binary form before running it. These resident background programs, capable of managing multistep processes, were often called monitors or monitor-programs before the term OS established itself.

An underlying program offering basic hardware-management, software-scheduling and resource-monitoring may seem a remote ancestor to the user-oriented OSes of the personal computing era. But there has been a shift in meaning. With the era of commercial computing, more and more "secondary" software was bundled in the OS package, leading eventually to the perception of an OS as a complete user-system with utilities, applications (such as text editors and file managers) and configuration tools, and having an integrated graphical user interface. The true descendant of the early operating systems is what is now called the "kernel". In technical and development circles the old restricted sense of an OS persists because of the continued active development of embedded operating systems for all kinds of devices with a data-processing component, from hand-held gadgets up to industrial robots and real-time control-systems, which do

7

not run user-applications at the front-end. An embedded OS in a device today is not so far removed as one might think from its ancestor of the 1950s.

The broader categories of systems and application software are discussed in the computer software article.

The mainframe era

It is generally thought that the first operating system used for real work was GM-NAA I/O, produced in 1956 by General Motors' Research division for its IBM 704. Most other early operating systems for IBM mainframes were also produced by customers.

Early operating systems were very diverse, with each vendor or customer producing one or more operating systems specific to their particular mainframe computer. Every operating system, even from the same vendor, could have radically different models of commands, operating procedures, and such facilities as debugging aids. Typically, each time the manufacturer brought out a new machine, there would be a new operating system, and most applications would have to be manually adjusted, recompiled, and retested.

Systems on IBM hardware: The state of affairs continued until the 1960s when IBM, already a leading hardware vendor, stopped the work on existing systems, and put all the effort into developing the System/360 series of machines, all of which used the same instruction architecture. IBM intended to develop also a single operating system for the new hardware, the OS/360. The problems encountered in the development of the OS/360 are legendary, and are described by Fred Brooks in The Mythical Man-Month—a book that has become a classic of software engineering. Because of performance differences across the hardware range and delays with software development, a whole family of operating systems were introduced instead of a single OS/360.

IBM wound up releasing a series of stop-gaps followed by three longer-lived operating systems:

• OS/MFT for mid-range systems. This had one successor, OS/VS1, which was discontinued in the 1980s.

• OS/MVT for large systems. This was similar in most ways to OS/MFT (programs could be ported between the two without being re-compiled), but has more sophisticated memory management and a time-sharing facility, TSO. MVT had several successors including the current z/OS.

• DOS/360 for small System/360 models had several successors including the current z/VSE. It was significantly different from OS/MFT and OS/MVT.

• IBM maintained full compatibility with the past, so that programs developed in the sixties can still run under z/VSE (if developed for DOS/360) or z/OS (if developed for OS/MFT or OS/MVT) with no change.

Other mainframe operating systems: Control Data Corporation developed the SCOPE operating system in the 1960s, for batch processing. In cooperation with the University of Minnesota, the KRONOS and later the NOS operating systems were developed during the 1970s, which supported simultaneous batch and timesharing use. Like many commercial timesharing systems, its interface was an extension of the DTSS time sharing system, one of the pioneering efforts in timesharing and programming languages.

In the late 1970s, Control Data and the University of Illinois developed the PLATO system, which used plasma panel displays and long-distance time sharing networks. PLATO was remarkably

8

innovative for its time; the shared memory model of PLATO's TUTOR programming language allowed applications such as real-time chat and multi-user graphical games.

UNIVAC, the first commercial computer manufacturer, produced a series of EXEC operating systems. Like all early main-frame systems, this was a batch-oriented system that managed magnetic drums, disks, card readers and line printers. In the 1970s, UNIVAC produced the Real-Time Basic (RTB) system to support large-scale time sharing, also patterned after the Dartmouth BASIC system.

Burroughs Corporation introduced the B5000 in 1961 with the MCP (Master Control Program) operating system. The B5000 was a stack machine designed to exclusively support high-level languages with no machine language or assembler and indeed the MCP was the first OS to be written exclusively in a high-level language (ESPOL, a dialect of ALGOL). MCP also introduced many other ground-breaking innovations, such as being the first commercial implementation of virtual memory. MCP is still in use today in the Unisys ClearPath/MCP line of computers.

Project MAC at MIT, working with GE, developed Multics and General Electric Comprehensive Operating Supervisor (GECOS), which introduced the concept of ringed security privilege levels. After Honeywell acquired GE's computer business, it was renamed to General Comprehensive Operating System (GCOS).

Digital Equipment Corporation developed many operating systems for its various computer lines, including TOPS-10 and TOPS-20 time sharing systems for the 36-bit PDP-10 class systems. Prior to the widespread use of UNIX, TOPS-10 was a particularly popular system in universities, and in the early ARPANET community.

In the late 1960s through the late 1970s, several hardware capabilities evolved that allowed similar or ported software to run on more than one system. Early systems had utilized microprogramming to implement features on their systems in order to permit different underlying architecture to appear to be the same as others in a series. In fact most 360's after the 360/40 (except the 360/165 and 360/168) were microprogrammed implementations. But soon other means of achieving application compatibility were proven to be more significant.

Minicomputers and the rise of UNIX

The beginnings of the UNIX operating system was developed at AT&T Bell Laboratories in the late 1960s. Because it was essentially free in early editions, easily obtainable, and easily modified, it achieved wide acceptance. It also became a requirement within the Bell systems operating companies. Since it was written in a high level C language, when that language was ported to a new machine architecture UNIX was also able to be ported. This portability permitted it to become the choice for a second generation of minicomputers and the first generation of workstations. By widespread use it exemplified the idea of an operating system that was conceptually the same across various hardware platforms. It still was owned by AT&T and that limited its use to groups or corporations who could afford to license it. It became one of the roots of the open source movement.

Other than that Digital Equipment Corporation created the simple RT-11 system for its 16-bit PDP-11 class machines, and the VMS system for the 32-bit VAX computer.

Another system which evolved in this time frame was the Pick operating system. The Pick system was developed and sold by Microdata Corporation who created the precursors of the system. The system is an example of a system which started as a database application support program and graduated to system work.

The case of 8-bit home computers and game consoles

9

Home computers: Although most small 8-bit home computers of the 1980s, such as the Commodore 64, the Atari 8-bit, the Amstrad CPC, ZX Spectrum series and others could use a disk-loading operating system, such as CP/M or GEOS they could generally work without one. In fact, most if not all of these computers shipped with a built-in BASIC interpreter on ROM, which also served as a crude operating system, allowing minimal file management operations (such as deletion, copying, etc.) to be performed and sometimes disk formatting, along of course with application loading and execution, which sometimes required a non-trivial command sequence, like with the Commodore 64.

The fact that the majority of these machines were bought for entertainment and educational purposes and were seldom used for more "serious" or business/science oriented applications, partly explains why a "true" operating system was not necessary.

Another reason is that they were usually single-task and single-user machines and shipped with minimal amounts of RAM, usually between 4 and 256 kilobytes, with 64 and 128 being common figures, and 8-bit processors, so an operating system's overhead would likely compromise the performance of the machine without really being necessary.

Even the available word processor and integrated software applications were mostly self-contained programs which took over the machine completely, as also did video games.

Game consoles and video games: Since virtually all video game consoles and arcade cabinets designed and built after 1980 were true digital machines (unlike the analog Pong clones and derivatives), some of them carried a minimal form of BIOS or built-in game, such as the ColecoVision, the Sega Master System and the SNK Neo Geo. There were however successful designs where a BIOS was not necessary, such as the Nintendo NES and its clones.

Modern day game consoles and videogames, starting with the PC-Engine, all have a minimal BIOS that also provides some interactive utilities such as memory card management, Audio or Video CD playback, copy protection and sometimes carry libraries for developers to use etc. Few of these cases, however, would qualify as a "true" operating system.

The most notable exceptions are probably the Dreamcast game console which includes a minimal BIOS, like the PlayStation, but can load the Windows CE operating system from the game disk allowing easily porting of games from the PC world, and the Xbox game console, which is little more than a disguised Intel-based PC running a secret, modified version of Microsoft Windows in the background. Furthermore, there are Linux versions that will run on a Dreamcast and later game consoles as well.

Long before that, Sony had released a kind of development kit called the Net Yaroze for its first PlayStation platform, which provided a series of programming and developing tools to be used with a normal PC and a specially modified "Black PlayStation" that could be interfaced with a PC and download programs from it. These operations require in general a functional OS on both platforms involved.

In general, it can be said that videogame consoles and arcade coin operated machines used at most a built-in BIOS during the 1970s, 1980s and most of the 1990s, while from the PlayStation era and beyond they started getting more and more sophisticated, to the point of requiring a generic or custom-built OS for aiding in development and expandability.

The personal computer era: Apple, PC/MS/DR-DOS and beyond

The development of microprocessors made inexpensive computing available for the small business and hobbyist, which in turn led to the widespread use of interchangeable hardware

10

components using a common interconnection (such as the S-100, SS-50, Apple II, ISA, and PCI buses), and an increasing need for 'standard' operating systems to control them. The most important of the early OSes on these machines was Digital Research's CP/M-80 for the 8080 / 8085 / Z-80 CPUs. It was based on several Digital Equipment Corporation operating systems, mostly for the PDP-11 architecture. Microsoft's first Operating System, M-DOS, was designed along many of the PDP-11 features, but for microprocessor based system. MS-DOS (or PC-DOS when supplied by IBM) was based originally on CP/M-80. Each of these machines had a small boot program in ROM which loaded the OS itself from disk. The BIOS on the IBM-PC class machines was an extension of this idea and has accreted more features and functions in the 20 years since the first IBM-PC was introduced in 1981.

The decreasing cost of display equipment and processors made it practical to provide graphical user interfaces for many operating systems, such as the generic X Window System that is provided with many UNIX systems, or other graphical systems such as Microsoft Windows, the RadioShack Color Computer's OS-9 Level II/MultiVue, Commodore's AmigaOS, Apple's Mac OS, or even IBM's OS/2. The original GUI was developed at Xerox Palo Alto Research Center in the early '70s (the Alto computer system) and imitated by many vendors.

The rise of virtualization

Operating systems were originally running directly on the hardware itself, and provided services to applications. With VM/CMS on System/370, IBM introduced the notion of virtual machine, where the operating system itself runs under the control of an hypervisor, instead of being in direct control of the hardware. VMware popularized this technology on personal computers. Over time, the line between virtual machines monitors and operating systems was blurred:

• Hypervisors grew more complex, gaining their own application programming interface, memory management or file system

• Virtualization becomes a key feature of operating systems, as exemplified by Hyper-V in Windows Server 2008 or HP Integrity Virtual Machines in HP-UX

• In some systems, such as POWER5 and POWER6-based servers from IBM, the hypervisor is no longer optional.

• Applications have been re-designed to run directly on a virtual machine monitor.

• In many ways, virtual machine software today plays the role formerly held by the operating system, including managing the hardware resources (processor, memory, I/O devices), applying scheduling policies, or allowing system administrators to manage the system.

Features

Program execution: The operating system acts as an interface between an application and the hardware. The user interacts with the hardware from "the other side". The operating system is a set of services which simplifies development of applications. Executing a program involves the creation of a process by the operating system. The kernel creates a process by assigning memory and other resources, establishing a priority for the process (in multi-tasking systems), loading program code into memory, and executing the program. The program then interacts with the user and/or other devices performing its intended function.

11

Interrupts: Interrupts are central to operating systems as they provide an efficient way for the operating system to interact and react to its environment. The alternative is to have the operating system "watch" the various sources of input for events (polling) that require action -- not a good use of CPU resources. Interrupt-based programming is directly supported by most CPUs. Interrupts provide a computer with a way of automatically running specific code in response to events. Even very basic computers support hardware interrupts, and allow the programmer to specify code which may be run when that event takes place.

When an interrupt is received the computer's hardware automatically suspends whatever program is currently running, saves its status, and runs computer code previously associated with the interrupt. This is analogous to placing a bookmark in a book when someone is interrupted by a phone call and then taking the call. In modern operating systems interrupts are handled by the operating system's kernel. Interrupts may come from either the computer's hardware or from the running program.

When a hardware device triggers an interrupt the operating system's kernel decides how to deal with this event, generally by running some processing code. How much code gets run depends on the priority of the interrupt (for example: a person usually responds to a smoke detector alarm before answering the phone). The processing of hardware interrupts is a task that is usually delegated to software called device drivers, which may be either part of the operating system's kernel, part of another program, or both. Device drivers may then relay information to a running program by various means.

A program may also trigger an interrupt to the operating system. If a program wishes to access hardware for example, it may interrupt the operating system's kernel, which causes control to be passed back to the kernel. The kernel will then process the request. If a program wishes additional resources (or wishes to shed resources) such as memory, it will trigger an interrupt to get the kernel's attention.

Protected mode and supervisor mode: Modern CPUs support something called dual mode operation. CPUs with this capability use two modes: protected mode and supervisor mode, which allow certain CPU functions to be controlled and affected only by the operating system kernel. Here, protected mode does not refer specifically to the 80286 (Intel's x86 16-bit microprocessor) CPU feature, although its protected mode is very similar to it. CPUs might have other modes similar to 80286 protected mode as well, such as the virtual 8086 mode of the 80386 (Intel's x86 32-bit microprocessor or i386).

However, the term is used here more generally in operating system theory to refer to all modes which limit the capabilities of programs running in that mode, providing things like virtual memory addressing and limiting access to hardware in a manner determined by a program running in supervisor mode. Similar modes have existed in supercomputers, minicomputers, and mainframes as they are essential to fully supporting UNIX-like multi-user operating systems.

When a computer first starts up, it is automatically running in supervisor mode. The first few programs to run on the computer, being the BIOS, bootloader and the operating system have unlimited access to hardware - and this is required because, by definition, initializing a protected environment can only be done outside of one. However, when the operating system passes control to another program, it can place the CPU into protected mode.

In protected mode, programs may have access to a more limited set of the CPU's instructions. A user program may leave protected mode only by triggering an interrupt, causing control to be passed back to the kernel. In this way the operating system can maintain exclusive control over things like access to hardware and memory.

12

The term "protected mode resource" generally refers to one or more CPU registers, which contain information that the running program isn't allowed to alter. Attempts to alter these resources generally causes a switch to supervisor mode, where the operating system can deal with the illegal operation the program was attempting (for example, by killing the program).

Memory management: Among other things, a multiprogramming operating system kernel must be responsible for managing all system memory which is currently in use by programs. This ensures that a program does not interfere with memory already used by another program. Since programs time share, each program must have independent access to memory.

Cooperative memory management, used by many early operating systems assumes that all programs make voluntary use of the kernel's memory manager, and do not exceed their allocated memory. This system of memory management is almost never seen anymore, since programs often contain bugs which can cause them to exceed their allocated memory. If a program fails it may cause memory used by one or more other programs to be affected or overwritten. Malicious programs, or viruses may purposefully alter another program's memory or may affect the operation of the operating system itself. With cooperative memory management it takes only one misbehaved program to crash the system.

Memory protection enables the kernel to limit a process' access to the computer's memory. Various methods of memory protection exist, including memory segmentation and paging. All methods require some level of hardware support (such as the 80286 MMU) which doesn't exist in all computers.

In both segmentation and paging, certain protected mode registers specify to the CPU what memory address it should allow a running program to access. Attempts to access other addresses will trigger an interrupt which will cause the CPU to re-enter supervisor mode, placing the kernel in charge. This is called a segmentation violation or Seg-V for short, and since it is both difficult to assign a meaningful result to such an operation, and because it is usually a sign of a misbehaving program, the kernel will generally resort to terminating the offending program, and will report the error.

Windows 3.1-Me had some level of memory protection, but programs could easily circumvent the need to use it. Under Windows 9x all MS-DOS applications ran in supervisor mode, giving them almost unlimited control over the computer. A general protection fault would be produced indicating a segmentation violation had occurred, however the system would often crash anyway.

In most Linux systems, part of the hard disk is reserved for virtual memory when the Operating system is being installed on the system. This part is known as swap space. Windows systems use a swap file instead of a partition.

Virtual memory: The use of virtual memory addressing (such as paging or segmentation) means that the kernel can choose what memory each program may use at any given time, allowing the operating system to use the same memory locations for multiple tasks.

If a program tries to access memory that isn't in its current range of accessible memory, but nonetheless has been allocated to it, the kernel will be interrupted in the same way as it would if the program were to exceed its allocated memory. (See section on memory management.) Under UNIX this kind of interrupt is referred to as a page fault.

13

When the kernel detects a page fault it will generally adjust the virtual memory range of the program which triggered it, granting it access to the memory requested. This gives the kernel discretionary power over where a particular application's memory is stored, or even whether or not it has actually been allocated yet.

In modern operating systems, application memory which is accessed less frequently can be temporarily stored on disk or other media to make that space available for use by other programs. This is called swapping, as an area of memory can be used by multiple programs, and what that memory area contains can be swapped or exchanged on demand.

Multitasking: Multitasking refers to the running of multiple independent computer programs on the same computer; giving the appearance that it is performing the tasks at the same time. Since most computers can do at most one or two things at one time, this is generally done via time sharing, which means that each program uses a share of the computer's time to execute.

An operating system kernel contains a piece of software called a scheduler which determines how much time each program will spend executing, and in which order execution control should be passed to programs. Control is passed to a process by the kernel, which allows the program access to the CPU and memory. At a later time control is returned to the kernel through some mechanism, so that another program may be allowed to use the CPU. This so-called passing of control between the kernel and applications is called a context switch.

An early model which governed the allocation of time to programs was called cooperative multitasking. In this model, when control is passed to a program by the kernel, it may execute for as long as it wants before explicitly returning control to the kernel. This means that a malicious or malfunctioning program may not only prevent any other programs from using the CPU, but it can hang the entire system if it enters an infinite loop.

The philosophy governing preemptive multitasking is that of ensuring that all programs are given regular time on the CPU. This implies that all programs must be limited in how much time they are allowed to spend on the CPU without being interrupted. To accomplish this, modern operating system kernels make use of a timed interrupt. A protected mode timer is set by the kernel which triggers a return to supervisor mode after the specified time has elapsed. (See above sections on Interrupts and Dual Mode Operation.)

On many single user operating systems cooperative multitasking is perfectly adequate, as home computers generally run a small number of well tested programs. Windows NT was the first version of Microsoft Windows which enforced preemptive multitasking, but it didn't reach the home user market until Windows XP, (since Windows NT was targeted at professionals.)

Kernel Preemption: In recent years concerns have arisen because of long latencies often associated with some kernel run-times, sometimes on the order of 100ms or more in systems with monolithic kernels. These latencies often produce noticeable slowness in desktop systems, and can prevent operating systems from performing time-sensitive operations such as audio recording and some communications.

Modern operating systems extend the concepts of application preemption to device drivers and kernel code, so that the operating system has preemptive control over internal run-times as well. Under Windows Vista, the introduction of the Windows Display Driver Model (WDDM) accomplishes this for display drivers, and in Linux, the preemptable kernel model introduced in version 2.6 allows all device drivers and some other parts of kernel code to take advantage of preemptive multi-tasking.

14

Under Windows prior to Windows Vista and Linux prior to version 2.6 all driver execution was co-operative, meaning that if a driver entered an infinite loop it would freeze the system.

Disk access and file systems: Access to files stored on disks is a central feature of all operating systems. Computers store data on disks using files, which are structured in specific ways in order to allow for faster access, higher reliability, and to make better use out of the drive's available space. The specific way in which files are stored on a disk is called a file system, and enables files to have names and attributes. It also allows them to be stored in a hierarchy of directories or folders arranged in a directory tree.

Early operating systems generally supported a single type of disk drive and only one kind of file system. Early file systems were limited in their capacity, speed, and in the kinds of file names and directory structures they could use. These limitations often reflected limitations in the operating systems they were designed for, making it very difficult for an operating system to support more than one file system.

While many simpler operating systems support a limited range of options for accessing storage systems, operating systems like UNIX and Linux support a technology known as a virtual file system or VFS. An operating system like UNIX supports a wide array of storage devices, regardless of their design or file systems to be accessed through a common application programming interface (API). This makes it unnecessary for programs to have any knowledge about the device they are accessing. A VFS allows the operating system to provide programs with access to an unlimited number of devices with an infinite variety of file systems installed on them through the use of specific device drivers and file system drivers.

A connected storage device such as a hard drive is accessed through a device driver. The device driver understands the specific language of the drive and is able to translate that language into a standard language used by the operating system to access all disk drives. On UNIX this is the language of block devices.

When the kernel has an appropriate device driver in place, it can then access the contents of the disk drive in raw format, which may contain one or more file systems. A file system driver is used to translate the commands used to access each specific file system into a standard set of commands that the operating system can use to talk to all file systems. Programs can then deal with these file systems on the basis of filenames, and directories/folders, contained within a hierarchical structure. They can create, delete, open, and close files, as well as gather various information about them, including access permissions, size, free space, and creation and modification dates.

Various differences between file systems make supporting all file systems difficult. Allowed characters in file names, case sensitivity, and the presence of various kinds of file attributes makes the implementation of a single interface for every file system a daunting task. Operating systems tend to recommend the use of (and so support natively) file systems specifically designed for them; for example, NTFS in Windows and ext3 and ReiserFS in Linux. However, in practice, third party drives are usually available to give support for the most widely used filesystems in most general-purpose operating systems (for example, NTFS is available in Linux through NTFS-3g, and ext2/3 and ReiserFS are available in Windows through FS-driver and rfstool).

Device drivers: A device driver is a specific type of computer software developed to allow interaction with hardware devices. Typically this constitutes an interface for communicating with the device, through the specific computer bus or communications subsystem that the hardware is connected to, providing commands to and/or receiving data from the device, and on the other end,

15

the requisite interfaces to the operating system and software applications. It is a specialized hardware-dependent computer program which is also operating system specific that enables another program, typically an operating system or applications software package or computer program running under the operating system kernel, to interact transparently with a hardware device, and usually provides the requisite interrupt handling necessary for any necessary asynchronous time-dependent hardware interfacing needs.

The key design goal of device drivers is abstraction. Every model of hardware (even within the same class of device) is different. Newer models also are released by manufacturers that provide more reliable or better performance and these newer models are often controlled differently. Computers and their operating systems cannot be expected to know how to control every device, both now and in the future. To solve this problem, OSes essentially dictate how every type of device should be controlled. The function of the device driver is then to translate these OS mandated function calls into device specific calls. In theory a new device, which is controlled in a new manner, should function correctly if a suitable driver is available. This new driver will ensure that the device appears to operate as usual from the operating systems' point of view.

Networking: Currently most operating systems support a variety of networking protocols, hardware, and applications for using them. This means that computers running dissimilar operating systems can participate in a common network for sharing resources such as computing, files, printers, and scanners using either wired or wireless connections. Networks can essentially allow a computer's operating system to access the resources of a remote computer to support the same functions as it could if those resources were connected directly to the local computer. This includes everything from simple communication, to using networked file systems or even sharing another computer's graphics or sound hardware. Some network services allow the resources of a computer to be accessed transparently, such as SSH which allows networked users direct access to a computer's command line interface.

Client/server networking involves a program on a computer somewhere which connects via a network to another computer, called a server. Servers, usually running UNIX or Linux, offer (or host) various services to other network computers and users. These services are usually provided through ports or numbered access points beyond the server's network address. Each port number is usually associated with a maximum of one running program, which is responsible for handling requests to that port. A daemon, being a user program, can in turn access the local hardware resources of that computer by passing requests to the operating system kernel.

Many operating systems support one or more vendor-specific or open networking protocols as well, for example, SNA on IBM systems, DECnet on systems from Digital Equipment Corporation, and Microsoft-specific protocols (SMB) on Windows. Specific protocols for specific tasks may also be supported such as NFS for file access. Protocols like ESound, or esd can be easily extended over the network to provide sound from local applications, on a remote system's sound hardware.

Security: A computer being secure depends on a number of technologies working properly. A modern operating system provides access to a number of resources, which are available to software running on the system, and to external devices like networks via the kernel.

The operating system must be capable of distinguishing between requests which should be allowed to be processed, and others which should not be processed. While some systems may simply distinguish between "privileged" and "non-privileged", systems commonly have a form of requester identity, such as a user name. To establish identity there may be a process of authentication. Often a username must be quoted, and each username may have a password. Other methods of authentication, such as magnetic cards or biometric data, might be used instead. In some

16

cases, especially connections from the network, resources may be accessed with no authentication at all (such as reading files over a network share). Also covered by the concept of requester identity is authorization; the particular services and resources accessible by the requester once logged into a system and tied to either the requester's user account or to the variously configured groups of users to which the requester belongs.

In addition to the allow/disallow model of security, a system with a high level of security will also offer auditing options. These would allow tracking of requests for access to resources (such as, "who has been reading this file?"). Internal security, or security from an already running program is only possible if all possibly harmful requests must be carried out through interrupts to the operating system kernel. If programs can directly access hardware and resources, they cannot be secured.

External security involves a request from outside the computer, such as a login at a connected console or some kind of network connection. External requests are often passed through device drivers to the operating system's kernel, where they can be passed onto applications, or carried out directly. Security of operating systems has long been a concern because of highly sensitive data held on computers, both of a commercial and military nature. The United States Government Department of Defense (DoD) created the Trusted Computer System Evaluation Criteria (TCSEC) which is a standard that sets basic requirements for assessing the effectiveness of security. This became of vital importance to operating system makers, because the TCSEC was used to evaluate, classify and select computer systems being considered for the processing, storage and retrieval of sensitive or classified information.

Network services include offerings such as file sharing, print services, email, web sites, and file transfer protocols (FTP), most of which can have compromised security. At the front line of security are hardware devices known as firewalls or intrusion detection/prevention systems. At the operating system level, there are a number of software firewalls available, as well as intrusion detection/prevention systems. Most modern operating systems include a software firewall, which is enabled by default. A software firewall can be configured to allow or deny network traffic to or from a service or application running on the operating system. Therefore, one can install and be running an insecure service, such as Telnet or FTP, and not have to be threatened by a security breach because the firewall would deny all traffic trying to connect to the service on that port.

An alternative strategy, and the only sandbox strategy available in systems that do not meet the Popek and Goldberg virtualization requirements, is the operating system not running user programs as native code, but instead either emulates a processor or provides a host for a p-code based system such as Java.

Internal security is especially relevant for multi-user systems; it allows each user of the system to have private files that the other users cannot tamper with or read. Internal security is also vital if auditing is to be of any use, since a program can potentially bypass the operating system, inclusive of bypassing auditing.

Example: Microsoft Windows: While the Windows 9x series offered the option of having profiles for multiple users, they had no concept of access privileges, and did not allow concurrent access; and so were not true multi-user operating systems. In addition, they implemented only partial memory protection. They were accordingly widely criticised for lack of security.

The Windows NT series of operating systems, by contrast, are true multi-user, and implement absolute memory protection. However, a lot of the advantages of being a true multi-user operating system were nullified by the fact that, prior to Windows Vista, the first user account created during the setup process was an administrator account, which was also the default for new accounts. Though Windows XP did have limited accounts, the majority of home users did not change to an account type

17

with fewer rights – partially due to the number of programs which unnecessarily required administrator rights – and so most home users ran as administrator all the time.

Windows Vista changes this by introducing a privilege elevation system called User Account Control. When logging in as a standard user, a logon session is created and a token containing only the most basic privileges is assigned. In this way, the new logon session is incapable of making changes that would affect the entire system. When logging in as a user in the Administrators group, two separate tokens are assigned. The first token contains all privileges typically awarded to an administrator, and the second is a restricted token similar to what a standard user would receive. User applications, including the Windows Shell, are then started with the restricted token, resulting in a reduced privilege environment even under an Administrator account. When an application requests higher privileges or "Run as administrator" is clicked, UAC will prompt for confirmation and, if consent is given (including administrator credentials if the account requesting the elevation is not a member of the administrators group), start the process using the unrestricted token.

Example: Linux/Unix: Linux and UNIX both have two tier security, which limits any system-wide changes to the root user, a special user account on all UNIX-like systems. While the root user has virtually unlimited permission to affect system changes, programs running as a regular user are limited in where they can save files, what hardware they can access, etc. In many systems, a user's memory usage, their selection of available programs, their total disk usage or quota, available range of programs' priority settings, and other functions can also be locked down. This provides the user with plenty of freedom to do what needs to be done, without being able to put any part of the system in jeopardy (barring accidental triggering of system-level bugs) or make sweeping, system-wide changes. The user's settings are stored in an area of the computer's file system called the user's home directory, which is also provided as a location where the user may store their work, a concept later adopted by Windows as the 'My Documents' folder. Should a user have to install software outside of his home directory or make system-wide changes, they must become the root user temporarily, usually with the su or sudo command, which is answered with the computer's root password when prompted. Some systems (such as Ubuntu and its derivatives) are configured by default to allow select users to run programs as the root user via the sudo command, using the user's own password for authentication instead of the system's root password. One is sometimes said to "go root" or "drop to root" when elevating oneself to root access.

File system support in modern operating systems: Support for file systems is highly varied among modern operating systems although there are several common file systems which almost all operating systems include support and drivers for.

Solaris: The SUN Microsystems Solaris Operating System in earlier releases defaulted to (non-journaled or non-logging) UFS for bootable and supplementary file systems. Solaris (as most Operating Systems based upon Open Standards and/or Open Source) defaulted to, supported, and extended UFS.

Support for other file systems and significant enhancements were added over time, including Veritas Software Corp. (Journaling) VxFS, SUN Microsystems (Clustering) QFS, SUN Microsystems (Journaling) UFS, and SUN Microsystems (open source, poolable, 128 bit compressible, and error-correcting) ZFS.

Kernel extensions were added to Solaris to allow for bootable Veritas VxFS operation. Logging or Journaling was added to UFS in SUN's Solaris 7. Releases of Solaris 10, Solaris Express, OpenSolaris, and other Open Source variants of Solaris Operating System later supported bootable ZFS.

18

Logical Volume Management allows for spanning a file system across multiple devices for the purpose of adding redundancy, capacity, and/or throughput. Legacy environments in Solaris may use Solaris Volume Manager (formerly known as Solstice DiskSuite.) Multiple operating systems (including Solaris) may use Veritas Volume Manager. Modern Solaris based Operating Systems eclipse the need for Volume Management through leveraging virtual storage pools in ZFS.

Linux: Many Linux distributions support some or all of ext2, ext3, ext4, ReiserFS, Reiser4, JFS, XFS, GFS, GFS2, OCFS, OCFS2, and NILFS. The ext file systems, namely ext2, ext3 and ext4 are based on the original Linux file system. Others have been developed by companies to meet their specific needs, hobbyists, or adapted from UNIX, Microsoft Windows, and other operating systems. Linux has full support for XFS and JFS, along with FAT (the MS-DOS file system), and HFS which is the primary file system for the Macintosh.

In recent years support for Microsoft Windows NT's NTFS file system has appeared in Linux, and is now comparable to the support available for other native UNIX file systems. ISO 9660 and Universal Disk Format (UDF) are supported which are standard file systems used on CDs, DVDs, and BluRay discs. It is possible to install Linux on the majority of these file systems. Unlike other operating systems, Linux and UNIX allow any file system to be used regardless of the media it is stored in, whether it is a hard drive, a disc (CD,DVD...), an USB key, or even contained within a file located on another file system.

Microsoft Windows: Microsoft Windows currently supports NTFS and FAT file systems, along with network file systems shared from other computers, and the ISO 9660 and UDF filesystems used for CDs, DVDs, and other optical discs such as BluRay. Under Windows each file system is usually limited in application to certain media, for example CDs must use ISO 9660 or UDF, and as of Windows Vista, NTFS is the only file system which the operating system can be installed on. Windows Embedded CE 6.0, Windows Vista Service Pack 1, and Windows Server 2008 support ExFAT, a file system more suitable for flash drives.

Mac OS X: Mac OS X supports HFS+ with journaling as its primary file system. It is derived from the Hierarchical File System of the earlier Mac OS. Mac OS X has facilities to read and write FAT, NTFS (only read, although an open-source cross platform implementation known as NTFS 3G provides read-write support to Microsoft Windows NTFS file system for Mac OS X users.), UDF, and other file systems, but cannot be installed to them. Due to its UNIX heritage Mac OS X now supports virtually all the file systems supported by the UNIX VFS. Recently Apple Inc. started work on porting Sun Microsystem's ZFS filesystem to Mac OS X and preliminary support is already available in Mac OS X 10.5.

Special-purpose files systems: FAT file systems are commonly found on floppy disks, flash memory cards, digital cameras, and many other portable devices because of their relative simplicity. Performance of FAT compares poorly to most other file systems as it uses overly simplistic data structures, making file operations time-consuming, and makes poor use of disk space in situations where many small files are present. ISO 9660 and Universal Disk Format are two common formats that target Compact Discs and DVDs. Mount Rainier is a newer extension to UDF supported by Linux 2.6 kernels and Windows Vista that facilitates rewriting to DVDs in the same fashion as has been possible with floppy disks.

Journalized file systems: File systems may provide journaling, which provides safe recovery in the event of a system crash. A journaled file system writes some information twice: first to the journal, which is a log of file system operations, then to its proper place in the ordinary file system. Journaling is handled by the file system driver, and keeps track of each operation taking place that changes the contents of the disk. In the event of a crash, the system can recover to a consistent state

19

by replaying a portion of the journal. Many UNIX file systems provide journaling including ReiserFS, JFS, and Ext3.

In contrast, non-journaled file systems typically need to be examined in their entirety by a utility such as fsck or chkdsk for any inconsistencies after an unclean shutdown. Soft updates is an alternative to journaling that avoids the redundant writes by carefully ordering the update operations. Log-structured file systems and ZFS also differ from traditional journaled file systems in that they avoid inconsistencies by always writing new copies of the data, eschewing in-place updates.

Graphical user interfaces: Most modern computer systems support graphical user interfaces (GUI), and often include them. In some computer systems, such as the original implementations of Microsoft Windows and the Mac OS, the GUI is integrated into the kernel.

While technically a graphical user interface is not an operating system service, incorporating support for one into the operating system kernel can allow the GUI to be more responsive by reducing the number of context switches required for the GUI to perform its output functions. Other operating systems are modular, separating the graphics subsystem from the kernel and the Operating System. In the 1980s UNIX, VMS and many others had operating systems that were built this way. Linux and Mac OS X are also built this way. Modern releases of Microsoft Windows such as Windows Vista implement a graphics subsystem that is mostly in user-space, however versions between Windows NT 4.0 and Windows Server 2003's graphics drawing routines exist mostly in kernel space. Windows 9x had very little distinction between the interface and the kernel.

Many computer operating systems allow the user to install or create any user interface they desire. The X Window System in conjunction with GNOME or KDE is a commonly-found setup on most Unix and Unix-like (BSD, Linux, Minix) systems. A number of Windows shell replacements have been released for Microsoft Windows, which offer alternatives to the included Windows shell, but the shell itself cannot be separated from Windows.

Numerous Unix-based GUIs have existed over time, most derived from X11. Competition among the various vendors of Unix (HP, IBM, Sun) led to much fragmentation, though an effort to standardize in the 1990s to COSE and CDE failed for the most part due to various reasons, eventually eclipsed by the widespread adoption of GNOME and KDE. Prior to open source-based toolkits and desktop environments, Motif was the prevalent toolkit/desktop combination (and was the basis upon which CDE was developed).

Graphical user interfaces evolve over time. For example, Windows has modified its user interface almost every time a new major version of Windows is released, and the Mac OS GUI changed dramatically with the introduction of Mac OS X in 1999.

Examples of Operating Systems

Microsoft Windows

Microsoft Windows is a family of proprietary operating systems that originated as an add-on to the older MS-DOS operating system for the IBM PC. Modern versions are based on the newer Windows NT kernel that was originally intended for OS/2. Windows runs on x86, x86-64 and Itanium processors. Earlier versions also ran on the DEC Alpha, MIPS, Fairchild (later Intergraph) Clipper and PowerPC architectures (some work was done to port it to the SPARC architecture).

As of June 2008, Microsoft Windows holds a large amount of the worldwide desktop market share. Windows is also used on servers, supporting applications such as web servers and database servers. In recent years, Microsoft has spent significant marketing and research & development

20

money to demonstrate that Windows is capable of running any enterprise application, which has resulted in consistent price/performance records (see the TPC) and significant acceptance in the enterprise market.

The most widely used version of the Microsoft Windows family is Windows XP, released on October 25, 2001.

In November 2006, after more than five years of development work, Microsoft released Windows Vista, a major new operating system version of Microsoft Windows family which contains a large number of new features and architectural changes. Chief amongst these are a new user interface and visual style called Windows Aero, a number of new security features such as User

Account Control, and a few new multimedia applications such as Windows DVD Maker. A server variant based on the same kernel, Windows Server 2008, was released in early 2008.

Windows 7 is currently under development; Microsoft has stated that it intends to scope its development to a three-year timeline, placing its release sometime after mid-2009.

UNIX and UNIX-like operating systems

Ken Thompson wrote B, mainly based on BCPL, which he used to write Unix, based on his experience in the MULTICS project. B was replaced by C, and Unix developed into a large, complex family of inter-related operating systems which have been influential in every modern operating system.

The Unix-like family is a diverse group of operating systems, with several major sub-categories including System V, BSD, and Linux. The name "UNIX" is a trademark of The Open Group which licenses it for use with any operating system that has been shown to conform to their definitions. "Unix-like" is commonly used to refer to the large set of operating systems which resemble the original Unix.

Unix-like systems run on a wide variety of machine architectures. They are used heavily for servers in business, as well as workstations in academic and engineering environments. Free software Unix variants, such as GNU, Linux and BSD, are popular in these areas.

Market share statistics for freely available operating systems are usually inaccurate since most free operating systems are not purchased, making usage under-represented. On the other hand, market share statistics based on total downloads of free operating systems are often inflated, as there is no economic disincentive to acquire multiple operating systems so users can download multiple systems, test them, and decide which they like best.

Some Unix variants like HP's HP-UX and IBM's AIX are designed to run only on that vendor's hardware. Others, such as Solaris, can run on multiple types of hardware, including x86 servers and PCs. Apple's Mac OS X, a hybrid kernel-based BSD variant derived from NeXTSTEP, Mach, and FreeBSD, has replaced Apple's earlier (non-Unix) Mac OS.

Unix interoperability was sought by establishing the POSIX standard. The POSIX standard can be applied to any operating system, although it was originally created for various Unix variants.

Mac OS X

Mac OS X is a line of proprietary, graphical operating systems developed, marketed, and sold by Apple Inc., the latest of which is pre-loaded on all currently shipping Macintosh computers. Mac OS X is the successor to the original Mac OS, which had been Apple's primary operating system since 1984. Unlike its predecessor, Mac OS X is a UNIX operating system built on technology that

21

had been developed at NeXT through the second half of the 1980s and up until Apple purchased the company in early 1997.

The operating system was first released in 1999 as Mac OS X Server 1.0, with a desktop-oriented version (Mac OS X v10.0) following in March 2001. Since then, five more distinct "end-user" and "server" editions of Mac OS X have been released, the most recent being Mac OS X v10.5, which was first made available in October 2007. Releases of Mac OS X are named after big cats; Mac OS X v10.5 is usually referred to by Apple and users as "Leopard".

The server edition, Mac OS X Server, is architecturally identical to its desktop counterpart but usually runs on Apple's line of Macintosh server hardware. Mac OS X Server includes work group management and administration software tools that provide simplified access to key network services, including a mail transfer agent, a Samba server, an LDAP server, a domain name server, and others.

Plan 9

Ken Thompson, Dennis Ritchie and Douglas McIlroy at Bell Labs designed and developed the C programming language to build the operating system Unix. Programmers at Bell Labs went on to develop Plan 9 and Inferno, which were engineered for modern distributed environments. Plan 9 was designed from the start to be a networked operating system, and had graphics built-in, unlike Unix, which added these features to the design later. Plan 9 has yet to become as popular as Unix derivatives, but it has an expanding community of developers. It is currently released under the Lucent Public License. Inferno was sold to Vita Nuova Holdings and has been released under a GPL/MIT license.

Real-time operating systems

A real-time operating system (RTOS) is a multitasking operating system intended for applications with fixed deadlines (real-time computing). Such applications include some small embedded systems, automobile engine controllers, industrial robots, spacecraft, industrial control, and some large-scale computing systems.

An early example of a large-scale real-time operating system was Transaction Processing Facility developed by American Airlines and IBM for the Sabre Airline Reservations System.

Embedded systems

Embedded systems use a variety of dedicated operating systems. In some cases, the "operating system" software is directly linked to the application to produce a monolithic special-purpose program. In the simplest embedded systems, there is no distinction between the OS and the application. Embedded systems that have fixed deadlines use a real-time operating system such as VxWorks, eCos, QNX, MontaVista Linux and RTLinux.

Some embedded systems use operating systems such as Symbian OS, Palm OS, Windows CE, BSD, and Linux, although such operating systems do not support real-time computing.

Windows CE shares similar APIs to desktop Windows but shares none of desktop Windows' codebase.

Hobby development22

Operating system development, or OSDev for short, as a hobby has a large cult-like following. As such, operating systems, such as Linux, have derived from hobby operating system projects. The design and implementation of an operating system requires skill and determination, and the term can cover anything from a basic "Hello World" boot loader to a fully featured kernel. One classical example of this is the Minix Operating System—an OS that was designed by A.S. Tanenbaum as a teaching tool but was heavily used by hobbyists before Linux eclipsed it in popularity.

Other

Older operating systems which are still used in niche markets include OS/2 from IBM; Mac OS, the non-Unix precursor to Apple's Mac OS X; BeOS; XTS-300. Some, most notably AmigaOS 4 and RISC OS, continue to be developed as minority platforms for enthusiast communities and specialist applications. OpenVMS formerly from DEC, is still under active development by Hewlett-Packard. There were a number of operating systems for 8 bit computers - Apple's DOS (Disk Operating System) 3.2 & 3.3 for Apple ][, ProDOS, UCSD, CP/M - available for various 8 and 16 bit environments.

Research and development of new operating systems continues. GNU Hurd is designed to be backwards compatible with Unix, but with enhanced functionality and a microkernel architecture. Singularity is a project at Microsoft Research to develop an operating system with better memory protection based on the .Net managed code model. Systems development follows the same model used by other Software development, which involves maintainers, version control "trees", forks, "patches", and specifications. From the AT&T-Berkeley lawsuit the new unencumbered systems were based on 4.4BSD which forked as FreeBSD and NetBSD efforts to replace missing code after the Unix wars. Recent forks include DragonFly BSD and Darwin from BSD Unix.

23

CHAPTER 2

OPERATING SYSTEM STRUCTURE

System Components

Even though, not all systems have the same structure many modern operating systems share the same goal of supporting the following types of system components.

Process Management

The operating system manages many kinds of activities ranging from user programs to system programs like printer spooler, name servers, file server etc. Each of these activities is encapsulated in a process. A process includes the complete execution context (code, data, PC, registers, OS resources in use etc.).

It is important to note that a process is not a program. A process is only ONE instant of a program in execution. There are many processes can be running the same program. The five major activities of an operating system in regard to process management are

• Creation and deletion of user and system processes.

• Suspension and resumption of processes.

• A mechanism for process synchronization.

• A mechanism for process communication.

• A mechanism for deadlock handling.

Main-Memory Management

Primary-Memory or Main-Memory is a large array of words or bytes. Each word or byte has its own address. Main-memory provides storage that can be access directly by the CPU. That is to say for a program to be executed, it must in the main memory.

The major activities of an operating in regard to memory-management are:

• Keep track of which part of memory are currently being used and by whom.

• Decide which process are loaded into memory when memory space becomes available.

• Allocate and deallocate memory space as needed.

File Management

A file is a collected of related information defined by its creator. Computer can store files on the disk (secondary storage), which provide long term storage. Some examples of storage media are magnetic tape, magnetic disk and optical disk. Each of these media has its own properties like speed, capacity, data transfer rate and access methods.

25

A file system normally organized into directories to ease their use. These directories may contain files and other directions.

The five main major activities of an operating system in regard to file management are

1. The creation and deletion of files.

2. The creation and deletion of directions.

3. The support of primitives for manipulating files and directions.

4. The mapping of files onto secondary storage.

5. The back up of files on stable storage media.

I/O System Management

I/O subsystem hides the peculiarities of specific hardware devices from the user. Only the device driver knows the peculiarities of the specific device to whom it is assigned.

Secondary-Storage Management

Generally speaking, systems have several levels of storage, including primary storage, secondary storage and cache storage. Instructions and data must be placed in primary storage or cache to be referenced by a running program. Because main memory is too small to accommodate all data and programs, and its data are lost when power is lost, the computer system must provide secondary storage to back up main memory. Secondary storage consists of tapes, disks, and other media designed to hold information that will eventually be accessed in primary storage (primary, secondary, cache) is ordinarily divided into bytes or words consisting of a fixed number of bytes. Each location in storage has an address; the set of all addresses available to a program is called an address space.

The three major activities of an operating system in regard to secondary storage management are:

1. Managing the free space available on the secondary-storage device.

2. Allocation of storage space when new files have to be written.

3. Scheduling the requests for memory access.

Networking

A distributed system is a collection of processors that do not share memory, peripheral devices, or a clock. The processors communicate with one another through communication lines called network. The communication-network design must consider routing and connection strategies, and the problems of contention and security.

Protection System

If a computer systems has multiple users and allows the concurrent execution of multiple processes, then the various processes must be protected from one another's activities. Protection refers to mechanism for controlling the access of programs, processes, or users to the resources defined by a computer systems.

Command Interpreter System

26

A command interpreter is an interface of the operating system with the user. The user gives commands with are executed by operating system (usually by turning them into system calls). The main function of a command interpreter is to get and execute the next user specified command.

Command-Interpreter is usually not part of the kernel, since multiple command interpreters (shell, in UNIX terminology) may be support by an operating system, and they do not really need to run in kernel mode. There are two main advantages to separating the command interpreter from the kernel.

If we want to change the way the command interpreter looks, i.e., I want to change the interface of command interpreter, I am able to do that if the command interpreter is separate from the kernel. I cannot change the code of the kernel so I cannot modify the interface.

If the command interpreter is a part of the kernel it is possible for a malicious process to gain access to certain part of the kernel that it showed not have to avoid this ugly scenario it is advantageous to have the command interpreter separate from kernel.

Operating Systems Services

Following are the five services provided by an operating systems to the convenience of the users.

Program Execution

The purpose of a computer systems is to allow the user to execute programs. So the operating systems provides an environment where the user can conveniently run programs. The user does not have to worry about the memory allocation or multitasking or anything. These things are taken care of by the operating systems.

Running a program involves the allocating and deallocating memory, CPU scheduling in case of multiprocess. These functions cannot be given to the user-level programs. So user-level programs cannot help the user to run programs independently without the help from operating systems.

I/O Operations

Each program requires an input and produces output. This involves the use of I/O. The operating systems hides the user the details of underlying hardware for the I/O. All the user sees is that the I/O has been performed without any details. So the operating system by providing I/O makes it convenient for the users to run programs.

For efficiently and protection users cannot control I/O so this service cannot be provided by user-level programs.

File System Manipulation

The output of a program may need to be written into new files or input taken from some files. The operating systems provides this service. The user does not have to worry about secondary storage management. User gives a command for reading or writing to a file and sees his/her task accomplished. Thus operating systems makes it easier for user programs to accomplished their task.

This service involves secondary storage management. The speed of I/O that depends on secondary storage management is critical to the speed of many programs and hence I think it is best relegated to the operating systems to manage it than giving individual users the control of it. It is not difficult for the user-level programs to provide these services but for above mentioned reasons it is best if this service s left with operating system.

27

Communications

There are instances where processes need to communicate with each other to exchange information. It may be between processes running on the same computer or running on the different computers. By providing this service the operating system relieves the user of the worry of passing messages between processes. In case where the messages need to be passed to processes on the other computers through a network it can be done by the user programs. The user program may be customized to the specifics of the hardware through which the message transits and provides the service interface to the operating system.

Error Detection

An error is one part of the system may cause malfunctioning of the complete system. To avoid such a situation the operating system constantly monitors the system for detecting the errors. This relieves the user of the worry of errors propagating to various part of the system and causing malfunctioning.

This service cannot allowed to be handled by user programs because it involves monitoring and in cases altering area of memory or deallocation of memory for a faulty process. Or may be relinquishing the CPU of a process that goes into an infinite loop. These tasks are too critical to be handed over to the user programs. A user program if given these privileges can interfere with the correct (normal) operation of the operating systems.

System Calls and System Programs

System calls provide an interface between the process and the operating system. System calls allow user-level processes to request some services from the operating system which process itself is not allowed to do. In handling the trap, the operating system will enter in the kernel mode, where it has access to privileged instructions, and can perform the desired service on the behalf of user-level process. It is because of the critical nature of operations that the operating system itself does them every time they are needed. For example, for I/O a process involves a system call telling the operating system to read or write particular area and this request is satisfied by the operating system.

System programs provide basic functioning to users so that they do not need to write their own environment for program development (editors, compilers) and program execution (shells). In some sense, they are bundles of useful system calls.

Layered Approach Design

In this case the system is easier to debug and modify, because changes affect only limited portions of the code, and programmer does not have to know the details of the other layers. Information is also kept only where it is needed and is accessible only in certain ways, so bugs affecting that data are limited to a specific module or layer.

Mechanisms and Policies

The policies what is to be done while the mechanism specifies how it is to be done. For instance, the timer construct for ensuring CPU protection is mechanism. On the other hand, the decision of how long the timer is set for a particular user is a policy decision.

The separation of mechanism and policy is important to provide flexibility to a system. If the interface between mechanism and policy is well defined, the change of policy may affect only a few

28

parameters. On the other hand, if interface between these two is vague or not well defined, it might involve much deeper change to the system.

Once the policy has been decided it gives the programmer the choice of using his/her own implementation. Also, the underlying implementation may be changed for a more efficient one without much trouble if the mechanism and policy are well defined. Specifically, separating these two provides flexibility in a variety of ways. First, the same mechanism can be used to implement a variety of policies, so changing the policy might not require the development of a new mechanism, but just a change in parameters for that mechanism, but just a change in parameters for that mechanism from a library of mechanisms. Second, the mechanism can be changed for example, to increase its efficiency or to move to a new platform, without changing the overall policy.

29

CHAPTER 3

PROCESS

Definition of Process

The term "process" was first used by the designers of the MULTICS in 1960's. Since then, the term process, used somewhat interchangeably with 'task' or 'job'. The process has been given many definitions for instance

• A program in Execution.

• An asynchronous activity.

• The 'animated sprit' of a procedure in execution.

• The entity to which processors are assigned.

• The 'dispatchable' unit.

And many more definitions have given. As we can see from above that there is no universally agreed upon definition, but the definition "Program in Execution" seem to be most frequently used. And this is a concept are will use in the present study of operating systems.

Now that we agreed upon the definition of process, the question is what is the relation between process and program. It is same beast with different name or when this beast is sleeping (not executing) it is called program and when it is executing becomes process. Well, to be very precise. Process is not the same as program. In the following discussion we point out some of the difference between process and program. As we have mentioned earlier.

Process is not the same as program. A process is more than a program code. A process is an 'active' entity as oppose to program which consider to be a 'passive' entity. As we all know that a program is an algorithm expressed in some suitable notation, (e.g., programming language). Being a passive, a program is only a part of process. Process, on the other hand, includes:

• Current value of Program Counter (PC)

• Contents of the processors registers

• Value of the variables

• The process stack (SP) which typically contains temporary data such as subroutine parameter, return address, and temporary variables.

• A data section that contains global variables.

• A process is the unit of work in a system.

30

Process State

The process state consist of everything necessary to resume the process execution if it is somehow put aside temporarily. The process state consists of at least following:

• Code for the program. • Program's static data. • Program's dynamic data. • Program's procedure call stack. • Contents of general purpose registers. • Contents of program counter (PC) • Contents of program status word (PSW). • Operating Systems resource in use.

A process goes through a series of discrete process states.

New State: The process being created.

Running State: A process is said to be running if it has the CPU, that is, process actually using the CPU at that particular instant.

Blocked (or waiting) State: A process is said to be blocked if it is waiting for some event to happen such that as an I/O completion before it can proceed. Note that a process is unable to run until some external event happens.

Ready State: A process is said to be ready if it use a CPU if one were available. A ready state process is runable but temporarily stopped running to let another process run.

Terminated state: The process has finished execution.

Process Operations Process Creation

In general-purpose systems, some way is needed to create processes as needed during operation. There are four principal events led to processes creation.

• System initialization. • Execution of a process Creation System calls by a running process. • A user request to create a new process. • Initialization of a batch job.

Foreground processes interact with users. Background processes that stay in background sleeping but suddenly springing to life to handle activity such as email, webpage, printing, and so on. Background processes are called daemons. This call creates an exact clone of the calling process.

A process may create a new process by some create process such as 'fork'. It choose to does so, creating process is called parent process and the created one is called the child processes. Only

31

one parent is needed to create a child process. Note that unlike plants and animals that use sexual representation, a process has only one parent. This creation of process (processes) yields a hierarchical structure of processes like one in the figure. Notice that each child has only one parent but each parent may have many children. After the fork, the two processes, the parent and the child, have the same memory image, the same environment strings and the same open files. After a process is created, both the parent and child have their own distinct address space. If either process changes a word in its address space, the change is not visible to the other process.

Following are some reasons for creation of a process• User logs on. • User starts a program. • Operating systems creates process to provide service, e.g., to manage printer. • Some program starts another process, e.g., Netscape calls xv to display a picture.

Process Termination A process terminates when it finishes executing its last statement. Its resources are returned to

the system, it is purged from any system lists or tables, and its process control block (PCB) is erased i.e., the PCB's memory space is returned to a free memory pool. The new process terminates the existing process, usually due to following reasons:

• Normal Exist Most processes terminates because they have done their job. This call is exist in UNIX.

• Error Exist When process discovers a fatal error. For example, a user tries to compile a program that does not exist.

• Fatal Error An error caused by process due to a bug in program for example, executing an illegal instruction, referring non-existing memory or dividing by zero.

• Killed by another Process A process executes a system call telling the Operating Systems to terminate some other process. In UNIX, this call is kill. In some systems when a process kills all processes it created are killed as well (UNIX does not work this way).

Process States A process goes through a series of discrete process states.

• New State The process being created. • Terminated State The process has finished execution. • Blocked (waiting) State When a process blocks, it does so because logically it cannot

continue, typically because it is waiting for input that is not yet available. Formally, a process is said to be blocked if it is waiting for some event to happen (such as an I/O completion) before it can proceed. In this state a process is unable to run until some external event happens.

• Running State A process is said t be running if it currently has the CPU, that is, actually using the CPU at that particular instant.

• Ready State A process is said to be ready if it use a CPU if one were available. It is runable but temporarily stopped to let another process run.

Logically, the 'Running' and 'Ready' states are similar. In both cases the process is willing to run, only in the case of 'Ready' state, there is temporarily no CPU available for it. The 'Blocked' state is different from the 'Running' and 'Ready' states in that the process cannot run, even if the CPU is available.

32

Process State Transitions

Following are six (6) possible transitions among above mentioned five (5) states

Transition 1 occurs when process discovers that it cannot continue. If running process initiates an I/O operation before its allotted time expires, the running process voluntarily relinquishes the CPU.

This state transition is:

Block (process-name): Running → Block.

Transition 2 occurs when the scheduler decides that the running process has run long enough and it is time to let another process have CPU time.


Time-Run-Out (process-name): Running → Ready.

Transition 3 occurs when all other processes have had their share and it is time for the first process to run again


Dispatch (process-name): Ready → Running.

Transition 4 occurs when the external event for which a process was waiting (such as arrival of input) happens.


Wakeup (process-name): Blocked → Ready.

Transition 5 occurs when the process is created.


Admitted (process-name): New → Ready.

Transition 6 occurs when the process has finished execution.


Exit (process-name): Running → Terminated.

33

Process Control Block

A process in an operating system is represented by a data structure known as a process control block (PCB) or process descriptor. The PCB contains important information about the specific process including

• The current state of the process i.e., whether it is ready, running, waiting, or whatever.

• Unique identification of the process in order to track "which is which" information.

• A pointer to parent process.

• Similarly, a pointer to child process (if it exists).

• The priority of process (a part of CPU scheduling information).

• Pointers to locate memory of processes.

• A register save area.

• The processor it is running on.

The PCB is a certain store that allows the operating systems to locate key information about a process. Thus, the PCB is the data structure that defines a process to the operating systems.

Process (computing)

In computing, a process is an instance of a computer program, consisting of one or more threads, that is being sequentially executed by a computer system that has the ability to run several computer programs concurrently.

A computer program itself is just a passive collection of instructions, while a process is the actual execution of those instructions. Several processes may be associated with the same program; for example, opening up several instances of the same program often means more than one process is being executed. In the computing world, processes are formally defined by the operating system (OS) running them and so may differ in detail from one OS to another.

A single computer processor executes one or more (multiple) instructions at a time (per clock cycle), one after the other (this is a simplification; for the full story, see superscalar CPU architecture). To allow users to run several programs at once (e.g., so that processor time is not wasted waiting for input from a resource), single-processor computer systems can perform time-sharing. Time-sharing allows processes to switch between being executed and waiting (to continue) to be executed. In most cases this is done very rapidly, providing the illusion that several processes are executing 'at once'. (This is known as concurrency or multiprogramming.) Using more than one physical processor on a computer, permits true simultaneous execution of more than one stream of instructions from different processes, but time-sharing is still typically used to allow more than one process to run at a time. (Concurrency is the term generally used to refer to several independent processes sharing a single processor; simultaneously is used to refer to several processes, each with their own processor.) Different processes may share the same set of instructions in memory (to save storage), but this is not known to any one process. Each execution of the same set of instructions is known as an instance— a completely separate instantiation of the program.

34

For security and reliability reasons most modern operating systems prevent direct communication between 'independent' processes, providing strictly mediated and controlled inter-process communication functionality.

Sub-processes and multi-threading

Thread (computer science)

A process may split itself into multiple 'daughter' sub-processes or threads that execute in parallel, running different instructions on much of the same resources and data (or, as noted, the same instructions on logically different resources and data).

Multithreading is useful when various 'events' are occurring in an unpredictable order, and should be processed in another order than they occur, for example based on response time constraints. Multithreading makes it possible for the processing of one event to be temporarily interrupted by an event of higher priority. Multithreading may result in more efficient CPU time utilization, since the CPU may switch to low-priority tasks while waiting for other events to occur.

For example, a word processor could perform a spell check as the user types, without "freezing" the application - a high-priority thread could handle user input and update the display, while a low-priority background process runs the time-consuming spell checking utility. This results in that the entered text is shown immediately on the screen, while spelling mistakes are indicated or corrected after a longer time.

Multithreading allows a server, such as a web server, to serve requests from several users concurrently. Thus, we can avoid that requests are left unheard if the server is busy with processing a request. One simple solution to that problem is one thread that puts every incoming request in a queue, and a second thread that processes the requests one by one in a first-come first-served manner. However, if the processing time is very long for some requests (such as large file requests or requests from users with slow network access data rate), this approach would result in long response time also for requests that do not require long processing time, since they may have to wait in queue. One thread per request would reduce the response time substantially for many users and may reduce the CPU idle time and increase the utilization of CPU and network capacity. In case the communication protocol between the client and server is a communication session involving a sequence of several messages and responses in each direction (which is the case in the TCP transport protocol used in for web browsing), creating one thread per communication session would reduce the complexity of the program substantially, since each thread is an instance with its own state and variables.

In a similar fashion, multi-threading would make it possible for a client such as a web browser to communicate efficiently with several servers concurrently.

A process that has only one thread is referred to as a single-threaded process, while a process with multiple threads is referred to as a multi-threaded process. Multi-threaded processes have the advantage over multi-process systems that they can perform several tasks concurrently without the extra overhead needed to create a new process and handle synchronised communication between these processes. However, single-threaded processes have the advantage of even lower overhead.

Representation

In general, a computer system process consists of (or is said to 'own') the following resources:

• An image of the executable machine code associated with a program.

35

• Memory (typically some region of virtual memory); which includes the executable code, process-specific data (input and output), a call stack (to keep track of active subroutines and/or other events), and a heap to hold intermediate computation data generated during run time.

• Operating system descriptors of resources that are allocated to the process, such as file descriptors (Unix terminology) or handles (Windows), and data sources and sinks.

• Security attributes, such as the process owner and the process' set of permissions (allowable operations).

• Processor state (context), such as the content of registers, physical memory addressing, etc. The state is typically stored in computer registers when the process is executing, and in memory otherwise.

The operating system holds most of this information about active processes in data structures called process control blocks (PCB).

Any subset of resources, but typically at least the processor state, may be associated with each of the process' threads in operating systems that support threads or 'daughter' processes.

The operating system keeps its processes separated and allocates the resources they need so that they are less likely to interfere with each other and cause system failures (e.g., deadlock or thrashing). The operating system may also provide mechanisms for inter-process communication to enable processes to interact in safe and predictable ways.

Process management in multi-tasking operating systems

Process management (computing)

A multitasking* operating system may just switch between processes to give the appearance of many processes executing concurrently or simultaneously, though in fact only one process can be executing at any one time on a single-core CPU (unless using multi-threading or other similar technology).

It is usual to associate a single process with a main program, and 'daughter' ('child') processes with any spin-off, parallel processes, which behave like asynchronous subroutines. A process is said to own resources, of which an image of its program (in memory) is one such resource. (Note, however, that in multiprocessing systems, many processes may run off of, or share, the same reentrant program at the same location in memory— but each process is said to own its own image of the program.)

Processes are often called tasks in embedded operating systems. The sense of 'process' (or task) is 'something that takes up time', as opposed to 'memory', which is 'something that takes up space'. (Historically, the terms 'task' and 'process' were used interchangeably, but the term 'task' seems to be dropping from the computer lexicon.)

The above description applies to both processes managed by an operating system, and processes as defined by process calculi.

If a process requests something for which it must wait, it will be blocked. When the process is in the Blocked State, it is eligible for swapping to disk, but this is transparent in a virtual memory system, where blocks of memory values may be really on disk and not in main memory at any time. Note that even unused portions of active processes/tasks (executing programs) are eligible for

36

swapping to disk. All parts of an executing program and its data do not have to be in physical memory for the associated process to be active.

*Tasks and processes refer essentially to the same entity. And, although they have somewhat different terminological histories, they have come to be used as synonyms. Today, the term process is generally preferred over task, except when referring to 'multitasking', since the alternative term, 'multiprocessing', is too easy to confuse with multiprocessor (which is a computer with two or more CPUs).

In Process model, all software on the computer is organized into a number of sequential processes. A process includes PC, registers, and variables. Conceptually, each process has its own virtual CPU. In reality, the CPU switches back and forth among processes. (The rapid switching back and forth is called multiprogramming).

We’re starting with CPU as a resource, so we need an abstraction of CPU use. We define a process as the OS’s representation of a program in execution so that we can allocate CPU time to it. (Other defs: “the thing pointed to by a PCB” to “the animated spirit of a procedure”). Note the difference between a program and a process. The ls program on disk is a program; the ls instance running on a computer is a process.

Process states

The various process states, displayed in a state diagram, with arrows indicating possible transitions between states.

Operating system kernel, which allows multi-tasking, needs process to have certain states. Names of these states are not standardised, but they have similar functionality.

• First, the process is "created" - it is loaded from secondary storage device (hard disk or CD-ROM...) into main memory. After that process scheduler assigns him state "waiting".

• When process is "waiting" it waits for scheduler to do so-called context switch and load the process into the processor. The process state then becomes "running", and processor executes processes instructions.

• If a process needs to wait for a resource (wait for user input or file to open ...)it is assigned "blocked" state. Process state is changed back to "waiting" state when process no longer needs to wait.

• Once the process finishes execution, or is terminated by the operating system, it is no longer needed. Process is removed instantly or is moved to the "terminated" state. When remove it just waits to be removed from main memory.

Inter-process communication

When processes communicate with each other it is called "Inter-process communication" (IPC).

It is possible for both processes to run even on different machines. The operating system (OS) differ one to another, therefore some mediators (called protocols) are needed.

History

37

By the early 60s computer control software had evolved from Monitor control software, e.g., IBSYS, to Executive control software. Computers got "faster" and computer time was still neither "cheap" nor fully used. It made multiprogramming possible and necessary.

Multiprogramming means that several programs run "at the same time" (concurrently). At first they ran on a single processor (i.e., uniprocessor) and shared scarce resources. Multiprogramming is also basic form of multiprocessing, a much broader term.

Programs consist of sequence of instruction for processor. Single processor can run only one instruction at a time. Therefore it is impossible to run more programs at the same time. Program might need some resource (input ...) which has "big" delay. Program might start some slow operation (output to printer ...). This all leads to processor being "idle" (unused). To use processor at all time the execution of such program was halted. At that point, a second (or nth) program was started or restarted. User percieved that programs run "at the same time" (hence the term, concurrent).

Shortly thereafter, the notion of a 'program' was expanded to the notion of an 'executing program and its context'. The concept of a process was born.

This became necessary with the invention of re-entrant code.

Threads came somewhat later. However, with the advent of time-sharing; computer networks; multiple-CPU, shared memory computers; etc., the old "multiprogramming" gave way to true multitasking, multiprocessing and, later, multithreading.

Processes in Action

At any given moment a process is in one of several states:

These gender-neutral terms are something of an innovation. In the early days of computer science, talk of father and son processes was more common. This tradition worked in reverse at IBM, where processes were female. Because male mammals don’t bear young, this is one of the few times where IBM nomenclature is more sensible.

• running

• ready blocked

• Dispatch

• Quantum

• expired

Block for IO

IO completes

The functions of the states are:

• Running the process is executing on the processor. Only one process is in this state on a given processor.

• Blocked The process is waiting for some external event, for example disk I/O

• Ready The process is ready to run38

• These states may be defined implicitly. A process is in the ready state if its on the ready queue, or

• Blocked if it’s on the blocked queue. Frequently there is more than one ready or blocked queue. There may be multiple ready queues to reflect jobs priorities and multiple blocked queues to represent the events for which the process is waiting.

• The act of removing one process from the running state and putting another there is called a context

• switch because the context (that is the running environment: all the user credentials, open files, etc.) of one

• Process is changed for another. We’ll talk more about the details of this next lecture, but you should think about what constitutes a process context.

• Good questions to ask are “why does a process leave the running state?” and “how does the OS pick

• The process to run?” The answers to those questions make up the subtopic of process scheduling.

There are several kinds of schedulers:

• schedulers

• preemptive nonpreemptive

• cooperative

• run-to completion

Run-to-completion schedulers are the easiest to understand. The process leaves the running state exactly once, when it exits. A process never enters the blocked state. Examples are batch systems. Some web servers are conceptually run to completion, but because they are usually implemented on systems with a more complex scheduler, their behavior is more complex.

Processes in a coopertive multitasking environment tell the OS when to switch them. They explicitly block for I/O or they specifically give up the CPU to other processes. An example is Apple’s original multitasking

System and some Java systems.

A preemptive multiprocessing system interrupts (preempts) a running process if it has had the CPU too long and forces a context switch. UNIX is a preemptive multitasking system.

The time a process can keep the CPU is called the system’s time quantum. The choice of time quantum can have a profound effect on system performance. Small time quanta give good interactive performance to short interactive jobs (which are likely to block for I/O). Larger quanta are better for long-running CPU bound jobs because they do not make as many context switches (which don’t move their computations forward). If the time quantum is so small that the system spends more time switching processes than doing useful work, the system is said to be thrashing. Thrashing is a condition we shall see in other subsystems as well. The general definition is when a system is spending more time on overhead than on useful work.

39

Process Scheduling and Implementation

Scheduling

Last lecture we discussed half of process scheduling, when a process gives up the CPU. Today we start with the other half, which process is scheduled to take its place. This is our first introduction to scheduling algorithms which will be a repeating topic in the course. Operating systems schedule pages of memory, disk blocks, and several other things. The algorithms discussed today and variations on them tuned for specific other applications are important tools for your bag of OS design tricks.

Why not just pick a process at random? Congratulations, that are a scheduling discipline - random scheduling. It has the advantages that it’s easy to implement, but gives somewhat unpredictable results.

N. B. if you have a homogeneous set of jobs, it may be an effective scheduling mechanism! All scheduling mechanisms involve design tradeoffs. The relevant parameters to trade off in process Scheduling includes:

• Response Time for processes to complete. the

• OS may want to favor certain types

• of processes or to minimize a statistical

• property like average time

• Implementation Time This includes the complexity of the

• algorithm and the maintenance

• Overhead Time to decide which process to

• schedule and to collect the data

• needed to make that selection

• Fairness To what extent are different users’

• processes treated differently

Some Scheduling Disciplines

First-In-First-Out (FIFO) and Round Robin

The ready queue is a single FIFO queue where the next process to be run is the one at the front of the queue. Processes are added to the back of the ready queue. This is a simple discipline to implement, and with equal sized quanta on a preemptive scheduling system results in each process getting roughly an equal time on the processor. In the limit, i.e., a preemptive system with a quanta the size of one machine instruction and no context switch overhead, the discipline is called processor sharing, and each of n processes gets 1/n of the CPU time.

As quanta get larger, FIFO tends to discriminate against short jobs that give up the CPU quickly for I/O while long CPU-bound jobs hold it for their full quantum.

40

Priority Scheduling

FIFO is eqalitarian - all processes are treated equally. It is often reasonable to discriminate between processes based on their relative importance. (The payroll calculations may be more important than my video game.)

One method of handling this is to assign each process a priority and run the highest priority process. (What to do at on a tie puts us back to square 1 - we pick a scheduling policy).

This solves FIFOs problem with interactive jobs in a mixed workload. Interactive jobs are given high priority and run when there are some. Lower-priority CPU-bound jobs share what’s left. Particularly aggressive priority schedulers reschedule jobs whenever a job moves on any queue, so interactive jobs would be able to run immediately after their I/O completes.

The CTSS system in Tannenbaum uses a different quantum at each priority scheduling level. More complex systems have rules about moving processes between priority levels. (Systems that move processes between mutliply priorities based on their behavior are sometimes called multilevel feedback queues.)

Priority Problems - Starvation and Inversion

When processes cooperate in a priority scheduling system, there can be interactions between the processes that confuse the priority system. Consider three processes, A, B, C; A has the highest priority (runs first) and C the lowest with B having a priority between them. A blocks waiting for C to do something. B will run to completion even though A, a higher priority process, could continue if C would run. This is sometimes referred to as a priority inversion. This happens in real systems - the Mars Rover a couple years ago suffered a failure due to a priority inversion.

Starvation is simpler to understand. Imagine our 2-level priority system above with an endless, fast stream of interactive jobs. Any CPU-bound jobs will be never run.

The final problem with priority systems is how to determine priorities. The can be statically allocated to each program (ls always runs with priority 3) or each user (root always runs with priority 3), or computed on the fly (process aging). All of these have their problems. For every scheduling strategy, there is a counter-strategy.

Shortest Job First (SJF)

An important metric of interactive job performance is the response time of the process (the amount of time that the process is in the sysytem, i.e., on some queue). SJF minimizes the average repsonse time for the system. Processes are labelled with their expected processing time, and the shortest one is scheduled first.

The problem, of course, is determining those response times. For batch processes that run frequently, guesses are easy to come by. Other programs have run times that vary widely (e.g. a prime tester runs quickly on even numbers and slowly on primes). In general, the problem of determining run times a priori is impossible.

41

There is hope, however, in the form of heuristics (that is, algorithms that provide good guesses). The simplest is to use a moving average. An average run time is kept for each program, and after each run of that program it is recomputed as:

estimate = α (old estimate) + (1 − α ) measurement (for 0 < α ≤ 1 and constant

Moving averages are another powerful tool for your design toolkit.

Process Implementation

The operating system represents a process primarily in a data structure called a Process Control Block (PCB). You’ll see Tack Control Block (TCB) and other variants. When a process is created, it is allocated a PCB that includes

• CPU Registers

• Pointer to Text (program code)

• Pointer to uninitialized data

• Stack Pointer

• Program Counter

• Pointer to Data

• Root directory

• Default File Permissions

• Working directory

• Process State

• Exit Status

• File Descriptors

• Process Identifier (pid)

• User Identifier (uid)

• Pending Signals

• Signal Maps

• Other OS-dependent information

These are some of the major elements that make up the process context. Although not all of them are directly manipulated on a context switch.

Context Switching

The act of switching from one process to another is somewhat machine-dependent. A general outline is:

42

• The OS gets control (either because of a timer interrupt or because the process made a system call.

• Operating system processing info is updated (pointer to the current PCB, etc.)

• Processor state is saved (registers, memory map and floating point state, etc)

• This process is replaced on the ready queue and the next process selected by the scheduling

algorithm

• The new process’s operating system and processor state is restored

• The new process continues (to this process it looks like a block call has just returned, or as if

An interrupt service routine (not a signal handler) has just returned

Context switches must be made as safe and fast as possible. Safe because isolation must be maintained and fast because any time spent doing them is stolen from processes doing useful work. Linux’s well-tuned context switch code runs in about 5 microseconds on a high-end Pentium.

Process Creation

There are two main models of process creation - the fork/exec and the spawn models. On systems that support fork, a new process is created as a copy of the original one and then explicitly executes (exec) a new program to run. In the spawn model the new program and arguments are named in the system call, a new process is created and that program run directly.

Fork is the more flexible model. It allows a program to arbitrarily change the environment of the child process before starting the new program. Typical fork pseudo-code looks like:

if (fork() == 0 )

/* Child process */

change standard input

block signals for timers

run the new program

else

/* Parent process */

wait for child to complete

Any parameters of the child process’s operating environment that must be changed must be included in the parameters to spawn, and spawn will have a standard way of handling them. There are various ways to handle the proliferation of parameters that results, for example AmigaDOS® uses tag lists - linked lists of self-describing parameters - to solve the problem.

The steps to process creation are similar for both models. The OS gains control after the fork or spawn system call, and creates and fills a new PCB. Then a new address space (memory) is allocated for the process. Fork creates a copy of the parent address space, and spawn creates a new

43

address space derived from the program. Then the PCB is put on the run list and the system call returns.

An important difference between the two systems is that the fork call must create a copy of the parent address space. This can be wasteful if that address space will be deleted and rewritten in a few instructions’ time. One solution to this problem has been a second system call, vfork, that lets the child process use the parent’s memory until an exec is made. We’ll discuss other systems to mitigate the cost of fork when we talk about memory management.

CHAPTER 4

THREADS

Threads

Despite of the fact that a thread must execute in process, the process and its associated threads are different concept. Processes are used to group resources together and threads are the entities scheduled for execution on the CPU.

A thread is a single sequence stream within in a process. Because threads have some of the properties of processes, they are sometimes called lightweight processes. In a process, threads allow multiple executions of streams. In many respect, threads are popular way to improve application through parallelism. The CPU switches rapidly back and forth among the threads giving illusion that the threads are running in parallel. Like a traditional process i.e., process with one thread, a thread can be in any of several states (Running, Blocked, Ready or Terminated). Each thread has its own stack. Since thread will generally call different procedures and thus a different execution history. This is why thread needs its own stack. An operating system that has thread facility, the basic unit of CPU utilization is a thread. A thread has or consists of a program counter (PC), a register set, and a stack space. Threads are not independent of one other like processes as a result threads shares with other threads their code section, data section, OS resources also known as task, such as open files and signals.

44

Process and Threads

A process is an execution stream in the context of a particular process state.

• An execution stream is a sequence of instructions.

• Process state determines the effect of the instructions. It usually includes (but is not restricted to):

o Registers

o Stack

o Memory (global variables and dynamically allocated memory)

o Open file tables

o Signal management information

Key concept: processes are separated: no process can directly affect the state of another process.

Process is a key OS abstraction that users see - the environment you interact with when you use a computer is built up out of processes.

• The shell you type stuff into is a process.

• When you execute a program you have just compiled, the OS generates a process to run the program.

• Your WWW browser is a process.

Organizing system activities around processes has proved to be a useful way of separating out different activities into coherent units.

Two concepts: uniprogramming and multiprogramming.

• Uniprogramming: only one process at a time. Typical example: DOS. Problem: users often wish to perform more than one activity at a time (load a remote file while editing a program, for example), and uniprogramming does not allow this. So DOS and other uniprogrammed systems put in things like memory-resident programs that invoked asynchronously, but still have separation problems. One key problem with DOS is that there is no memory protection - one program may write the memory of another program, causing weird bugs.

• Multiprogramming: multiple processes at a time. Typical of Unix plus all currently envisioned new operating systems. Allows system to separate out activities cleanly.

Multiprogramming introduces the resource sharing problem - which processes get to use the physical resources of the machine when? One crucial resource: CPU. Standard solution is to use preemptive multitasking - OS runs one process for a while, then takes the CPU away from that process and lets another process run. Must save and restore process state. Key issue: fairness. Must ensure that all processes get their fair share of the CPU.

How does the OS implement the process abstraction? Uses a context switch to switch from running one process to running another process.

45

How does machine implement context switch? A processor has a limited amount of physical resources. For example, it has only one register set. But every process on the machine has its own set of registers. Solution: save and restore hardware state on a context switch. Save the state in Process Control Block (PCB). What is in PCB? Depends on the hardware.

• Registers - almost all machines save registers in PCB.

• Processor Status Word.

• What about memory? Most machines allow memory from multiple processes to coexist in the physical memory of the machine. Some may require Memory Management Unit (MMU) changes on a context switch. But, some early personal computers switched all of process's memory out to disk (!!!).

Operating Systems are fundamentally event-driven systems - they wait for an event to happen, respond appropriately to the event, then wait for the next event.

Examples:

• User hits a key. The keystroke is echoed on the screen.

• A user program issues a system call to read a file. The operating system figures out which disk blocks to bring in, and generates a request to the disk controller to read the disk blocks into memory.

• The disk controller finishes reading in the disk block and generates and interrupt. The OS moves the read data into the user program and restarts the user program.

• A Mosaic or Netscape user asks for a URL to be retrieved. This eventually generates requests to the OS to send request packets out over the network to a remote WWW server. The OS sends the packets.

• The response packets come back from the WWW server, interrupting the processor. The OS figures out which process should get the packets, then routes the packets to that process.

• Time-slice timer goes off. The OS must save the state of the current process, choose another process to run, the give the CPU to that process.

When build an event-driven system with several distinct serial activities, threads are a key structuring mechanism of the OS.

A thread is again an execution stream in the context of a thread state. Key difference between processes and threads is that multiple threads share parts of their state. Typically, allow multiple threads to read and write same memory. (Recall that no processes could directly access memory of another process). But, each thread still has its own registers. Also has its own stack, but other threads can read and write the stack memory.

What is in a thread control block? Typically just registers. Don't need to do anything to the MMU when switch threads, because all threads can access same memory.

Typically, an OS will have a separate thread for each distinct activity. In particular, the OS will have a separate thread for each process, and that thread will perform OS activities on behalf of the process. In this case we say that each user process is backed by a kernel thread.

46

• When process issues a system call to read a file, the process's thread will take over, figure out which disk accesses to generate, and issue the low level instructions required to start the transfer. It then suspends until the disk finishes reading in the data.

• When process starts up a remote TCP connection, its thread handles the low-level details of sending out network packets.

Having a separate thread for each activity allows the programmer to program the actions associated with that activity as a single serial stream of actions and events. Programmer does not have to deal with the complexity of interleaving multiple activities on the same thread.

Why allow threads to access same memory? Because inside OS, threads must coordinate their activities very closely.

• If two processes issue read file system calls at close to the same time, must make sure that the OS serializes the disk requests appropriately.

• When one process allocates memory, its thread must find some free memory and give it to the process. Must ensure that multiple threads allocate disjoint pieces of memory.

Having threads share the same address space makes it much easier to coordinate activities - can build data structures that represent system state and have threads read and write data structures to figure out what to do when they need to process a request.

One complication that threads must deal with: asynchrony. Asynchronous events happen arbitrarily as the thread is executing, and may interfere with the thread's activities unless the programmer does something to limit the asynchrony. Examples:

• An interrupt occurs, transferring control away from one thread to an interrupt handler.

• A time-slice switch occurs, transferring control from one thread to another.

• Two threads running on different processors read and write the same memory.

Asynchronous events, if not properly controlled, can lead to incorrect behavior. Examples:

• Two threads need to issue disk requests. First thread starts to program disk controller (assume it is memory-mapped, and must issue multiple writes to specify a disk operation). In the meantime, the second thread runs on a different processor and also issues the memory-mapped writes to program the disk controller. The disk controller gets horribly confused and reads the wrong disk block.

• Two threads need to write to the display. The first thread starts to build its request, but before it finishes a time-slice switch occurs and the second thread starts its request. The combination of the two threads issues a forbidden request sequence, and smoke starts pouring out of the display.

• For accounting reasons the operating system keeps track of how much time is spent in each user program. It also keeps a running sum of the total amount of time spent in all user programs. Two threads increment their local counters for their processes, then concurrently increment the global counter. Their increments interfere, and the recorded total time spent in all user processes is less than the sum of the local times.

47

So, programmers need to coordinate the activities of the multiple threads so that these bad things don't happen. Key mechanism: synchronization operations. These operations allow threads to control the timing of their events relative to events in other threads. Appropriate use allows programmers to avoid problems like the ones outlined above.

Thread Creation, Manipulation and Synchronization

We first must postulate a thread creation and manipulation interface. Will use the one in Nachos:

class Thread

public:

Thread(char* debugName);

~Thread();

void Fork(void (*func)(int), int arg);

void Yield();

void Finish();

The Thread constructor creates a new thread. It allocates a data structure with space for the TCB.

• To actually start the thread running, must tell it what function to start running when it runs. The Fork method gives it the function and a parameter to the function.

• What does Fork do? It first allocates a stack for the thread. It then sets up the TCB so that when the thread starts running, it will invoke the function and pass it the correct parameter. It then puts the thread on a run queue someplace. Fork then returns, and the thread that called Fork continues.

• How does OS set up TCB so that the thread starts running at the function? First, it sets the stack pointer in the TCB to the stack. Then, it sets the PC in the TCB to be the first instruction in the function. Then, it sets the register in the TCB holding the first parameter to the parameter. When the thread system restores the state from the TCB, the function will magically start to run.

• The system maintains a queue of runnable threads. Whenever a processor becomes idle, the thread scheduler grabs a thread off of the run queue and runs the thread.

• Conceptually, threads execute concurrently. This is the best way to reason about the behavior of threads. But in practice, the OS only has a finite number of processors, and it can't run all of the runnable threads at once. So, must multiplex the runnable threads on the finite number of processors.

Let's do a few thread examples. First example: two threads that increment a variable.

int a = 0;

void sum(int p)

a++;

print("%d : a = %d\n", p, a);48

}

void main() {

Thread *t = new Thread ("child");

t->Fork(sum, 1);

sum (0);

}

• The two calls to sum run concurrently. What are the possible results of the program? To understand this fully, we must break the sum subroutine up into its primitive components.

• Sum first reads the value of into a register. It then increments the register, then stores the contents of the register back into a. It then reads the values of the control string, p and a into the registers that it uses to pass arguments to the print routine. It then calls print, which prints out the data.

• The best way to understand the instruction sequence is to look at the generated assembly language (cleaned up just a bit). You can have the compiler generate assembly code instead of object code by giving it the -S flag. It will put the generated assembly in the same file name as the .c or .cc file, but with a .s suffix.

la a, %r0

ld [%r0],%r1

add %r1,1,%r1

st %r1,[%r0]

ld [%r0], %o3 ! Parameters are passed starting with %o0

mop %o0, %o1

la .L17, %o0

call print

• So when execute concurrently, the result depends on how the instructions interleave. What are possible results?

0: 1 0: 1

1: 2 1: 1

49

1: 2 1: 1

0: 1 0: 1

1: 1 0: 2

0: 2 1: 2

0: 2 1: 2

1: 1 0: 2

So the results are nondeterministic - you may get different results when you run the program more than once. So, it can be very difficult to reproduce bugs. Nondeterministic execution is one of the things that make writing parallel programs much more difficult than writing serial programs.

• Chances are, the programmer is not happy with all of the possible results listed above. Probably wanted the value of to be 2 after both threads finish. To achieve this, must make the increment operation atomic. That is, must prevent the interleaving of the instructions in a way that would interfere with the additions.

• Concept of atomic operation. An atomic operation is one that executes without any interference from other operations - in other words, it executes as one unit. Typically build complex atomic operations up out of sequences of primitive operations. In our case the primitive operations are the individual machine instructions.

• More formally, if several atomic operations execute, the final result is guaranteed to be the same as if the operations executed in some serial order.

• In our case above, build an increment operation up out of loads, stores and add machine instructions. Want the increment operation to be atomic.

• Use synchronization operations to make code sequences atomic. First synchronization abstraction: semaphores. A semaphore is, conceptually, a counter that supports two atomic operations, P and V. Here is the Semaphore interface from Nachos:

class Semaphore

public:

Semaphore(char* debugName, int initial Value);

~Semaphore();

void P();

void V();

}

• Here is what the operations do:

50

o Semphore (name, count): creates a semaphore and initializes the counter to count.

o P (): Atomically waits until the counter is greater than 0, then decrements the counter and returns.

o V (): Atomically increments the counter.

• Here is how we can use the semaphore to make the sum example work:

int a = 0;

Semaphore *s;

void sum(int p) {

int t;

s->P ();

a++;

t = a;

s->V ();

print ("%d : a = %d\n", p, t);

}

void main ()

Thread *t = new Thread ("child");

s = new Semaphore ("s", 1);

t->Fork (sum, 1);

sum (0);

}

• We are using semaphores here to implement a mutual exclusion mechanism. The idea behind mutual exclusion is that only one thread at a time should be allowed to do something. In this case, only one thread should access a. Use mutual exclusion to make operations atomic. The code that performs the atomic operation is called a critical section.

• Semaphores do much more than mutual exclusion. They can also be used to synchronize producer/consumer programs. The idea is that the producer is generating data and the consumer is consuming data. So a Unix pipe has a producer and a consumer. You can also think of a person typing at a keyboard as a producer and the shell program reading the characters as a consumer.

• Here is the synchronization problem: make sure that the consumer does not get ahead of the producer. But, we would like the producer to be able to produce without waiting for the consumer to consume. Can use semaphores to do this. Here is how it works:

51

Semaphore *s;

void consumer (int dummy)

while (1)

s->P ();

consume the next unit of data

void producer(int dummy)

while (1)

produce the next unit of data

s->V ();

void main )

s = new Semaphore ("s", 0);

Thread *t = new Thread ("consumer");

t->Fork(consumer, 1);

t = new Thread ("producer");

t->Fork(producer, 1);

}

In some sense the semaphore is an abstraction of the collection of data.

• In the real world, pragmatics intrude. If we let the producer run forever and never run the consumer, we have to store all of the produced data somewhere. But no machine has an infinite amount of storage. So, we want to let the producer to get ahead of the consumer if it can, but only a given amount ahead. We need to implement a bounded buffer which can hold only N items. If the bounded buffer is full, the producer must wait before it can put any more data in.

Semaphore *full;

Semaphore *empty;


While (1)

full->P ();


empty->V ();

void producer (int dummy)

while (1)

52

empty->P ();


full->V ();

void main ()

empty = new Semaphore ("empty", N);

full = new Semaphore("full", 0);





}

An example of where you might use a producer and consumer in an operating system is the console (a device that reads and writes characters from and to the system console). You would probably use semaphores to make sure you don't try to read a character before it is typed.

• Semaphores are one synchronization abstraction. There is another called locks and condition variables.

• Locks are an abstraction specifically for mutual exclusion only. Here is the Nachos lock interface:

class Lock

public:

Lock (char* debugName); // initialize lock to be FREE

~Lock (); // deallocate lock

void Acquire (); // these are the only operations on a lock

void Release (); // they are both *atomic*

}

• A lock can be in one of two states: locked and unlocked. Semantics of lock operations:

o Lock (name): creates a lock that starts out in the unlocked state.

o Acquire (): Atomically waits until the lock state is unlocked, then sets the lock state to locked.

o Release (): Atomically changes the lock state to unlocked from locked.

In assignment 1 you will implement locks in Nachos on top of semaphores.

53

• What are requirements for a locking implementation?

o Only one thread can acquire lock at a time. (safety)

o If multiple threads try to acquire an unlocked lock, one of the threads will get it. (liveness)

o All unlocks complete in finite time. (liveness)

• What are desirable properties for a locking implementation?

o Efficiency: take up as little resources as possible.

o Fairness: threads acquire lock in the order they ask for it. Are also weaker forms of fairness.

o Simple to use.

• When use locks, typically associate a lock with pieces of data that multiple threads access. When one thread wants to access a piece of data, it first acquires the lock. It then performs the access, then unlocks the lock. So, the lock allows threads to perform complicated atomic operations on each piece of data.

• Can you implement unbounded buffer only using locks? There is a problem - if the consumer wants to consume a piece of data before the producer produces the data, it must wait. But locks do not allow the consumer to wait until the producer produces the data. So, consumer must loop until the data is ready. This is bad because it wastes CPU resources.

• There is another synchronization abstraction called condition variables just for this kind of situation. Here is the Nachos interface:

class Condition

public:

Condition (char* debugName);

~Condition ();

void Wait (Lock *conditionLock);

void Signal (Lock *conditionLock);

void Broadcast (Lock *conditionLock);

}

• Semantics of condition variable operations:

o Condition (name): creates a condition variable.

o Wait (Lock *l): Atomically releases the lock and waits. When Wait returns the lock will have been reacquired.

54

o Signal (Lock *l): Enables one of the waiting threads to run. When Signal returns the lock is still acquired.

o Broadcast (Lock *l): Enables all of the waiting threads to run. When Broadcast returns the lock is still acquired.

• All locks must be the same. In assignment 1 you will implement condition variables in Nachos on top of semaphores.

• Typically, you associate a lock and a condition variable with a data structure. Before the program performs an operation on the data structure, it acquires the lock. If it has to wait before it can perform the operation, it uses the condition variable to wait for another operation to bring the data structure into a state where it can perform the operation. In some cases you need more than one condition variable.

• Let's say that we want to implement an unbounded buffer using locks and condition variables. In this case we have 2 consumers.

Lock *l;

Condition *c;

int avail = 0;


while (1) {

l->Acquire();

if (avail == 0)

c->Wait(l);


avail--;

l->Release();

void producer (int dummy)

while (1)

l->Acquire();


avail++;

c->Signal (l);

l->Release();

void main ()

55

l = new Lock ("l");

c = new Condition ("c");







}

• There are two variants of condition variables: Hoare condition variables and Mesa condition variables. For Hoare condition variables, when one thread performs a Signal, the very next thread to run is the waiting thread. For Mesa condition variables, there are no guarantees when the signalled thread will run. Other threads that acquire the lock can execute between the signaler and the waiter. The example above will work with Hoare condition variables but not with Mesa condition variables.

• What is the problem with Mesa condition variables? Consider the following scenario: Three threads, thread 1 one producing data, threads 2 and 3 consng data.

o Thread 2 calls consumer, and suspends.

o Thread 1 calls producer, and signals thread 2.

o Instead of thread 2 running next, thread 3 runs next, calls consumer, and consumes the element. (Note: with Hoare monitors, thread 2 would always run next, so this would not happen.)

o Thread 2 runs, and tries to consume an item that is not there. Depending on the data structure used to store produced items, may get some kind of illegal access error.

• How can we fix this problem? Replace if with a while.


while (1)

l->Acquire();

while (avail == 0)

c->Wait(l);


56

Avail--;

l->Release();

}

}

In general, this is a crucial point. Always put while's around your condition variable code. If you don't, you can get really obscure bugs that show up very infrequently.

• In this example, what is the data that the lock and condition variable are associated with? The avail variable.

• People have developed a programming abstraction that automatically associates locks and condition variables with data. This abstraction is called a monitor. A monitor is a data structure plus a set of operations (sort of like an abstract data type). The monitor also has a lock and, optionally, one or more condition variables.

• The compiler for the monitor language automatically inserts a lock operation at the beginning of each routine and an unlock operation at the end of the routine. So, programmer does not have to put in the lock operations.

• Monitor languages were popular in the middle 80's - they are in some sense safer because they eliminate one possible programming error. But more recent languages have tended not to support monitors explicitly, and expose the locking operations to the programmer. So the programmer has to insert the lock and unlock operations by hand. Java takes a middle ground - it supports monitors, but also allows programmers to exert finer grain control over the locked sections by supporting synchronized blocks within methods. But synchronized blocks still present a structured model of synchronization, so it is not possible to mismatch the lock acquire and release.

• Laundromat Example: A local Laundromat has switched to a computerized machine allocation scheme. There are N machines, numbered 1 to N. By the front door there are P allocation stations. When you want to wash your clothes, you go to an allocation station and put in your coins. The allocation station gives you a number, and you use that machine. There are also P deallocation stations. When your clothes finish, you give the number back to one of the deal location stations, and someone else can use the machine. Here is the alpha release of the machine allocation software:

allocate (int dummy) {

while (1) {

wait for coins from user

n = get ();

give number n to user

deallocate (int dummy)

while (1)

Wait for number n from user

57

put (i);

main ()

for (i = 0; i < P; i++) {

t = new Thread ("allocate");

t->Fork(allocate, 0);

t = new Thread ("deallocate");

t->Fork(deallocate, 0);

}

• The key parts of the scheduling are done in the two routines get and put, which use an array data structure a to keep track of which machines are in use and which are free.

int a[N];

int get() {

for (i = 0; i < N; i++)

if (a[i] == 0) {

A[i] = 1;

return (i+1);

void put (int i)

a [i-1] = 0;

}

• It seems that the alpha software isn't doing all that well. Just looking at the software, you can see that there are several synchronization problems.

• The first problem is that sometimes two people are assigned to the same machine. Why does this happen? We can fix this with a lock:

int a[N];

Lock *l;

int get ()

l->Acquire ();

for (i = 0; i < N; i++)

if (a [i] == 0)

a [i] = 1;

l->Release ();

58

return (i+1);

l->Release ();

void put (int i)

l->Acquire();

a [i-1] = 0;

l->Release ();

So now, have fixed the multiple assignment problem. But what happens if someone comes in to the laundry when all of the machines are already taken? What does the machine return? Must fix it so that the system waits until there is a machine free before it returns a number. The situation calls for condition variables.

int a[N];

Lock *l;

Condition *c;

int get()

l->Acquire();

while (1)

for (i = 0; i < N; i++)

if (a [i] == 0)

a [i] = 1;

l->Release ();

return (i+1);

c->Wait (l);

void put (int i)

l->Acquire ();

a[i-1] = 0;

c->Signal ();

l->Release ();

• What data is the lock protecting? The a array.

59

• When would you use a broadcast operation? Whenever want to wake up all waiting threads, not just one. For an event that happens only once. For example, a bunch of threads may wait until a file is deleted. The thread that actually deleted the file could use a broadcast to wake up all of the threads.

• Also use a broadcast for allocation/deallocation of variable sized units. Example: concurrent malloc/free.

Lock *l;

Condition *c;

char *malloc (int s)

l->Acquire ();

while (cannot allocate a chunk of size s)

c->Wait (l);

allocate chunk of size s;

l->Release ();

return pointer to allocated chunk

void free (char *m)

l->Acquire ();

deallocate m.

c->Broadcast (l);

l->Release ();

Example with malloc/free. Initially start out with 10 bytes free.

Time Process 1 Process 2 Process 3

malloc(10) - succeeds malloc(5) - suspends lock malloc(5) suspends lock

1 gets lock - waits

2 gets lock - waits

3 free(10) - broadcast

4 resume malloc(5) - succeeds

60


6 malloc(7) - waits

7 malloc(3) - waits

8 free(5) - broadcast

9 resume malloc(7) - waits


What would happen if changed c->Broadcast(l) to c->Signal(l)? At step 10, process 3 would not wake up, and it would not get the chance to allocate available memory. What would happen if changed while loop to an if?

• You will be asked to implement condition variables as part of assignment 1. The following implementation is INCORRECT. Please do not turn this implementation in.

class Condition

private:

int waiting;

Semaphore *sema;

void Condition::Wait (Lock* l)

waiting++;

l->Release();

sema->P();

l->Acquire();

void Condition::Signal (Lock* l)

if (waiting > 0)

seamy->V ();

waiting--;

As we mentioned earlier that in many respect threads operate in the same way as that of processes. Some of the similarities and differences are:

61

Similarities

• Like processes threads share CPU and only one thread active (running) at a time.

• Like processes, threads within a processes, threads within a processes execute sequentially.

• Like processes, thread can create children.

• And like process, if one thread is blocked, another thread can run.

Differences

• Unlike processes, threads are not independent of one another.

• Unlike processes, all threads can access every address in the task.

• Unlike processes, thread are design to assist one other. Note that processes might or might not assist one another because processes may originate from different users.

Why Threads?

Following are some reasons why we use threads in designing operating systems.

1. A process with multiple threads make a great server for example printer server.

2. Because threads can share common data, they do not need to use interprocess communication.

3. Because of the very nature, threads can take advantage of multiprocessors.

Threads are cheap in the sense that

1. They only need a stack and storage for registers therefore, threads are cheap to create.

2. Threads use very little resources of an operating system in which they are working. That is, threads do not need new address space, global data, program code or operating system resources.

3. Context switching are fast when working with threads. The reason is that we only have to save and/or restore PC, SP and registers.

But this cheapness does not come free - the biggest drawback is that there is no protection between threads.

User Level Threads and Kernel Level Threads

User-Level Threads

User-level threads implement in user-level libraries, rather than via systems calls, so thread switching does not need to call operating system and to cause interrupt to the kernel. In fact, the kernel knows nothing about user-level threads and manages them as if they were single-threaded processes.

Advantages:

62

The most obvious advantage of this technique is that a user-level threads package can be implemented on an Operating System that does not support threads. Some other advantages are

• User-level threads does not require modification to operating systems.

• Simple Representation:

Each thread is represented simply by a PC, registers, stack and a small control block, all stored in the user process address space.

• Simple Management:

This simply means that creating a thread, switching between threads and synchronization between threads can all be done without intervention of the kernel.

• Fast and Efficient:

Thread switching is not much more expensive than a procedure call.

Disadvantages:

• There is a lack of coordination between threads and operating system kernel. Therefore, process as whole gets one time slice respect of whether process has one thread or 1000 threads within. It is up to each thread to relinquish control to other threads.

• User-level threads requires non-blocking systems call i.e., a multithreaded kernel. Otherwise, entire process will blocked in the kernel, even if there are runable threads left in the processes. For example, if one thread causes a page fault, the process blocks.

Kernel-Level Threads

In this method, the kernel knows about and manages the threads. No runtime system is needed in this case. Instead of thread table in each process, the kernel has a thread table that keeps track of all threads in the system. In addition, the kernel also maintains the traditional process table to keep track of processes. Operating Systems kernel provides system call to create and manage threads.

Advantages:

• Because kernel has full knowledge of all threads, Scheduler may decide to give more time to a process having large number of threads than process having small number of threads.

• Kernel-level threads are especially good for applications that frequently block.

Disadvantages:

• The kernel-level threads are slow and inefficient. For instance, threads operations are hundreds of times slower than that of user-level threads.

• Since kernel must manage and schedule threads as well as processes. It require a full thread control block (TCB) for each thread to maintain information about threads. As a result there is significant overhead and increased in kernel complexity.

63

Advantages of Threads over Multiple Processes

• Context Switching Threads are very inexpensive to create and destroy, and they are inexpensive to represent. For example, they require space to store, the PC, the SP, and the general-purpose registers, but they do not require space to share memory information, Information about open files of I/O devices in use, etc. With so little context, it is much faster to switch between threads. In other words, it is relatively easier for a context switch using threads.

• Sharing Treads allow the sharing of a lot resources that cannot be shared in process, for example, sharing code section, data section, Operating System resources like open file etc.

Disadvantages of Threads over Multiprocesses

• Blocking The major disadvantage if that if the kernel is single threaded, a system call of one thread will block the whole process and CPU may be idle during the blocking period.

• Security Since there is, an extensive sharing among threads there is a potential problem of security. It is quite possible that one thread over writes the stack of another thread (or damaged shared data) although it is very unlikely since threads are meant to cooperate on a single task.

Application that Benefits from Threads

A proxy server satisfying the requests for a number of computers on a LAN would be benefited by a multi-threaded process. In general, any program that has to do more than one task at a time could benefit from multitasking. For example, a program that reads input, process it, and outputs could have three threads, one for each task.

Application that cannot benefit from Threads

Any sequential process that cannot be divided into parallel task will not benefit from thread, as they would block until the previous one completes. For example, a program that displays the time of the day would not benefit from multiple threads.

Resources used in Thread Creation and Process Creation

When a new thread is created it shares its code section, data section and operating system resources like open files with other threads. But it is allocated its own stack, register set and a program counter.

The creation of a new process differs from that of a thread mainly in the fact that all the shared resources of a thread are needed explicitly for each process. So though two processes may be

64

running the same piece of code they need to have their own copy of the code in the main memory to be able to run. Two processes also do not share other resources with each other. This makes the creation of a new process very costly compared to that of a new thread.

Context Switch

To give each process on a multiprogrammed machine a fair share of the CPU, a hardware clock generates interrupts periodically. This allows the operating system to schedule all processes in main memory (using scheduling algorithm) to run on the CPU at equal intervals. Each time a clock interrupt occurs, the interrupt handler checks how much time the current running process has used. If it has used up its entire time slice, then the CPU scheduling algorithm (in kernel) picks a different process to run. Each switch of the CPU from one process to another is called a context switch.

Major Steps of Context Switching

• The values of the CPU registers are saved in the process table of the process that was running just before the clock interrupt occurred.

• The registers are loaded from the process picked by the CPU scheduler to run next.

In a multiprogrammed uniprocessor computing system, context switches occur frequently enough that all processes appear to be running concurrently. If a process has more than one thread, the Operating System can use the context switching technique to schedule the threads so they appear to execute in parallel. This is the case if threads are implemented at the kernel level. Threads can also be implemented entirely at the user level in run-time libraries. Since in this case no thread scheduling is provided by the Operating System, it is the responsibility of the programmer to yield the CPU frequently enough in each thread so all threads in the process can make progress.

Action of Kernel to Context Switch among Threads

The threads share a lot of resources with other peer threads belonging to the same process. So a context switch among threads for the same process is easy. It involves switch of register set, the program counter and the stack. It is relatively easy for the kernel to accomplished this task.

Action of kernel to Context Switch among Processes

Context switches among processes are expensive. Before a process can be switched its process control block (PCB) must be saved by the operating system. The PCB consists of the following information:

• The process state.

• The program counter, PC.

• The values of the different registers.

• The CPU scheduling information for the process.

• Memory management information regarding the process.

• Possible accounting information for this process.

65

• I/O status information of the process.

When the PCB of the currently executing process is saved the operating system loads the PCB of the next process that has to be run on CPU. This is a heavy task and it takes a lot of time.

User Threads

User threads are implemented entriely in user space. The programmer of the thread library writes code to synchronize threads and to context switch them, and they all run in one process. The operating system is unaware that a thread system is even running.

User-level threads replicate some amount of kernel level functionality in user space. Examples of user-level threads systems are Nachos and Java (on OSes that don’t support kernel threads).

Because the OS treats the running process like any other there is no additional kernel overhead for user-level threads. However, the user-level threads only run when the OS has scheduled their underlying process (making a blocking system call blocks all the threads.)

Kernel Threads

Some OS kernels support the notion of threads and schedule them directly. There are system calls to create threads and manipulate them in ways similar to processes. Synchronization and scheduling may be provided by the kernel.

Kernel-level threads have more overhead in the kernel (a kernel thread control block) and more overhead in their use (manipulating them requires a system call). However the abstraction is cleaner (threads can make system calls independently).

66

CHAPTER 5

THE CENTRAL PROCESSING UNIT (CPU)

This chapter gives some more detail on the Central Processing Unit (CPU) and leads up to where we can write significant programs in assembly/machine code. First we will give an overview of how a processor and memory function together to execute a single machine instruction - the famous fetch-decode-execute cycle.

A CPU consists of three major parts:

1. The internal registers, the ALU and the connecting buses - sometimes called the data path;

2. The input-output interface, which is the gateway through which data are sent and received from main memory and input-output devices;

3. The control part, which directs the activities of the data path and input-output interface, e.g. opening and closing access to buses, selecting ALU function, etc. We will avoid going into much detail about the control.

A fourth part, main memory, is never far from the CPU but from a logical point of view is best kept separate.

We will pay most attention to the data path part of the processor, and what must happen in it to cause useful things to happen - to cause program instructions to be executed.

In the system we describe, the control part is implemented by microprogram, i.e. how the fetching, decoding and execution of a machine instruction can be implemented by execution of a set of sequencing steps called a microprogram. Note on terminology: the term microprogram was devised in the early 1950s long before microprocessors were ever dreamt of.

The Architecture of Mic-1

Figure 6.1 shows the data path part of our hypothetical CPU from (Tanenbaum, 1990), page 170 onwards. Here, we briefly describe the components of Figure 6.1. Then we give a qualitative discussion of how it executes program instructions. Finally we describe the execution of instructions in some detail.

Figure 6.1: Mic-1 CPU (from Tanenbaum, Structured Computer Organisation, 3rd ed.)

67

Registers

There are 16 identical 16-bit registers. But, they are not general purpose, each has a special use:

PC, program counter: The PC points to memory location that holds the next instruction to be executed;

AC, accumulator: The accumulator is like the display register in a calculator; most operations use it implicitly as an unmentioned input, and the result of any operation is placed in it. For now, we can ignore all the others, though we give brief descriptions below.

SP, stack-pointer: Used for maintaining a data area called the stack; the stack is used for remembering where we came from when we call subprograms; likewise for remembering data when an interrupt is being processed; it is also used as a communication medium for passing data to subprograms; finally, it is used as storage area for local variables in subprograms;

IR, Instruction Register: Holds the instruction (the actual instruction data) currently being executed.

TIR, Temporary Instruction Register: Holds temporary versions of the instruction while it is being decoded; 0, +1, -1

Constants; it is handy to have copies of them close by - avoids wasting time accessing main memory.

AMASK: Another constant; used for masking (anding) the address part of the instruction; i.e. AMASK and IR address.

SMASK: ditto for stack (relative) addresses.

A, B, ...F: General purpose registers; but general purpose only for the micro programmer, i.e. the assembly language cannot address them.

Internal Buses

There are three internal buses, A and B (source) buses and C (destination) bus.

External Buses

The address bus and the data bus. Minor point to note: many buses, in particular those in many of the Intel 80X86 family, use the same physical bus (connections) for both address and data; it's simple to do - the control part of the bus just has to make sure all users of the bus know when it's data, when it's address.

Latches

A and B latches hold stable versions of A and B buses. There would be problems if, for example, AC was connected straight into the A input of the ALU and, meanwhile, the output of the ALU was connected to AC, i.e.. what version of AC to use; the answer would be continuously changing.

A-Multiplexer (AMUX)

68

The ALU input A can be fed with either: (i) the contents of the A latch; or (ii) the contents of MBR, i.e. what was originally the contents of a memory location.

ALU

In Mac-1a the ALU may perform just one of four functions:

0

; note `plus', rather than or; ;

1.

; ;

2

straight through, ignored; ;

3

; .

Any other functions have to be programmed.

Shifter

The shifter is not a register - it passes the ALU output straight through: shifted left, shifted right or not shifted.

Memory Address Register (MAR) and Memory Buffer Register (MBR) and Memory

The MAR is a register which is used as a gateway - a `buffer' - onto the address bus. Likewise the MBR (it might be better to call this memory data register) for the data bus.

The memory is considered to be a collection of cells or locations, each of which can be addressed individually, and thus written to or read from. Effectively, memory is like an array in C, Basic or any other high-level language. For brevity, we shall refer to this memory `array' as

and the address of a general cell as

and so, the contents of the cell at address as

, or .

To read from a memory cell, the controller must cause the following to happen:

1. Put an address, , in MAR;

69

2. Requests read - by asserting a read control line;

3. At some time later, the contents of , appear in MBR, from where, the controller can cause it to be ...

4. ...transferred to the AC or somewhere else.

To write to a memory cell, the controller must cause something similar to happen:

1. Put an address, , in MAR;

2. Put the data in MBR;

3. Requests write - by asserting a write control line;

4. At some time later, the data arrive in memory cell .

It is a feature of all general purpose computers that executable instructions and data occupy the same memory space. Often, programs are organised so that there are blocks of instructions and blocks of data. But, there is no fundamental reason, except tidiness and efficiency, why instructions and data cannot be mixed up together.

Register Transfer Language

To describe the details of operation of the CPU, we use a simple language called Register Transfer Language (RTL). The notation is as follows.

denotes contents of location ; sometimes , or even

just . Think of an envelope with £100 in it, and your address on it.

Reg denotes a register; Ret = PC, IR, AC, R1 or R2.

denotes contents of the address contained in . Think of an envelope containing another envelope.

We use to denote transfer: . Pronounce this as `A gets B'. In the case of , we say `A gets contents of x'.

Simple Model of a Computer - Part 3

Back in section 2.7, we produced a simple model of a computer. Here we show it again, Figure 6.2.

70

Figure 6.2: Mechanical Computer

At the end of section 2.7 we admitted that we had been telling only half the truth! And we admitted that we had to fit the program into memory as well. Fine, here goes. Were going to use the same program.

In this more realistic model, the person operating the CPU has no list of instructions available on the desk, but must read one instruction at a time from memory.

Recall what was needed: add the contents of memory cell 0 to the contents of memory cell 1, store the result in cell 2; if the result is greater-than-or-equal-to 40, put 1 in cell 3, otherwise put 0 in cell 3. (We are adding marks, and cell 3 contains an indicator of Pass (1) or Fail (0).

And the program, with appropriate numerical code (so that instructions be stored in memory). The numerically coded instruction is given in four Hexadecimal digits; the first digit gives the operation required (load, add, store,) - the opcode; the last three digits give the address or data - the operand.

The opcodes are as follows:

Figure 6.3: Opcodes

I have to renumber the program steps from P1-P14 to P101 ..., for reasons which will soon become evident. Also, we will use hexadecimal numbering.

71

P101: Load contents of memory 0 into AC. Code: 0 000

P102: Add contents of memory 1 to contents of AC. Code: 2 001

P103: Store the contents of AC in memory 2. Code: 1 002

P104: Load the constant 40 into the AC. Code: 7 028 (40dec is 28Hex)

P105: Store the contents of AC in memory 4. Code: 1 004

P106: Load the contents of memory 2 into AC. Code: 0 002

P107: Subtract contents of memory 4 from contents of AC. Code: 3 004

P108: If AC is positive, jump to instruction P10c. Code: 4 10c

P109: Load the constant 0 into the AC. Code 7 000

P10a: Store the contents of AC in memory 3. Code 1 003

P10b: Jump to P10e. Code 6 10e

P10c: Load the constant 1 into the AC. Code 7 001

P10d: Store the contents of AC in memory 3. Code 1 003

P10e: Stop.

We now have to revise Figure 6.2 to show the program, Figure 6.4. The revisions are as follows:

• Show the additional memory (containing the program).

• Show a Program Counter (PC) register that keeps track of the address of the next instruction.

• Show the Instruction Register (IR); this tells the CPU operator what to do for the current step.

Figure 6.4: Mechanical Computer with Program

72

In this revised model, the CPU operator has no list of instructions on his/her desk (the CPU); he/she must go through the following cycle of steps for each instruction step:

Fetch: (a) Take the number in the PC; (b) place it in MAR; (c) shout "Bus"; (d) add one to the number in the PC - to make it point to the next step; (e) wait until a new number arrives in the MBR; (f) take the number in the MBR and put it in the Instruction Register;

Decode: (a) Take the number in IR; (b) Take the top digit (opcode), look it up in Figure 6.3, and see what has to be done; (c) take the number in the bottom three digits - this signifies the operand.

Execute: Perform the action required. E.g. Add contents of memory 1 to contents of AC (2 001). Opcode is 2, operand is 001. We've already done this: (a) write 1 on a piece of paper and place it in MAR; (b) put a tick against Read; (c) shout "Bus"; (d) some time later, the contents of cell 1 (33) will arrive in MBR; (e) look at what is in AC and in MBR, use the calculator to add them (22 + 33); (f) write down a copy of the result and put it in AC. Thus, in the case shown, a piece of paper with 55 on it would be put in to AC.

If the operation is a jump, then all the operator does is take the operand (the jump-to address) and place it in the PC - thus stopping the PC pointing to the next instruction in sequence.

There we have it. The famous fetch-decode-execute cycle. The CPU is a pretty busy place!

The Fetch-Decode-Execute Cycle

How does the CPU and its controller execute a sequence of instructions? Let us start by considering the execution the instruction at location 0x100; what follows is an endless loop of the so-called fetch-decode-execute cycle.

Fetch: Read the next instruction and put it in the Instruction Register. Point to the next instruction, ready for the next Fetch.

Decode: Figure out what the instruction means;

Execute: Do what is required that instruction; if it is a JUMP type instruction, then revise the pointing to the jumped-to instruction. Go back to Fetch.

Instruction Set

We now examine the instruction set, by which assembly programmers can program the machine. We will call the machine Mac-1a; Mac-1a is a restricted version of Tanenbaum's Mac-1. The main characteristics of Mac-1a are: data word length 16-bit; address size 12-bits.

Exercise. What is the maximum number of words we can have in the main memory of Mac-1a? (neglect memory mapped input-output). How many bytes? There are two addressing modes: immediate and direct; we will neglect Tanenbaum's local and indirect for the meanwhile.

It is accumulator based: that is, everything is done through AC; thus, `Add' is done as follows: put operand 1 in AC, add to memory location, result is put in AC; if necessary, i.e. we want to retain the result, the contents of the AC is now copied to memory.

73

The Mac-1a programmer has no access to the PC or other CPU registers. Also, for present purposes, assume that SP does not exist. A limited version of the Mac-1 instruction set is shown in Figure 6.5. The columns are as follows:

Binary code for instruction: I.e. what the instruction looks like in computer memory, Machine code.

Mnemonic: The name given to the instruction. Used when coding in assembly code.

Long name: Descriptive name for instruction.

Action: What the instruction actually does, described formally in register transfer language (RTL).

74

Figure 6.5: Mac-1a Instruction Set (limited version of Mac-1)

Microprogram Control versus Hardware Control

Control of the CPU - fetch, decode, execute - is done by a microcontroller which obeys a program of microinstructions. We might think of the microcontroller as a black-box such as that shown in Figure 6.6. The microcontroller has a set of inputs and a set of outputs - just like any other circuit, ALU, multiplexer, etc. Therefore, instead of microprogramming, it can be made from logic hardware.

Figure 6.6: Controller Black-box, either Microcontroller or Logic

To design the circuit, all you have to do is prepare a truth-table (6 input columns - op-code (4 bits) and N, Z, 22 output columns), and generate the logic.

There is no reason why this hardware circuit could not decode an instruction in ONE clock period, i.e. a lot faster than the microcode solution.

The micro programmed solution allows arbitrarily complex instructions to be built-up. It may also be more flexible, for example, there were many machines that users could microprogram themselves; and, there were computers which differed only by their microcode, perhaps one optimised for execution of C programs, another for COBOL programs.

On the other hand, if implemented on a chip, control store takes up a lot of chip space. And, as you can see by examining (Tanenbaum, 1990), microcode interpretation may be relatively slow -- and gets slower, the more instructions there are.

75

Figure 6.7 shows the full Mac-1 CPU with its microcontroller unit.

Figure 6.7: Mac-1 CPU including control (from Tanenbaum, Structured Computer Organisation, 3rd ed.)

CISC versus RISC

Machines with large sets of complex (and perhaps slow) instructions (implemented with microcode), are called CISC - complex instruction set computer.

Those with small sets of relatively simple instructions, probably implemented in logic are called RISC - reduced instruction set computer.

Most early machines - before about 1965 - were RISC. Then the fashion switched to CISC. Now the fashion is switching back to RISC, albeit with some special go-faster features that were not present on early RISC.

CISC machines are easier to program in machine and assembly code (see next chapter), because they have a richer set of instructions. But, nowadays, less and less programmers use assembly code, and compilers are becoming better. It comes down to a trade off, complexity of

76

`silicon' (microcode and CISC) or complexity of software (highly efficient optimising compilers and RISC).

CPU Scheduling

What is CPU scheduling? Determining which processes run when there are multiple runnable processes. Why is it important? Because it can have a big effect on resource utilization and the overall performance of the system.

By the way, the world went through a long period (late 80's, early 90's) in which the most popular operating systems (DOS, Mac) had NO sophisticated CPU scheduling algorithms. They were single threaded and ran one process at a time until the user directs them to run another process. Why was this true? More recent systems (Windows NT) are back to having sophisticated CPU scheduling algorithms. What drove the change, and what will happen in the future?

Basic assumptions behind most scheduling algorithms:

• There is a pool of runnable processes contending for the CPU.

• The processes are independent and compete for resources.

• The job of the scheduler is to distribute the scarce resource of the CPU to the different processes ``fairly'' (according to some definition of fairness) and in a way that optimizes some performance criteria.

In general, these assumptions are starting to break down. First of all, CPUs are not really that scarce - almost everybody has several, and pretty soon people will be able to afford lots. Second, many applications are starting to be structured as multiple cooperating processes. So, a view of the scheduler as mediating between competing entities may be partially obsolete.

How do processes behave? First, CPU/IO burst cycle. A process will run for a while (the CPU burst), perform some IO (the IO burst), then run for a while more (the next CPU burst). How long between IO operations? Depends on the process.

• IO Bound processes: processes that perform lots of IO operations. Each IO operation is followed by a short CPU burst to process the IO, then more IO happens.

• CPU bound processes: processes that perform lots of computation and do little IO. Tend to have a few long CPU bursts.

One of the things a scheduler will typically do is switch the CPU to another process when one process does IO. Why? The IO will take a long time, and don't want to leave the CPU idle while wait for the IO to finish.

When look at CPU burst times across the whole system, have the exponential or hyper exponential distribution in Fig. 5.2.

What are possible process states?

• Running - process is running on CPU.

• Ready - ready to run, but not actually running on the CPU.

• Waiting - waiting for some event like IO to happen.

77

When do scheduling decisions take place? When does CPU choose which process to run? Are a variety of possibilities:

• When process switches from running to waiting. Could be because of IO request, because wait for child to terminate, or wait for synchronization operation (like lock acquisition) to complete.

• When process switches from running to ready - on completion of interrupt handler, for example. Common example of interrupt handler - timer interrupt in interactive systems. If scheduler switches processes in this case, it has preempted the running process. Another common case interrupt handler is the IO completion handler.

• When process switches from waiting to ready state (on completion of IO or acquisition of a lock, for example).

• When a process terminates.

How to evaluate scheduling algorithm? There are many possible criteria:

• CPU Utilization: Keep CPU utilization as high as possible. (What is utilization, by the way?).

• Throughput: number of processes completed per unit time.

• Turnaround Time: mean time from submission to completion of process.

• Waiting Time: Amount of time spent ready to run but not running.

• Response Time: Time between submission of requests and first response to the request.

• Scheduler Efficiency: The scheduler doesn't perform any useful work, so any time it takes is pure overhead. So, need to make the scheduler very efficient.

Big difference: Batch and Interactive systems. In batch systems, typically want good throughput or turnaround time. In interactive systems, both of these are still usually important (after all, want some computation to happen), but response time is usually a primary consideration. And, for some systems, throughput or turnaround time is not really relevant - some processes conceptually run forever.

Difference between long and short term scheduling. Long term scheduler is given a set of processes and decides which ones should start to run. Once they start running, they may suspend because of IO or because of preemption. Short term scheduler decides which of the available jobs that long term scheduler has decided are runnable to actually run.

Let's start looking at several vanilla scheduling algorithms.

First-Come, First-Served. One ready queue, OS runs the process at head of queue, new processes come in at the end of the queue. A process does not give up CPU until it either terminates or performs IO.

Consider performance of FCFS algorithm for three compute-bound processes. What if have 4 processes P1 (takes 24 seconds), P2 (takes 3 seconds) and P3 (takes 3 seconds). If arrive in order P1, P2, P3, what is

• Waiting Time? (24 + 27) / 3 = 17

78

• Turnaround Time? (24 + 27 + 30) = 27.

• Throughput? 30 / 3 = 10.

What about if processes come in order P2, P3, P1? What is

• Waiting Time? (3 + 3) / 2 = 6

• Turnaround Time? (3 + 6 + 30) = 13.

• Throughput? 30 / 3 = 10.

Shortest-Job-First (SJF) can eliminate some of the variance in Waiting and Turnaround time. In fact, it is optimal with respect to average waiting time. Big problem: how does scheduler figure out how long will it take the process to run?

For long term scheduler running on a batch system, user will give an estimate. Usually pretty good - if it is too short, system will cancel job before it finishes. If too long, system will hold off on running the process. So, users give pretty good estimates of overall running time.

For short-term scheduler, must use the past to predict the future. Standard way: use a time-decayed exponentially weighted average of previous CPU bursts for each process. Let Tn be the measured burst time of the nth burst, sn be the predicted size of next CPU burst. Then, choose a weighting factor w, where 0 <= w <= 1 and compute sn+1 = w Tn + (1 - w)sn. s0 is defined as some default constant or system average.

w tells how to weight the past relative to future. If choose w = .5, last observation has as much weight as entire rest of the history. If choose w = 1, only last observation has any weight. Do a quick example.

Preemptive vs. Non-preemptive SJF scheduler. Preemptive scheduler reruns scheduling decision when process becomes ready. If the new process has priority over running process, the CPU preempts the running process and executes the new process. Non-preemptive scheduler only does scheduling decision when running process voluntarily gives up CPU. In effect, it allows every running process to finish its CPU burst.

Consider 4 processes P1 (burst time 8), P2 (burst time 4), P3 (burst time 9) P4 (burst time 5) that arrive one time unit apart in order P1, P2, P3, P4. Assume that after burst happens, process is not reenabled for a long time (at least 100, for example). What does a preemptive SJF scheduler do? What about a non-preemptive scheduler?

Priority Scheduling. Each process is given a priority, then CPU executes process with highest priority. If multiple processes with same priority are runnable, use some other criteria - typically FCFS. SJF is an example of a priority-based scheduling algorithm. With the exponential decay algorithm above, the priorities of a given process change over time.

Assume we have 5 processes P1 (burst time 10, priority 3), P2 (burst time 1, priority 1), P3 (burst time 2, priority 3), P4 (burst time 1, priority 4), P5 (burst time 5, priority 2). Lower numbers represent higher priorities. What would a standard priority scheduler do?

Big problem with priority scheduling algorithms: starvation or blocking of low-priority processes. Can use aging to prevent this - make the priority of a process go up the longer it stays runnable but isn't run?

79

What about interactive systems? Cannot just let any process run on the CPU until it gives it up - must give response to users in a reasonable time. So, use an algorithm called round-robin scheduling. Similar to FCFS but with preemption. Have a time quantum or time slice. Let the first process in the queue run until it expires its quantum (i.e. runs for as long as the time quantum), then run the next process in the queue.

Implementing round-robin requires timer interrupts. When schedule a process, set the timer to go off after the time quantum amount of time expires. If process does IO before timer goes off, no problem - just run next process. But if process expires its quantum, do a context switch. Save the state of the running process and run the next process.

How well does RR work? Well, it gives good response time, but can give bad waiting time. Consider the waiting times under round robin for 3 processes P1 (burst time 24), P2 (burst time 3), and P3 (burst time 4) with time quantum 4. What happens, and what is average waiting time? What gives best waiting time?

What happens with really a really small quantum? It looks like you've got a CPU that is 1/n as powerful as the real CPU, where n is the number of processes. Problem with a small quantum - context switch overhead.

What about having a really small quantum supported in hardware? Then, you have something called multithreading. Give the CPU a bunch of registers and heavily pipeline the execution. Feed the processes into the pipe one by one. Treat memory access like IO - suspend the thread until the data comes back from the memory. In the meantime, execute other threads. Use computation to hide the latency of accessing memory.

What about a really big quantum? It turns into FCFS. Rule of thumb - want 80 percent of CPU bursts to be shorter than time quantum.

Multilevel Queue Scheduling - like RR, except have multiple queues. Typically, classify processes into separate categories and give a queue to each category. So, might have system, interactive and batch processes, with the priorities in that order. Could also allocate a percentage of the CPU to each queue.

Multilevel Feedback Queue Scheduling - Like multilevel scheduling, except processes can move between queues as their priority changes. Can be used to give IO bound and interactive processes CPU priority over CPU bound processes. Can also prevent starvation by increasing the priority of processes that have been idle for a long time.

A simple example of a multilevel feedback queue scheduling algorithm. Have 3 queues, numbered 0, 1, 2 with corresponding priority. So, for example, execute a task in queue 2 only when queues 0 and 1 are empty.

A process goes into queue 0 when it becomes ready. When run a process from queue 0, give it a quantum of 8 ms. If it expires its quantum, move to queue 1. When execute a process from queue 1, give it a quantum of 16. If it expires its quantum, move to queue 2. In queue 2, run a RR scheduler with a large quantum if in an interactive system or an FCFS scheduler if in a batch system. Of course, preempt queue 2 processes when a new process becomes ready.

Another example of a multilevel feedback queue scheduling algorithm: the Unix scheduler. We will go over a simplified version that does not include kernel priorities. The point of the algorithm is to fairly allocate the CPU between processes, with processes that have not recently used a lot of CPU resources given priority over processes that have.

80

Processes are given a base priority of 60, with lower numbers representing higher priorities. The system clock generates an interrupt between 50 and 100 times a second, so we will assume a value of 60 clock interrupts per second. The clock interrupt handler increments a CPU usage field in the PCB of the interrupted process every time it runs.

The system always runs the highest priority process. If there is a tie, it runs the process that has been ready longest. Every second, it recalculates the priority and CPU usage field for every process according to the following formulas.

• CPU usage field = CPU usage field / 2

• Priority = CPU usage field / 2 + base priority

So, when a process does not use much CPU recently, its priority rises. The priorities of IO bound processes and interactive processes therefore tend to be high and the priorities of CPU bound processes tend to be low (which is what you want).

Unix also allows users to provide a ``nice'' value for each process. Nice values modify the priority calculation as follows:

• Priority = CPU usage field / 2 + base priority + nice value

So, you can reduce the priority of your process to be ``nice'' to other processes (which may include your own).

In general, multilevel feedback queue schedulers are complex pieces of software that must be tuned to meet requirements.

Anomalies and system effects associated with schedulers.

Priority interacts with synchronization to create a really nasty effect called priority inversion. A priority inversion happens when a low-priority thread acquires a lock, then a high-priority thread tries to acquire the lock and blocks. Any middle-priority threads will prevent the low-priority thread from running and unlocking the lock. In effect, the middle-priority threads block the high-priority thread.

How to prevent priority inversions? Use priority inheritance. Any time a thread holds a lock that other threads are waiting on, give the thread the priority of the highest-priority thread waiting to get the lock. Problem is that priority inheritance makes the scheduling algorithm less efficient and increases the overhead.

Preemption can interact with synchronization in a multiprocessor context to create another nasty effect - the convoy effect. One thread acquires the lock, then suspends. Other threads come along, and need to acquire the lock to perform their operations. Everybody suspends until the lock that has the thread wakes up. At this point the threads are synchronized, and will convoy their way through the lock, serializing the computation. So, drives down the processor utilization.

If have non-blocking synchronization via operations like LL/SC, don't get convoy effects caused by suspending a thread competing for access to a resource. Why not? Because threads don't hold resources and prevent other threads from accessing them.

Similar effect when scheduling CPU and IO bound processes. Consider a FCFS algorithm with several IO bound and one CPU bound process. All of the IO bound processes execute their bursts quickly and queue up for access to the IO device. The CPU bound process then executes for a long time. During this time all of the IO bound processes have their IO requests satisfied and move back

81

into the run queue. But they don't run - the CPU bound process is running instead - so the IO device idles. Finally, the CPU bound process gets off the CPU, and all of the IO bound processes run for a short time then queue up again for the IO devices. Result is poor utilization of IO device - it is busy for a time while it processes the IO requests, then idle while the IO bound processes wait in the run queues for their short CPU bursts. In this case an easy solution is to give IO bound processes priority over CPU bound processes.

In general, a convoy effect happens when a set of processes need to use a resource for a short time, and one process holds the resource for a long time, blocking all of the other processes. Causes poor utilization of the other resources in the system.

CPU/Process Scheduling

The assignment of physical processors to processes allows processors to accomplish work. The problem of determining when processors should be assigned and to which processes is called processor scheduling or CPU scheduling.

When more than one process is runable, the operating system must decide which one first. The part of the operating system concerned with this decision is called the scheduler, and algorithm it uses is called the scheduling algorithm.

Goals of scheduling (objectives)

In this section we try to answer following question: What the scheduler try to achieve?

Many objectives must be considered in the design of a scheduling discipline. In particular, a scheduler should consider fairness, efficiency, response time, turnaround time, throughput, etc., Some of these goals depends on the system one is using for example batch system, interactive system or real-time system, etc. but there are also some goals that are desirable in all systems.

General Goals

Fairness: Fairness is important under all circumstances. A scheduler makes sure that each process gets its fair share of the CPU and no process can suffer indefinite postponement. Note that giving equivalent or equal time is not fair. Think of safety control and payroll at a nuclear plant.

Policy Enforcement: The scheduler has to make sure that system's policy is enforced. For example, if the local policy is safety then the safety control processes must be able to run whenever they want to, even if it means delay in payroll processes.

Efficiency: Scheduler should keep the system (or in particular CPU) busy cent percent of the time when possible. If the CPU and all the Input/Output devices can be kept running all the time, more work gets done per second than if some components are idle.

Response Time: A scheduler should minimize the response time for interactive user.

Turnaround: A scheduler should minimize the time batch users must wait for an output.

Throughput: A scheduler should maximize the number of jobs processed per unit time.

A little thought will show that some of these goals are contradictory. It can be shown that any scheduling algorithm that favors some class of jobs hurts another class of jobs. The amount of CPU time available is finite, after all.

Preemptive Vs No Preemptive Scheduling

82

The Scheduling algorithms can be divided into two categories with respect to how they deal with clock interrupts.

Nonpreemptive Scheduling

A scheduling discipline is nonpreemptive if, once a process has been given the CPU, the CPU cannot be taken away from that process.

Following are some characteristics of nonpreemptive scheduling

1. In nonpreemptive system, short jobs are made to wait by longer jobs but the overall treatment of all processes is fair.

2. In nonpreemptive system, response times are more predictable because incoming high priority jobs can not displace waiting jobs.

3. In nonpreemptive scheduling, a scheduler executes jobs in the following two situations.

a. When a process switches from running state to the waiting state.

b. When a process terminates.

Preemptive Scheduling

A scheduling discipline is preemptive if, once a process has been given the CPU can taken away.

The strategy of allowing processes that are logically runable to be temporarily suspended is called Preemptive Scheduling and it is contrast to the "run to completion" method.

Scheduling Algorithms

CPU Scheduling deals with the problem of deciding which of the processes in the ready queue is to be allocated the CPU.

Following are some scheduling algorithms we will study

• FCFS Scheduling.

• Round Robin Scheduling.

• SJF Scheduling.

• SRT Scheduling.

• Priority Scheduling.

• Multilevel Queue Scheduling.

• Multilevel Feedback Queue Scheduling.

First-Come-First-Served (FCFS) Scheduling

Other names of this algorithm are:

• First-In-First-Out (FIFO)

83

• Run-to-Completion

• Run-Until-Done

Perhaps, First-Come-First-Served algorithm is the simplest scheduling algorithm is the simplest scheduling algorithm. Processes are dispatched according to their arrival time on the ready queue. Being a nonpreemptive discipline, once a process has a CPU, it runs to completion. The FCFS scheduling is fair in the formal sense or human sense of fairness but it is unfair in the sense that long jobs make short jobs wait and unimportant jobs make important jobs wait.

FCFS is more predictable than most of other schemes since it offers time. FCFS scheme is not useful in scheduling interactive users because it cannot guarantee good response time. The code for FCFS scheduling is simple to write and understand. One of the major drawback of this scheme is that the average time is often quite long.

The First-Come-First-Served algorithm is rarely used as a master scheme in modern operating systems but it is often embedded within other schemes.

Round Robin Scheduling

One of the oldest, simplest, fairest and most widely used algorithm is round robin (RR).

In the round robin scheduling, processes are dispatched in a FIFO manner but are given a limited amount of CPU time called a time-slice or a quantum.

If a process does not complete before its CPU-time expires, the CPU is preempted and given to the next process waiting in a queue. The preempted process is then placed at the back of the ready list.

Round Robin Scheduling is preemptive (at the end of time-slice) therefore it is effective in time-sharing environments in which the system needs to guarantee reasonable response times for interactive users.

The only interesting issue with round robin scheme is the length of the quantum. Setting the quantum too short causes too many context switches and lower the CPU efficiency. On the other hand, setting the quantum too long may cause poor response time and approximates FCFS.

In any event, the average waiting time under round robin scheduling is often quite long.

Shortest-Job-First (SJF) Scheduling

Other name of this algorithm is Shortest-Process-Next (SPN).

Shortest-Job-First (SJF) is a non-preemptive discipline in which waiting job (or process) with the smallest estimated run-time-to-completion is run next. In other words, when CPU is available, it is assigned to the process that has smallest next CPU burst.

The SJF scheduling is especially appropriate for batch jobs for which the run times are known in advance. Since the SJF scheduling algorithm gives the minimum average time for a given set of processes, it is probably optimal.

The SJF algorithm favors short jobs (or processors) at the expense of longer ones.

84

The obvious problem with SJF scheme is that it requires precise knowledge of how long a job or process will run, and this information is not usually available.

The best SJF algorithm can do is to rely on user estimates of run times.

In the production environment where the same jobs run regularly, it may be possible to provide reasonable estimate of run time, based on the past performance of the process. But in the development environment users rarely know how their program will execute.

Like FCFS, SJF is non preemptive therefore, it is not useful in timesharing environment in which reasonable response time must be guaranteed.

Priority Scheduling

The basic idea is straightforward: each process is assigned a priority, and priority is allowed to run. Equal-Priority processes are scheduled in FCFS order. The shortest-Job-First (SJF) algorithm is a special case of general priority scheduling algorithm.

An SJF algorithm is simply a priority algorithm where the priority is the inverse of the (predicted) next CPU burst. That is, the longer the CPU burst, the lower the priority and vice versa.

Priority can be defined either internally or externally. Internally defined priorities use some measurable quantities or qualities to compute priority of a process.

Examples of Internal priorities are

• Time limits.

• Memory requirements.

• File requirements, for example, number of open files.

• CPU Vs I/O requirements.

Externally defined priorities are set by criteria that are external to operating system such as

• The importance of process.

• Type or amount of funds being paid for computer use.

• The department sponsoring the work.

• Politics.

Priority scheduling can be either preemptive or non preemptive

• A preemptive priority algorithm will preemptive the CPU if the priority of the newly arrival process is higher than the priority of the currently running process.

• A non-preemptive priority algorithm will simply put the new process at the head of the ready queue.

85

A major problem with priority scheduling is indefinite blocking or starvation. A solution to the problem of indefinite blockage of the low-priority process is aging. Aging is a technique of gradually increasing the priority of processes that wait in the system for a long period of time.

Multilevel Queue Scheduling

A multilevel queue scheduling algorithm partitions the ready queue in several separate queues, for instance

In a multilevel queue scheduling processes are permanently assigned to one queues.

The processes are permanently assigned to one another, based on some property of the process, such as

• Memory size

• Process priority

• Process type

Algorithm choose the process from the occupied queue that has the highest priority, and run that process either

• Preemptive or

• Non-preemptively

Each queue has its own scheduling algorithm or policy.

Possibility 1

If each queue has absolute priority over lower-priority queues then no process in the queue could run unless the queue for the highest-priority processes were all empty.

For example, in the above figure no process in the batch queue could run unless the queues for system processes, interactive processes, and interactive editing processes will all empty.

Possibility II

If there is a time slice between the queues then each queue gets a certain amount of CPU times, which it can then schedule among the processes in its queue. For instance;

• 80% of the CPU time to foreground queue using RR.

• 20% of the CPU time to background queue using FCFS.

Since processes do not move between queue so, this policy has the advantage of low scheduling overhead, but it is inflexible.

Multilevel Feedback Queue Scheduling

Multilevel feedback queue-scheduling algorithm allows a process to move between queues. It uses many ready queues and associate a different priority with each queue.

The Algorithm chooses to process with highest priority from the occupied queue and run that process either preemptively or unpreemptively. If the process uses too much CPU time it will moved

86

to a lower-priority queue. Similarly, a process that wait too long in the lower-priority queue may be moved to a higher-priority queue may be moved to a highest-priority queue. Note that this form of aging prevents starvation.

• A process entering the ready queue is placed in queue 0.

• If it does not finish within 8 milliseconds time, it is moved to the tail of queue 1.

• If it does not complete, it is preempted and placed into queue 2.

• Processes in queue 2 run on a FCFS basis, only when queue 2 run on a FCFS basis, only when queue 0 and queue 1 are empty.`

CHAPTER 6

INTERPROCESS COMMUNICATION

Since processes frequently needs to communicate with other processes therefore, there is a need for a well-structured communication, without using interrupts, among processes.

Race Conditions

In operating systems, processes that are working together share some common storage (main memory, file etc.) that each process can read and write. When two or more processes are reading or writing some shared data and the final result depends on who runs precisely when, are called race conditions. Concurrently executing threads that share data need to synchronize their operations and processing in order to avoid race condition on shared data. Only one ‘customer’ thread at a time should be allowed to examine and update the shared variable.

Race conditions are also possible in Operating Systems. If the ready queue is implemented as a linked list and if the ready queue is being manipulated during the handling of an interrupt, then interrupts must be disabled to prevent another interrupt before the first one completes. If interrupts are not disabled than the linked list could become corrupt.

Critical Section

How to avoid race conditions?

87

The key to preventing trouble involving shared storage is find some way to prohibit more than one process from reading and writing the shared data simultaneously. That part of the program where the shared memory is accessed is called the Critical Section. To avoid race conditions and flawed results, one must identify codes in Critical Sections in each thread. The characteristic properties of the code that form a Critical Section are

• Codes that reference one or more variables in a “read-update-write” fashion while any of those variables is possibly being altered by another thread.

• Codes that alter one or more variables that are possibly being referenced in “read-updata-write” fashion by another thread.

• Codes use a data structure while any part of it is possibly being altered by another thread.

• Codes alter any part of a data structure while it is possibly in use by another thread.

Here, the important point is that when one process is executing shared modifiable data in its

Mutual Exclusion

A way of making sure that if one process is using a shared modifiable data, the other processes will be excluded from doing the same thing.

Formally, while one process executes the shared variable, all other processes desiring to do so at the same time moment should be kept waiting; when that process has finished executing the shared variable, one of the processes waiting; while that process has finished executing the shared variable, one of the processes waiting to do so should be allowed to proceed. In this fashion, each process executing the shared data (variables) excludes all others from doing so simultaneously. This is called Mutual Exclusion.

Note that mutual exclusion needs to be enforced only when processes access shared modifiable data - when processes are performing operations that do not conflict with one another they should be allowed to proceed concurrently.

Mutual Exclusion Conditions

If we could arrange matters such that no two processes were ever in their critical sections simultaneously, we could avoid race conditions. We need four conditions to hold to have a good solution for the critical section problem (mutual exclusion).

• No two processes may at the same moment inside their critical sections.

• No assumptions are made about relative speeds of processes or number of CPUs.

88

• No process should outside its critical section should block other processes.

• No process should wait arbitrary long to enter its critical section.

Proposals for Achieving Mutual Exclusion

The mutual exclusion problem is to devise a pre-protocol (or entry protocol) and a post-protocol (or exist protocol) to keep two or more threads from being in their critical sections at the same time. Tanenbaum examine proposals for critical-section problem or mutual exclusion problem.

Problem

When one process is updating shared modifiable data in its critical section, no other process should allowed to enter in its critical section.

Proposal 1 -Disabling Interrupts (Hardware Solution)

Each process disables all interrupts just after entering in its critical section and re-enable all interrupts just before leaving critical section. With interrupts turned off the CPU could not be switched to other process. Hence, no other process will enter its critical and mutual exclusion achieved.

Disabling interrupts is sometimes a useful interrupts is sometimes a useful technique within the kernel of an operating system, but it is not appropriate as a general mutual exclusion mechanism for users process. The reason is that it is unwise to give user process the power to turn off interrupts.

Proposal 2 - Lock Variable (Software Solution)

In this solution, we consider a single, shared, (lock) variable, initially 0. When a process wants to enter in its critical section, it first test the lock. If lock is 0, the process first sets it to 1 and then enters the critical section. If the lock is already 1, the process just waits until (lock) variable becomes 0. Thus, a 0 means that no process in its critical section, and 1 means hold your horses - some process is in its critical section.

The flaw in this proposal can be best explained by example. Suppose process A sees that the lock is 0. Before it can set the lock to 1 another process B is scheduled, runs, and sets the lock to 1. When the process A runs again, it will also set the lock to 1, and two processes will be in their critical section simultaneously.

Proposal 3 - Strict Alteration

In this proposed solution, the integer variable 'turn' keeps track of whose turn is to enter the critical section. Initially, process A inspect turn, finds it to be 0, and enters in its critical section. Process B also finds it to be 0 and sits in a loop continually testing 'turn' to see when it becomes 1.Continuously testing a variable waiting for some value to appear is called the Busy-Waiting.

Taking turns is not a good idea when one of the processes is much slower than the other. Suppose process 0 finishes its critical section quickly, so both processes are now in their noncritical section. This situation violates above mentioned condition 3.

Using Systems calls 'sleep' and 'wakeup'

89

Basically, what above mentioned solution do is this: when a processes wants to enter in its critical section , it checks to see if then entry is allowed. If it is not, the process goes into tight loop and waits (i.e., start busy waiting) until it is allowed to enter. This approach waste CPU-time.

Now look at some interprocess communication primitives is the pair of steep-wakeup.

• Sleep: It is a system call that causes the caller to block, that is, be suspended until some other process wakes it up.

• Wakeup: It is a system call that wakes up the process.

Both 'sleep' and 'wakeup' system calls have one parameter that represents a memory address used to match up 'sleeps' and ‘wakeups’.

The Bounded Buffer Producers and Consumers

The bounded buffer producers and consumers assume that there is a fixed buffer size i.e., a finite numbers of slots are available.

Statement

To suspend the producers when the buffer is full, to suspend the consumers when the buffer is empty, and to make sure that only one process at a time manipulates a buffer so there are no race conditions or lost updates.

As an example how sleep-wakeup system calls are used, consider the producer-consumer problem also known as bounded buffer problem.

Two processes share a common, fixed-size (bounded) buffer. The producer puts information into the buffer and the consumer takes information out.

Trouble arises when

1. The producer wants to put a new data in the buffer, but buffer is already full.

Solution: Producer goes to sleep and to be awakened when the consumer has removed data.

2. The consumer wants to remove data the buffer but buffer is already empty.

Solution: Consumer goes to sleep until the producer puts some data in buffer and wakes consumer up.

This approaches also leads to same race conditions we have seen in earlier approaches. Race condition can occur due to the fact that access to 'count' is unconstrained. The essence of the problem is that a wakeup call, sent to a process that is not sleeping, is lost.

Semaphores

A semaphore is a protected variable whose value can be accessed and altered only by the operations P and V and initialization operation called 'Semaphoiinitislize'.

Binary Semaphores can assume only the value 0 or the value 1 counting semaphores also called general semaphores can assume only nonnegative values.

The P (or wait or sleep or down) operation on semaphores S, written as P(S) or wait (S), operates as follows:

90

P(S): IF S > 0

THEN S := S – 1

ELSE (wait on S)

The V (or signal or wakeup or up) operation on semaphore S, written as V(S) or signal (S), operates as follows:

V(S): IF (one or more process are waiting on S)

THEN (let one of these processes proceed)

ELSE S: = S +1

Operations P and V are done as single, indivisible, atomic action. It is guaranteed that once a semaphore operations has stared, no other process can access the semaphore until operation has completed. Mutual exclusion on the semaphore, S, is enforced within P(S) and V(S).

If several processes attempt a P(S) simultaneously, only process will be allowed to proceed. The other processes will be kept waiting, but the implementation of P and V guarantees that processes will not suffer indefinite postponement.

Semaphores solve the lost-wakeup problem.

Producer-Consumer Problem Using Semaphores

The Solution to producer-consumer problem uses three semaphores, namely, full, empty and mutex.

The semaphore 'full' is used for counting the number of slots in the buffer that are full. The 'empty' for counting the number of slots that are empty and semaphore 'mutex' to make sure that the producer and consumer do not access modifiable shared section of the buffer simultaneously.

Initialization

• Set full buffer slots to 0. i.e., semaphore Full = 0.

• Set empty buffer slots to N. i.e., semaphore empty = N.

• For control access to critical section set mutex to 1. i.e., semaphore mutex = 1.

Producer ( )

WHILE (true)

produce-Item ( );

P (empty);

P (mutex);

enter-Item ( )

V (mutex)

V (full);

91

Consumer ( )

WHILE (true)

P (full)

P (mutex);

remove-Item ( );

V (mutex);

V (empty);

consume-Item (Item)

CHAPTER 7

DEADLOCK

Definition

“Crises and deadlocks when they occur have at least this advantage that they force us to think.”- Jawaharlal Nehru (1889 - 1964) Indian political leader

A set of process is in a deadlock state if each process in the set is waiting for an event that can be caused by only another process in the set. In other words, each member of the set of deadlock processes is waiting for a resource that can be released only by a deadlock process. None of the processes can run, none of them can release any resources, and none of them can be awakened. It is important to note that the number of processes and the number and kind of resources possessed and requested are unimportant.

The resources may be either physical or logical. Examples of physical resources are Printers, Tape Drivers, Memory Space, and CPU Cycles. Examples of logical resources are Files, Semaphores, and Monitors.

The simplest example of deadlock is where process 1 has been allocated non-shareable resources A, say, a tap drive, and process 2 has be allocated non-sharable resource B, say, a printer. Now, if it turns out that process 1 needs resource B (printer) to proceed and process 2 needs

92

resource A (the tape drive) to proceed and these are the only two processes in the system, each is blocked the other and all useful work in the system stops. This situation ifs termed deadlock. The system is in deadlock state because each process holds a resource being requested by the other process neither process is willing to release the resource it holds.

Preemptable and Nonpreemptable Resources

Resources come in two flavors: preemptable and nonpreemptable. A preemptable resource is one that can be taken away from the process with no ill effects. Memory is an example of a preemptable resource. On the other hand, a nonpreemptable resource is one that cannot be taken away from process (without causing ill effect). For example, CD resources are not preemptable at an arbitrary moment.

Reallocating resources can resolve deadlocks that involve preemptable resources. Deadlocks that involve nonpreemptable resources are difficult to deal with.

Deadlock Condition

Necessary and Sufficient Deadlock Conditions

Coffman (1971) identified four (4) conditions that must hold simultaneously for there to be a deadlock.

Mutual Exclusion Condition: The resources involved are non-shareable.

Explanation: At least one resource (thread) must be held in a non-shareable mode, that is, only one process at a time claims exclusive control of the resource. If another process requests that resource, the requesting process must be delayed until the resource has been released.

Hold and Wait Condition: Requesting process hold already, resources while waiting for requested resources.

Explanation: There must exist a process that is holding a resource already allocated to it while waiting for additional resource that are currently being held by other processes.

No-Preemptive Condition: Resources already allocated to a process cannot be preempted.

Explanation: Resources cannot be removed from the processes are used to completion or released voluntarily by the process holding it.

Circular Wait Condition: The processes in the system form a circular list or chain where each process in the list is waiting for a resource held by the next process in the list.

As an example, consider the traffic deadlock in the following figure

93

Consider each section of the street as a resource.

• Mutual exclusion condition applies, since only one vehicle can be on a section of the street at a time.

• Hold-and-wait condition applies, since each vehicle is occupying a section of the street, and waiting to move on to the next section of the street.

• No-preemptive condition applies, since a section of the street that is a section of the street that is occupied by a vehicle cannot be taken away from it.

• Circular wait condition applies, since each vehicle is waiting on the next vehicle to move. That is, each vehicle in the traffic is waiting for a section of street held by the next vehicle in the traffic.

The simple rule to avoid traffic deadlock is that a vehicle should only enter an intersection if it is assured that it will not have to stop inside the intersection.

It is not possible to have a deadlock involving only one single process. The deadlock involves a circular “hold-and-wait” condition between two or more processes, so “one” process cannot hold a resource, yet be waiting for another resource that it is holding. In addition, deadlock is not possible between two threads in a process, because it is the process that holds resources, not the thread that is, each thread has access to the resources held by the process.

Dealing with Deadlock Problem

In general, there are four strategies of dealing with deadlock problem:

• The Ostrich Approach: Just ignore the deadlock problem altogether.

94

• Deadlock Detection and Recovery: Detect deadlock and, when it occurs, take steps to recover.

• Deadlock Avoidance: Avoid deadlock by careful resource scheduling.

• Deadlock Prevention: Prevent deadlock by resource scheduling so as to negate at least one of the four conditions.

Deadlock Prevention

Havender in his pioneering work showed that since all four of the conditions are necessary for deadlock to occur, it follows that deadlock might be prevented by denying any one of the conditions.

Elimination of “Mutual Exclusion” Condition: The mutual exclusion condition must hold for non-sharable resources. That is, several processes cannot simultaneously share a single resource. This condition is difficult to eliminate because some resources, such as the tap drive and printer, are inherently non-shareable. Note that shareable resources like read-only-file do not require mutually exclusive access and thus cannot be involved in deadlock.

Elimination of “Hold and Wait” Condition: There are two possibilities for elimination of the second condition. The first alternative is that a process request be granted all of the resources it needs at once, prior to execution. The second alternative is to disallow a process from requesting resources whenever it has previously allocated resources. This strategy requires that all of the resources a process will need must be requested at once. The system must grant resources on “all or none” basis. If the complete set of resources needed by a process is not currently available, then the process must wait until the complete set is available. While the process waits, however, it may not hold any resources. Thus the “wait for” condition is denied and deadlocks simply cannot occur. This strategy can lead to serious waste of resources. For example, a program requiring ten tap drives must request and receive all ten derives before it begins executing. If the program needs only one tap drive to begin execution and then does not need the remaining tap drives for several hours. Then substantial computer resources (9 tape drives) will sit idle for several hours. This strategy can cause indefinite postponement (starvation). Since not all the required resources may become available at once.

Elimination of “No-preemption” Condition: The nonpreemption condition can be alleviated by forcing a process waiting for a resource that cannot immediately be allocated to relinquish all of its currently held resources, so that other processes may use them to finish. Suppose a system does allow processes to hold resources while requesting additional resources. Consider what happens when a request cannot be satisfied. A process holds resources a second process may need in order to proceed while second process may hold the resources needed by the first process. This is a deadlock. This strategy require that when a process that is holding some resources is denied a request for additional resources. The process must release its held resources and, if necessary, request them again together with additional resources. Implementation of this strategy denies the “no-preemptive” condition effectively. High Cost when a process releases resources the process may lose all its work to that point. One serious consequence of this strategy is the possibility of indefinite postponement (starvation). A process might be held off indefinitely as it repeatedly requests and releases the same resources.

Elimination of “Circular Wait” Condition: The last condition, the circular wait, can be denied by imposing a total ordering on all of the resource types and than forcing, all processes to request the resources in order (increasing or decreasing). This strategy impose a total ordering of all resources types, and to require that each process requests resources in a numerical order (increasing or decreasing) of enumeration. With this rule, the resource allocation graph can never have a cycle.

95

For example, provide a global numbering of all the resources, as shown

1 ≡ Card reader

2 ≡ Printer

3 ≡ Plotter

4 ≡ Tape drive

5 ≡ Card punch

Now the rule is this: processes can request resources whenever they want to, but all requests must be made in numerical order. A process may request first printer and then a tape drive (order: 2, 4), but it may not request first a plotter and then a printer (order: 3, 2). The problem with this strategy is that it may be impossible to find an ordering that satisfies everyone.

Deadlock Avoidance

This approach to the deadlock problem anticipates deadlock before it actually occurs. This approach employs an algorithm to access the possibility that deadlock could occur and acting accordingly. This method differs from deadlock prevention, which guarantees that deadlock cannot occur by denying one of the necessary conditions of deadlock.

If the necessary conditions for a deadlock are in place, it is still possible to avoid deadlock by being careful when resources are allocated. Perhaps the most famous deadlock avoidance algorithm, due to Dijkstra [1965], is the Banker’s algorithm. So named because the process is analogous to that used by a banker in deciding if a loan can be safely made.

Banker’s Algorithm

In this analogy

Customers≡ processes

Units ≡ resources, say, tape drive

Banker ≡ Operating System

CustomersUsed Max

ABCD

0000

6547

Available Units = 10

96

Fig. 1

In the above figure, we see four customers each of whom has been granted a number of credit nits. The banker reserved only 10 units rather than 22 units to service them. At certain moment, the situation becomes

CustomersUsed Max

ABCD

1124

6547

Available Units = 2

Fig. 2

Safe State: The key to a state being safe is that there is at least one way for all users to finish. In other analogy, the state of figure 2 is safe because with 2 units left, the banker can delay any request except C's, thus letting C finish and release all four resources. With four units in hand, the banker can let either D or B have the necessary units and so on.

Unsafe State: Consider what would happen if a request from B for one more unit were granted in above figure 2.

We would have following situation

CustomersUsed Max

ABCD

1224

6547

Available Units = 1

Fig. 3

This is an unsafe state.

If all the customers namely A, B, C, and D asked for their maximum loans, then banker could not satisfy any of them and we would have a deadlock.

It is important to note that an unsafe state does not imply the existence or even the eventual existence a deadlock. What an unsafe state does imply is simply that some unfortunate sequence of events might lead to a deadlock.

97

The Banker's algorithm is thus to consider each request as it occurs, and see if granting it leads to a safe state. If it does, the request is granted, otherwise, it postponed until later. Haberman [1969] has shown that executing of the algorithm has complexity proportional to N2 where N is the number of processes and since the algorithm is executed each time a resource request occurs, the overhead is significant.

Deadlock Detection

Deadlock detection is the process of actually determining that a deadlock exists and identifying the processes and resources involved in the deadlock.

The basic idea is to check allocation against resource availability for all possible allocation sequences to determine if the system is in deadlocked state a. Of course, the deadlock detection algorithm is only half of this strategy. Once a deadlock is detected, there needs to be a way to recover several alternatives exists:

• Temporarily prevent resources from deadlocked processes.

• Back off a process to some check point allowing preemption of a needed resource and restarting the process at the checkpoint later.

• Successively kill processes until the system is deadlock free.

These methods are expensive in the sense that each iteration calls the detection algorithm until the system proves to be deadlock free. The complexity of algorithm is O(N2) where N is the number of proceeds. Another potential problem is starvation; same process killed repeatedly.

CHAPTER 8

MEMORY MANAGEMENT

98

About Memory

A Macintosh computer's available RAM is used by the Operating System, applications, and other software components, such as device drivers and system extensions. This section describes both the general organization of memory by the Operating System and the organization of the emory partition allocated to your application when it is launched. This section also provides a preliminary description of three related memory topics:

• temporary memory

• virtual memory

• 24- and 32-bit addressing

For more complete information on these three topics, you need to read the remaining chapters in this book.

Organization of Memory by the Operating System

When the Macintosh Operating System starts up, it divides the available RAM into two broad sections. It reserves for itself a zone or partition of memory known as the system partition. The system partition always begins at the lowest addressable byte of memory (memory address 0) and extends upward. The system partition contains a system heap and a set of global variables, described in the next two sections.

All memory outside the system partition is available for allocation to applications or other software components. In system software version 7.0 and later (or when MultiFinder is running in system software versions 5.0 and 6.0), the user can have multiple applications open at once. When an application is launched, the Operating System assigns it a section of memory known as its application partition. In general, an application uses only the memory contained in its own application partition.

Figure 1-1 illustrates the organization of memory when several applications are open at the same time. The system partition occupies the lowest position in memory. Application partitions occupy part of the remaining space. Note that application partitions are loaded into the top part of memory first.

99

In Figure 1-1, three applications are open, each with its own application partition. The application labeled Application is the active applicationThe System Heap

The main part of the system partition is an area of memory known as the system heap. In general, the system heap is reserved for exclusive use by the Operating System and other system software components, which load into it various items such as system resources, system code segments, and system data structures. All system buffers and queues, for example, are allocated in the system heap.

The system heap is also used for code and other resources that do not belong to specific applications, such as code resources that add features to the Operating System or that provide control of special-purpose peripheral equipment. System patches and system extensions (stored as code resources of type 'INIT') are loaded into the system heap during the system startup process. Hardware device drivers (stored as code resources of type 'DRVR') are loaded into the system heap when the driver is opened.

Most applications don't need to load anything into the system heap. In certain cases, however, you might need to load resources or code segments into the system heap. For example, if you want a vertical retrace task to continue to execute even when your application is in the background, you

100

need to load the task and any data associated with it into the system heap. Otherwise, the Vertical Retrace Manager ignores the task when your application is in the background.

The System Global Variables

The lowest part of memory is occupied by a collection of global variables called system global variables (or low-memory system global variables). The Operating System uses these variables to maintain different kinds of information about the operating environment. For example, the Ticks global variable contains the number of ticks (sixtieths of a second) that have elapsed since the system was most recently started up. Similar variables contain, for example, the height of the menu bar (MBarHeight) and pointers to the heads of various operating-system queues (DTQueue, FSQHdr, VBLQueue, and so forth). Most low-memory global variables are of this variety: they contain information that is generally useful only to the Operating System or other system software components.

Other low-memory global variables contain information about the current application. For example, the ApplZone global variable contains the address of the first byte f the active application's partition. The ApplLimit global variable contains the address of the last byte the active application's heap can expand to include. The CurrentA5 global variable contains the address of the boundary between the active application's global variables and its application parameters. Because these global variables contain information about the active application, the Operating System changes the values of these variables whenever a context switch occurs.

In general, it is best to avoid reading or writing low-memory system global variables. Most of these variables are undocumented, and the results of changing their values can be unpredictable. Usually, when the value of a low-memory global variable is likely to be useful to applications, the system software provides a routine that you can use to read or write that value. For example, you can get the current value of the Ticks global variable by calling the Tick Count function.

In rare instances, there is no routine that reads or writes the value of a documented global variable. In those cases, you might need to read or write that value directly. See the chapter "Memory Manager" in this book for instructions on reading and writing the values of low-memory global variables from a high-level language.

Organization of Memory in an Application Partition

When your application is launched, the Operating System allocates for it a partition of memory called its application partition. That partition contains required segments of the application's code as well as other data associated with the application. Figure 1-2 illustrates the general organization of an application partition.

Figure 1-2 Organization of an application partition

101

Your application partition is divided into three major parts:

• the application stack

• the application heap

• the application global variables and A5 world

The heap is located at the low-memory end of your application partition and always expands (when necessary) toward high memory. The A5 world is located at the high-memory end of your application partition and is of fixed size. The stack begins at the low-memory end of the A5 world and expands downward, toward the top of the heap.

As you can see in Figure 1-2, there is usually an unused area of memory between the stack and the heap. This unused area provides space for the stack to grow without encroaching upon the space assigned to the application heap. In some cases, however, the stack might grow into space reserved for the application heap. If this happens, it is very likely that data in the heap will become corrupted.

The ApplLimit global variable marks the upper limit to which your heap can grow. If you call the MaxApplZone procedure at the beginning of your program, the heap immediately extends all the way up to this limit. If you were to use all of the heap's free space, the Memory Manager would not allow you to allocate additional blocks above ApplLimit. If you do not call MaxApplZone, the heap grows toward ApplLimit whenever the Memory Manager finds that there is not enough memory in the heap to fill a request. However, once the heap grows up to ApplLimit, it can grow no further. Thus, whether you maximize your application heap or not, you can use only the space between the bottom of the heap and ApplLimit.

Unlike the heap, the stack is not bounded by ApplLimit. If your application uses heavily nested procedures with many local variables or uses extensive recursion, the stack could grow downward

102

beyond ApplLimit. Because you do not use Memory Manager routines to allocate memory on the stack, the Memory Manager cannot stop your stack from growing beyond ApplLimit and possibly encroaching upon space reserved for the heap. However, a vertical retrace task checks approximately 60 times each second to see if the stack has moved into the heap. If it has, the task, known as the "stack sniffer," generates a system error. This system error alerts you that you have allowed the stack to grow too far, so that you can make adjustments. See "Changing the Size of the Stack" on page 1-39 for instructions on how to change the size of your application stack.

The Application Stack

The stack is an area of memory in your application partition that can grow or shrink at one end while the other end remains fixed. This means that space on the stack is always allocated and released in LIFO (last-in, first-out) order. The last item allocated is always the first to be released. It also means that the allocated area of the stack is always contiguous. Space is released only at the top of the stack, never in the middle, so there can never be any unallocated "holes" in the stack.

By convention, the stack grows from high memory toward low memory addresses. The end of the stack that grows or shrinks is usually referred to as the "top" of the stack, even though it's actually at the lower end of memory occupied by the stack.

Because of its LIFO nature, the stack is especially useful for memory allocation connected with the execution of functions or procedures. When your application calls a routine, space is automatically allocated on the stack for a stack frame. A stack frame contains the routine's parameters, local variables, and return address. Figure 1-3 illustrates how the stack expands and shrinks during a function call. The leftmost diagram shows the stack just before the function is called. The middle diagram shows the stack expanded to hold the stack frame. Once the function is executed, the local variables and function parameters are popped off the stack. If the function is a Pascal function, all that remains is the previous stack with the function result on top.

Figure 1-3. The application stack

103

The Application Heap

An application heap is the area of memory in your application partition in which space is dynamically allocated and released on demand. The heap begins at the low-memory end of your application partition and extends upward in memory. The heap contains virtually all items that are not allocated on the stack. For instance, your application heap contains the application's code segments and resources that are currently loaded into memory. The heap also contains other dynamically allocated items such as window records, dialog records, document data, and so forth.

You allocate space within your application's heap by making calls to the Memory Manager, either directly (for instance, using the NewHandle function) or indirectly (for instance, using a routine such as NewWindow, which calls Memory Manager routines). Space in the heap is allocated in locks, which can be of any size needed for a particular object.

The Memory Manager does all the necessary housekeeping to keep track of blocks in the heap as they are allocated and released. Because these operations can occur in any order, the heap doesn't usually grow and shrink in an orderly way, as the stack does. Instead, after your application has been running for a while, the heap can tend to become fragmented into a patchwork of allocated and free blocks, as shown in Figure 1-4. This fragmentation is known as heap fragmentation.

Figure 1-4 A fragmented heaps

One result of heap fragmentation is that the Memory Manager might not be able to satisfy your application's request to allocate a block of a particular size. Even though there is enough free space available, the space is broken up into blocks smaller than the requested size. When this happens, the Memory Manager tries to create the needed space by moving allocated blocks together, thus

104

collecting the free space in a single larger block. This operation is known as heap compaction. Figure 1-5 shows the results of compacting the fragmented heap shown in Figure 1-4.

Figure 1-5 A compacted heaps

Heap fragmentation is generally not a problem as long as the blocks of memory you allocate are free to move during heap compaction. There are, however, two situations in which a block is not free to move: when it is a nonrelocatable block, and when it is a locked, relocatable block. To minimize heap fragmentation, you should use nonrelocatable blocks sparingly, and you should lock relocatable blocks only when absolutely necessary.

The Application Global Variables and A5 World

Your application's global variables are stored in an area of memory near the top of your application partition known as the application A5 world. The A5 world contains four kinds of data:

• application global variables

• application QuickDraw global variables

• application parameters

• the application's jump table

Each of these items is of fixed size, although the sizes of the global variables and of the jump table may vary from application to application. Figure 1-6 shows the standard organization of the A5 world.

Figure 1-6 Organization of an application's A5 world 105

The system global variable CurrentA5 points to the boundary between the current application's global variables and its application parameters. For this reason, the application's global variables are found as negative offsets from the value of CurrentA5. This boundary is important because the Operating System uses it to access the following information from your application: its global variables, its QuickDraw global variables, the application parameters, and the jump table. This information is known collectively as the A5 world because the Operating System uses the microprocessor's A5 register to point to that boundary.

Your application's QuickDraw global variables contain information about its drawing environment. For example, among these variables is a pointer to the current graphics port.

Your application's jump table contains an entry for each of your application's routines that is called by code in another segment. The Segment Manager uses the jump table to determine the address of any externally referenced routines called by a code segment.

The application parameters are 32 bytes of memory located above the application global variables; they're reserved for use by the Operating System. The first long word of those parameters is a pointer to your application's QuickDraw global variables.

Temporary Memory

In the Macintosh multitasking environment, each application is limited to a particular memory partition (whose size is determined by information in the 'SIZE' resource of that application). The size of your application's partition places certain limits on the size of your application heap and hence on the sizes of the buffers and other data structures that your application uses. In general, you specify an application partition size that is large enough to hold all the buffers, resources, and other data that your application is likely to need during its execution.

If for some reason you need more memory than is currently available in your application heap, you can ask the Operating System to let you use any available memory that is not yet allocated to any other application. This memory, known as temporary memory, is allocated from the available

106

unused RAM; usually, that memory is not contiguous with the memory in your application's zone. Figure 1-7 shows an application using some temporary memory.

Figure 1-7 Using temporary memory allocated from unused RAM

In Figure 1-7, Application 1 has almost exhausted its application heap. As a result, it has requested and received a large block of temporary memory, extending from the top of Application 2's partition to the top of the a locatable space. Application 1 can use the temporary memory in whatever manner it desires.

Your application should use temporary memory only for occasional short-term purposes that could be accomplished in less space, though perhaps less efficiently. For example, if you want to copy a large file, you might try to allocate a fairly large buffer of temporary memory. If you receive the temporary memory, you can copy data from the source file into the destination file using the large buffer. If, however, the request for temporary memory fails, you can instead use a smaller buffer within your application heap. Although using the smaller buffer might prolong the copying operation, the file is nonetheless copied.

One good reason for using temporary memory only occasionally is that you cannot assume that you will always receive the temporary memory you request. For example, in Figure 1-7, all the available memory is allocated to the two open applications; any further requests by either one for some temporary memory would fail. For complete details on using temporary memory, see the chapter "Memory Manager" in this book.

107

Virtual Memory

In system software version 7.0 and later, suitably equipped Macintosh computers can take advantage of a feature of the Operating System known as virtual memory, by which the machines have a logical address space that extends beyond the limits of the available physical memory. Because of virtual memory, a user can load more programs and data into the logical address space than would fit in the computer's physical RAM.

The Operating System extends the address space by using part of the available secondary storage (that is, part of a hard disk) to hold portions of applications and data that are not currently needed in RAM. When some of those portions of memory are needed, the Operating System swaps out unneeded parts of applications or data to the secondary storage, thereby making room for the parts that are needed.

It is important to realize that virtual memory operates transparently to most applications. Unless your application has time-critical needs that might be adversely affected by the operation of virtual memory or installs routines that execute at interrupt time, you do not need to know whether virtual memory is operating. For complete details on virtual memory, see the chapter "Virtual Memory Manager" later in this book.

Addressing Modes

On suitably equipped Macintosh computers, the Operating System supports 32-bit addressing, that is, the ability to use 32 bits to determine memory addresses. Earlier versions of system software use 24-bit addressing, where the upper 8 bits of memory addresses are ignored or used as flag bits. In a 24-bit addressing scheme, the logical address space has a size of 16 MB. Because 8 MB of this total are reserved for I/O space, ROM, and slot space, the largest contiguous program address space is 8 MB. When 32-bit addressing is in operation, the maximum program address space is 1 GB.

The ability to operate with 32-bit addressing is available only on certain Macintosh models, namely those with systems that contain a 32-bit Memory Manager. (For compatibility reasons, these systems also contain a 24-bit Memory Manager.) In order for your application to work when the machine is using 32-bit addressing, it must be 32-bit clean, that is, able to run in an environment where all 32 bits of a memory address are significant. Fortunately, writing applications that are 32-bit clean is relatively easy if you follow the guidelines in Inside Macintosh. In general, applications are not 32-bit clean because they manipulate flag bits in master pointers directly (for instance, to mark the associated memory blocks as locked or purgeable) instead of using Memory Manager routines to achieve the desired result.

Heap Management

Applications allocate and manipulate memory primarily in their application heap. As you have seen, space in the application heap is allocated and released on demand. When the blocks in your heap are free to move, the Memory Manager can often reorganize the heap to free space when necessary to fulfill a memory-allocation request. In some cases, however, blocks in your heap cannot move. In these cases, you need to pay close attention to memory allocation and management to avoid fragmenting your heap and running out of memory.

This section provides a general description of how to manage blocks of memory in your application heap. It describes

• relocatable and nonrelocatable blocks

108

• properties of relocatable blocks

• heap purging and compaction

• heap fragmentation

• dangling pointers

• low-memory conditions

Relocatable and Nonrelocatable Blocks

You can use the Memory Manager to allocate two different types of blocks in your heap: nonrelocatable blocks and relocatable blocks. A nonrelocatable block is a block of memory whose location in the heap is fixed. In contrast, a relocatable block is a block of memory that can be moved within the heap (perhaps during heap compaction). The Memory Manager sometimes moves relocatable blocks during memory operations so that it can use the space in the heap optimally.

The Memory Manager provides data types that reference both relocatable and nonrelocatable blocks. It also provides routines that allow you to allocate and release blocks of both types.

To reference a nonrelocatable block, you can use a pointer variable, defined by the Ptr data type.

TYPE

SignedByte = -128..127;

Ptr = ^SignedByte;

A pointer is simply the address of an arbitrary byte in memory, and a pointer to a nonrelocatable block of memory is simply the address of the first byte in the block, as illustrated in Figure 1-8. After you allocate a nonrelocatable block, you can make copies of the pointer variable. Because a pointer is the address of a block of memory that cannot be moved, all copies of the pointer correctly reference the block as long as you don't dispose of it.

Figure 1-8. A pointer to a nonrelocatable block

109

The pointer variable itself occupies 4 bytes of space in your application partition. Often the pointer variable is a global variable and is therefore contained in your application's A5 world. But the pointer can also be allocated on the stack or in the heap itself.

To reference relocatable blocks, the Memory Manager uses a scheme known as double indirection. The Memory Manager keeps track of a relocatable block internally with a master pointer, which itself is part of a nonrelocatable master pointer block in your application heap and can never move.

When the Memory Manager moves a relocatable block, it updates the master pointer so that it always contains the address of the relocatable block. You reference the block with a handle, defined by the Handle data type.

TYPE

Handle = ^Ptr;

A handle contains the address of a master pointer. The left side of Figure 1-9 shows a handle to a relocatable block of memory located in the middle of the application heap. If necessary (perhaps to make room for another block of memory), the Memory Manager can move that block down in the heap, as shown in the right side of Figure 1-9.

Figure 1-9 A handle to a relocatable block

110

Master pointers for relocatable objects in your heap are always allocated in your application heap. Because the blocks of masters pointers are nonrelocatable, it is best to allocate them as low in your heap as possible. You can do this by calling the MoreMasters procedure when your application starts up.

Whenever possible, you should allocate memory in relocatable blocks. This gives the Memory Manager the greatest freedom when rearranging the blocks in your application heap to create a new block of free memory. In some cases, however, you may be forced to allocate a nonrelocatable block of memory. When you call the Window Manager function NewWindow, for example, the Window Manager internally calls the NewPtr function to allocate a new nonrelocatable block in your application partition. You need to exercise care when calling Toolbox routines that allocate such blocks, lest your application heap become overly fragmented.

Using relocatable blocks makes the Memory Manager more efficient at managing available space, but it does carry some overhead. As you have seen, the Memory Manager must allocate extra memory to hold master pointers for relocatable blocks. It groups these master pointers into nonrelocatable blocks. For large relocatable blocks, this extra space is negligible, but if you allocate many very small relocatable blocks, the cost can be considerable. For this reason, you should avoid allocating a very large number of handles to small blocks; instead, allocate a single large block and use it as an array to hold the data you need.

Properties of Relocatable Blocks

As you have seen, a heap block can be either relocatable or nonrelocatable. The designation of a block as relocatable or nonrelocatable is a permanent property of that block. If relocatable, a block can be either locked or unlocked; if it's unlocked, a block can be either purgeable or

111

unpurgeable. These attributes of relocatable blocks can be set and changed as necessary. The following sections explain how to lock and unlock blocks, and how to mark them as purgeable or unpurgeable.

Locking and Unlocking Relocatable Blocks: Occasionally, you might need a relocatable block of memory to stay in one place. To prevent a block from moving, you can lock it, using the HLock procedure. Once you have locked a block, it won't move. Later, you can unlock it, using the HUnlock procedure, allowing it to move again.

In general, you need to lock a relocatable block only if there is some danger that it might be moved during the time that you read or write the data in that block. This might happen, for instance, if you dereference a handle to obtain a pointer to the data and (for increased speed) use the pointer within a loop that calls routines that might cause memory to be moved. If, within the loop, the block whose data you are accessing is in fact moved, then the pointer no longer points to that data; this pointer is said to dangle.

Using locked relocatable blocks can, however, slow the Memory Manager down as much as using nonrelocatable blocks. The Memory Manager can't move locked blocks. In addition, except when you allocate memory and resize relocatable blocks, it can't move relocatable blocks around locked relocatable blocks (just as it can't move them around nonrelocatable blocks). Thus, locking a block in the middle of the heap for long periods of time can increase heap fragmentation.

Locking and unlocking blocks every time you want to prevent a block from moving can become troublesome. Fortunately, the Memory Manager moves unlocked, relocatable blocks only at well-defined, predictable times. In general, each routine description in Inside Macintosh indicates whether the routine could move or purge memory. If you do not call any of those routines in a section of code, you can rely on all blocks to remain stationary while that code executes. Note that the Segment Manager might move memory if you call a routine located in a segment that is not currently resident in memory.

Purging and Reallocating Relocatable Blocks: One advantage of relocatable blocks is that you can use them to store information that you would like to keep in memory to make your application more efficient, but that you don't really need if available memory space becomes low. For example, your application might, at the beginning of its execution, load user preferences from a preferences file into a relocatable block. As long as the block remains in memory, your application can access information from the preferences file without actually reopening the file. However, reopening the file probably wouldn't take enough time to justify keeping the block in memory if memory space were scarce.

By making a relocatable block purgeable, you allow the Memory Manager to free the space it occupies if necessary. If you later want to prohibit the Memory Manager from freeing the space occupied by a relocatable block, you can make the block unpurgeable. You can use the HPurge and HNoPurge procedures to change back and forth between these two states. A block you create by calling NewHandle is initially unpurgeable.

Once you make a relocatable block purgeable, you should subsequently check handles to that block before using them if you call any of the routines that could move or purge memory. If a handle's master pointer is set to NIL, then the Operating System has purged its block. To use the information formerly in the block, you must reallocate space for it (perhaps by calling the ReallocateHandle procedure) and then reconstruct its contents (for example, by rereading the preferences file). Figure 1-10 illustrates the purging and reallocating of a relocatable block. When the block is purged, its master pointer is set to NIL. When it is reallocated, the handle correctly references a new block, but that block's contents are initially undefined.

112

Figure 1-10 Purging and reallocating a relocatable block

Memory Reservation

The Memory Manager does its best to prevent situations in which nonrelocatable blocks in the middle of the heap trap relocatable blocks. When it allocates new nonrelocatable blocks, it attempts to reserve memory for them as low in the heap as possible. The Memory Manager reserves memory for a nonrelocatable block by moving unlocked relocatable blocks upward until it has created a space large enough for the new block. When the Memory Manager can successfully pack all nonrelocatable blocks into the bottom of the heap, no nonrelocatable block can trap a relocatable block, and it has successfully prevented heap fragmentation.

Figure 1-11 illustrates how the Memory Manager allocates nonrelocatable blocks. Although it could place a block of the requested size at the top of the heap, it instead reserves space for the block as close to the bottom of the heap as possible and then puts the block into that reserved space. During this process, the Memory Manager might even move a relocatable block over a nonrelocatable block to make room for another nonrelocatable block.

Figure 1-11. Allocating a nonrelocatable block

113

When allocating a new relocatable block, you can, if you want, manually reserve space for the block by calling the ReserveMem procedure. If you do not, the Memory Manager looks for space big enough for the block as low in the heap as possible, but it does not create space near the bottom of the heap for the block if there is already enough space higher in the heap.

Heap Purging and Compaction

When your application attempts to allocate memory (for example, by calling either the NewPtr or NewHandle function), the Memory Manager might need to compact or purge the heap to free memory and to fuse many small free blocks into fewer large free blocks. The Memory Manager first tries to obtain the requested amount of space by compacting the heap; if compaction fails to free the required amount of space, the Memory Manager then purges the heap.

When compacting the heap, the Memory Manager moves unlocked, relocatable blocks down until they reach nonrelocatable blocks or locked, relocatable blocks. You can compact the heap manually, by calling either the CompactMem function or the MaxMem function.

In a purge of the heap, the Memory Manager sequentially purges unlocked, purgeable relocatable blocks until it has freed enough memory or until it has purged all such blocks. It purges a block by deallocating it and setting its master pointer to NIL.

If you want, you can manually purge a few blocks or an entire heap in anticipation of a memory shortage. To purge an individual block manually, call the EmptyHandle procedure. To purge your entire heap manually, call the PurgeMem procedure or the MaxMem function.

Heap Fragmentation

Heap fragmentation can slow your application by forcing the Memory Manager to compact or purge your heap to satisfy a memory-allocation request. In the worst cases, when your heap is severely fragmented by locked or nonrelocatable blocks, it might be impossible for the Memory Manager to find the requested amount of contiguous free space, even though that much space is actually free in your heap. This can have disastrous consequences for your application. For example,

114

if the Memory Manager cannot find enough room to load a required code segment, your application will crash.

Obviously, it is best to minimize the amount of fragmentation that occurs in your application heap. It might be tempting to think that because the Memory Manager controls the movement of blocks in the heap, there is little that you can do to prevent heap fragmentation. In reality, however, fragmentation does not strike your application's heap by chance. Once you understand the major causes of heap fragmentation, you can follow a few simple rules to minimize it.

The primary causes of heap fragmentation are indiscriminate use of nonrelocatable blocks and indiscriminate locking of relocatable blocks. Each of these creates immovable blocks in your heap, thus creating "roadblocks" for the Memory Manager when it rearranges the heap to maximize the amount of contiguous free space. You can significantly reduce heap fragmentation simply by exercising care when you allocate nonrelocatable blocks and when you lock relocatable blocks.

Throughout this section, you should keep in mind the following rule: the Memory Manager can move a relocatable block around a nonrelocatable block (or a locked relocatable block) at these times only:

• When the Memory Manager reserves memory for a nonrelocatable block (or when you manually reserve memory before allocating a block), it can move unlocked, relocatable blocks upward over nonrelocatable blocks to make room for the new block as low in the heap as possible.

• When you attempt to resize a relocatable block, the Memory Manager can move that block around other blocks if necessary.

In contrast, the Memory Manager cannot move relocatable blocks over nonrelocatable blocks during compaction of the heap.

Deallocating Nonrelocatable Blocks

One of the most common causes of heap fragmentation is also one of the most difficult to avoid. The problem occurs when you dispose of a nonrelocatable block in the middle of the pile of nonrelocatable blocks at the bottom of the heap. Unless you immediately allocate another nonrelocatable block of the same size, you create a gap where the nonrelocatable block used to be. If you later allocate a slightly smaller, nonrelocatable block, that gap shrinks. However, small gaps are inefficient because of the small likelihood that future memory allocations will create blocks small enough to occupy the gaps.

It would not matter if the first block you allocated after deleting the nonrelocatable block were relocatable. The Memory Manager would place the block in the gap if possible. If you were later to allocate a nonrelocatable block as large as or smaller than the gap, the new block would take the place of the relocatable block, which would join other relocatable blocks in the middle of the heap, as desired. However, the new nonrelocatable block might be smaller than the original nonrelocatable block, leaving a small gap.

Whenever you dispose of a nonrelocatable block that you have allocated, you create small gaps, unless the next nonrelocatable block you allocate happens to be the same size as the disposed block. These small gaps can lead to heavy fragmentation over the course of your application's execution. Thus, you should try to avoid disposing of and then reallocating nonrelocatable blocks during program execution.

115

Reserving Memory

Another cause of heap fragmentation ironically occurs because of a limitation of memory reservation, a process designed to prevent it. Memory reservation never makes fragmentation worse than it would be if there were no memory reservation. Ordinarily, memory reservation ensures that allocating nonrelocatable blocks in the middle of your application's execution causes no problems. Occasionally, however, memory reservation can cause fragmentation, either when it succeeds but leaves small gaps in the reserved space, or when it fails and causes a nonrelocatable block to be allocated in the middle of the heap.

The Memory Manager uses memory reservation to create space for nonrelocatable blocks as low as possible in the heap. (You can also manually reserve memory for relocatable blocks, but you rarely need to do so.) However, when the Memory Manager moves a block up during memory reservation, that block cannot overlap its previous location. As a result, the Memory Manager might need to move the relocatable block up more than is necessary to contain the new nonrelocatable block, thereby creating a gap between the top of the new block and the bottom of the relocated block.

Memory reservation can also fragment the heap if there is not enough space in the heap to move the relocatable block up. In this case, the Memory Manager allocates the new nonrelocatable block above the relocatable block. The relocatable block cannot then move over the nonrelocatable block, except during the times described previously.

Locking Relocatable Blocks

Locked relocatable blocks present a special problem. When relocatable blocks are locked, they can cause as much heap fragmentation as nonrelocatable blocks. One solution is to reserve memory for all relocatable blocks that might at some point need to be locked, and to leave them locked for as long as they are allocated. This solution has drawbacks, however, because then the blocks would lose any flexibility that being relocatable otherwise gives them. Deleting a locked relocatable block can create a gap, just as deleting a nonrelocatable block can.

An alternative partial solution is to move relocatable blocks to the top of the heap before locking them. The MoveHHi procedure allows you to move a relocatable block upward until it reaches the top of the heap, a nonrelocatable block, or a locked relocatable block. This has the effect of partitioning the heap into four areas, as illustrated in Figure 1-12. At the bottom of the heap are the nonrelocatable blocks. Above those blocks are the unlocked relocatable blocks. At the top of the heap are locked relocatable blocks. Between the locked relocatable blocks and the unlocked relocatable blocks is an area of free space. The principal idea behind moving relocatable blocks to the top of the heap and locking them there is to keep the contiguous free space as large as possible.

Figure 1-12. An effectively partitioned heap

116

Using MoveHHi is, however, not always a perfect solution to handling relocatable blocks that need to be locked. The MoveHHi procedure moves a block upward only until it reaches either a nonrelocatable block or a locked relocatable block. Unlike NewPtr and ReserveMem, MoveHHi does not currently move a relocatable block around one that is not relocatable.

Even if MoveHHi succeeds in moving a block to the top area of the heap, unlocking or deleting locked blocks can cause fragmentation if you don't unlock or delete those blocks beginning with the lowest locked block. A relocatable block that is locked at the top area of the heap for a long period of time could trap other relocatable blocks that were locked for short periods of time but then unlocked.

This suggests that you need to treat relocatable blocks locked for a long period of time differently from those locked for a short period of time. If you plan to lock a relocatable block for a long period of time, you should reserve memory for it at the bottom of the heap before allocating it, then lock it for the duration of your application's execution (or as long as the block remains allocated). Do not reserve memory for relocatable blocks you plan to allocate for only short periods of time. Instead, move them to the top of the heap (by calling MoveHHi) and then lock them.

In practice, you apply the same rules to relocatable blocks that you reserve space for and leave permanently locked as you apply to nonrelocatable blocks: Try not to allocate such blocks in the middle of your application's execution, and don't dispose of and reallocate such blocks in the middle of your application's execution.

After you lock relocatable blocks temporarily, you don't need to move them manually back into the middle area when you unlock them. Whenever the Memory Manager compacts the heap or moves another relocatable block to the top heap area, it brings all unlocked relocatable blocks at the bottom of that partition back into the middle area. When moving a block to the top area, be sure to call MoveHHi on the block and then lock the block, in that order.

Allocating Nonrelocatable Blocks

As you have seen, there are two reasons for not allocating nonrelocatable blocks during the middle of your application's execution. First, if you also dispose of nonrelocatable blocks in the middle of your application's execution, then allocation of new nonrelocatable blocks is likely to create small gaps, as discussed earlier. Second, even if you never dispose of nonrelocatable blocks until your application terminates, memory reservation is an imperfect process, and the Memory Manager could occasionally place new nonrelocatable blocks above relocatable blocks.

117

There is, however, an exception to the rule that you should not allocate nonrelocatable blocks in the middle of your application's execution. Sometimes you need to allocate a nonrelocatable block only temporarily. If between the times that you allocate and dispose of a nonrelocatable block, you allocate no additional nonrelocatable blocks and do not attempt to compact the heap, then you have done no harm. The temporary block cannot create a new gap because the Memory Manager places no other block over the temporary block.

Summary of Preventing Fragmentation

Avoiding heap fragmentation is not difficult. It simply requires that you follow a few rules as closely as possible. Remember that allocation of even a small nonrelocatable block in the middle of your heap can ruin a scheme to prevent fragmentation of the heap, because the Memory Manager does not move relocatable blocks around nonrelocatable blocks when you call MoveHHi or when it attempts to compact the heap.

If you adhere to the following rules, you are likely to avoid significant heap fragmentation:

• At the beginning of your application's execution, call the MaxApplZone procedure once and the MoreMasters procedure enough times so that the Memory Manager never needs to call it for you.

• Try to anticipate the maximum number of nonrelocatable blocks you will need and allocate them at the beginning of your application's execution.

• Avoid disposing of and then reallocating nonrelocatable blocks during your application's execution.

• When allocating relocatable blocks that you need to lock for long periods of time, use the ReserveMem procedure to reserve memory for them as close to the bottom of the heap as possible, and lock the blocks immediately after allocating them.

• If you plan to lock a relocatable block for a short period of time and allocate nonrelocatable blocks while it is locked, use the MoveHHi procedure to move the block to the top of the heap and then lock it. When the block no longer needs to be locked, unlock it.

• Remember that you need to lock a relocatable block only if you call a routine that could move or purge memory and you then use a dereferenced handle to the relocatable block, or if you want to use a dereferenced handle to the relocatable block at interrupt time.

Perhaps the most difficult restriction is to avoid disposing of and then reallocating nonrelocatable blocks in the middle of your application's execution. Some Toolbox routines require you to use nonrelocatable blocks, and it is not always easy to anticipate how many such blocks you will need. If you must allocate and dispose of blocks in the middle of your program's execution, you might want to place used blocks into a linked list of free blocks instead of disposing of them. If you know how many nonrelocatable blocks of a certain size your application is likely to need, you can add that many to the beginning of the list at the beginning of your application's execution. If you need a nonrelocatable block later, you can check the linked list for a block of the exact size instead of simply calling the NewPtr function.

Dangling Pointers

Accessing a relocatable block by double indirection, through its handle instead of through its master pointer, requires an extra memory reference. For efficiency, you might sometimes want to

118

dereference the handle--that is, make a copy of the block's master pointer--and then use that pointer to access the block by single indirection. When you do this, however, you need to be particularly careful. Any operation that allocates space from the heap might cause the relocatable block to be moved or purged. In that event, the block's master pointer is correctly updated, but your copy of the master pointer is not. As a result, your copy of the master pointer is a dangling pointer.

Dangling pointers are likely to make your application crash or produce garbled output. Unfortunately, it is often easy during debugging to overlook situations that could leave pointers dangling, because pointers dangle only if the relocatable blocks that they reference actually move. Routines that can move or purge memory do not necessarily do so unless memory space is tight. Thus, if you improperly dereference a handle in a section of code, that code might still work properly most of the time. If, however, a dangling pointer does cause errors, they can be very difficult to trace.

This section describes a number of situations that can cause dangling pointers and suggests some ways to avoid them.

Compiler Dereferencing

Some of the most difficult dangling pointers to isolate are not caused by any explicit dereferencing on your part, but by implicit dereferencing on the part of the compiler. For example, suppose you use a handle called myHandle to access the fields of a record in a relocatable block. You might use Pascal's WITH statement to do so, as follows:

WITH myHandle^^ DO

BEGIN

...

END;

A compiler is likely to dereference myHandle so that it can access the fields of the record without double indirection. However, if the code between the BEGIN and END statements causes the Memory Manager to move or purge memory, you are likely to end up with a dangling pointer.

The easiest way to prevent dangling pointers is simply to lock the relocatable block whose data you want to read or write. Because the block is locked and cannot move, the master pointer is guaranteed always to point to the beginning of the block's data. Listing 1-1 illustrates one way to avoid dangling pointers by locking a relocatable block.

Listing 1-1. Locking a block to avoid dangling pointers

VAR

origState: SignedByte; {original attributes of handle}

origState := HGetState(Handle(myData));{get handle attributes}

MoveHHi(Handle(myData)); {move the handle high}

HLock(Handle(myData)); {lock the handle}

WITH myData^^ DO {fill in window data}

119

BEGIN

editRec := TENew(gDestRect, gViewRect);

vScroll := GetNewControl(rVScroll, myWindow);

hScroll := GetNewControl(rHScroll, myWindow);

fileRefNum := 0;

windowDirty := FALSE;

END;

HSetState (origState); {reset handle attributes}

The handle myData needs to be locked before the WITH statement because the functions TENew and GetNewControl allocate memory and hence might move the block whose handle is myData.

You should be careful to lock blocks only when necessary, because locked relocatable blocks can increase heap fragmentation and slow down your application unnecessarily. You should lock a handle only if you dereference it, directly or indirectly, and then use a copy of the original master pointer after calling a routine that could move or purge memory. When you no longer need to reference the block with the master pointer, you should unlock the handle. In Listing 1-1, the handle myData is never explicitly unlocked. Instead, the original attributes of the handle are saved by calling HGetState and later are restored by calling HSetState. This strategy is preferable to just calling HLock and HUnlock.

A compiler can generate hidden dereferencing, and hence potential dangling pointers, in other ways, for instance, by assigning the result of a function that might move or purge blocks to a field in a record referenced by a handle. Such problems are particularly common in code that manipulates linked data structures. For example, you might use this code to allocate a new element of a linked list:

myHandle^^.nextHandle := NewHandle(sizeof(myLinkedElement));

This can cause problems because your compiler could dereference myHandle before calling NewHandle. Therefore, you should either lock myHandle before performing the allocation, or use a temporary variable to allocate the new handle, as in the following code:

tempHandle := NewHandle(sizeof(myLinkedElement));

myHandle^^.nextHandle := tempHandle;

Passing fields of records as arguments to routines that might move or purge memory can cause similar problems, if the records are in relocatable blocks referred to with handles. Problems arise only when you pass a field by reference rather than by value. Pascal conventions call for all arguments larger than 4 bytes to be passed by reference. In Pascal, a variable is also passed by reference when the routine called requests a variable parameter. Both of the following lines of code could leave a pointer dangling:

TEUpdate(hTE^^.viewRect, hTE);

InvalRect(theControl^^.contrlRect);

120

These problems occur because a compiler may dereference a handle before calling the routine to which you pass the handle. Then, that routine may move memory before it uses the dereferenced handle, which might then be invalid. As before, you can solve these problems by locking the handles or using temporary variables.

Loading Code Segments

If you call an application-defined routine located in a code segment that is not currently in RAM, the Segment Manager might need to move memory when loading that code segment, thus jeopardizing any dereferenced handles you might be using. For example, suppose you call an application-defined procedure ManipulateData, which manipulates some data at an address passed to it in a variable parameter.

PROCEDURE MyRoutine;

BEGIN

...

ManipulateData(myHandle^);

...

END;

You can create a dangling pointer if ManipulateData and MyRoutine are in different segments, and the segment containing ManipulateData is not loaded when MyRoutine is executed. You can do this because you've passed a dereferenced copy of myHandle as an argument to ManipulateData. If the Segment Manager must allocate a new relocatable block for the segment containing ManipulateData, it might move myHandle to do so. If so, the dereferenced handle would dangle. A similar problem can occur if you assign the result of a function in a nonresident code segment to a field in a record referred to by a handle.

You need to be careful even when passing a field in a record referenced by a handle to a routine in the same code segment as the caller, or when assigning the result of a function in the same code segment to such a field. If that routine could call a Toolbox routine that might move or purge memory, or call a routine in a different, nonresident code segment, then you could indirectly cause a pointer to dangle.

Callback Routines

Code segmentation can also lead to a different type of dangling-pointer problem when you use callback routines. The problem rarely arises, but it is difficult to debug. Some Toolbox routines require that you pass a pointer to a procedure in a variable of type ProcPtr. Ordinarily, it does not matter whether the procedure you pass in such a variable is in the same code segment as the routine that calls it or in a different code segment. For example, suppose you call TrackControl as follows:

myPart := TrackControl(myControl, myEvent.where, @MyCallBack);

If MyCallBack were in the same code segment as this line of code, then a compiler would pass to TrackControl the absolute address of the MyCallBack procedure. If it were in a different code segment, then the compiler would take the address from the jump table entry for MyCallBack. Either way, TrackControl should call MyCallBack correctly.

121

Occasionally, you might use a variable of type ProcPtr to hold the address of a callback procedure and then pass that address to a routine. Here is an example:

myProc := @MyCallBack;

...

myPart := TrackControl(myControl, myEvent.where, myProc);

As long as these lines of code are in the same code segment and the segment is not unloaded between the execution of those lines, the preceding code should work perfectly. Suppose, however, that myProc is a global variable, and the first line of the code is in a different segment from the call to TrackControl. Suppose, further, that the MyCallBack procedure is in the same segment as the first line of the code (which is in a different segment from the call to TrackControl). Then, the compiler might place the absolute address of the MyCallBack routine into the variable myProc. The compiler cannot realize that you plan to use the variable in a different code segment from the one that holds both the routine you are referencing and the routine you are using to initialize the myProc variable. Because MyCallBack and the call to TrackControl are in different code segments, the TrackControl procedure requires that you pass an address in the jump table, not an absolute address. Thus, in this hypothetical situation, myProc would reference MyCallBack incorrectly.

To avoid this problem, make sure to place in the same segment any code in which you assign a value to a variable of type ProcPtr and any code in which you use that variable. If you must put them in different code segments, then be sure that you place the callback routine in a code segment different from the one that initializes the variable.

Invalid Handles

An invalid handle refers to the wrong area of memory, just as a dangling pointer does. There are three types of invalid handles: empty handles, disposed handles, and fake handles. You must avoid empty, disposed, or fake handles as carefully as dangling pointers. Fortunately, it is generally easier to detect, and thus to avoid, invalid handles.

Disposed Handles: A disposed handle is a handle whose associated relocatable block has been disposed of. When you dispose of a relocatable block (perhaps by calling the procedure DisposeHandle), the Memory Manager does not change the value of any handle variables that previously referenced that block. Instead, those variables still hold the address of what once was the relocatable block's master pointer. Because the block has been disposed of, however, the contents of the master pointer are no longer defined. (The master pointer might belong to a subsequently allocated relocatable block, or it could become part of a linked list of unused master pointers maintained by the Memory Manager.)

If you accidentally use a handle to a block you have already disposed of, you can obtain unexpected results. In the best cases, your application will crash. In the worst cases, you will get garbled data. It might, however, be difficult to trace the cause of the garbled data, because your application can continue to run for quite a while before the problem begins to manifest itself.

You can avoid these problems quite easily by assigning the value NIL to the handle variable after you dispose of its associated block. By doing so, you indicate that the handle does not point anywhere in particular. If you subsequently attempt to operate on such a block, the Memory Manager will probably generate a nilHandleErr result code. If you want to make certain that a handle is not disposed of before operating on a relocatable block, you can test whether the value of the handle is

122

NIL, as follows:

IF myHandle <> NIL THEN

...; {handle is valid, so we can operate on it here}

Empty Handles: An empty handle is a handle whose master pointer has the value NIL. When the Memory Manager purges a relocatable block, for example, it sets the block's master pointer to NIL. The space occupied by the master pointer itself remains allocated, and handles to the purged block continue to point to the master pointer. This is useful, because if you later reallocate space for the block by calling ReallocateHandle, the master pointer will be updated and all existing handles will correctly access the reallocated block.

Once again, however, inadvertently using an empty handle can give unexpected results or lead to a system crash. In the Macintosh Operating System, NIL technically refers to memory location 0. But this memory location holds a value. If you doubly dereference an empty handle, you reference whatever data is found at that location, and you could obtain unexpected results that are difficult to trace.

You can check for empty handles much as you check for disposed handles. Assuming you set handles to NIL when you dispose of them, you can use the following code to determine whether a handle both points to a valid master pointer and references a nonempty relocatable block:

IF myHandle <> NIL THEN

IF myHandle^ <> NIL THEN

... {we can operate on the relocatable block here}

Note that because Pascal evaluates expressions completely, you need two IF-THEN statements rather than one compound statement in case the value of the handle itself is NIL. Most compilers, however, allow you to use "short-circuit" Boolean operators to minimize the evaluation of expressions. For example, if your compiler uses the operator & as a short-circuit operator for AND, you could rewrite the preceding code like this:

IF (myHandle <> NIL) & (myHandle^ <> NIL) THEN

... {we can operate on the relocatable block here}

In this case, the second expression is evaluated only if the first expression evaluates to TRUE.

It is useful during debugging to set memory location 0 to an odd number, such as $50FFC001. This causes the Operating System to crash immediately if you attempt to dereference an empty handle. This is useful, because you can immediately fix problems that might otherwise require extensive debugging.

Fake Handles: A fake handle is a handle that was not created by the Memory Manager. Normally, you create handles by either directly or indirectly calling the Memory Manager function NewHandle (or one of its variants, such as NewHandleClear). You create a fake handle--usually inadvertently--by directly assigning a value to a variable of type Handle, as illustrated in Listing 1-2.

Listing 1-2. Creating a fake handle

FUNCTION MakeFakeHandle: Handle; {DON'T USE THIS FUNCTION!}

123

CONST

kMemoryLoc = $100; {a random memory location}

VAR

myHandle: Handle;

myPointer: Ptr;

BEGIN

myPointer := Ptr(kMemoryLoc); {the address of some memory}

myHandle := @myPointer; {the address of a pointer}

MakeFakeHandle := myHandle;

END;

Remember that a real handle contains the address of a master pointer. The fake handle manufactured by the function MakeFakeHandle in Listing 1-2 contains an address that may or may not be the address of a master pointer. If it isn't the address of a master pointer, then you virtually guarantee chaotic results if you pass the fake handle to a system software routine that expects a real handle.

For example, suppose you pass a fake handle to the MoveHHi procedure. After allocating a new relocatable block high in the heap, MoveHHi is likely to copy the data from the original block to the new block by dereferencing the handle and using, supposedly, a master pointer. Because, however, the value of a fake handle probably isn't the address of a master pointer, MoveHHi copies invalid data. (Actually, it's unlikely that MoveHHi would ever get that far; probably it would run into problems when attempting to determine the size of the original block from the block header.)

Not all fake handles are as easy to spot as those created by the MakeFakeHandle function defined in Listing 1-2. You might, for instance, attempt to copy the data in an existing record (myRecord) into a new handle, as follows:

myHandle := NewHandle(SizeOf(myRecord)); {create a new handle}

myHandle^ := @myRecord; {DON'T DO THIS!}

The second line of code does not make myHandle a handle to the beginning of the myRecord record. Instead, it overwrites the master pointer with the address of that record, making myHandle a fake handle.

A correct way to create a new handle to some existing data is to make a copy of the data using the PtrToHand function, as follows:

myErr := PtrToHand(@myRecord, myHandle, SizeOf(myRecord));

The Memory Manager provides a set of pointer- and handle-manipulation routines that can help you avoid creating fake handles.

Low-Memory Conditions

124

It is particularly important to make sure that the amount of free space in your application heap never gets too low. For example, you should never deplete the available heap memory to the point that it becomes impossible to load required code segments. As you have seen, your application will crash if the Segment Manager is called to load a required code segment and there is not enough contiguous free memory to allocate a block of the appropriate size.

You can take several steps to help maximize the amount of free space in your heap. For example, you can mark as purgeable any relocatable blocks whose contents could easily be reconstructed. By making a block purgeable, you give the Memory Manager the freedom to release that space if heap memory becomes low. You can also help maximize the available heap memory by intelligently segmenting your application's executable code and by periodically unloading any unneeded segments. The standard way to do this is to unload every nonessential segment at the end of your application's main event loop.

Memory Cushions: These two measures--making blocks purgeable and unloading segments--help you only by releasing blocks that have already been allocated. It is even more important to make sure, before you attempt to allocate memory directly, that you don't deplete the available heap memory. Before you call NewHandle or NewPtr, you should check that, if the requested amount of memory were in fact allocated, the remaining amount of space free in the heap would not fall below a certain threshold. The free memory defined by that threshold is your memory cushion. You should not simply inspect the handle or pointer returned to you and make sure that its value isn't NIL, because you might have succeeded in allocating the space you requested but left the amount of free space dangerously low.

You also need to make sure that indirect memory allocation doesn't cut into the memory cushion. When, for example, you call GetNewDialog, the Dialog Manager might need to allocate space for a dialog record; it also needs to allocate heap space for the dialog item list and any other custom items in the dialog. Before calling GetNewDialog, therefore, you need to make sure that the amount of space left free after the call is greater than your memory cushion.

The execution of some system software routines requires significant amounts of memory in your heap. For example, some QuickDraw operations on regions can temporarily allocate fairly large amounts of space in your heap. Some of these system software routines, however, do little or no checking to see that your heap contains the required amount of free space. They either assume that they will get whatever memory they need or they simply issue a system error when they don't get the needed memory. In either case, the result is usually a system crash.

You can avoid these problems by making sure that there is always enough space in your heap to handle these hidden memory allocations. Experience has shown that 40 KB is a reasonably safe size for this memory cushion. If you can consistently maintain that amount of space free in your heap, you can be reasonably certain that system software routines will get the memory they need to operate. You also generally need a larger cushion (about 70 KB) when printing.

Memory Reserves

Unfortunately, there are times when you might need to use some of the memory in the cushion yourself. It is better, for instance, to dip into the memory cushion, if necessary, to save a user's document than to reject the request to save the document. Some actions your application performs should not be rejectable simply because they require it to reduce the amount of free space below a desired minimum.

Instead of relying on just the free memory of a memory cushion, you can allocate a memory reserve, some additional emergency storage that you release when free memory becomes low. The

125

important difference between this memory reserve and the memory cushion is that the memory reserve is a block of allocated memory, which you release whenever you detect that essential tasks have dipped into the memory cushion.

That emergency memory reserve might provide enough memory to compensate for any essential tasks that you fail to anticipate. Because you allow essential tasks to dip into the memory cushion, the release itself of the memory reserve should not be a cause for alarm. Using this scheme, your application releases the memory reserve as a precautionary measure during ordinary operation. Ideally, however, the application should never actually deplete the memory cushion and use the memory reserve.

Grow-Zone Functions

The Memory Manager provides a particularly easy way for you to make sure that the emergency memory reserve is released when necessary. You can define a grow-zone function that is associated with your application heap. The Memory Manager calls your heap's grow-zone function only after other techniques of freeing memory to satisfy a memory request fail (that is, after compacting and purging the heap and extending the heap zone to its maximum size). The grow-zone function can then take appropriate steps to free additional memory.

A grow-zone function might dispose of some blocks or make some unpurgeable blocks purgeable. When the function returns, the Memory Manager once again purges and compacts the heap and tries to reallocate memory. If there is still insufficient memory, the Memory Manager calls the grow-zone function again (but only if the function returned a nonzero value the previous time it was called). This mechanism allows your grow-zone function to release just a little bit of memory at a time. If the amount it releases at any time is not enough, the Memory Manager calls it again and gives it the opportunity to take more drastic measures. As the most drastic step to freeing memory in your heap, you can release the emergency reserve.

Using Memory

This section describes how you can use the Memory Manager to perform the most typical memory management tasks. In particular, this section shows how you can

• set up your application heap at application launch time

• determine how much free space is available in your application heap

• allocate and release blocks of memory in your heap

• define and install a grow-zone function

The techniques described in this section are designed to minimize fragmentation of your application heap and to ensure that your application always has sufficient memory to complete any essential operations.

Setting Up the Application Heap

When the Process Manager launches your application, it calls the Memory Manager to create and initialize a memory partition for your application. The Process Manager then loads code segments into memory and sets up the stack, heap, and A5 world (including the jump table) for your application.

126

To help prevent heap fragmentation, you should also perform some setup of your own early in your application's execution. Depending on the needs of your application, you might want to

• change the size of your application's stack

• expand the heap to the heap limit

• allocate additional master pointer blocks

The following sections describe in detail how and when to perform these operations.

Changing the Size of the Stack: Most applications allocate space on their stack in a predictable way and do not need to monitor stack space during their execution. For these applications, stack usage usually reaches a maximum in some heavily nested routine. If the stack in your application can never grow beyond a certain size, then to avoid collisions between your stack and heap you simply need to ensure that your stack is large enough to accommodate that size. If you never encounter system error 28 (generated by the stack sniffer when it detects a collision between the stack and the heap) during application testing, then you probably do not need to increase the size of your stack.

Some applications, however, rely heavily on recursive programming techniques, in which one routine repeatedly calls itself or a small group of routines repeatedly call each other. In these applications, even routines with just a few local variables can cause stack overflow, because each time a routine calls itself, a new copy of that routine's parameters and variables is appended to the stack. The problem can become particularly acute if one or more of the local variables is a string, which can require up to 256 bytes of stack space.

You can help prevent your application from crashing because of insufficient stack space by expanding the size of your stack. If your application does not depend on recursion, you should do this only if you encounter system error 28 during testing. If your application does depend on recursion, you might consider expanding the stack so that your application can perform deeply nested recursive computations. In addition, some object-oriented languages (for example, C++) allocate space for objects on the stack. If you are using one of these languages, you might need to expand your stack.

To increase the size of your stack, you simply reduce the size of your heap. Because the heap cannot grow above the boundary contained in the ApplLimit global variable, you can lower the value of ApplLimit to limit the heap's growth. By lowering ApplLimit, technically you are not making the stack bigger; you are just preventing collisions between it and the heap.

By default, the stack can grow to 8 KB on Macintosh computers without Color QuickDraw and to 32 KB on computers with Color QuickDraw. (The size of the stack for a faceless background process is always 8 KB, whether Color QuickDraw is present or not.) You should never decrease the size of the stack, because future versions of system software might increase the default amount of space allocated for the stack. For the same reason, you should not set the stack to a predetermined absolute size or calculate a new absolute size for the stack based on the microprocessor's type. If you must modify the size of the stack, you should increase the stack size only by some relative amount that is sufficient to meet the increased stack requirements of your application. There is no maximum size to which the stack can grow.

Listing 1-3 defines a procedure that increases the stack size by a given value. It does so by determining the current heap limit, subtracting the value of the extraBytes parameter from that value, and then setting the application limit to the difference.

127

Listing 1-3. Increasing the amount of space allocated for the stack

PROCEDURE IncreaseStackSize (extraBytes: Size);

BEGIN

SetApplLimit(Ptr(ORD4(GetApplLimit) - extraBytes));

END;

You should call this procedure at the beginning of your application, before you call the MaxApplZone procedure (as described in the next section). If you call IncreaseStackSize after you call MaxApplZone, it has no effect, because the SetApplLimit procedure cannot change the ApplLimit global variable to a value lower than the current top of the heap.

Expanding the Heap: Near the beginning of your application's execution, before you allocate any memory, you should call the MaxApplZone procedure to expand the application heap immediately to the application heap limit. If you do not do this, the Memory Manager gradually expands your heap as memory needs require. This gradual expansion can result in significant heap fragmentation if you have previously moved relocatable blocks to the top of the heap (by calling MoveHHi) and locked them (by calling HLock). When the heap grows beyond those locked blocks, they are no longer at the top of the heap. Your heap then remains fragmented for as long as those blocks remain locked.

Another advantage to calling MaxApplZone is that doing so is likely to reduce the number of relocatable blocks that are purged by the Memory Manager. The Memory Manager expands your heap to fulfill a memory request only after it has exhausted other methods of obtaining the required amount of space, including compacting the heap and purging blocks marked as purgeable. By expanding the heap to its limit, you can prevent the Memory Manager from purging blocks that it otherwise would purge. This, together with the fact that your heap is expanded only once, can make memory allocation significantly faster.

Allocating Master Pointer Blocks

After calling MaxApplZone, you should call the MoreMasters procedure to allocate as many new nonrelocatable blocks of master pointers as your application is likely to need during its execution. Each block of master pointers in your application heap contains 64 master pointers. The Operating System allocates one block of master pointers as your application is loaded into memory, and every relocatable block you allocate needs one master pointer to reference it.

If, when you allocate a relocatable block, there are no unused master pointers in your application heap, the Memory Manager automatically allocates a new block of master pointers. For several reasons, however, you should try to prevent the Memory Manager from calling MoreMasters for you. First, MoreMasters executes more slowly if it has to move relocatable blocks up in the heap to make room for the new nonrelocatable block of master pointers. When your application first starts running, there are no such blocks that might have to be moved. Second, the new nonrelocatable block of master pointers is likely to fragment your application heap. At any time the Memory Manager is forced to call MoreMasters for you, there are already at least 64 relocatable blocks allocated in your heap. Unless all or most of those blocks are locked high in the heap (an unlikely situation), the new nonrelocatable block of master pointers might be allocated above existing relocatable blocks. This increases heap fragmentation.

To prevent this fragmentation, you should call MoreMasters at the beginning of your application enough times to ensure that the Memory Manager never needs to call it for you. For

128

example, if your application never allocates more than 300 relocatable blocks in its heap, then five calls to the MoreMasters should be enough. It's better to call MoreMasters too many times than too few, so if your application usually allocates about 100 relocatable blocks but sometimes might allocate 1000 in a particularly busy session, you should call MoreMasters enough times at the beginning of the program to cover the larger figure.

You can determine empirically how many times to call MoreMasters by using a low-level debugger. First, remove all the calls to MoreMasters from your code and then give your application a rigorous workout, opening and closing windows, dialog boxes, and desk accessories as much as any user would. Then, find out from your debugger how many times the system called MoreMasters. To do so, count the nonrelocatable blocks of size $100 bytes (decimal 256, or 64 4). Because of Memory Manager size corrections, you should also count any nonrelocatable blocks of size $108, $10C, or $110 bytes. (You should also check to make sure that your application doesn't allocate other nonrelocatable blocks of those sizes. If it does, subtract the number it allocates from the total.) Finally, call MoreMasters at least that many times at the beginning of your application.

Listing 1-4 illustrates a typical sequence of steps to configure your application heap and stack. The DoSetUpHeap procedure defined there increases the size of the stack by 32 KB, expands the application heap to its new limit, and allocates five additional blocks of master pointers.

Listing 1-4. Setting up your application heap and stack

PROCEDURE DoSetUpHeap;

CONST

kExtraStackSpace = $8000; {32 KB}

kMoreMasterCalls = 5; {for 320 master ptrs}

VAR

count: Integer;

BEGIN

IncreaseStackSize(kExtraStackSpace); {increase stack size}

MaxApplZone; {extend heap to limit}

FOR count := 1 TO kMoreMasterCalls DO

MoreMasters; {64 more master ptrs}

END;

To reduce heap fragmentation, you should call DoSetUpHeap in a code segment that you never unload (possibly the main segment) rather than in a special initialization code segment. This is because MoreMasters allocates a nonrelocatable block. If you call MoreMasters from a code segment that is later purged, the new master pointer block is located above the purged space, thereby increasing fragmentation.

Determining the Amount of Free Memory

Because space in your heap is limited, you cannot usually honor every user request that would require your application to allocate memory. For example, every time the user opens a new window,

129

you probably need to allocate a new window record and other associated data structures. If you allow the user to open windows endlessly, you risk running out of memory. This might adversely affect your application's ability to perform important operations such as saving existing data in a window.

It is important, therefore, to implement some scheme that prevents your application from using too much of its own heap. One way to do this is to maintain a memory cushion that can be used only to satisfy essential memory requests. Before allocating memory for any nonessential task, you need to ensure that the amount of memory that remains free after the allocation exceeds the size of your memory cushion. You can do this by calling the function IsMemoryAvailable defined in Listing 1-5.

Listing 1-5. Determining whether allocating memory would deplete the memory cushion

FUNCTION IsMemoryAvailable (memRequest: LongInt): Boolean;

VAR

total: LongInt; {total free memory if heap purged}

contig: LongInt; {largest contiguous block if heap purged}

BEGIN

PurgeSpace(total, contig);

IsMemoryAvailable := ((memRequest + kMemCushion) < contig);

END;

The IsMemoryAvailable function calls the Memory Manager's PurgeSpace procedure to determine the size of the largest contiguous block that would be available if the application heap were purged; that size is returned in the contig parameter. If the size of the potential memory request together with the size of the memory cushion is less than the value returned in contig, IsMemoryAvailable is set to TRUE, indicating that it is safe to allocate the specified amount of memory; otherwise, IsMemoryAvailable returns FALSE.

Notice that the IsMemoryAvailable function does not itself cause the heap to be purged or compacted; the Memory Manager does so automatically when you actually attempt to allocate the memory.

Usually, the easiest way to determine how big to make your application's memory cushion is to experiment with various values. You should attempt to find the lowest value that allows your application to execute successfully no matter how hard you try to allocate memory to make the application crash. As an extra guarantee against your application's crashing, you might want to add some memory to this value. As indicated earlier in this chapter, 40 KB is a reasonable size for most applications.

CONST

kMemCushion = 40 * 1024; {size of memory cushion}

You should call the IsMemoryAvailable function before all nonessential memory requests, no matter how small. For example, suppose your application allocates a new, small relocatable block each time a user types a new line of text. That block might be small, but thousands of such blocks

130

could take up a considerable amount of space. Therefore, you should check to see if there is sufficient memory available before allocating each one.

You should never, however, call the IsMemoryAvailable function before an essential memory request. When deciding how big to make the memory cushion for your application, you must make sure that essential requests can never deplete all of the cushion. Note that when you call the IsMemoryAvailable function for a nonessential request, essential requests might have already dipped into the memory cushion. In that case, IsMemoryAvailable returns FALSE no matter how small the nonessential request is.

Some actions should never be rejectable. For example, you should guarantee that there is always enough memory free to save open documents, and to perform typical maintenance tasks such as updating windows. Other user actions are likely to be always rejectable. For example, because you cannot allow the user to create an endless number of documents, you should make the New Document and Open Document menu commands rejectable.

Although the decisions of which actions to make rejectable are usually obvious, modal and modeless boxes present special problems. If you want to make such dialog boxes available at all costs, you must ensure that you allocate a large enough memory cushion to handle the maximum number of these dialog boxes that the user could open at once. If you consider a certain dialog box (for instance, a spelling checker) nonessential, you must be prepared to inform the user that there is not enough memory to open it if memory space become low.

Allocating Blocks of Memory

As you have seen, a key element of the memory-management scheme presented in this chapter is to disallow any nonessential memory allocation requests that would deplete the memory cushion. In practice, this means that, before calling NewHandle, NewPtr, or another function that allocates memory, you should check that the amount of space remaining after the allocation, if successful, exceeds the size of the memory cushion.

An easy way to do this is never to allocate memory for nonessential tasks by calling NewHandle or NewPtr directly. Instead call a function such as NewHandleCushion, defined in Listing 1-6, or NewPtrCushion, defined in Listing 1-7.

Listing 1-6 Allocating relocatable blocks

FUNCTION NewHandleCushion (logicalSize: Size): Handle;

BEGIN

IF NOT IsMemoryAvailable(logicalSize) THEN

NewHandleCushion := NIL

ELSE

BEGIN

SetGrowZone(NIL); {remove grow-zone function}

NewHandleCushion := NewHandleClear(logicalSize);

SetGrowZone(@MyGrowZone); {install grow-zone function}

END;131

END;

The NewHandleCushion function first calls IsMemoryAvailable to determine whether allocating the requested number of bytes would deplete the memory cushion. If so, NewHandleCushion returns NIL to indicate that the request has failed. Otherwise, if there is indeed sufficient space for the new block, NewHandleCushion calls NewHandleClear to allocate the relocatable block. Before calling NewHandleClear, however, NewHandleCushion disables the grow-zone function for the application heap. This prevents the grow-zone function from releasing any emergency memory reserve your application might be maintaining.

You can define a function NewPtrCushion to handle allocation of nonrelocatable blocks, as shown in Listing 1-7.

Listing 1-7 Allocating nonrelocatable blocks

FUNCTION NewPtrCushion (logicalSize: Size): Handle;

BEGIN

IF NOT IsMemoryAvailable(logicalSize) THEN

NewPtrCushion := NIL

ELSE

BEGIN

SetGrowZone(NIL); {remove grow-zone function}

NewPtrCushion := NewPtrClear(logicalSize);

SetGrowZone(@MyGrowZone); {install grow-zone function}

END;

END;

Listing 1-8 illustrates a typical way to call NewPtrCushion.

Listing 1-8 Allocating a dialog record

FUNCTION GetDialog (dialogID: Integer): DialogPtr;

VAR

myPtr: Ptr; {storage for the dialog record}

BEGIN

myPtr := NewPtrCushion(SizeOf(DialogRecord));

IF MemError = noErr THEN

GetDialog := GetNewDialog(dialogID, myPtr, WindowPtr(-1))

ELSE

GetDialog := NIL; {can't get memory}

132

END;

When you allocate memory directly, you can later release it by calling the DisposeHandle and DisposePtr procedures. When you allocate memory indirectly by calling a Toolbox routine, there is always a corresponding Toolbox routine to release that memory. For example, the DisposeWindow procedure releases memory allocated with the NewWindow function. Be sure to use these special Toolbox routines instead of the generic Memory Manager routines when applicable.

Maintaining a Memory Reserve

A simple way to help ensure that your application always has enough memory available for essential operations is to maintain an emergency memory reserve. This memory reserve is a block of memory that your application uses only for essential operations and only when all other heap space has been allocated. This section illustrates one way to implement a memory reserve in your application.

To create and maintain an emergency memory reserve, you follow three distinct steps:

• When your application starts up, you need to allocate a block of reserve memory. Because you allocate the block, it is no longer free in the heap and does not enter into the free-space determination done by IsMemoryAvailable.

• When your application needs to fulfill an essential memory request and there isn't enough space in your heap to satisfy the request, you can release the reserve. This effectively ensures that you always have the memory you request, at least for essential operations.

• Each time through your main event loop, you should check whether the reserve has been released. If it has, you should attempt to recover the reserve. If you cannot recover the reserve, you should warn the user that memory is critically short.

To refer to the emergency reserve, you can declare a global variable of type Handle.

VAR

gEmergencyMemory: Handle; {handle to emergency memory reserve}

Listing 1-9 defines a function that you can call early in your application's execution (before entering your main event loop) to create an emergency memory reserve. This function also installs the application-defined grow-zone procedure.

Listing 1-9. Creating an emergency memory reserve

PROCEDURE InitializeEmergencyMemory;

BEGIN

gEmergencyMemory := NewHandle(kEmergencyMemorySize);

SetGrowZone(@MyGrowZone);

END;

The InitializeEmergencyMemory procedure defined in Listing 1-9 simply allocates a relocatable block of a predefined size. That block is the emergency memory reserve. A reasonable size for the

133

memory reserve is whatever size you use for the memory cushion. Once again, 40 KB is a good size for many applications.

CONST

kEmergencyMemorySize = 40 * 1024; {size of memory reserve}

When using a memory reserve, you need to change the IsMemoryAvailable function defined earlier in Listing 1-5. You need to make sure, when determining whether a nonessential memory allocation request should be honored, that the memory reserve has not been released. To check that the memory reserve is intact, use the function IsEmergencyMemory defined in Listing 1-10.

Listing 1-10. Checking the emergency memory reserve

FUNCTION IsEmergencyMemory: Boolean;

BEGIN

IsEmergencyMemory :=

(gEmergencyMemory <> NIL) & (gEmergencyMemory^ <> NIL);

END;

Then, you can replace the function IsMemoryAvailable defined in Listing 1-5 by the version defined in Listing 1-11.

Listing 1-11. Determining whether allocating memory would deplete the memory cushion

FUNCTION IsMemoryAvailable (memRequest: LongInt): Boolean;

VAR

total: LongInt; {total free memory if heap purged}

contig: LongInt; {largest contiguous block if heap purged}

BEGIN

IF NOT IsEmergencyMemory THEN {is emergency memory available?}

IsMemoryAvailable := FALSE

ELSE

BEGIN

PurgeSpace(total, contig);

IsMemoryAvailable:= ((memRequest + kMemCushion) < contig);

END;

END;

As you can see, this is exactly like the earlier version except that it indicates that memory is not available if the memory reserve is not intact.

134

Once you have allocated the memory reserve early in your application's execution, it should be released only to honor essential memory requests when there is no other space available in your heap. You can install a simple grow-zone function that takes care of releasing the reserve at the proper moment. Each time through your main event loop, you can check whether the reserve is still intact; to do this, add these lines of code to your main event loop, before you make your event call:

IF NOT IsEmergencyMemory THEN

RecoverEmergencyMemory;

The RecoverEmergencyMemory function, defined in Listing 1-12, simply attempts to reallocate the memory reserve.

Listing 1-12 Reallocating the emergency memory reserve

PROCEDURE RecoverEmergencyMemory;

BEGIN

ReallocateHandle(gEmergencyMemory, kEmergencyMemorySize);

END;

If you are unable to reallocate the memory reserve, you might want to notify the user that because memory is in short supply, steps should be taken to save any important data and to free some memory.

Defining a Grow-Zone Function

The Memory Manager calls your heap's grow-zone function only after other attempts to obtain enough memory to satisfy a memory allocation request have failed. A grow-zone function should be of the following form:

FUNCTION MyGrowZone (cbNeeded: Size): LongInt;

The Memory Manager passes to your function (in the cbNeeded parameter) the number of bytes it needs. Your function can do whatever it likes to free that much space in the heap. For example, your grow-zone function might dispose of certain blocks or make some unpurgeable blocks purgeable. Your function should return the number of bytes, if any, it managed to free.

When the function returns, the Memory Manager once again purges and compacts the heap and tries again to allocate the requested amount of memory. If there is still insufficient memory, the Memory Manager calls your grow-zone function again, but only if the function returned a nonzero value when last called. This mechanism allows your grow-zone function to release memory gradually; if the amount it releases is not enough, the Memory Manager calls it again and gives it the opportunity to take more drastic measures.

Typically a grow-zone function frees space by calling the EmptyHandle procedure, which purges a relocatable block from the heap and sets the block's master pointer to NIL. This is preferable to disposing of the space (by calling the DisposeHandle procedure), because you are likely to want to reallocate the block.

The Memory Manager might designate a particular relocatable block in the heap as protected; your grow-zone function should not move or purge that block. You can determine which block, if any, the Memory Manager has protected by calling the GZSaveHnd function in your grow-zone function.

135

Listing 1-13 defines a very basic grow-zone function. The MyGrowZone function attempts to create space in the application heap simply by releasing the block of emergency memory. First, however, it checks that (1) the emergency memory hasn't already been released and (2) the emergency memory is not a protected block of memory (as it would be, for example, during an attempt to reallocate the emergency memory block). If either of these conditions isn't true, then MyGrowZone returns 0 to indicate that no memory was released.

Listing 1-13 A grow-zone function that releases emergency storage

FUNCTION MyGrowZone (cbNeeded: Size): LongInt;

VAR

theA5: LongInt; {value of A5 when function is called}

BEGIN

theA5:= SetCurrentA5; remember current value of A5; install ours}

IF (gEmergencyMemory^ <> NIL) & (gEmergencyMemory <> GZSaveHnd) THEN

BEGIN

EmptyHandle(gEmergencyMemory);

MyGrowZone := kEmergencyMemorySize;

END

ELSE

MyGrowZone := 0; {no more memory to release}

theA5 := SetA5(theA5); {restore previous value of A5}

END;

The function MyGrowZone defined in Listing 1-13 saves the current value of the A5 register when it begins and then restores the previous value before it exits. This is necessary because your grow-zone function might be called at a time when the system is attempting to allocate memory and value in the A5 register is not correct. See the chapter "Memory Management Utilities" in this book for more information about saving and restoring the A5 register.

136

CHAPTER 9

CACHING AND INTRO TO FILE SYSTEMS

Introduction to File Systems

• File systems. Important topic - most crucial data stored in file systems, and file system performance is crucial component of overall system performance. In practice, is maybe the most important.

• What are files? Data that is readily available, but stored on non-volatile media. Standard place to store files: on a hard disk or floppy disk. Also, data may be a network away.

• Most systems let you organize files into a tree structure, so have directories and files.

• What is stored in files? Latex source, Nachos source, FrameMaker source, C++ object files, executables, Perl scripts, shell files, databases, PostScript files, etc.

• Meaning of a file depends on the tools that manipulate it. Meaning of a Latex file is different for the Latex executable than for a standard text editor. Executable file format has meaning to OS. Object file format has meaning to linker.

• Some systems support a lot of different file types explicitly. Macintosh, IBM mainframes do this. Knowledge of file types built into OS, and OS handles different kinds of files differently.

• In Unix, meaning of a file is simply a sequence of bytes. How do Unix tools tell file types apart? By looking at contents! For example, how does Unix tell executables apart from shell scripts apart from Perl files when it executes it?

o Perl Scripts - start with a #!/usr/bin/perl. In general, if file starts with #!tool, Unix shell interprets file using tool.

o Shell Scripts - start with a #. 137

o How about executables? Start with Unix executable magic number. Recall Nachos object file format.

o What about PostScript files? Start with something like %!PS-Adobe-2.0, which printing utilities recognize.

Single exception: directories and symbolic links are explicitly tagged in Unix.

• What about Macintosh? All files have a type (pict, text) and the name of program that created the file. When double click on the file, it automatically starts the program that created file and loads the file. Have to have utilities that twiddle the file metadata (types and program names).

• What about DOS? Have an ad-hoc file typing mechanism built into file naming conventions. So, .com and .exe identify two different kinds of executables. .bat identifies a text batch file. These are enforced by OS (because it is involved with launching executables). Other file extensions are recognized by other programs but not by OS.

• File attributes:

o Name

o Type - in Unix, implicit.

o Location - where file is stored on disk

o Size

o Protection

o Time, date and user identification.

• All file system information is stored in nonvolatile storage in a way that it can be reconstructed on a system crash. Very important for data security.

• How do programs access files? Several general ways:

o Sequential - open it, then read or write from beginning to end.

o Direct - specify the starting address of the data.

o Indexed - index file by identifier (name, for example), then retrieve record associated by name.

Files may be accessed more than one way. A payroll file, for example, may be accessed sequentially by paycheck program and indexed by personnel office. Nachos executable files are accessed directly.

• File structure can be optimized for a given access mode.

o For sequential access, can have file just laid out sequentially on disk. What is problem?

o For direct access, can have a disk block table telling where each disk block is. To access indexed data, first traverse disk block table to find right disk block, then go to the block containing data.

138

o For more sophisticated indexed access, may build an index file. Example: IBM ISAM (Indexed Sequential Access Mode). User selects a key, and system builds a two-level index for the key. Uses binary search at each level of index, then linear search within final block. Notice how memory hierarchy considerations drive file implementation.

• Easy to simulate a sequential access file given a direct access file - just keep track of current file position. But simulating direct access file with a sequential access file is a lot harder.

• Fundamental design choice: lots of file formats or few file formats? Unix: few (one) file format. VMS: few (three). IBM lots (I don't know just how many).

• Advantage of lots of file formats: user probably has one that fits the bill.

• Disadvantage: OS becomes larger. System becomes harder to use (must choose file format, if get it wrong it is a big problem).

• Directory structure. To organize files, many systems provide a hierarchical file system arrangement. Can have files, and then directories of files. Common arrangement: tree of files. Naming can be absolute, relative, or both.

• There is sometimes a need to share files between different parts of the tree. So, structure becomes a graph. Can get to same file in multiple ways. Unix supports two kinds of links:

o Symbolic Links: directory entry is name of another file. If that file is moved, symbolic link still points to (non-existent) file. If another file is copied into that spot, symbolic link all of a sudden points to it.

o Hard Links: sticks with the file. If file is moved, hard link still points to file. To get rid of file, must delete it from all places that have hard links to it.

Link command (ln) sets these links up.

• Uses for soft links? Can have two people share files. Can also set up source directories, then link compilation directories to source directories. Typically useful file system structuring tool.

• Graph structure introduces complications. First, must be sure not to delete hard linked files until all pointers to them are gone. Standard solution: reference counts. Second, only want to traverse files once even if have multiple references to same file. Standard solution: marking. cp does not handle this well for soft links; tar handles it well.

• What about cyclic graph structures? Problem is that cycles may make reference counts not work - can have a section of graph that is disconnected from rest, but all entries have positive reference counts. Only solution: garbage collect. Not done very often because it takes so long.

• Unix prevents users from making hard links create cycles by only allowing hard links to point to files, not directories. But, with still have some cycles in structure.

• Memory-mapped files. Standard view of system: have data stored in address space of a process, but data goes away when process dies. If want to preserve data, must write it to disk, then read it back in again when need it.

139

• Writing IO routines to dump data to disk and back again is a real hassle. What is worse, if programs share data using files, must maintain consistency between file and data read in via some other mechanism.

• Solution: memory-mapped files. Can map part of file into process's address space and read and write the file like a normal piece of memory. Sort of like memory-mapped IO, generalized to user level. So, processes can share persistent data directly with no hassles. Programs can dump data structures to disk without having to write routines to linearize, output and read in data structures.

• Used for stuff like snapshot files in interactive systems.

• In Unix, the system call that sets this up is the map system call. How is sharing set up for processes on the same machine? What about processes on different machines?

• Next issue: protection. Why is protection necessary? Because people want to share files, but not share all aspects of all files. Want protection on individual file and operation basis.

o Professor wants students to read but not write assignments.

o Professor wants to keep exam in same directory as assignments, but students should not be able to read exam.

o Can execute but not write commands like cp, cat, etc.

For convenience, want to create coarser grain concepts.

• All people in research group should be able to read and write source files. Others should not be able to access them.

• Everybody should be able to read files in a given directory.

• Conceptually, have operations (open, read, write, execute), resources (files) and principals (users or processes). Can describe desired protection using access matrix. Have list of principals across top and resources on the side. Each entry of matrix lists operations that the principal can perform on the resource.

• Two standard mechanisms for access control: access lists and capabilities.

o Access lists: for each resource (like a file), give a list of principals allowed to access that resource and the access they are allowed to perform. So, each row of access matrix is an access list.

o Capabilities: for each resource and access operation, give out capabilities that give the holder the right to perform the operation on that resource. Capabilities must be unforgeable. Each column of access matrix is a capability list.

Instead of organizing access lists on a principal by principal basis, can organize on a group basis.

• Who controls access lists and capabilities? Done under OS control. Will talk more about security later.

140

• What is the Unix security model? Have three operations - read, write and execute. Each file has an owner and a group. Protections are given for each operation on basis of everybody, group and owner. Like everything else in Unix, is a fairly simple and primitive protection strategy.

• Unix file listing:

4 drwxr-xr-- 2 martin faculty 2048 May 15 21:03 ./

2 drwxr-xr-x 7 martin faculty 512 May 3 17:46 ../

2 -rw-r----- 1 martin faculty 213 Apr 19 22:27 a0.aux

8 -rw-r----- 1 martin faculty 3488 Apr 19 22:27 a0.dvi

4 -rw-r----- 1 martin faculty 1218 Apr 19 22:27 a0.log

72 -rw-r--r-- 1 martin faculty 36617 Apr 19 22:27 a0.ps

6 -rwxr-xr-x 1 martin faculty 2599 Apr 5 18:07 a0.tex*

• How are files implemented on a standard hard-disk based system? It is up to OS to implement it. Why must OS do this? Protection.

• What does a disk look like? It is a stack of platters. Each platter may have two surfaces (one per side). There is one disk head per surface. The surfaces revolve beneath the heads, with the heads riding on a cushion of air. The heads move back and forth between the platters as a unit. The area beneath a stationary head is a track. The set of tracks that can be accessed without moving the heads is a cylinder. Each track is broken up into sectors. A sector is the unit of disk transfer.

• To read a given sector we first move the heads to that sector's cylinder (seek time), then wait for the sector to rotate under the head (latency time), then copy data off of disk into memory (transfer time).

• Typical hard disk statistics: (Sequel 5400 from August 1993, 5.25 inch 4.0Gbyte).

o Platters: 13

o Read/Write heads: 26

o Tracks/Surface: 3,058

o Track Capacity (bytes): 40,448 - 60,928

o Bytes/Sector: 512 - 520

o Sectors/Track: 79-119

o Media Transfer Rate (MB/s): 3.6-5.5

o Track-to-track Seek: 1.3 ms

o Max Seek: 25 ms

141

o Average Seek: 12 ms

o Rotational Speed: 5,400 rpm

o Average Latency: 5.6 ms

• How does this compare to timings for a standard workstation? DECStation 5000 is a standard workstation available in 1993. Had a 33 MHz MIPS R3000, 60 ns memory. How many instructions can execute in 30 ms (about time for average seek plus average latency)? 33 * 30 * 1000 = 990,000. Plus, many operations require multiple disk accesses.

• What does disk look like to OS? Is just a sequence of sectors. All sectors in a track are in sequence; all tracks in a cylinder are in sequence. Adjacent cylinders are in sequence. OS may logically link several disk sectors together to increase effective disk block size.

• How does OS access disk? There is a piece of hardware on the disk called a disk controller. OS issues instructions to disk controller. Can either use IO instructions or memory-mapped IO operations.

• In effect, disk is just a big array of fixed-size chunks. Job of the OS is to implement file system abstractions on top of these chunks.

File System Implementation

• Discuss several file system implementation strategies.

• First implementation strategy: contiguous allocation. Just lay out the file in contiguous disk blocks. Used in VM/CMS - an old IBM interactive system.

Advantages:

• Quick and easy calculation of block holding data - just offset from start of file!

• For sequential access, almost no seeks required.

• Even direct access is fast - just seek and read. Only one disk access.

Disadvantages:

• Where is best place to put a new file?

• Problems when file gets bigger - may have to move whole file!!

• External Fragmentation.

• Compaction may be required, and it can be very expensive.

• Next strategy: linked allocation. All files stored in fixed size blocks. Link together adjacent blocks like a linked list. Advantages:

o No more variable-sized file allocation problems. Everything takes place in fixed-size chunks, which makes memory allocation a lot easier.

142

o No more external fragmentation.

o No need to compact or relocate files.

Disadvantages:

o Potentially terrible performance for direct access files - have to follow pointers from one disk block to the next!

o Even sequential access is less efficient than for contiguous files because may generate long seeks between blocks.

o Reliability -if lose one pointer, have big problems.

• FAT allocation. Instead of storing next file pointer in each block, have a table of next pointers indexed by disk block. Still have to linearly traverse next pointers, but at least don't have to go to disk for each of them. Can just cache the FAT table and do traverse all in memory. MS/DOS and OS/2 use this scheme.

• Table pointer of last block in file has EOF pointer value. Free blocks have table pointer of 0. Allocation of free blocks with FAT scheme is straightforward. Just search for first block with 0 table pointer.

• Indexed Schemes. Give each file an index table. Each entry of the index points to the disk blocks containing the actual file data. Supports fast direct file access, and not bad for sequential access.

• Question: how to allocate index table? Must be stored on disk like everything else in the file system. Have basically same alternatives as for file itself! Contiguous, linked, and multilevel index. In practice some combination scheme is usually used. This whole discussion is reminiscent of paging discussions.

• Will now discuss how traditional Unix lays out file system.

• First 8KB - label + boot block. Next 8KB - Superblock plus free inode and disk block cache.

• Next 64KB - inodes. Each inode corresponds to one file.

• Until end of file system - disk blocks. Each disk block consists of a number of consecutive sectors.

• What is in an inode - information about a file. Each inode corresponds to one file. Important fields:

o Mode. This includes protection information and the file type. File type can be normal file (-), directory (d), symbolic link (l).

o Owner

o Number of links - number of directory entries that point to this inode.

o Length - how many bytes long the file is.

o Nblocks - number of disk blocks the file occupies.

143

o Array of 10 direct block pointers. These are first 10 blocks of file.

o One indirect block pointer. Points to a block full of pointers to disk blocks.

o One doubly indirect block pointer. Points to a block full of pointers to blocks full of pointers to disk blocks.

o One triply indirect block pointer. (Not currently used).

So, a file consists of an inode and the disk blocks that it points to.

• Nblocks and Length do not contain redundant information - can have holes in files. A hole shows up as block pointers that point to block 0 - i.e., nothing in that block.

• Assume block size is 512 bytes (i.e. one sector). To access any of first 512*10 bytes of file, can just go straight from inode. To access data farther in, must go indirect through at least one level of indirection.

• What does a directory look like? It is a file consisting of a list of (name,inode number) pairs. In early Unix Systems the name was a maximum of 14 characters long, and the inode number was 2 bytes. Later versions of Unix removed this restriction, and each directory entry was variable length and also included the length of the file name.

• Why don't inodes contain names? Because would like a file to be able to have multiple names.

• How does Unix implement the directories . and ..? They are just names in the directory. . points to the inode of the directory, while .. points to the inode of the directory's parent directory. So, there are some circularities in the file system structure.

• User can refer to files in one of two ways: relative to current directory, or relative to the root directory. Where does lookup for root start? By convention, inode number 2 is the inode for the top directory. If a name starts with /, lookup starts at the file for inode number 2.

• How does system convert a name to an inode? There is a routine called name that does it.

• Do a simple file system example, draw out inodes and disk blocks, etc. Include counts, length, etc.

• What about symbolic links? A symbolic link is a file containing a file name. Whenever a Unix operation has the name of the symbolic link as a component of a file name, it macro substitutes the name in the file in for the component.

• What disk accesses take place when list a directory, cd to a directory, cat a file? Is there any difference between ls and ls -F?

• What about when use the Unix rm command? Does it always delete the file? NO - it decrements the reference count. If the count is 0, then it frees up the space. Does this algorithm work for directories? NO - directory has a reference to itself (.). Use a different command.

• When write a file, may need to allocate more inodes and disk blocks. The superblock keeps track of data that help this process along. A superblock contains:

144

o the size of the file system

o number of free blocks in the file system

o list of free blocks available in the file system

o index of next free block in free block list

o the size of the inode list

o the number of free inodes in the file system

o a cache of free inodes

o the index of the next free inode in inode cache

• The kernel maintains the superblock in memory, and periodically writes it back to disk. The superblock also contains crucial information, so it is replicated on disk in case part of disk fails.

• When OS wants to allocate an inode, it first looks in the inode cache. The inode cache is a stack of free inodes, the index points to the top of the stack. When the OS allocates an inode, it just decrements index. If the inode cache is empty, it linearly searches inode list on disk to find free inodes. An inode is free if its type field is 0. So, when go to search inode list for free inodes, keep looking until wrap or fill inode cache in superblock. Keep track of where stopped looking - will start looking there next time.

• To free an inode, put it in superblock's inode cache if there is room. If not, don't do anything much. Only check against the number where OS stopped looking for inodes the last time it filled the cache. Make this number the minimum of the freed inode number and the number already there.

• OS stores list of free disk blocks as follows. The list consists of a sequence of disk blocks. Each disk block in this sequence stores a sequence of free disk block numbers. The first number in each disk block is the number of the next disk block in this sequence. The rest of the numbers are the numbers of free disk blocks. The superblock has the first disk block in this sequence.

• To allocate a disk block, check the superblock's block of free disk blocks. If there are at least two numbers, grab the one at the top and decrement the index of next free block. If there is only one number left, it contains the index of the next block in the disk block sequence. Copy this disk block into the superblock's free disk block list, and use it as the free disk block.

• To free a disk block do the reverse. If there is room in the superblock's disk block, push it on there. If not, write superblock's disk block into free block, then put index of newly free disk block in as first number in superblock's disk block.

• Note that OS maintains a list of free disk blocks, but only a cache of free inodes. Why is this?

o Kernel can determine whether inode is free or not just by looking at it. But, cannot with disk block - any bit pattern is OK for disk blocks.

o Easy to store lots of free disk block numbers in one disk block. But, inodes aren't large enough to store lots of inode numbers.

145

o Users consume disk blocks faster than inodes. So, pauses to search for inodes aren't as bad as searching for disk blocks would be.

o Inodes are small enough to read in lots in a single disk operation. So, scanning lists of inodes is not so bad.

• Synchronizing multiple file accesses. What should correct semantics be for concurrent reads and writes to the same file? Reads and writes should be atomic:

o If a read execute concurrently, read should either observe the entire write or none of the write.

o Reads can execute concurrently with no atomicity constraints.

• How to implement these atomicity constraints? Implement reader-writer locks for each open file. Here are some operations:

o Acquire read lock: blocks until no other process has a write lock, then increments read lock count and returns.

o Release read lock: decrements read lock count.

o Acquire write lock: blocks until no other process has a write or read lock, then sets the write lock flag and returns.

o Release write lock: clears write lock flag.

• Obtain read or write locks inside the kernel's system call handler. On a Read system call, obtain read lock, perform all file operations required to read in the appropriate part of file, then release read lock and return. On Write system call, do something similar except get write locks.

• What about Create, Open, Close and Delete calls? If multiple processes have file open, and a process calls Delete on that file, all processes must close the file before it is actually deleted. Yet another form of synchronization is required.

• How to organize synchronization? Have a global file table in addition to local file tables. What does each file table do?

o Global File Table: Indexed by some global file id - for example, the inode index would work. Each entry has a reader/writer lock, a count of number of processes that have file open and a bit that says whether or not to delete the file when last process that has file open closes it. May have other data depending on what other functionality file system supports.

o Local File Table: Indexed by open file id for that process. Has a pointer to the current position in the open file to start reading from or writing to for Write and Read operations.

• For your nachos assignments, do not have to implement reader/writer locks - can just use a simple mutual exclusion lock.

• What are sources of inefficiency in this file system? Are two kinds - wasted time and wasted space.

146

• Wasted time comes from waiting to access the disk. Basic problem with system described above: it scatters related items all around the disk.

o Inodes separated from files.

o Inodes in same directory may be scattered around in inode space.

o Disk blocks that store one file are scattered around the disk.

So, system may spend all of its time moving the disk heads and waiting for the disk to revolve.

• The initial layout attempts to minimize these phenonmena by setting up free lists so that they allocate consecutive disk blocks for new files. So, files tend to be consecutive on disk. But, as use file system, layout gets scrambled. So, the free list order becomes increasingly randomized, and the disk blocks for files get spread all over the disk.

• Just how bad is it? Well, in traditional Unix, the disk block size equaled the sector size, which was 512 bytes. When they went from 3BSD to 4.0BSD they doubled the disk block size. This more than doubled the disk performance. Two factors:

o Each block access fetched twice as much data, so amortized the disk seek overhead over more data.

o The file blocks were bigger, so more files fit into the direct section of the inode index.

But, still pretty bad. When file system first created, got transfer rates of up to 175 KByte per second. After a few weeks, deteriorated down to 30 KByte per second. What is worse, this is only about 4 percent (!!!!) of maxmimum disk throughput. So, the obvious fix is to make the block size even bigger.

• Wasted space comes from internal fragmentation. Each file with anything in it (even small ones) takes up at least one disk block. So, if file size is not an even multiple of disk block size, there will be wasted space off the end of the last disk block in the file. And, since most files are small, there may not be lots of full disk blocks in the middle of files.

• Just how bad is it? It gets worse for larger block sizes. (so, maybe making block size bigger to get more of the disk transfer rate isn't such a good idea...). Did some measurements on a file system at Berkeley, to calculate size and percentage of waste based on disk block size. Here are some numbers:

Space Used (Mbytes) Percent Waste Organization

775.2 0.0 Data only, no separation between files

828.7 6.9 Data + inodes, 512 byte block



147


• Notice that a problem is that the presence of small files kills large file performance. If only had large files, would make the block size large and amortize the seek overhead down to some very small number. But, small files take up a full disk block and large disk blocks waste space.

• In 4.2BSD they attempted to fix some of these problems.

• Introduced concept of a cylinder group. A cylinder group is a set of adjacent cylinders. A file system consists of a set of cylinder groups.

• Each cylinder group has a redundant copy of the super block, space for inodes and a bit map describing available blocks in the cylinder group. Default policy: allocate 1 inode per 2048 bytes of space in cylinder group.

• Basic idea behind cylinder groups: will put related information together in the same cylinder group and unrelated information apart in different cylinder groups. Use a bunch of heuristics.

• Try to put all inodes for a given directory in the same cylinder group.

• Also try to put blocks for one file adjacent in the cylinder group. The bitmap as a storage device makes it easier to find adjacent groups of blocks. For long files redirect blocks to a new cylinder group every megabyte. This spreads stuff out over the disk at a large enough granularity to amortize the seek time.

• Important point to making this scheme work well - keep a free space reserve (5 to 10 percent). Once above this reserve, only supervisor can allocate disk blocks. If disk is almost completely full, allocation scheme cannot keep related data together and allocation scheme degenerates to random.

• Increased block size. The minimum block size is now 4096 bytes. Helps read bandwidth and write bandwidth for big files. But, don't waste a lot of space for small files? Solution: introduce concept of a disk block fragment.

• Each disk block can be chopped up into 2, 4, or 8 fragments. Each file contains at most one fragment which holds the last part of data in the file. So, if have 8 small files they together only occupy one disk block. Can also allocate larger fragments if the end of the file is larger than one eighth of the disk block. The bit map is laid out at the granularity of fragments.

• When increase the size of the file, may need to copy out the last fragment if the size gets too big. So, may copy a file multiple times as it grows. The Unix utilities try to avoid this problem by growing files a disk block at a time.

• Bottom line: this helped a lot - read bandwidth up to 43 percent of peak disk transfer rate for large files.

148

• Another standard mechanism that can really help disk performance - a disk block cache. OS maintains a cache of disk blocks in main memory. When a request comes, it can satisfy

• Request locally if data is in cache. This is part of almost any IO system in a modern machine, and can really help performance.

• How does caching algorithm work? Devote part of main memory to cached data. When read a file, put into disk block cache. Before reading a file, check to see if appropriate disk blocks are in the cache.

• What about replacement policy? Have many of same options as for paging algorithms. Can use LRU, FIFO with second chance, etc.

• How easy is it to implement LRU for disk blocks? Pretty easy - OS gets control every time disk block is accessed. So can implement an exact LRU algorithm easily.

• How easy was it to implement an exact LRU algorithm for virtual memory pages? How easy was it to implement an approximate LRU algorithm for virtual memory pages?

• Bottom line: different context makes different cache replacement policies appropriate for disk block caches.

• What is bad case for all LRU algorithms? Sequential accesses. What is common case for file access? Sequential accesses. How to fix this? Use free-behind for large sequentially accessed files - as soon as finish reading one disk block and move to the next, eject first disk block from the cache.

• So what cache replacement policy do you use? Best choice depends on how file is accessed. So, policy choice is difficult because may not know.

• Can use read-ahead to improve file system performance. Most files accessed sequentially, so can optimistically prefect disk blocks ahead of the one that is being read.

• Perfecting is a general technique used to increase the performance of fetching data from long-latency devices. Can try to hide latency by running something else concurrently with fetch.

• With disk block caching, physical memory serves as a cache for the files stored on disk. With virtual memory, physical memory serves as a cache for processes stored on disk. So, have one physical resource shared by two parts of system.

• How much of each resource should file cache and virtual memory get?

o Fixed allocation. Each gets a fixed amount. Problem - not flexible enough for all situations.

o Adaptive - if run an application that uses lots of files, give more space to file cache. If run applications that need more memory, give more to virtual memory subsystem. Sun OS does this.

• How to handle writes. Can you avoid going to disk on writes? Possible answers:

o No - user wants data on stable storage, that's why he wrote it to a file.

149

o Yes - keep in memory for a short time, and can get big performance improvements. Maybe file is deleted, so don't ever need to use disk at all. Especially useful for /tmp

o Files. Or, can batch up lots of small writes into a larger write, or can give disk scheduler more flexibility.

In general, depends on needs of the system.

• One more question - do you keep data written back to disk in the file cache? Probably - may be read in the near future, so should keep it resident locally.

• One common problem with file caches - if use file system as backing store, can run into double caching. Eject a page, and it gets written back to file. But, disk blocks from recently written files may be cached in memory in the file cache. In effect, file caching interferes with performance of the virtual memory system. Fix this by not caching backing store files.

• An important issue for file systems is crash recovery. Must maintain enough information on disk to recover from crashes. So, modifications must be carefully sequenced to leave disk in a recoverable state at all times.

An old Homework problem

Consider a virtual memory system with 32 bit addresses and 8 KB pages. Each page table entry is 2 bytes long, and there is one level of page table. The entire 32 bit address space is available to processes.

Consider a process that has allocated the top and bottom 128 MB of its address space. How much memory does its page table use. (Note that 1MB = 1048576 bytes)

Now consider the same process in a two level paging system with the same size pages, but two page identifiers of 10 and 9 bits for the first and second level of the page table. How much memory does this page table consume for the same process?

Answer: The first page table takes the same amount of space for all processes. 8K page tables imply that the offset takes up 13 bits, leaving 19 bits for the page identifier. There are 219 two byte entries which is 220 = 1048576 bytes (or 1MB).

The second page table requires one main page table of 210 2 byte entries, which takes up 2048 bytes (or 2KB). Each secondary page table addresses 29 = 512 pages. Because each page is 8K, each secondary page addresses 4194304 (4MB) of data. Each 128 MB of data requires 32 secondary pages for a total of 64 secondary pages each of 29 2 byte entries (1024 bytes). The whole page table takes 2048 + 64 ´ 1024 = 67584 or 68KB.

Another old homework problem

Consider a virtual memory system. This system also has virtual memory hardware cache on the processor.

The hit rates and service times for the layers of the memory system are:

• Caching level Hit Rate Service Time

• CPU Cache 90% 1ns

150

• Main Memory 75% 1microsecond

• Page Fault (includes 100% 10 milliseconds translation and retrieval)

The layers are tried in order, so a main memory hit only occurs on a cache miss, so a main memory hit is (0.1*0.75) 7.5% of all memory references. What is the average service time of a memory access on this machine? Note that the service time for a layer is paid for a hit or a miss, the time to serve a cache miss that is in main memory is the service time of the cache plus the service time of the memory. The service time must be paid to determine that the item you want is not at this level. Which improves memory performance more, improving the processor cache service time by 10% (a 0.9 ns lookup time) or increasing the main memory hit rate by 5% (to 80%) by choosing a better page replacement algorithm? Calculate the average memory access time for both changes.

Answer: Absolute rates are:

Caching level Total Hit Rate Service Time

Cache 90% 1 ns

Main Memory (0.1*0.75) = 7.5% 10 ns

Page Fault (0.1*0.25*1) = 2.5% 10ns

So the average memory access time is access time

= 0. 9(1 ´ 10-9) + 0. 075( ´ 10-9 + 10 ´ 10-9) + 0. 025(1 ´ 10-9 + 10 ´ 10-9 + 10 ´ 10-3)

= 1(1 ´ 10-9) + 0. 1(10 ´ 10-9) + 0. 025(10 ´ 10-3)

= 250, 002 ´ 10-9 = 250. 002m s

Reducing cache service time to 0.9 ns gives a memory access time of:

access time = 0. 9(0. 9 ´ 10-9) + 0. 075(0. 9 ´ 10-9 + 10 ´ 10-9) + 0. 025(0. 9 ´ 10-9 + 10 ´ 10-9 + 10 ´ 10-3)

= 1(0. 9 ´ 10-9) + 0. 1(10 ´ 10-9) + 0. 025(10 ´ 10-3)

= 250, 0019 ´ 10-9 = 250. 0019m s

Increasing main memory hit rate to 80% gives:

access time = 0. 9(1 ´ 10-9) + 0. 08(1 ´ 10-9 + 10 ´ 10-9) + 0. 02(1 ´ 10-9 + 10 ´ 10-9 + 10 ´ 10-3)

= 1(1 ´ 10-9) + 0. 1(10 ´ 10-9) + 0. 02(10 ´ 10-3)= 200, 002 ´ 10-9 = 200. 002m sake the increased hit rate.

File Systems

File systems provide applications with permanent storage. More than that they organize and protect data, and provide a clean interface to allow manipulation of that data. It’s no exaggeration to say that providing a file system is one of the major services of general purpose operating systems, and less general ones as well. (Even the palm pilot has permanent storage).

151

Files

A file is a persistent, hardware-independent, named, protected collection of bits and a collection of operations that can be executed on them. The access operations generally impose an order on the bits. The attributes attributes define what files are used for.

Persistence implies that the bytes have a meaning that extends in time. Memory used in calculating intermediate results doesn’t have that attribute. One wouldn’t store the memory used in a computation in a file because it has no long-term use. Because the data in files has this long term significance, files are stored on more permanent media. These days, the most common medium is still magnetic disk, although several others are making bids. Some other media that can contain files are memory, flash memory1, tapes, CD-ROMS, and more esoteric media. Basically anything that can hold information permanently and be read by a computer has held a file system, or will eventually.

By definition, files are largely medium-independent. The same operations are generally allowed on files regardless of the underlying storage medium. There are obvious exceptions - you can write to a CDROM at most once, and there are obvious drawbacks to trying to move to a byte at the front of a tape to the back. In general, however, code that manipulates files on one medium will work on others. This saves a lot of programmer time, as we’ll see.

Finally, file systems provide a way to name files. This is a seemingly simple function that turns out to be enormously powerful.2 File systems provide ways to name files that span multiple media on the same machine (the UNIX® file system), loosely connected local area networks (the Network File System (NFS)) and even global name spaces (the Andrew File System (AFS)). Providing a name space outside the confines of memory addresses allows processes to share data and communicate.

Because files are outside memory, they are also outside the protection of the memory protection system3.

As a result, the file system has to impose ideas of user identity and related privileges on the data.

The File Abstraction

Although the idea of an abstract, named collection of bytes is easy enough to grasp, the Devil is in the details. Despite the fairly simple idea of what a file is, files on different operating systems can be remarkably different. We’ll discuss the variations in the file abstraction along the following axes:

• Naming

• Data Structure and Access Patterns

• File Types

• Attributes

• Operations

1. Flash memory holds the data stored in it even when the power is off. So does core memory, but your generation will only see that in museums.

2. Magicians and conjurers have long believed that to know a thing’s name gives a person power over that thing. So it is with computers. Naming information and the operations thereon is the heart of computer science.

152

3. Unless we put them there, like Mulitcs does. Even then the initial permissions on the memory segments are derived from the files themselves.

Naming

A file generally has a name, a string of bits that (usually) correspond to human-readable letters. The operating system defines what characters are valid in file names, and any equivalence classes between them.

For example, UNIX allows any byte except hex 0x2f (ASCII for /) to appear in a filename. MS-DOS limits the character set to uppercase letters and a few symbols (letters are converted to uppercase in all file names). AmigaDOS allows you to specify any capitalization, but internally ignores case. "File", "file", and "File" all refer to the same file, although it will appear in directory listings spelled as the creator of the file spelled it.4

Beyond the character set, operating systems impose a structure on the names of files. Under UNIX the restraints are minor - filenames can’t contain /5 Compare this to MS-DOS (and the Windows\d\d systems that basically sit atop them) that requires an 8 character file name and three character extension. (Systems other than MS-DOS have the idea of an extension, or a naming convention for related files.)

Different operating systems depend to differing extents on the structure of the file names. MS-DOS defines an executable file by its extension, while UNIX generally takes it as a hint. Other programs, like compilers, place varying degrees of emphasis on file names. For example gcc uses the file extension to determine which of the languages it supports should be used to compile the source file, although the behavior can be overridden.

File Structure

In its simplest form, often called a flat file, a file is a collection of ordered bytes. Some systems, however, place additional structure on files. For example files under some operating systems consist of records, or collections of bytes. Rather than reading single bytes or seeking to arbitrary offsets, files are always accessed in terms of records. The records can be fixed length or variable.

Arranging a file as records implies the existence of a schema (or description of the record) either embedded in the file in a manner that the OS can read or separately in the system (perhaps elsewhere in the file system).

Many people think of record-based files as database entries, and that’s one common use of them.

Another common type of record based file was the card file - a file that was a sequence of 80 character records that was the electronic equivalent of a stack of punched cards, or of printer lines.

Record-based files may display an ordering that is independent of the way the bits are ordered on the underlying storage medium. The internal structure of the file may reflect this possibility and be significantly more complex than a flat file. We will discuss the details of this when we discuss implementation of

File systems.

Record-based files impose a structure on the data and allows the operating system to keep that structure intact. The flip-side of that is that record-based files are less flexible. Translation between formats or adding a new format is frequently difficult.

153

I should note that record-based access is provided by and enforced by the operating system. It’s not an application convention (although many applications create the illusion of record-based files in a flat file system.) Record based file systems cannot seek to a specific byte in the file, or ask for a set of bytes that spans a record boundary. Records are the fundamental building block of files in such systems in the same way that bytes are the fundamental building blocks of flat files. (Alternatively, flat files are files with 1 byte unformatted records).

Related to the building blocks of the files is the access method that a file supports. The access methods are an abstraction of the underlying hardware.

A file system that supports only sequential access requires all files to be read or written from the first to the last byte, only. Such access methods are appropriate file files residing on a magnetic tape for example.

Files on disk or CD-ROM

Note well the distinction between an application’s access pattern and the access method allowed by the OS. An application may read a configuration file that supports random access sequentially. That means that the operating system allows any access pattern, but the application chose to read it sequentially. The reverse cannot be done; a file that supports only sequential access cannot have the bytes read in another order.

Other access methods include read-only for unmodifiable files or indexed, for record based files that have been sorted multiple ways (the OS must support that, of course).

File Types

Files have various uses: they hold data for various programs, programs themselves, free-form text intended to be read by humans, and other things. Some operating systems allow the contents of a file to be directly encodes as a file type.

File types can be an explicit piece of information remembered by the operating system (a file attribute - see below) or can be a combination of name, permissions (another attribute - see below) and contents.

How types are encoded in the system and how rigidly type restrictions are enforced control how type based the file system is. Strongly-typed operating systems require file types to be encoded in every file, and enforce restrictions. As a result files generally have only one function, and their use can be easily controlled.

Older business computers, especially mainframes had strongly-typed systems. The advantage is that file use, and the procedures that underlie them can be tightly constrained. If files that can be printed as checks can only be created by a few trusted programs, it makes forging a check more difficult. It also makes creating one for a legitimate purpose that the system hasn’t been programmed for more difficult.

Other general purpose systems, like UNIX and Windows, rely on a combination of filenames, permissions, and file contents to provide typing. These systems generally only care about determining if a file is executable (that is contains a program that can be run) or not, leaving the business of discriminating between data types to the applications that use them. (Applications use similar criteria to differentiate).

UNIX considers a file executable if the user has the right to execute the file encoded in the permissions and if the file is correctly formatted as an executable. Formatting is generally checked by

154

a magic number in the first few bytes of the file. MS-DOS relies on the file extension and the internal format. Check out the file command in UNIX for a list of some of the magic numbers used by modern versions of UNIX.

How intrinsic types are in the file system is a tradeoff between codifying practices in the OS and allowing adaptability.

Permissions and Attributes

Permissions encode the operations allowable on a file and what users8 are allowed to perform them. You can think of this as a list of all the possible operations on a file, and what users are allowed to perform them. In practice, the representations are smaller and the listings less exhaustive. Consider the UNIX file permissions: each file has an owning user and group, and the rights to read, write, and execute a file is controlled for each of those entities and other users. For example a file might be readable and writable by its owner, readable by members of its group, and not accessible to other users.

We will talk more about permissions in a few days, but right now, the owner, group and permissions are interesting as attributes of the file. Attributes are meta-data, that is data about the file itself, not the data within it. Permissions are a good example: they control what processes may access the underlying data of the file, but are independent of that data.

Some other common attributes are:

• creator

1. Again, this is an OS feature, not a simulation by the application. The read system calls return the records in sorted order.

2. Well, processes acting on behalf on what users. We’ll discuss it in a little while.

• owner

• system file flag

• hidden flag

• temporary flag

• creation time

• modification time

• information about the last backup

• lock information

• current size

• maximum size

Meta-data is used for a variety of reasons. Some of it is for human use: a data provenance. Some of it is for internal OS use, for example the backup flags. The ability of the file system to store

155

both data and data about the data is an important aspect of the system. For example the make utility would be useless without the meta-data telling the program the relative ages of the files.

File Operations

File operations are a generally simple and intuitive set of operations on files. Not all of these are supported by all file systems. In some file systems, operations are implemented in terms of other file operations.

Still, these give a good feel for file operations.

Create: Create a new file. This may allocate space for the file, or just reserve the name for future action.

Delete: Delete an existing file. The ability to delete a file is distinct from (but often related to) the ability to modify its contents. Some operating systems use the existence of a file to start a service, so the existence of such a file is its most important attribute, and should therefore be protected.

Open: This lets the OS know that the current process will be interested in a file soon. In some sense, it’s extraneous, but in cases where the file resides on a medium that has significant startup cost (a robotic tape cabinet) not returning the open call until the file is ready for access is a good idea. Your Nachos work should give you an idea about some of the OS set up work that’s done here.

Close: Let the OS know that the process is done with this file, and that the OS can reclaim the resources allocated to manipulating the file. (The data and meta-data are updated, but remaining the file system, of course.) Some systems delay writes or cache data for future reads. Close is an indication to them that pending writes must be flushed and that cached reads can be discarded.

Seek: Files have a notion of the current byte (or record) of the file that will next be accessed. On files that can be randomly accessed, seek allows the calling process to set the current byte (or record).

Read: Get some data from the file. The OS may take this opportunity to predict future behavior, collect statistics or do other support work in addition to bringing the data from secondary storage to the processes address space. Reading occurs at the current byte/record.

Write: Write some data to the file. Strictly this means to change existing data in the file to a new value, but many systems also use the write system call to append data. Like read, there may be secondary actions associated with this. Writing occurs at the current byte/record.

Note that reading and writing may cache data, and that such caches have to be coordinated so that all processes see a consistent version of the file.

Append: Add data to the end of the file. This shares aspects with write, but includes the idea that the underlying file is changing size. Append means that the OS must increase the allocation of storage to the file. As I mentioned above, some OSes determine whether a write system call causes an append or a write based on the current byte and the length of the buffer written.

Get/Set Attributes: For those attributes that can be modified directly by users, for example the backup flag), this provides access.

Rename: Change a name of the file in the file system. This may be an operation on the file or a directory depending on the file system.

156

Memory Mapping Files

As we discussed in the memory management unit, it is sometimes convenient to move a file into memory and access it directly there. The easiest way to do this is to make the file on disk (or whatever) the backing store for a segment (or section of a paged VM space) and let the paging system handle the writes.

The big issue here is consistency - when do changes in memory get reflected to the file in secondary storage?

The paging system is probably pretty lazy about getting changes out to a file, compared to a file write, but writing a whole page to memory on every memory write is probably too slow.

One solution is to keep data about which files are memory mapped (as an attribute) and make associated file reads from the memory system rather than the file.

Putting Devices in the File System

The simple, powerful semantics of files lend themselves to controlling a variety of resources. For example, terminal input can be thought of as a sequential read-only file, and terminal output as a sequential write-only file. This means that programs can be written to take an input and output file and transparently run interactively or from disk files. Furthermore, by introducing memory-resident sequential files that are written by one process and read by another, called pipes, programs can be linked together in arbitrary ways.

The OS has to do extra work to make a terminal look like a file, or create a pipe, but the result is a simpler programming model for developers: I/O is file I/O.

UNIX takes this to an extreme - nearly every OS resource has a file system interface. Physical disk drives are a special type of file, so are terminals, modems and printers. All of these can be accessed through the familiar file system interface. Recently, the data of running procedures has been added to the list of data accessible through the file system.

Furthermore, the access control mechanisms of files can be directly applied to prevent unauthorized manipulations of the hardware. Raw disk drives or modem devices can only be accessed by privileged users, and those accesses are controlled by the same system as file accesses. A unified access control system is easier to use and only having one to debug implies that it will be more secure.

All is not a bed of roses, though. Hardware has features that are not easily expressed as file operations.

157

CHAPTER 10

DIRECTORIES AND SECURITY

The terms protection and security are often used together, and the distinction between them is a bit blurred, but security is generally used in a broad sense to refer to all concerns about controlled access to facilities, while protection describes specific technological mechanisms that support security.

Security

As in any other area of software design, it is important to distinguish between policies and mechanisms. Before you can start building machinery to enforce policies, you need to establish what policies you are trying to enforce. Many years ago, I heard a story about a software firm that was hired by a small savings and loan corporation to build a financial accounting system. The chief financial officer used the system to embezzle millions of dollars and fled the country. The losses were so great the S&L went bankrupt, and the loss of the contract was so bad the software company also went belly-up. Did the accounting system have a good or bad security design? The problem wasn't unauthorized access to information, but rather authorization to the wrong person. The situation is analogous to the old saw that every program is correct according to some specification. Unfortunately, we don't have the space here to go into the whole question of security policies here. We will just assume that terms like “authorized access” have some well-defined meaning in a particular context.

Threats

Any discussion of security must begin with a discussion of threats. After all, if you don't know what you're afraid of, how are you going to defend against it? Threats are generally divided in three main categories.

• Unauthorized disclosure. A “bad guy” gets to see information he has no right to see (according to some policy that defines “bad guy” and “right to see”).

• Unauthorized updates. The bad guy makes changes he has no right to change.

• Denial of service. The bad guy interferes with legitimate access by other users.

There is a wide spectrum of denial-of-service threats. At one end, it overlaps with the previous category. A bad guy deleting a good guy's file could be considered an unauthorized update. A the other end of the spectrum, blowing up a computer with a hand grenade is not usually considered an unauthorized update. As this second example illustrates, some denial-of-service threats can only be enforced by physical security. No matter how well your OS is designed, it can't protect my files from his hand grenade. Another form of denial-of-service threat comes from unauthorized consumption of resources, such as filling up the disk, tying up the CPU with an infinite loop, or crashing the system by triggering some bug in the OS. While there are software defenses against these threats, they are generally considered in the context of other parts of the OS rather than security and protection. In short, discussion of software mechanisms for computer security generally focus on the first two threats.

In response to these threats counter measures also fall into various categories. As programmers, we tend to think of technological tricks, but it is also important to realize that a

158

complete security design must involve physical components (such as locking the computer in a secure building with armed guards outside) and human components (such as a background check to make sure your CFO isn't a crook, or checking to make sure those armed guards aren't taking bribes).

The Trojan Horse

Break-in techniques come in numerous forms. One general category of attack that comes in a great variety of disguises is the Trojan Horse scam. The name comes from Greek mythology. The ancient Greeks were attacking the city of Troy, which was surrounded by an impenetrable wall. Unable to get in, they left a huge wooden horse outside the gates as a “gift” and pretended to sail away. The Trojans brought the horse into the city, where they discovered that the horse was filled with Greek soldiers who defeated the Trojans to win the Rose Bowl (oops, wrong story). In software, a Trojan Horse is a program that does something useful--or at least appears to do something useful--but also subverts security somehow. In the personal computer world, Trojan horses are often computer games infected with “viruses.”

Here's the simplest Trojan Horse program I know of. Log onto a public terminal and start a program that does something like this:

print ("login :");

name = readALine ();

turnOffEchoing ();

print ("password:");

passwd = readALine ();

sendMail("badguy", name,passwd);

print ("login incorrect");

exit ();

A user waking up to the terminal will think it is idle. He will attempt to log in, typing his login name and password. The Trojan Horse program sends this information to the bad guy, prints the message login incorrect and exits. After the program exits, the system will generate a legitimate login: message and the user, thinking he mistyped his password (a common occurrence because the password is not echoed) will try again, log in successfully, and have no suspicion that anything was wrong. Note that the Trojan Horse program doesn't actually have to do anything useful, it just has to appear to.

Design Principles

1. Public Design. A common mistake is to try to keep a system secure by keeping its algorithms secret. That's a bad idea for many reasons. First, it gives a kind of all-or-nothing security. As soon as anybody learns about the algorithm, security is all gone. In the words of Benjamin Franklin, “Two people can keep a secret if one of them is dead.” Second, it is usually not that hard to figure out the algorithm, by seeing how the system responds to various inputs, decompiling the code, etc. Third, publishing the algorithm can have beneficial effects. The bad guys probably have already figured out your algorithm and found its weak points. If you publish it, perhaps some good guys will notice bugs or loopholes and tell you about them so you can fix them.

159

2. Default = No Access. Start out by granting as little access a possible and adding privileges only as needed. If you forget to grant access where it is legitimately needed, you'll soon find out about it. Users seldom complain about having too much access.

3. Timely Checks. Checks tend to “wear out.” For example, the longer you use the same password, the higher the likelihood it will be stolen or deciphered. Be careful: This principle can be overdone. Systems that force users to change passwords frequently encourage them to use particularly bad ones. A system that forced users to supply a password every time they wanted to open a file would inspire all sorts of ingenious ways to avoid the protection mechanism altogether.

4. Minimum Privilege. This is an extension of point 2. A person (or program or process) should be given just enough powers to get the job done. In other contexts, this principle is called “need to know.” It implies that the protection mechanism has to support fine-grained control.

5. Simple, Uniform Mechanisms. Any piece of software should be as simple as possible (but no simpler!) to maximize the chances that it is correctly and efficiently implemented. This is particularly important for protection software, since bugs are likely be usable as security loopholes. It is also important that the interface to the protection mechanisms be simple, easy to understand, and easy to use. It is remarkably hard to design good, foolproof security policies; policy designers need all the help they can get.

6. Appropriate Levels of Security. You don't store your best silverware in a box on the front lawn, but you also don't keep it in a vault at the bank. The US Strategic Air Defense calls for a different level of security than my records of the grades for this course. Not only does excessive security mechanism add unnecessary cost and performance degradation, it can actually lead to a less secure system. If the protection mechanisms are too hard to use, users will go out of their way to avoid using them.

Authentication

Authentication is a process by which one party convinces another of its identity. A familiar instance is the login process, though which a human user convinces the computer system that he has the right to use a particular account. If the login is successful, the system creates a process and associates with it the internal identifier that identifies the account. Authentication occurs in other contexts, and it isn't always a human being that is being authenticated. Sometimes a process needs to authenticate itself to another process. In a networking environment, a computer may need to authenticate itself to another computer. In general, let's call the party that whats to be authenticated the client and the other party the server.

One common technique for authentication is the use of a password. This is the technique used most often for login. There is a value, called the password that is known to both the server and to legitimate clients. The client tells the server who he claims to be and supplies the password as proof. The server compares the supplied password with what he knows to be the true password for that user.

Although this is a common technique, it is not a very good one. There are lots of things wrong with it.

Direct attacks on the password

The most obvious way of breaking in is a frontal assault on the password. Simply try all possible passwords until one works. The main defense against this attack is the time it takes to try

160

lots of possibilities. If the client is a computer program (perhaps masquerading as a human being), it can try lots of combinations very quickly, but by if the password is long enough, even the fastest computer cannot try succeed in a reasonable amount of time. If the password is a string of 8 letters and digits, there are 2,821,109,907,456 possibilities. A program that tried one combination every millisecond would take 89 years to get through them all. If users are allowed to pick their own passwords, they are likely to choose “cute doggie names”, common words, names of family members, etc. That cuts down the search space considerably. A password cracker can go through dictionaries, lists of common names, etc. It can also use biographical information about the user to narrow the search space. There are several defenses against this sort of attack.

• The system chooses the password. The problem with this is that the password will not be easy to remember, so the user will be tempted to write it down or store it in a file, making it easy to steal. This is not a problem if the client is not a human being.

• The system rejects passwords that are too “easy to guess”. In effect, it runs a password cracker when the user tries to set his password and rejects the password if the cracker succeeds. This has many of the disadvantages of the previous point. Besides, it leads to a sort of arms race between crackers and checkers.

• The password check is artificially slowed down, so that it takes longer to go through lots of possibilities. One variant of this idea is to hang up a dial-in connection after three unsuccessful login attempts, forcing the bad guy to take the time to redial.

Eavesdropping.

This is a far bigger program for passwords than brute force attacks. In comes in many disguises.

• Looking over someone's shoulder while he's typing his password. Most systems turn off echoing, or echo each character as an asterisk to mitigate this problem.

• Reading the password file. In order to verify that the password is correct, the server has to have it stored somewhere. If the bad guy can somehow get access to this file, he can pose as anybody. While this isn't a threat on its own (after all, why should the bad guy have access to the password file in the first place?), it can magnify the effects of an existing security lapse.

• Unix introduced a clever fix to this problem, that has since been almost universally copied. Use some hash function f and instead of storing password, store f(password). The hash function should have two properties: Like any hash function it should generate all possible result values with roughly equal probability, and in addition, it should be very hard to invert--that is, given f(password), it should be hard to recover password. It is quite easy to devise functions with these properties. When a client sends his password, the server applies f to it and compares the result with the value stored in the password file. Since only f(password) is stored in the password file, nobody can find out the password for a given user, even with full access to the password file, and logging in requires knowing password, not f(password). In fact, this technique is so secure, it has become customary to make the password file publicly readable!

• Wire tapping. If the bad guy can somehow intercept the information sent from the client to the server, password-based authentication breaks down altogether. It is increasingly the case the authentication occurs over an insecure channel such as a dial-up line or a local-area network. Note that the Unix scheme of storing f(password) is of no help here, since the password is sent in its original form (“plaintext” in the jargon of encryption) from the client to the server. We will consider this problem in more detail below.

161

Spoofing

This is the worst threat of all. How does the client know that the server is who it appears to be? If the bad guy can pose as the server, he can trick the client into divulging his password. We saw a form of this attack above. It would seem that the server needs to authenticate itself to the client before the client can authenticate itself to the server. Clearly, there's a chicken-and-egg problem here. Fortunately, there's a very clever and general solution to this problem.

Challenge-response

There are wide variety of authentication protocols, but they are all based on a simple idea. As before, we assume that there is a password known to both the (true) client and the (true) server. Authentication is a four-step process.

• The client sends a message to the server saying who he claims to be and requesting authentication.

• The server sends a challenge to the client consisting of some random value x.

• The client computes g(password,x) and sends it back as the response. Here g is a hash function similar to the function f above, except that it has two arguments. It should have the property that it is essentially impossible to figure out password even if you know both x and g(password,x).

• The server also computes g(password,x) and compares it with the response it got from the client.

Clearly this algorithm works if both the client and server are legitimate. An eavesdropper could learn the user's name, x and g(password,x), but that wouldn't help him pose as the user. If he tried to authenticate himself to the server he would get a different challenge x', and would have no way to respond. Even a bogus server is no threat. The change provides him with no useful information. Similarly, a bogus client does no harm to a legitimate server except for tying him up in a useless exchange (a denial-of-service problem!).

Protection Mechanisms

First, some terminology

Objects: The things to which we wish to control access. They include physical (hardware) objects as well as software objects such as files, databases, semaphores, or processes. As in object-oriented programming, each object has a type and supports certain operations as defined by its type. In simple protection systems, the set of operations is quite limited: read, write, and perhaps execute, append, and a few others. Fancier protection systems support a wider variety of types and operations, perhaps allowing new types and operations to be dynamically defined.

Principals: Intuitively, “users”--the ones who do things to objects. Principals might be individual persons, groups or projects, or roles, such as “administrator.” Often each process is associated with a particular principal, the owner of the process.

Rights: Permissions to invoke operations. Each right is the permission for a particular principal to perform a particular operation on a particular object. For example, principal solomon might have read rights for a particular file object.

162

Domains: Sets of rights. Domains may overlap. Domains are a form of indirection, making it easier to make wholesale changes to the access environment of a process. There may be three levels of indirection: A principal owns a particular process, which is in a particular domain, which contains a set of rights, such as the right to modify a particular file.

Conceptually, the protection state of a system is defined by an access matrix. The rows correspond to principals (or domains), the columns correspond to objects, and each cell is a set of rights. For example, if

access[solomon]["/tmp/foo"] = { read, write }

Then I have read and write access to file "/tmp/foo". I say “conceptually” because the access is never actually stored anywhere. It is very large and has a great deal of redundancy (for example, my rights to a vast number of objects are exactly the same: none!), so there are much more compact ways to represent it. The access information is represented in one of two ways, by columns, which are called access control lists (ACLs), and by rows, called capability lists.

Access Control Lists

An ACL (pronounced “ackle”) is a list of rights associated with an object. A good example of the use of ACLs is the Andrew File System (AFS) originally created at Carnegie-Mellon University and now marketed by Transarc Corporation as an add-on to Unix. This file system is widely used in the Computer Sciences Department. Your home directory is in AFS. AFS associates an ACL with each directory, but the ACL also defines the rights for all the files in the directory (in effect, they all share the same ACL). You can list the ACL of a directory with the fs listacl command:

% fs listacl /u/c/s/cs537-1/public

Access list for /u/c/s/cs537-1/public is

Normal rights:

system:administrators rlidwka

System:anyuser rl

solomon rlidwka

The entry system: anyuser rl means that the principal system:anyuser (which represents the role “anybody at all”) has rights r (read files in the directory) and l (list the files in the directory and read their attributes). The entry solomon rlidwka means that I have all seven rights supported by AFS. In addition to r and l, they include the rights to insert new file in the directory (i.e., create files), delete files, write files, lock files, and administer the ACL list itself. This last right is very powerful: It allows me to add, delete, or modify ACL entries. I thus have the power to grant or deny any rights to this directory to anybody. The remaining entry in the list shows that the principal system: administrators has the same rights I do (namely, all rights). This principal is the name of a group of other principals. The command pts membership system: administrators list the members of the group.

Ordinary Unix also uses an ACL scheme to control access to files, but in a much stripped-down form. Each process is associated with a user identifier (uid) and a group identifier (gid), each of which is a 16-bit unsigned integer. The inode of each file also contains a uid and a gid, as well as a nine-bit protection mask, called the mode of the file. The mask is composed of three groups of three bits. The first group indicates the rights of the owner: one bit each for read access, write access, and

163

execute access (the right to run the file as a program). The second group similarly lists the rights of the file's group, and the remaining three three bits indicate the rights of everybody else. For example, the mode 111 101 101 (0755 in octal) means that the owner can read, write, and execute the file, while members of the owning group and others can read and execute, but not write the file. Programs that print the mode usually use the characters rwx- rather than 0 and 1. Each zero in the binary value is represented by a dash, and each 1 is represented by r, w, or x, depending on its position. For example, the mode 111101101 is printed as rwxr-xr-x.

In somewhat more detail, the access-checking algorithm is as follows: The first three bits are checked to determine whether an operation is allowed if the uid of the file matches the uid of the process trying to access it. Otherwise, if the gid of the file matches the gid of the process, the second three bits are checked. If neither of the id's match, the last three bits are used. The code might look something like this.

boolean accessOK(Process p, Inode i, int operation) {

int mode;

if (p.uid == i.uid)

mode = i.mode >> 6;

else if (p.gid == i.gid)

mode = i.mode >> 3;

else mode = i.mode;

switch (operation) {

case READ: mode &= 4; break;

case WRITE: mode &= 2; break;

case EXECUTE: mode &= 1; break;

}

return (mode != 0);

}

(The expression i.mode >> 3 denotes the value i.mode shifted right by three bits positions and the operation mode &= 4 clears all but the third bit from the right of mode.) Note that this scheme can actually give a random user more powers over the file than its owner. For example, the mode ---r--rw- (000 100 110 in binary) means that the owner cannot access the file at all, while members of the group can only read the file, and other can both read and write. On the other hand, the owner of the file (and only the owner) can execute the chmod system call, which changes the mode bits to any desired value. When a new file is created, it gets the uid and gid of the process that created it, and a mode supplied as an argument to the creat system call.

Most modern versions of Unix actually implement a slightly more flexible scheme for groups. A process has a set of gid's, and the check to see whether the file is in the process' group checks to see whether any of the process' gid's match the file's gid.

boolean accessOK (Process p, Inode i, int operation) {

164

int mode;

if (prudes == auld)

mode = i.mode >> 6;

else if (p.gidSet.contains(i.gid))

mode = i.mode >> 3;

else mode = i.mode;

switch (operation) {

case READ: mode &= 4; break;

case WRITE: mode &= 2; break;

case EXECUTE: mode &= 1; break;

}

return (mode != 0);

}

When a new file is created, it gets the uid of the process that created it and the gid of the containing directory. There are system calls to change the uid or gid of a file. For obvious security reasons, these operations are highly restricted. Some versions of Unix only allow the owner of the file to change it gid, only allow him to change it to one of his gid's, and don't allow him to change the uid at all.

For directories, “execute” permission is interpreted as the right to get the attributes of files in the directory. Write permission is required to create or delete files in the directory. This rule leads to the surprising result that you might not have permission to modify a file, yet be able to delete it and replace it with another file of the same name but with different contents!

Unix has another very clever feature--so clever that it is patented! The file mode actually has a few more bits that I have not mentioned. One of them is the so-called setuid bit. If a process executes a program stored in a file with the setuid bit set, the uid of the process is set equal to the uid of the file. This rather curious rule turns out to be a very powerful feature, allowing the simple rwx permissions directly supported by Unix to be used to define arbitrarily complicated protection policies.

As an example, suppose you wanted to implement a mail system that works by putting all mail messages in to one big file, say /usr/spool/mbox. I should be able to read only those message that mention me in the To: or Cc: fields of the header. Here's how to use the setuid feature to implement this policy. Define a new uid mail, make it the owner of /usr/spool/mbox, and set the mode of the file to rw------- (i.e., the owner mail can read and write the file, but nobody else has any access to it). Write a program for reading mail, say /usr/bin/readmail. This file is also owned by mail and has mode srwxr-xr-x. The ‘s’ means that the setuid bit is set. My process can execute this program (because the “execute by anybody” bit is on), and when it does, it suddenly changes its uid to mail so that it has complete access to /usr/spool/mbox. At first glance, it would seem that letting my process pretend to be owned by another user would be a big security hole, but it isn't, because processes don't have free will. They can only do what the program tells them to do. While my process is running readmail, it is following instructions written by the designer of the mail system, so it is safe to let it have access

165

appropriate to the mail system. There's one more feature that helps readmail do its job. A process really has two lid’s, called the effective uid and the real uid. When a process executes a setuid program, its effective uid changes to the uid of the program, but its real uid remains unchanged. It is the effective uid that is used to determine what rights it has to what files, but there is a system call to find out the real uid of the current process. Readmail can use this system call to find out what user called it, and then only show the appropriate messages.

Capabilities

An alternative to ACLs are capabilities. A capability is a “protected pointer” to an object. It designates an object and also contains a set of permitted operations on the object. For example, one capability may permit reading from a particular file, while another allows both reading and writing. To perform an operation on an object, a process makes a system call; presenting a capability that points to the object and permits the desired operation. For capabilities to work as a protection mechanism, the system has to ensure that processes cannot mess with their contents. There are three distinct ways to ensure the integrity of a capability.

Tagged architecture

Some computers associate a tag bit with each word of memory, marking the word as a capability word or a data word. The hardware checks that capability words are only assigned from other capability words. To create or modify a capability, a process has to make a kernel call.

Separate capability segments

If the hardware does not support tagging individual words, the OS can protect capabilities by putting them in a separate segment and using the protection features that control access to segments.

Encryption

Each capability can be extended with a cryptographic checksum that is computed from the rest of the content of the capability and a secret key. If a process modifies a capability it cannot modify the checksum to match without access to the key. Only the kernel knows the key. Each time a process presents a capability to the kernel to invoke an operation; the kernel checks the checksum to make sure the capability hasn't been tampered with.

Capabilities, like segments are a “good idea” that somehow seldom seems to be implemented in real systems in full generality. Like segments, capabilities show up in an abbreviated form in many systems. For example, the file descriptor for an open file in Unix is a kind of capability. When a process tries to open a file for writing, the system checks the file's ACL to see whether the access is permitted. If it is, the process gets a file descriptor for the open file, which is a sort of capability to the file that permits write operations. Unix uses the separate segment approach to protect the capability. The capability itself is stored in a table in the kernel and the process has only an indirect reference to it (the index of the slot in the table). File descriptors are not full-fledged capabilities, however. For example, they cannot be stored in files, because they go away when the process terminates.

Directories

Directories are collections of files. Their primary function is to impose a naming system on files and organize them relative to each other. Directories provide a mechanism for collecting the names of

Files together so that a user and find the file(s) they’re interested in. They are an important piece of metadata.

166

Flat Directories

The simplest kind of directory is a simple flat list of filenames and the information needed to find the respective file contents. These are rarely used today, although they still do exist. The Palm Pilot has a flat file name space. IBM MVS, which will never completely die, has a flat file name space.

Flat file systems are not often used because they provide no inherent organization to the files and are difficult to make efficient for large collections of files.

Name space collisions have to be avoided in a flat space, so generally some naming convention is followed to prevent them. A common one is to prepend filenames with the user’s name. For example, a collection of my files might have names like:

DISK1.FABER.HOME.PROFILE

DISK1.FABER.PROJECT1.GRADES

Problems with this include enforcing the conventions (what if other users choose $ to separate parts of the file name?) and the inefficiency of holding all the system’s files in one big list. Consider scanning the system directory to print all the files that I own. Either the whole table will have to be searched, or it will have to be kept sorted. Keeping a large list sorted adds overhead, and scanning a large table linearly is not efficient.

Also updating such a centralized structure will require synchronization - all file creations and deletions will have to enforce exclusion on the table, or we have to introduce a very fine-grained locking mechanism.

Basically, a single directory doesn’t scale easily to many users, both in terms of technical operation and user behavior.

Hierarchical Directories

A natural organization of files is into a hierarchy. That is files can be seen as being in related classes and each class has a directory.

• All Files

• User files

• Instructor files

• Faber’s files

This is implemented as a tree of directories where each entry in the directory describes either a file or another directory. If the hierarchy is chosen carefully, the result is many small directories. Scanning each one is a reasonable amount of work, and synchronization is maintained for each directory. Because separate tasks can be confined to separate directories contention can be made rare.

167

Besides the efficiency issues, hierarchies are a natural way to organize many systems of data. The desktop metaphor of files in folders and cabinets underscores this (although the hierarchy afforded by hierarchical file systems is richer because there can be nearly arbitrary levels of nesting).

Given that we want to impose such a structure on our file names, we have to describe a syntax to find a file in the tree. This is done by giving a path through the directory tree. The strings that represent such paths are called pathnames. A character is chosen as the path separator. A path is then the list of directories traversed in order to reach the file. Using / as a separator, a pathname for the file labelled file in the above diagram is /etc/ast/fn. It’s inconvenient to name files with their entire pathname all the time, after all we put related files in the same directory so they’d be close together, and specifying the long pathname hides that. One solution is to add the concept of a current directory to the system and allow paths to be specified relative to it. Paths that begin with the path separator are absolute pathnames; paths that do not are relative and have the current directory prepended.

To facilitate relative naming, many filesystems have special names that refer to the current directory and the parent directory. These are often and, respectively.1 an example relative pathname is./test/halt, which means the file test in the directory test which is a subdirectory of the current directory.

Links

Directories impose a naming structure on files, and in some systems offer the opportunity to give a file multiple names. If the information about a file (its attributes and OS data) are not stored directly in a directory entry, a file may be pointed to by entries in several directories. For example the file in the directory tree above can be named as /etc/ast/fn or /etc/jim/f2. Such multiple naming is called linking.

There are 2 major forms of links: hard and soft. A hard link is a direct link from the directory entry to the internal file data. All hard links are equivalent, and in file systems that support them, a file cannot be deleted without deleting all its hard links. In general hard links are restricted to parts of the file system that share internal information. To preserve the tree structure of hierarchical file systems, they generally can’t link to directories.

Soft links are a path translation (often also called a symbolic link). They are a pathname that points to the file (or directory) on which to operate. These paths can be absolute or relative. Because they are a visible pathname translation; they are often allowed to point to directories because programs that rely on the. directory provides the answer to a UNIX® puzzle: how do you delete a file named -f? rm -f fails as the filename is taken as a switch. The solution is to use a longer relative path: rm ./-f . the hierarchical nature of the file system (like system utilities) can detect and ignore them. Because they are a translation, soft links can access any parts of the file system they can address, but because they are not linked closely with the internal structure of the file system, they may not be updated when a file is deleted or moved. It’s possible that a symbolic link can point to a file that no longer exists. This is called a dangling pointer problem by analogy with the same problem involving freed memory in a program.

Directory Operations

Like files, directories have well defined operations:

Create: Allocate space for a new directory and create the special directories in it.

Delete: Remove a directory. Most OSes require the directory to be empty.

168

Open: Analogous to file open.

Read: Get the information about one or more files.

Write: Change the information about one or more files.

Metadata manipulation: Change the permissions or some other field associated with this directory.

Link: Add a link to an existing file.

Unlink: Remove a link to an existing file (if this is the last link and the file is closed, this usually implies removal of the file).

Rename: Really covered by write, as is file renaming

Other Directory Systems

Although hierarchical systems are by far the most common, there are some other interesting ways to think about file naming:

• Attribute-based naming. Not all data can be organized nicely into a hierarchy. For example is the right hierarchy /usr/faber/pubs/papers/RST or /usr/faber/papers/published/ RST? Attribute based naming says that filenames should be the set of attributes that a file satisfies. For example the filename above might be (user=faber, type=published-paper, topic = RST packets).

• Temporal filesystems. The folks at Lucent have added a time axis to their Plan 9 filesystem. Rather than overwriting files for each update, they save all the versions (well, actually they save the state once a day) and in addition to giving a pathname, you can also give a temporal coordinate. You can change directory into the source code from last week. It’s a strange but powerful idea.

• User specific. The folks from Lucent also allow dynamic binding of their file systems. A user specifies a set of directories to bind to a name, and thereafter the user has their own version of that directory. The idea of an execution path is turned into a customized /bin directory.

Naming Systems

The mapping from pathname to filename is just one example of a naming system. A naming system maps a string to a resource (or maybe just to another string). Being able to name a resource is the first step in being able to manipulate it. Some other interesting naming systems are:

• The Domain Name System: hierarchical distributed naming of computers on the Internet. Each “Directory” or domain is resolved by a different machine in the Internet. The names are parsed backward − edu is resolved before use before aludra in aludra.usc.edu. (The path Separator is a “.”.)

• X.500 is an alternative attribute-based naming system for hosts and users. The names look like PN=faber, CO=US, IN=USC, DEPT=CS, INST=ISI.

• Printer names. Strings bind to printers in a flat namespace.

169

• URLs: a combination of DNS and hierarchical filenames. The name compactly describes the communication protocol to use, the name of the machine to contact, and the file in to ask about (or other service dependent identifier).

Most successful systems have a significant naming component. The decision of what elements in a system to name, and how to name them is significant.

Security and the File System

File System Security

The problem addressed by the security system is how are information and resources protected from people. Issues include the contents of data files which are a privacy issues, and the use of resources, which is an accounting issue. Security must pervade the system, or the system is insecure, but the file system is a particularly good place to discuss security because its protection mechanisms are visible, and the things it protects are very concrete (for a computer system).

We’re talking about some interesting stuff when we talk about security. For certain people who like puzzles, finding loopholes in security systems and understanding them to the point of breaking them is a challenge. I understand the lure of this. Remember, however, that everyone using these machines is a student like yourself who deserves the same respect that you do. Breaking into another person’s files is like breaking into their home, and should not be taken lightly either by those breaking in, or those who catch them. Uninvited intrusions should be dealt with harshly (for example, it’s a felony to break into a machine that stores medical records). If you really want to play around with UNIX® security, get yourself a linux box and play to your heart’s content; don’t break into someone’s account here and start deleting files.

Policies and Mechanisms

Policies are real world statements about the protection that the system provides. These are all statements of (significantly different) policies:

Users should not be able to read each other’s mail No student should be able to see answer keys before they are made public. All users should have access to all data.

The various systems in a computer system that control access to resources are the mechanisms that are used to implement a policy. A good secuirty system is one with clearly stated policy objectives that have been effectively translated into mechanisms.

The fact that data security does not stop with computer security cannot be understated. If your computer is perfectly secure, and an employee photocopies printouts of your new chip design, don’t blame the computer security system.

Design Principles

Although every security system is different, some overriding principles make sense. Here is a list generated by Saltzer and Schroeder from their experience on MULTICS that remain valid today (these are fun to apply to caper movies - next time you watch Mission Impossible or Sneakers or War Games, try to spot the security flaws that let the intruders work their magic):

Public Design Surprisingly public designs tend to be more secure than private ones. The reason is that the security community as a while reviews them and reports flaws that can be fixed.

170

Even if you take pains to keep the source code of your system secret, you should assume that attackers have access to your code. The bad guys will share knowledge; the good guys should, too.

Default access is no access. This holds for subsystems just like login screens. It sounds like apple atitude, but is a principle worth following at all levels. People who need a certain access will let you know about it quickly.

Test for current authority just because the user had the right to perform an operation a millisecond ago doesn’t mean they can do it now. Test the authority every time so that revocation of that authority is meaningful.

Give each entity the least privilege required for it to do its job. This may mean creating a bunch of fine-grained privilege levels. The more privilege an entity possesses the more costly a mistake or misuse of that entity is. Printer daemons that run as root can cause logins that run as root.

Build in security from the start. Adding security later almost never works. There are too many holes to plug, and as a practical matter security is nearly impossible to add to a fundamentally insecure system.

In order to make such a design integrable, it must be simple and capable of being applied uniformly.

The system must be acceptable to the users. All security systems are a compromise between security and usability. The more features a system has, the more likely opportunities there are for exploitation. Furthermore, if a security feature is too onerous to the users, they will just invent ways to circumvent them. These circumventions are then available for the attackers.

An unacceptable security system is automatically attacked from within.

A Sampling of Protection Mechanisms

The idea of protection domains originated with Multics and is a key one for understanding computer security. Imagine a matrix of all protection domains on one axis and all system resources (files) on another.

The contents of each cell in the matrix is the operations permitted by a process (or thread) in that domain on that process.

Domain File1 File 2 Domain 1 Domain 2

1 RW RWX - Enter

2 R - - -

Notice that once domains are defined, the ability to change domains becomes another part of the domain system. Processes in given domains are allowed to enter other domains. A process’s initial domain is a function of the user who starts the process and the process itself.

While the pure domain model makes protection easy to understand, it is almost never implemented.

Holding the domains as a matrix doesn’t scale.

Some Domains and Rings

171

UNIX divides processes into 2 parts, a user part and a kernel part. When running as a user the process has limited abilities, and to access hardware, it has to trap into the kernel. The kernel can access all OS and hardware, and decides what it will do on a user’s behalf based on credentials stored in the PCB.

This is a simplification of the MULTICS system of protection rings. Rather than 2 levels, MULTICS had a 64 ring system where each ring was more privileged than the ones surrounding it, and checked similar credentials before using its increased powers.

Access Control Lists

Another representation of the domain concept is Access Control Lists (ACLs). These are lists attached to each resource (file) that describe the valid operations on them. Generally the ACL languages are rich enough to describe users and groups of users economically. This economy comes from wildcarding and exclusion operators. Wildcarding provides a way to describe all users meeting a given criterion; exclusion operators allow exclusion of a set of users. Conceptually, though, each file contains a list of the users that can operate on the file.

The UNIX file protection system is similar, but simplified. There are 9 (really 12) bits associated with each file that determine the read, write and execution permissions of the owner, members of the owning group and everyone else. 2 of the other 3 bits allow limited domain switching. They are the setuid bits that allow processes running the program to change user or group id to be that of the owner of the file.

When the owner of a file is root, this can convey considerable new power on the process. ACLs are useful and support revocation of rights. That is when a user is reading the file, and the owner wants to stop that, the owner can remove that right. Because the system checks current authority (see above) the read will be stopped.

Capabilities

Another way to encode domain rights is to encode a processes rights in its pointer to the object. For example file rights would be in the file descriptor; memory rights in a memory pointer. Such pointers with protection information encoded in them are called capabilities.

Capabilities are kept in special lists (called C-lists) that must be protected from processes direct manipulation. One way is in hardware - the memory actually contains bits that the CPU cannot touch in user mode that determine that a memory location holds a capability. Another is to make the C-list part of the PCB and only manipulable by the OS. A third is to have the OS encrypt the capabilities with a key unknown to the user.

Capabilities have operations defined on them, like copying, making copies with reduced or amplified rights. When the process presents the capability to the OS, the OS need not verify anything about the user, only whether the capability is valid.

That property makes it hard to revoke a capability, although there are a couple ways: embedded validity checks and indirect access.

Authentication and Security

Central to the idea of protection systems is the idea of an authentication system. An authentication system proves the identities of elements with which a computer system interacts. This can include users and other systems.

172

In distributed systems, authentication should be 2-way: The user should authenticate to the machine, and the machine to the user.

Generally authentication is accomplished by means of the exchange of a shared secret. The most common shared secret is a password.

Passwords

A password is a string of characters that the user and computer system agree will establish the user’s identity to the system. The analogy is to physical passwords, where people who wanted access to a military facility had to recite such an unusual phrase to establish their identity to those inside the fort.

Computer passwords are often the weakest part of a computer security system, especially if the passwords can be guessed off-line - that is without alerting the system under attack that it is under attack. Passwords can be stolen (physically or electronically) or guessed.

There are several good rules for choosing a computer password:

• Choose a long one. Most systems allow eight or ten letters - use ’em all. There are only

140,608 3-letter (cap and lower case) passwords; there are more than 50 trillion 8-letter combinations.

Guessing 1 in 50 trillion is a literally half a billion times harder than 1 in 140,000.

• Don’t use a common phrase or name. A seminal work in computer security ran a cracking program on a couple hundred donated password files that tested common English words and the top 100 (or so) female names and had an ungodly (better than 50%) hit rate. Hopefully education has gotten better. Note that “common phrase” means anything available in the system dictionary, at least. In my opinion you’re better off not using any English, and non-English words fare little better. No science fiction or fantasy words, either.

• Include some non-letters, e.g., *&$ˆ@.

• Don’t write it down. You’ve changed a difficult puzzle into a physical search.

• Don’t get too attached to it; you should change them relatively frequently - every six months or so is a good idea.

Password Storage

It’s possible to store passwords in the open, without immediately giving away the contents of the password. The system uses a 1-way function. A 1-way function is an interesting function that is relatively easy to compute, but difficult to invert (essentially the only way to invert it is to compute all the forward transforms looking for one that matches the reverse).

Systems like UNIX® don’t store the password, but the result of a 1-way function on the password. To check a user’s password, the system takes the password as input, computes the 1-way function on it, and compares it with the result in the password file. If they match, the password was (with high probability) correct. Note that even knowing the algorithm and the encrypted password, it’s still impossible to easily invert the function.

173

Although it’s theoretically reasonable to leave a hashed password file in the open, it is rarely done anymore. There are a couple reasons:

These are also called hash functions.

• In practice, bad passwords are not uncommon enough, so rather than having to try all the passwords (or half the passwords on average), trying a large dictionary of common passwords is often enough to break into an account on the system.

• Password file can be attacked off-line, with the system under attack completely unaware that it is under attack. By forcing the attacker to actually try passwords on the system that they’re invading, the system can detect an attack.

• Other Shared Secrets

• Some other forms of shared secrets include:

• Shared Real Secrets - the user gives the system some information that “only the user knows” and the system quizzes the user on it instead of a password. Good, in that the user rarely has to write such information down. Bad in that there isn’t much information that can’t be found by a determined investigator.

• Code books - a frequent system is to ask the user for a word from a code book. This was in vogue for a while with anti-piracy systems; to gain access to a program the program would ask

• the user for the nth word on the m-th page of the manual. It practice it means that the pirate photocopied the manual.

• One time passwords - the computer generates a table of passwords for the user, each of which is to be used once. When the user tries to log in the computer asks him/her for the next password in the sequence. The advantage is that if an attacker manages to steal the password, it cannot be reused. The disadvantage is that an attacker can steal the list (and a user is unlikely to memorize a set of single use passwords).

• Challenge/Response - The system and user agree on some (one-way) function or transformation. At login time, the computer presents the user with a value (called the challenge) and the user responds with the transform of the value. For example if the function were the square root, a challenge of 9 would be correctly answered with a response of 3. In practice, the functions are more complex and usually encoded in hardware. The hardware is often password protected so that theft of the hardware only means that the user cannot log in, not that the intruder can.

Physical ID

Another shared secret can be physical attributes of the human who wants to access the system. Several body measurements identify a user with significant precision: finger lengths, retina, fingerprints, etc.

Controlling access based on physical features has problems if the features are damaged (cutting one’s fingertip should confuse a fingerprint scanner). It also raises the grisly possibility of theft of those features.

One way to beat a thumbprint scanner is to physically acquire someone’s thumb.

174

A Sampling of Attacks

Some common attacks on computer systems:

Trojan Horse - This is a benign program that steals information as part of its function. An Example is a script that mimics the login prompt, takes a user’s password, saves it for the owner of the script and logs the user in. The legitimate user has access to the account, but so does the owner of the script.

Password Guessing - we talked about this with passwords. I pause to mention the infamous TENEX security hole that Tannenbaum discusses. TENEX allowed user functions to be called on each page fault. Some clever user realized that this allowed password guessing by the letter instead of one letter at a time. The password being guessed had one. Letter on one page and the rest on another, which was forced out of memory. Passwords were checked letter-by-letter, equentially. If the first letter was correct, there would be a page fault when the system faulted he second into memory to check it. By repeating the process the whole password could be guessed sequentially. This is an interesting example of how multiple OS feature combine to affect security.

Social Engineering - This is by far the most difficult to control. An attacker simply lies to a human being and gets the information that they want. The only real cure for this is to educate anyone who has security information (that is everyone) about security.

Buffer overruns - forcing a program to overrun a variable on the stack and insert code in it that the attacker wants run.

Backdoors - sometimes developers leave privileged debugging hooks in place in production systems. One of the well known offenders here is sendmail. Other production systems used to ship with well known user names with well-known passwords for remote maintenance.

Viruses and Worms

Viruses are programs contained in other programs, often for malicious purposes. (They needn’t be, though - one can imagine benign programs propagated the same way - virus checkers for example). Worms are self replicating independent programs. The distinction is in the method of transmission: a virus needs a host program to be run to propagate it; a worm has no such host, it propagates itself. Both have made national news for in their malevolent forms, but both could be used for benign purposes.

Covert Channels

A covert channel is an unintentional communication channel in the system. For example, if 2 processes banned from communicating directly can use the following scheme: one process repeatedly performs a computation known to take a fixed time. The other process alternately loads and unloads them machine with computationally intensive child processes depending on the bit it wants to send. Loading the machine corresponds to a 1 and unloading the machine a 0. The listening process knows that if it’s computation takes longer than usual, it should record a 1 and if it’s shorter, record a 0. The two can work out the timing and loading (statistically, if necessary) to communicate.

Covert channels are necessarily low bandwidth, and stopping them is difficult. (In the example above, the system would have to guarantee system load to be fixed, which would mean slowing the system when it was unloaded.) Most systems don’t stop covert channels. Systems that hold serious enough data.

175

CHAPTER 11

FILE SYSTEM IMPLEMENTATION

First we look at files from the point of view of a person or program using the file system, and then we consider how this user interface is implemented.

The User Interface to Files

Just as the process abstraction beautifies the hardware by making a single CPU (or a small number of CPUs) appear to be many CPUs, one per “user,” the file system beautifies the hardware disk, making it appear to be a large number of disk-like objects called files. Like a disk, a file is capable of storing a large amount of data cheaply, reliably, and persistently. The fact that there are lots of files is one form of beautification: Each file is individually protected, so each user can have his own files, without the expense of requiring each user to buy his own disk. Each user can have lots of files, which makes it easier to organize persistent data. The filesystem also makes each individual file more beautiful than a real disk. At the very least, it erases block boundaries, so a file can be any length (not just a multiple of the block size) and programs can read and write arbitrary regions of the file without worrying about whether they cross block boundaries. Some systems (not Unix) also provide assistance in organizing the contents of a file.

Systems use the same sort of device (a disk drive) to support both virtual memory and files. The question arises why these have to be distinct facilities, with vastly different user interfaces. The answer is that they don't. In Multics, there was no difference whatsoever. Everything in Multics was a segment. The address space of each running process consisted of a set of segments (each with its own segment number), and the “file system” was simply a set of named segments. To access a segment from the file system, a process would pass its name to a system call that assigned a segment number to it. From then on, the process could read and write the segment simply by executing ordinary loads and stores. For example, if the segment was an array of integers, the program could access the number with a notation like a rather than having to seek to the appropriate offset and then execute a read system call. If the block of the file containing this value wasn't in memory, the array access would cause a page fault, which was serviced as explained in the previous chapter.

176

This user-interface idea, sometimes called “single-level store,” is a great idea. So why is it not common in current operating systems? In other words, why are virtual memory and files presented as very different kinds of objects? There are possible explanations one might propose:

The address space of a process is small compared to the size of a file system.

There is no reason why this has to be so. In Multics, a process could have up to 256K segments, but each segment was limited to 64K words. Multics allowed for lots of segments because every “file” in the file system was a segment. The upper bound of 64K words per segment was considered large by the standards of the time; the hardware actually allowed segments of up to 256K words (over one megabyte). Most new processors introduced in the last few years allow 64-bit virtual addresses. In a few years, such processors will dominate. So there is no reason why the virtual address space of a process cannot be large enough to include the entire file system.

The virtual memory of a process is transient--it goes away when the process terminates--while files must be persistent.

Multics showed that this doesn't have to be true. A segment can be designated as “permanent,” meaning that it should be preserved after the process that created it terminates. Permanent segments to raise a need for one “file-system-like” facility, the ability to give names to segments so that new processes can find them.

Files are shared by multiple processes, while the virtual address space of a process is associated with only that process.

Most modern operating systems (including most variants of Unix) provide some way for processes to share portions of their address spaces anyhow, so this is a particularly weak argument for a distinction between files and segments.

The real reason single-level store is not ubiquitous is probably a concern for efficiency. The usual file-system interface encourages a particular style of access: Open a file, go through it sequentially, copying big chunks of it to or from main memory, and then close it. While it is possible to access a file like an array of bytes, jumping around and accessing the data in tiny pieces, it is awkward. Operating system designers have found ways to implement files that make the common “file like” style of access very efficient. While there appears to be no reason in principle why memory-mapped files cannot be made to give similar performance when they are accessed in this way, in practice, the added functionality of mapped files always seems to pay a price in performance. Besides, if it is easy to jump around in a file, applications programmers will take advantage of it, overall performance will suffer, and the file system will be blamed.

Naming

Every file system provides some way to give a name to each file. We will consider only names for individual files here, and talk about directories later. The name of a file is (at least sometimes) meant to used by human beings, so it should be easy for humans to use. Different operating systems put different restrictions on names:

Size

177

Some systems put severe restrictions on the length of names. For example DOS restricts names to 11 characters, while early versions of Unix (and some still in use today) restrict names to 14 characters. The Macintosh operating system, Windows 95, and most modern version of Unix allow names to be essentially arbitrarily long. I say “essentially” since names are meant to be used by humans, so they don't really to to be all that long. A name that is 100 characters long is just as difficult to use as one that it forced to be under 11 characters long (but for different reasons). Most modern versions of Unix, for example, restrict names to a limit of 255 characters.

Case

Are upper and lower case letters considered different? The Unix tradition is to consider the names Foo and fop to be completely different and unrelated names. In DOS and its descendants, however, they are considered the same. Some systems translate names to one case (usually upper case) for storage. Others retain the original case, but consider it simply a matter of decoration. For example, if you create a file named “Foo,” you could open it as “foo” or “FOO,” but if you list the directory, you would still see the file listed as “Foo”.

Character Set

Different systems put different restrictions on what characters can appear in file names. The Unix directory structure supports names containing any character other than NUL (the byte consisting of all zero bits), but many utility programs (such as the shell) would have troubles with names that have spaces, control characters or certain punctuation characters (particularly ‘/’). MacOS allows all of these (e.g., it is not uncommon to see a file name with the Copyright symbol © in it). With the world-wide spread of computer technology, it is becoming increasingly important to support languages other than English, and in fact alphabets other than Latin. There is a move to support character strings (and in particular file names) in the Unicode character set, which devotes 16 bits to each character rather than 8 and can represent the alphabets of all major modern languages from Arabic to Devanagari to Telugu to Khmer.

Format

It is common to divide a file name into a base name and an extension that indicates the type of the file. DOS requires that each name be compose of a bast name of eight or less characters and an extension of three or less characters. When the name is displayed, it is represented as base.extension. Unix internally makes no such distinction, but it is a common convention to include exactly one period in a file name (e.g. foo.c for a C source file).

File Structure

Unix hides the “chunkiness” of tracks, sectors, etc. and presents each file as a “smooth” array of bytes with no internal structure. Application programs can, if they wish, use the bytes in the file to represent structures. For example, a wide-spread convention in Unix is to use the newline character (the character with bit pattern 00001010) to break text files into lines. Some other systems provide a variety of other types of files. The most common are files that consist of an array of fixed or variable size records and files that form an index mapping keys to values. Indexed files are usually implemented as B-trees.

File Types

178

Most systems divide files into various “types.” The concept of “type” is a confusing one, partially because the term “type” can mean different things in different contexts. Unix initially supported only four types of files: directories, two kinds of special files (discussed later), and “regular” files. Just about any type of file is considered a “regular” file by Unix. Within this category, however, it is useful to distinguish text files from binary files; within binary files there are executable files (which contain machine-language code) and data files; text files might be source files in a particular programming language (e.g. C or Java) or they may be human-readable text in some mark-up language such as html (hypertext markup language). Data files may be classified according to the program that created them or is able to interpret them, e.g., a file may be a Microsoft Word document or Excel spreadsheet or the output of TeX. The possibilities are endless.

In general (not just in Unix) there are three ways of indicating the type of a file:

• The operating system may record the type of a file in meta-data stored separately from the file, but associated with it. Unix only provides enough meta-data to distinguish a regular file from a directory (or special file), but other systems support more types.

• The type of a file may be indicated by part of its contents, such as a header made up of the first few bytes of the file. In Unix, files that store executable programs start with a two byte magic number that identifies them as executable and selects one of a variety of executable formats. In the original Unix executable format, called the a.out format, the magic number is the octal number 0407, which happens to be the machine code for a branch instruction on the PDP-11 computer, one of the first computers to implement Unix. The operating system could run a file by loading it into memory and jumping to the beginning of it. The 0407 code, interpreted as an instruction, jumps to the word following the 16-byte header, which is the beginning of the executable code in this format. The PDP-11 computer is extinct by now, but it lives on through the 0407 code!

• The type of a file may be indicated by its name. Sometimes this is just a convention, and sometimes it's enforced by the OS or by certain programs. For example, the Unix Java compiler refuses to believe that a file contains Java source unless its name ends with .java.

Some systems enforce the types of files more vigorously than others. File types may be enforced

• Not at all,

• Only by convention,

• By certain programs (e.g. the Java compiler), or

• By the operating system itself.

Unix tends to be very lax in enforcing types.

Access Modes

Systems support various access modes for operations on a file.

Sequential: Read or write the next record or next n bytes of the file. Usually, sequential access also allows a rewind operation.

Random: Read or write the nth record or bytes i through j. Unix provides an equivalent facility by adding a seek operation to the sequential operations listed above. This packaging of operations allows random access but encourages sequential access.

179

Indexed: Read or write the record with a given key. In some cases, the “key” need not be unique--there can be more than one record with the same key. In this case, programs use a combination of indexed and sequential operations: Get the first record with a given key, then get other records with the same key by doing sequential reads.

File Attributes

This is the area where there is the most variation among file systems. Attributes can also be grouped by general category.

Name

Ownership and Protection

Owner, owner's “group,” creator, access-control list (information about who can to what to this file, for example, perhaps the owner can read or modify it, other members of his group can only read it, and others have no access).

Time stamps

Time created, time last modified, time last accessed, time the attributes were last changed, etc. Unix maintains the last three of these. Some systems record not only when the file was last modified, but by whom.

Sizes

Current size, size limit, “high-water mark”, space consumed (which may be larger than size because of internal fragmentation or smaller because of various compression techniques).

Type Information

As described above: File is ASCII, is executable, is a “system” file, is an Excel spread sheet, etc.

Misc

Some systems have attributes describing how the file should be displayed when a directly is listed. For example MacOS records an icon to represent the file and the screen coordinates where it was last displayed. DOS has a “hidden” attribute meaning that the file is not normally shown. Unix achieves a similar effect by convention: The ls program that is usually used to list files does not show files with names that start with a period unless you explicit request it to (with the -a option).

Unix records a fixed set of attributes in the meta-data associated with a file. If you want to record some fact about the file that is not included among the supported attributes, you have to use one of the tricks listed above for recording type information: encode it in the name of the file, put it into the body of the file itself, or store it in a file with a related name (e.g. “foo.attributes”). Other systems (notably MacOS and Windows NT) allow new attributes to be invented on the fly. In MacOS, each file has a resource fork, which is a list of (attribute-name, attribute-value) pairs. The attribute name can be any four-character string, and the attribute value can be anything at all. Indeed, some kinds of files put the entire “contents” of the file in an attribute and leave the “body” of the file (called the data fork) empty.

Operations

POSIX, a standard API (application programming interface) based on Unix, provides the following operations (among others) for manipulating files:

180

• fd = open(name, operation)

• fd = creat(name, mode)

• status = close(fd)

• byte_count = read(fd, buffer, byte_count)

• byte_count = write(fd, buffer, byte_count)

• offset = lseek(fd, offset, whence)

• status = link(oldname, newname)

• status = unlink(name)

• status = stat(name, buffer)

• status = fstat(fd, buffer)

• status = utimes(name, times)

• status = chown(name, owner, group) or fchown(fd, owner, group)

• status = chmod(name, mode) or fchmod(fd, mode)

• status = truncate(name, size) or ftruncate(fd, size)

Some types of arguments and results need explanation.

Status: Many functions return a “status” which is either 0 for success or -1 for errors (there is another mechanism to get more information about went wrong). Other functions also use -1 as a return value to indicate an error.

Name: A character-string name for a file.

A “file descriptor”, which is a small non-negative integer used as a short, temporary name for a file during the lifetime of a process.

Buffer: The memory address of the start of a buffer for supplying or receiving data.

Whence: One of three codes, signifying from start, from end, or from current location.

Mode: A bit-mask specifying protection information.

Operation: An integer code, one of read, write, read and write, and perhaps a few other possibilities such as append only.

The open call finds a file and assigns a decriptor to it. It also indicates how the file will be used by this process (read only, read/write, etc). The creat call is similar, but creates a new (empty) file. The mode argument specifies protection attributes (such as “writable by owner but read-only by others”) for the new file. (Most modern versions of Unix have merged creat into open by adding an optional mode argument and allowing the operation argument to specify that the file is automatically created if it doesn't already exist.) The close call simply announces that fd is no longer in use and can be reused for another open or creat.

181

The read and write operations transfer data between a file and memory. The starting location in memory is indicated by the buffer parameter; the starting location in the file (called the seek pointer is wherever the last read or write left off. The result is the number of bytes transferred. For write it is normally the same as the byte_count parameter unless there is an error. For read it may be smaller if the seek pointer starts out near the end of the file. The lseek operation adjusts the seek pointer (it is also automatically updated by read and write). The specified offset is added to zero, the current seek pointer, or the current size of the file, depending on the value of whence.

The function link adds a new name (alias) to a file, while unlink removes a name. There is no function to delete a file; the system automatically deletes it when there are no remaining names for it.

The stat function retrieves meta-data about the file and puts it into a buffer (in a fixed, documented format), while the remaining functions can be used to update the meta-data: utimes updates time stamps, chown updates ownership, chmod updates protection information, and truncate changes the size (files can be make bigger by write, but only truncate can make them smaller). Most come in two flavors: one that take a file name and one that takes a descriptor for an open file.

To any Unix system. The ‘2’ means to look in section 2 of the manual, where system calls are explained.

Other systems have similar operations, and perhaps a few more. For example, indexed or indexed sequential files would require a version of seek to specify a key rather than an offset. It is also common to have a separate append operation for writing to the end of a file.

The User Interface to Directories

We already talked about file names. One important feature that a file name should have is that it be unambiguous: There should be at most one file with any given name. The symmetrical condition, that there be at most one name for any given file, is not necessarily a good thing. Sometimes it is handy to be able to give multiple names to a file. When we consider implementation, we will describe two different ways to implement multiple names for a file, each with slightly different semantics. If there are a lot of files in a system, it may be difficult to avoid giving two files the same name, particularly if there are multiple uses independently making up names. One technique to assure uniqueness is to prefix each file name with the name (or user id) of the owner. In some early operating systems, that was the only assistance the system gave in preventing conflicts.

A better idea is the hierarchical directory structure, first introduced by Multics, then popularized by Unix, and now found in virtually every operating system. You probably already know about hierarchical directories, but I would like to describe them from an unusual point of view, and then explain how this point of view is equivalent to the more familiar version.

Each file is named by a sequence of names. Although all modern operating systems use this technique, each uses a different character to separate the components of the sequence when displaying it as a character string. Multics uses `>', Unix uses ‘/’, DOS and its descendants use ‘\’, and MacOS uses ':'. Sequences make it easy to avoid naming conflicts. First, assign a sequence to each user and only let him create files with names that start with that sequence. For example, I might be assigned the sequence (“usr”, “solomon”), written in Unix as /usr/solomon. So far, this is the same as just appending the user name to each file name. But it allows me to further classify my own files to prevent conflicts. When I start a new project, I can create a new sequence by appending the name of the project to the end of the sequence assigned to me, and then use this prefix for all files in the project. For example, I might choose /usr/solomon/cs537 for files associated with this course, and

182

name them /usr/solomon/cs537/foo, /usr/solomon/cs537/bar, etc. As an extra aid, the system allows me to specify a “default prefix” and a short-hand for writing names that start with that prefix. In Unix, I use the system call chdir to specify a prefix, and whenever I use a name that does not start with ‘/’, the system automatically adds that prefix.

It is customary to think of the directory system as a directed graph, with names on the edges. Each path in the graph is associated with a sequence of names, the names on the edges that make up the path. For that reason, the sequence of names is usually called a path name. One node is designated as the root node, and the rule is enforced that there cannot be two edges with the same name coming out of one node. With this rule, we can use path name to name nodes. Start at the root node and treat the path name as a sequence of directions, telling us which edge to follow at each step. It may be impossible to follow the directions (because they tell us to use an edge that does not exist), but if is possible to follow the directions, they will lead us unambiguously to one node. Thus path names can be used as unambiguous names for nodes. In fact, as we will see, this is how the directory system is actually implemented. However, I think it is useful to think of “path names” simply as long names to avoid naming conflicts, since it clear separates the interface from the implementation.

Implementing File Systems

Files

We will assume that all the blocks of the disk are given block numbers starting at zero and running through consecutive integers up to some maximum. We will further assume that blocks with numbers that are near each other are located physically near each other on the disk (e.g., same cylinder) so that the arithmetic difference between the numbers of two blocks gives a good estimate how long it takes to get from one to the other. First let's consider how to represent an individual file. There are (at least!) four possibilities:

Contiguous: The blocks of a file are the block numbered n, n+1, n+2, ..., m. we can represent any file with a pair of numbers: the block number of first block and the length of the file (in blocks). The advantages of this approach are

• It's simple

• The blocks of the file are all physically near each other on the disk and in order so that a sequential scan through the file will be fast.

The problem with this organization is that you can only grow a file if the block following the last block in the file happens to be free. Otherwise, you would have to find a long enough run of free blocks to accommodate the new length of the file and copy it. As a practical matter, operating systems that use this organization require the maximum size of the file to be declared when it is created and pre-allocate space for the whole file. Even then, storage allocation has all the problems we considered when studying main-memory allocation including external fragmentation.

Linked List: A file is represented by the block number of its first block, and each block contains the block number of the next block of the file. This representation avoids the problems of the contiguous representation: We can grow a file by linking any disk block onto the end of the list, and there is no external fragmentation. However, it introduces a new problem: Random access is effectively impossible. To find the 100th block of a file, we have to read the first 99 blocks just to follow the list. We also lose the advantage of very fast sequential access to the file since its blocks may be scattered all over the disk. However, if we are careful when choosing blocks to add to a file, we can retain pretty good sequential access performance.

183

Both the space overhead (the percentage of the space taken up by pointers) and the time overhead (the percentage of the time seeking from one place to another) can be decreased by using larger blocks. The hardware designer fixes the block size (which is usually quite small) but the software can get around this problem by using “virtual” blocks, sometimes called clusters. The OS simply treats each group of (say) four continguous phyical disk sectors as one cluster. Large, clusters, particularly if they can be variable size, are sometimes called extents. Extents can be thought of as a compromise between linked and contiguous allocation.

Disk Index: The idea here is to keep the linked-list representation, but take the link fields out of the blocks and gather them together all in one place. This approach is used in the “FAT” file system of DOS, OS/2 and older versions of Windows. At some fixed place on disk, allocate an array I with one element for each block on the disk, and move the link field from block n to I[m] (see Figure 11.17 on page 382). The whole array of links, called a file access table (FAT) is now small enough that it can be read into main memory when the systems starts up. Accessing the 100th block of a file still requires walking through 99 links of a linked list, but now the entire list is in memory, so time to traverse it is negligible (recall that a single disk access takes as long as 10's or even 100's of thousands of instructions). This representation has the added advantage of getting the “operating system” stuff (the links) out of the pages of “user data”. The pages of user data are now full-size disk blocks, and lots of algorithms work better with chunks that are a power of two bytes long. Also, it means that the OS can prevent users (who are notorious for screwing things up) from getting their grubby hands on the system data.

The main problem with this approach is that the index array I can get quite large with modern disks. For example, consider a 2 GB disk with 2K blocks. There are million blocks, so a block number must be at least 20 bits. Rounded up to an even number of bytes, that's 3 bytes--4 bytes if we round up to a word boundary--so the array I is three or four megabytes. While that's not an excessive amount of memory given today's RAM prices, if we can get along with less, there are better uses for the memory.

File Index: Although a typical disk may contain tens of thousands of files, only a few of them are open at any one time, and it is only necessary to keep index information about open files in memory to get good performance. Unfortunately the whole-disk index described in the previous paragraph mixes index information about all files for the whole disk together, making it difficult to cache only information about open files. The inode structure introduced by Unix groups together index information about each file individually. The basic idea is to represent each file as a tree of blocks, with the data blocks as leaves. Each internal block (called an indirect block in Unix jargon) is an array of block numbers, listing its children in order. If a disk block is 2K bytes and a block number is four bytes, 512 block numbers fit in a block, so a one-level tree (a single root node pointing directly to the leaves) can accommodate files up to 512 blocks, or one megabyte in size. If the root node is cached in memory, the “address” (block number) of any block of the file can be found without any disk accesses. A two-level tree, with 513 total indirect blocks, can handle files 512 times as large (up to one-half gigabyte).

The only problem with this idea is that it wastes space for small files. Any file with more than one block needs at least one indirect block to store its block numbers. A 4K file would require three 2K blocks, wasting up to one third of its space. Since many files are quite small, this is serious problem. The Unix solution is to use a different kind of “block” for the root of the tree.

An index node (or inode for short) contains almost all the meta-data about a file listed above: ownership, permissions, time stamps, etc. (but not the file name). Inodes are small enough that several of them can be packed into one disk block. In addition to the meta-data, an inode contains the block numbers of the first few blocks of the file. What if the file is too big to fit all its block numbers

184

into the inode? The earliest version of Unix had a bit in the meta-data to indicate whether the file was “small” or “big.” For a big file, the inode contained the block numbers of indirect blocks rather than data blocks. More recent versions of Unix contain pointers to indirect blocks in addition to the pointers to the first few data blocks. The inode contains pointers to (i.e., block numbers of) the first few blocks of the file, a pointer to an indirect block containing pointers to the next several blocks of the file, a pointer to a doubly indirect block, which is the root of a two-level tree whose leaves are the next blocks of the file, and a pointer to a triply indirect block. A large file is thus a lop-sided tree.

A real-life example is given by the Solaris 2.5 version of Unix. Block numbers are four bytes and the size of a block is a parameter stored in the file system itself, typically 8K (8192 bytes), so 2048 pointers fit in one block. An inode has direct pointers to the first 12 blocks of the file, as well as pointers to singly, doubly, and triply indirect blocks. A file of up to 12+2048+2048*2048 = 4,196,364 blocks or 34,376,613,888 bytes (about 32 GB) can be represented without using triply indirect blocks, and with the triply indirect block, the maximum file size is (12+2048+2048*2048+2048*2048*2048)*8192 = 70,403,120,791,552 bytes (slightly more than 246 bytes, or about 64 terabytes). Of course, for such huge files, the size of the file cannot be represented as a 32-bit integer. Modern versions of Unix store the file length as a 64-bit integer, called a “long” integer in Java. An inode is 128 bytes long, allowing room for the 15 block pointers plus lots of meta-data. 64 inodes fit in one disk block. Since the inode for a file is kept in memory while the file is open, locating an arbitrary block of any file requires reading at most three I/O operations, not counting the operation to read or write the data block itself.

Directories

A directory is simply a table mapping character-string human-readable names to information about files. The early PC operating system CP/M shows how simple a directory can be. Each entry contains the name of one file, its owner, size (in blocks) and the block numbers of 16 blocks of the file. To represent files with more than 16 blocks, CP/M used multiple directory entries with the same name and different values in a field called the extent number. CP/M had only one directory for the entire system.

DOS uses a similar directory entry format, but stores only the first block number of the file in the directory entry. The entire file is represented as a linked list of blocks using the disk index scheme described above. All but the earliest version of DOS provide hierarchical directories using a scheme similar to the one used in Unix.

Unix has an even simpler directory format. A directory entry contains only two fields: a character-string name (up to 14 characters) and a two-byte integer called an inumber, which is interpreted as an index into an array of inodes in a fixed, known location on disk. All the remaining information about the file (size, ownership, time stamps, permissions, and an index to the blocks of the file) are stored in the inode rather than the directory entry. A directory is represented like any other file (there's a bit in the inode to indicate that the file is a directory). Thus the inumber in a directory entry may designate a “regular” file or another directory, allowing arbitrary graphs of nodes. However, Unix carefully limits the set of operating system calls to ensure that the set of directories is always a tree. The root of the tree is the file with inumber 1 (some versions of Unix use other conventions for designating the root directory). The entries in each directory point to its children in the tree. For convenience, each directory also two special entries: an entry with name “..”, which points to the parent of the directory in the tree and an entry with name “.”, which points to the directory itself. Inumber 0 is not used, so an entry is marked “unused” by setting its inumber field to 0.

The algorithm to convert from a path name to an inumber might be written in Java as

185

int namei(int current, String[] path) {

for (int i = 0; i<path.length; i++) {

if (inode[current].type != DIRECTORY)

throw new Exception("not a directory");

current = nameToInumber (inode[current], path[i]);

if (current == 0)

throw new Exception("no such file or directory");

}

return current;

}

The procedure nameToInumber(Inode node, String name) (not shown) reads through the directory file represented by the inode node, looks for an entry matching the given name and returns the inumber contained in that entry. The procedure namei walks the directory tree, starting at a given inode and following a path described by a sequence of strings. There is a procedure with this name in the Unix kernel. Files are always specified in Unix system calls by a character-string path name. You can learn the inumber of a file if you like, but you can't use the inumber when talking to the Unix kernel. Each system call that has a path name as an argument uses namei to translate it to an inumber. If the argument is an absolute path name (it starts with ‘/’), namei is called with current == 1. Otherwise, current is the current working directory.

Since all the information about a file except its name is stored in the inode, there can be more than one directory entry designating the same file. This allows multiple aliases (called links) for a file. Unix provides a system call link (old-name, new-name) to create new names for existing files. The call link ("/a/b/c", "/d/e/f") works something like this:

if (namei(1, parse("/d/e/f")) != 0)

throw new Exception("file already exists");

int dir = namei(1, parse("/d/e")):

if (dir==0 || inode[dir].type != DIRECTORY)

throw new Exception("not a directory");

int target = namei(1, parse("/a/b/c"));

if (target==0)

throw new Exception("no such directory");

if (inode[target].type == DIRECTORY)

throw new Exception("cannot link to a directory");

186

addDirectoryEntry(inode[dir], target, "f");

The procedure parse (not shown here) is assumed to break up a path name into its components. If, for example, /a/b/c resolves to inumber 123, the entry (123, "f") is added to directory file designated by "/d/e". The result is that both "/a/b/c" and "/d/e/f" resolve to the same file (the one with inumber 123).

We have seen that a file can have more than one name. What happens if it has no names (does not appear in any directory)? Since the only way to name a file in a system call is by a path name, such a file would be useless. It would consume resources (the inode and probably some data and indirect blocks) but there would be no way to read it, write to it, or even delete it. Unix protects against this “garbage collection” problem by using reference counts. Each inode contains a count of the number of directory entries that point to it. “User” programs are not allowed to update directories directly. System calls that add or remove directory entries (creat, link, mkdir, rmdir, etc) update these reference counts appropriately. There is no system call to delete a file, only the system call unlink(name) which removes the directory entry corresponding to name. If the reference count of an inode drops to zero, the system automatically deletes the files and returns all of its blocks to the free list.

We saw before that the reference counting algorithm for garbage collection has a fatal flaw: If there are cycles, reference counting will fail to collect some garbage. Unix avoids this problem by making sure cycles cannot happen. The system calls are designed so that the set of directories will always be a single tree rooted at inode 1: mkdir creates a new empty (except for the . and .. entries) as a leaf of the tree, rmdir is only allowed to delete a directory that is empty (except for the . and .. entries), and link is not allowed to link to a directory. Because links to directories are not allowed, the only place the file system is not a tree is at the leaves (regular files) and that cannot introduce cycles.

Although this algorithm provides the ability to create aliases for files in a simple and secure manner, it has several flaws:

• It's hard to figure own how to charge users for disk space. Ownership is associated with the file not the directory entry (the owner's id is stored in the inode). A file cannot be deleted without finding all the links to it and deleting them. If I create a file and you make a link to it, I will continue to be charged for it even if I try to remove it through my original name for it. Worse still, your link may be in a directory I don't have access to, so I may be unable to delete the file, even though I'm being charged for its space. Indeed, you could make it much bigger after I have no access to it.

• There is no way to make an alias for a directory.

• As we will see later, links cannot cross boundaries of physical disks.

• Since all aliases are equal, there's no one “true name” for a file. You can find out whether two path names designate the same file by comparing inumbers. There is a system call to get the meta-data about a file, and the inumber is included in that information. But there is no way of going in the other direction: to get a path name for a file given its inumber, or to find a path name of an open file. Even if you remember the path name used to get to the file, that is not a reliable “handle” to the file (for example to link two files together by storing the name of one in the other). One of the components of the path name could be removed, thus invalidating the name even though the file still exists under a different name.

While it's not possible to find the name (or any name) of an arbitrary file, it is possible to figure out the name of a directory. Directories do have unique names because the directories form a tree,

187

and one of the properties of a tree is that there is a unique path from the root to any node. The “..” and “.” entries in each directory make this possible. Here, for example, is code to find the name of the current working directory.

class DirectoryEntry {

int inumber;

String name;

}

String cawed () {

FileInputStream tinder = new FileInputStream (".");

int thisInumber = nameToInumber (tinder, ".");

getPath (".", thisInumber);

}

String getPath(String currentName, int currentInumber) {

String parentName = currentName + "/..";

FileInputSream parent = new FileInputStream (parentName);

int parentInumber = nameToInumber(parent, ".");

String fname = inumberToName(parent, currentInumber);

if (parentInumber == 1)

return "/" + fname;

else

return getPath (parentInumber, parentName) + "/" + fname;

}

The procedure nameToInumber is similar to the procedure with the same name described above, but takes an InputStream as an argument rather than an inode. Many versions of Unix allow a program to open a directory for reading and read its contents just like any other file. In such systems, it would be easy to write nameToInumber as a user-level procedure if you know the format of a directory. The procedure inumberToName is similar, but searches for an entry containing a particular inumber and returns the name field of the entry.

Symbolic Links

To get around the limitations with the original Unix notion of links, more recent versions of Unix introduced the notion of a symbolic link (to avoid confusion, the original kind of link, described in the previous section, is sometimes called a hard link). A symbolic link is a new type of file, distinguished by a code in the inode from directories, regular files, etc. When the namei procedure that translates path names to inumbers encounters a symlink, it treats the contents of the file as a pathname and

188

uses it to continue the translation. If the contents of the file is a relative path name (it does not start with a slash), it is interpreted relative to the directory containing the link itself, not the current working directory of the process doing the lookup.

int namei (int current, String[] path) {


if (inode[current].type != DIRECTORY)

throw new Exception ("not a directory");

parent = current;

current = nameToInumber (inode[current], path[i]);

if (current == 0)

throw new Exception ("no such file or directory");

if (inode [current].type == SYMLINK) {

String link = getContents (inode[current]);

String [] linkPath = parse(link);

if (link.charAt (0) == '/')

current = namei (1, linkPath);

else

current = namei (parent, linkPath);

if (current == 0)


}

}

return current;

}

The only change from the previous version of this procedure is the addition of the while loop. Any time the procedure encounters a node of type SYMLINK, it recursively calls itself to translate the contents of the file, interpreted as a path name, into an inumber.

Although the implementation looks complicated, it does just what you would expect in normal situations. For example, suppose there is an existing file named /a/b/c and an existing directory /d. Then the command ln -s /a/b /d/e makes the path name /d/e a synonym for /a/b, and also makes /d/e/c a synonym for /a/b/c. From the user's point of view, the picture looks like this:

189

In implementation terms, the picture looks like this

Where the hexagon denotes a node of type symlink.

Here's a more elaborate example that illustrates symlinks with relative path names. Suppose I have an existing directory /usr/solomon/cs537/s90 with various sub-directories and I am setting up project 5 for this semester. I might do something like this:

cd /usr/solomon/cs537

mkdir f96

cd f96

an -s ../s90/proj5 proj5.old

cat proj5.old/foo.c

cd /usr/solomon/cs537

cat f96/proj5.old/foo.c

cat s90/proj5/foo.c

Logically, the situation looks like this:

190

And physically, it looks like this:

All three of the cat commands refer to the same file.

The added flexibility of symlinks over hard links comes at the expense of less security. Symlinks are neither required nor guaranteed to point to valid files. You can remove a file out from under a symlink, and in fact, you can create a symlink to a non-existent file. Symlinks can also have cycles. For example, this works fine:

cd /usr/solomon

mkdir bar

ln -s /usr/solomon foo

ls /usr/solomon/foo/foo/foo/foo/bar

However, in some cases, symlinks can cause infinite loops or infinite recursion in the namei procedure. The real version in Unix puts a limit on how many times it will iterate and returns an error code of “too many links” if the limit is exceeded. Symlinks to directories can also cause the “change directory” command cd to behave in strange ways. Most people expect that the two commands

cd foo

191

cd ..

to cancel each other out. But in the last example, the commands

cd /usr/solomon

cd foo

cd ..

Would leave you in the directory /usr. Some shell programs treat cd specially and remember what alias you used to get to the current directory. After cd /usr/solomon; cd foo; cd foo, the current directory is /usr/solomon/foo/foo, which is an alias for /usr/solomon, but the command cd .. is treated as if you had typed cd /usr/solomon/foo.

Mounting

What if your computer has more than one disk? In many operating systems (including DOS and its descendants) a pathname starts with a device name, as in C:\usr\solomon (by convention, C is the name of the default hard disk). If you leave the device prefix off a path name, the system supplies a default current device similar to the current directory. Unix allows you to glue together the directory trees of multiple disks to create a single unified tree. There is a system call

Mount (device, mount_point)

Where device names a particular disk drive and mount_point is the path name of an existing node in the current directory tree (normally an empty directory). The result is similar to a hard link: The mount point becomes an alias for the root directory of the indicated disk. Here's how it works: The kernel maintains a table of existing mounts represented as (device1, inumber, device2) triples. During namei, whenever the current (device, inumber) pair matches the first two fields in one of the entries, the current device and inumber become device2 and 1, respectively. Here's the expanded code:

int namei (int curi, int curdev, String[] path) {


if (disk[curdev].inode[curi].type != DIRECTORY)

throw new Exception ("not a directory");

parent = curi;

curi = nameToInumber (disk[curdev].inode[curi], path[i]);

if (curi == 0)


if (disk[curdev].inode[curi].type == SYMLINK) {

String link = getContents (disk[curdev].inode[curi]);

192

String [] linkPath = parse(link);

if (link.charAt(0) == '/')

current = namei (1, curdev, linkPath);

else

current = namei (parent, curdev, linkPath);

if (current == 0)


}

int newdev = mountLookup (curdev, curi);

if (newdev != -1) {

curdev = newdev;

curi = 1;

}

}

return current;

}

In this code, we assume that mountLookup searches the mount table for matching entry, returning -1 if no matching entry is found. There is a also a special case (not shown here) for “..” so that the “..” entry in the root directory of a mounted disk behaves like a pointer to the parent directory of the mount point.

The Network File System (NFS) from Sun Microsystems extends this idea to allow you to mount a disk from a remote computer. The device argument to the mount system call names the remote computer as well as the disk drive and both pieces of information are put into the mount table. Now there are three pieces of information to define the “current directory”: the inumber, the device, and the computer. If the current computer is remote, all operations (read, write, creat, delete, mkdir, rmdir, etc.) are sent as messages to the remote computer. Information about remote open files, including a seek pointer and the identity of the remote machine, is kept locally. Each read or write operation is converted locally to one or more requests to read or write blocks of the remote file. NFS caches blocks of remote files locally to improve performance.

Special Files

I said that the Unix mount system call has the name of a disk device as an argument. How do you name a device? The answer is that devices appear in the directory tree as special files. An inode whose type is “special” (as opposed to “directory,” “symlink,” or “regular”) represents some sort of I/O device. It is customary to put special files in the directory /dev, but since it is the inode that is marked “special,” they can be anywhere. Instead of containing pointers to disk blocks, the inode of a special file contains information (in a machine-dependent format) about the device. The operating system

193

tries to make the device look as much like a file as possible, so that ordinary programs can open, close, read, or write the device just like a file.

Some devices look more like real file than others. A disk device looks exactly like a file. Reads return whatever is on the disk and writes can scribble anywhere on the disk. For obvious security reasons, the permissions for the raw disk devices are highly restrictive. A tape drive looks sort of like a disk, but a read will return only the next physical block of data on the device, even if more is requested.

The special file /dev/tty represents the terminal. Writes to /dev/tty display characters on the screen. Reads from /dev/tty return characters typed on the keyboard. The seek operation on a device like /dev/tty updates the seek pointer, but the seek pointer has no effect on reads or writes. Reads of /dev/tty are also different from reads of a file in that they may return fewer bytes than requested: Normally, a read will return characters only up through the next end-of-line. If the number of bytes requested is less than the length of the line, the next read will get the remaining bytes. A read call will block the caller until at least one character can be returned. On machines with more than one terminal, there are multiple terminal devices with names like /dev/tty0, /dev/tty1, etc.

Some devices, such as a mouse, are read-only. Write operations on such devices have no effect. Other devices, such as printers, are write-only. Attempts to read from them give an end-of-file indication (a return value of zero). There is special file called /dev/null that does nothing at all: reads return end-of-file and writes send their data to the garbage bin. (New EPA rules require that this data be recycled. It is now used to generate federal regulations and other meaningless documents.) One particularly interesting device is /dev/mem, which is an image of the memory space of the current process. In a sense, this device is the exact opposite of memory-mapped files. Instead of making a file look like part of virtual memory, it makes virtual memory look like a device.

This idea of making all sorts of things look like files can be very powerful. Some versions of Unix make network connections look like files. Some versions have a directory with one special file for each active process. You can read these files to get information about the states of processes. If you delete one of these files, the corresponding process is killed. Another idea is to have a directory with one special file for each print job waiting to be printed. Although this idea was pioneered by Unix, it is starting to show up more and more in other operating systems.

Node

Direct Inode

Indirect Inodes

The last 3 sector pointers are special. The first points to inode structures that contain only pointers to sectors; this is an indirect block. The second to pointers to pointers to sectors (a double indirect node) and the third to pointers to pointers to sectors (triple indirect).

This results in increasing access times for blocks later in the file. Large files will have longer access times to the end of the file. I-nodes specifically optimize for short files.

Directories

Directories are generally simply files with a special interpretation. Some directory structures contain the name of a file, its attributes and a pointer3 either into its FAT list or to its i-node.

This choice bears directly on the implementation of linking. If attributes are stored directly in the directory node, (hard) linking is difficult because changes to the file must be mirrored in all

194

directories. If the directory entry simply points to a structure (like an i-node) that holds the attributes internally, only that structure needs to be updated.

Multiple Disks

There are two approaches to having multiple disks on a system (where disks are really devices that export a file system interface). Either the disks can be explicit or implicit. An example of explicit disk naming is MS-DOS’s A:\SYS\FILE.TXT. Other systems from IBM VM/CMS to Amiga DOS have done the same thing. Making disks explicit makes the boundaries between physical devices clear. UNIX® clouds the issues by allowing one device to be grafted onto the name space established by another at a mount point. A mount point looks like a directory to the user, but to the operating system marks the boundary between devices. As a result, the file system appears to be one seamless name space, but there are subsets of the space on different devices.

3 It’s usually a sector number, but thinking of it as a pointer is closer to its function.

File System Implementation - Performance

This lecture discusses the details of making the file system perform well. This is primarily done with two mechanisms - caching and file layout.

Caching

Far and away the most effective strategy for improving file system performance is caching. Every time a file is opened or created it requires accesses to all the directories from the root to the file itself, and many files are referenced often. Consider executables for commonly executed utilities. Accesses to the physical disk for these common cases can be avoided if the blocks that would have been read from the disk are cached in memory rather than being read from disk each time.

Cached Blocks

Caching is accomplished by setting aside a portion of memory and using it to store blocks as they are brought into memory. Blocks are read or written into the memory if possible and written out to disk when convenient or when the block must be removed from the cache to make way for a new block.

The important issue with any caching system is the tradeoff between performance and consistency of the caching system. In this case that means the difference between the versions of file blocks that are in the cache and on the disk. During operation this distinction isn’t terribly interesting - processes always read from the file cache first and only from the disk if the block is not cached. The consistency problem arises when the computer catastrophically fails; for example a power outage when all the file cache blocks haven’t been written out to disk. When the OS restarts, the file system may be in an inconsistent state. The issue is how writes are cached.

There are several models of consistency that can be used, each of which marks a point on the consistency/ performance curve. The Linux EXT2FS file system makes no special concessions to consistency. Its goal is a blazingly fast UNIX® file system, and it’s willing to risk the dangers of file system corruption if operation is interrupted at an inopportune moment. The BSD FFS systems take a more conservative approach, but not the most conservative approach. They write metadata synchronously to disk - that is any changes to the file system structures themselves are not cached. File system structures include the free block list and file allocations. Thus its possible to lose modifications to allocated blocks in a file, but not to have a block appear once on the free list and once on a file’s allocation list. The slowest and most reliable choice is to write all data synchronously to disk. This is the choice made by MS-DOS and Windows (for floppies anyway).

195

The consistency issue underscores the difference between file cache management and virtual memory management algorithms. In paging, all pages are transient and therefore equally important. In the file system some blocks contain data that is essential to the continued correct operation of the filesystem1 and must be treated specially. Modulo these differences, paging algorithms are a good fit for managing the file cache.

In general a variant of LRU is usually used, with the modifications appropriate for the level of consistency required being made.

In fact, some systems, like FreeBSD and Solaris, use only one cache for both virtual memory pages and cached file blocks. This allows the number of pages or file blocks to grow as needed.

File System Checking

Most systems have a mechanism for checking the file system integrity, even if they take great pains to maintain that integrity. Occasional bad blocks or software bugs may corrupt data, and it’s worthwhile to periodically check the file systems to reduce the impact of any damage.

The checking process is basically:

• Compare the free list to the list of allocations for all the files. Each block should appear exactly once. Blocks no appearing at all should be considered free. Blocks appearing on both the free list and one file can (probably) be allocated to the file. Blocks allocated to two or more files are a complicated case that probably requires human intervention.

• Check all directories for pointers to the files. If files exist that have no directory entries deleted them and reclaim their allocations. This can be caused by a system like UNIX that allow open files to remain on the disk while a process has them open.

The less consistency the operating system enforces, the more frequently the file system should be checked.

Disk Layout

Avoiding disk seeks is the most fruitful optimization to a file system after caching2. One source of seeks is seeking from i-nodes allocated on one end of the disk to data blocks on the other. Alternatively, seeking from directory entry to data block. Consider the inode case.

In the original Berkeley UNIX file system, i-nodes were allocated at the beginning of the disk and data blocks everywhere else. The Berkeley fast file system disperses in-nodes throughout the disk, and does its best to place data sectors near the i-node that owns the file to reduce seek times.

• Old File system

• Fast File system

• Inodes

Although that example is UNIX-specific, the concept covers many operating systems. When seeks are expensive, try to place related data physically close together on the disk.

A related issue is reducing rotational latency. Files are usually read sequentially, but if file blocks are laid out contiguously on the disk, additional rotational delays can be incurred. The problem is that after the first block is read and returned, the OS issues a request for the next block, which has

196

partially rotated past the disk head. The disk now must wait one rotation time until the beginning of the sector appears again.

By interleaving file blocks, the time to return the block and get the next request can be spend spinning over an unneeded block.

1. It’s even a little worse than this, because not only are the blocks essential for the correct FS operation, sometimes the information in the blocks are interrelated. The easiest example is that a block being removed from the free list and added to a file depends on at least two blocks being consistently updated.

2. Assuming that seeks are expensive in the file system. That’s true of disks, tapes, CDs, but not of memory.

A memory file system can avoid these issues.

• The I/O Subsystem

• The I/O subsystem is concerned with the work of providing the virtual interface to the hardware.

• This is the code that has to deal with the idisyncracies of devices, converting from the OS’s logical view to the messy realities of the hardware.

Memory

Frequently, devices are not directly connected to the bus, but are managed by controllers. This allows multiple devices to share a single bus slot. Bus slots may be a scarce resource. Device controllers are also a frequent location for additional intelligence in the network.

Devices or controllers are frequently controlled by accessing device registers, which pass parameters from CPU to device (or controller). Such device registers are either accessible through special I/O instructions or by memory mapping the device registers into the processor’s address space.1

The Operating System is also responsible for coordinating the interrupts generated by the devices (and controllers). Generally a priority ordering is used. Some devices can have their handlers interrupted by higher priority devices. Depending on the services offered by the hardware, this may result in interrupts being delayed or even lost. If a system blocks rather than delaying interrupts, the system must poll device state when a high-priority interrupt returns in case a low-priority interrupt has been lost.

Direct Memory Access (DMA)

In simple systems, the CPU must move each data byte to or from the bus using a LOAD or STORE instruction, as if the data were being moved to memory. This quickly uses up much of the CPU’s computational power. In order to allow systems to support high I/O utilization while the CPU is getting useful work done on the users’ half, devices are allowed to directly access memory. This direct access of memory by devices (or controllers is called Direct Memory Access, commonly abbreviated DMA).

The CPU is still responsible for scheduling the memory accesses made by DMA devices, but once the program has been established, the CPU has no further involvement until the transfer is complete. Typically DMA devices will issue interrupts on I/O completion.

197

Because this memory is not being manipulated by the CPU, and therefore addresses may not pass through an MMU, DMA devices often confuse or are confused by virtual memory. It is important to that’s the address space of the processor itself, not a user process. User processes themselves are generally restricted from accessing devices directly, although such accesses may be allowed to improve performance at the cost of security and stability.

Guarantee that memory intended for use by a DMA device is not manipulated by the paging system while the I/O is being performed. Such pages are usually frozen (or pinned) to avoid changes.

In some sense DMA is simply an intermediate step to general purpose programmability on devices and device controllers. Several such smart controllers exist, with features ranging from bit swapping, to digital signal processing, checksum calculations, encryption and compression and general purpose processors.

Dealing with that programmibility requires synchronization and care. Moreover, in order for code to be portable, writing an interface to such smart peripherals is often a delicate balancing act between making features available and making the device unrecognizable.

I/O Software

The I/O software of the OS has several goals:

• Device Independence: All peripherals performing the same function should have the same interface. All disks should present logical blocks. All network adapters should accept packets. The protection of devices should be managed consistently. For example devices should all be accessible by capability, or all by the file system. In practice this is mitigated by the need to expose some features of the hardware.

• Uniform Naming: The OS needs to have a way to describe the various devices in the system so that it can administer them. Again the naming system should be as flexible as possible. Systems also have to deal with devices joining or leaving the name space (PCMCIA cards).

• Device Sharing: Most devices are shared at some granularity by processes on a general purpose computer. It’s the I/O system’s job to make sure that sharing is fair (for some fairness metric) and efficient.

• Error Handling: Devices can often deal with errors without user input - retrying a disk read or something similar. Fatal errors need to be communicated to the user in an understandable manner as well. Furthermore, although hiding errors can be good at some level, at other levels they should be seen. Users must be able to tell that their disks are slowly failing.

• Synchrony and Asynchrony: The I/O system needs to deal with the fact that external devices are not synchronized with the internal clock of the CPU. Events on disk drives occur without any regard for the state of the CPU, and the CPU must deal with that. The I/O system code is what turns the asynchronous interrupts into system events that can be handled by the CPU.

Software Levels

Interrupt Handlers

The Interrupt Service Routines (ISRs) are short routines designed to turn the asynchronous events from devices (and controllers) into synchronous ones that the operating system can deal with

198

in time. While an ISR is executing, some set of interrupts is usually blocked, which is a dangerous state of affairs that should be avoided as much as possible.

ISRs generally encode the information about the interrupt into some queue that the OS checks regularly - e.g. on a context switch.

Device Drivers: Device drivers are primarily responsible for issuing the low-level commands to the hardware that gets the hardware to do what the OS wants. As a result, much of them is hardware dependent.

Conceptually, perhaps the most important facet of device drivers is the conversion from logical to physical addressing. The OS may be coded in terms of logical block numbers for a file, but it is the device driver that converts such logical addresses to real physical addresses and encodes them in a form that the hardware can understand.

Device drivers may also be responsible for programming smart controllers, multiplexing requests and demultiplexing responses, and measuring and reporting device performance.

Device Independent OS Code: This is the part of the OS we’ve really been talking the most about. This part of the OS provides consistent device naming and interfaces to the users. It enforces protection, and does logical level caching and buffering.

In addition to providing a uniform interface, the uniform interface is sometimes pierced at this level to expose specific hardware features -- CD audio capabilities, for instance.

The device independent code also provides a consistent error mode to users, letting them know what general errors occurred when the device driver couldn’t recover.

User Code: Even the OS code is relatively rough and ready. User libraries provide simpler interfaces to I/O systems.

Good examples are the standard I/O library that provides a simplified interface to the file system. Print and open are easier to use than write and open. Specifically such systems handle data formatting and buffering.

Beyond that there are user level programs that specifically provide I/O services (daemons). Such programs spool data, or directly provide the services users require.

Device Driver Specifics: We now consider some of the details of device drivers. Each device is a different with different purposes, different implementations, and different worldviews. We will consider a representative sample of several kinds of device drivers and the specifics of how they work.

Disk Drives: Disk drivers control one or more physical magnetic disks. The most common use of disks is for file systems, but they are also used for swap space, raw backups, databases and assorted other uses. The device drives is responsible for logical ® physical translations, multiplexing and demultiplexing data transfers, and for error handling.

Block Naming: In the simplest case, logical ® physical mapping is just a matter of placing an ordering on the physical disk sectors that matches the logical disk sectors, which is pretty simple. Differences between logical and physical block sizes can confuse this.

The two may differ because multiple disks managed by the same OS may have different physical block sizes. Also, because the file system overhead may depend strongly on the size of the OS’s logical blocks, sometimes a logical block size much in excess of the disks sectors is chosen to

199

keep the size of the OS tables small. One example of this is large disks using a FAT file system. The size of the FAT is directly dependent on the number of sectors (and is bounded by the size of the entry in a FAT cell). To address a large disk partition may require using large block sizes.

Error handling may also interfere with a straightforward logical ® physical mapping. Some smart disk systems leave a few blocks unassigned on the disk when it is formatted for use when other blocks go bad. When a bad block is detected, the data is moved from the bad block to one of the set-aside blocks and the logical ® physical mapping changed so that the logical block maps to the set-aside block rather than the bad block. From the OS point of view, this transparently repairs bad blocks.

Originally, such block-shuffling shenanigans were all done in the software, but as disk controllers (and drives) get smarter, these remapping are often done in the hardware. This can make life much trickier.

Multiplexing and Arm Scheduling

The first issue in multiplexing multiple requests is doing so efficiently. Efficiently in this case means with returning data to the users as quickly as possible. Most of the latency in serving a disk request is in seeks time, i.e., moving the disk arm over the proper track. Scheduling the requests to minimize seek time is often handled by the device driver.

Scheduling user disk requests is called arm scheduling, because the problem is essentially scheduling accesses by the disk arm. Some familiar algorithms appear, as well as some new ones. We will illustrate the scheduling algorithms by using them to order requests for the following tracks (in the oder they were queued): 11, 1, 36, 16, 34, 9, 12

• FIFO: The requests are served in the order they appear: 11, 1, 36, 16, 34, 9, 12. This is easy to implement, but almost never used. Without optimizing at all, seek times can vary widely based on the applications making requests.

• Shortest Seek First (SSF): This is analogous to shortest job first, the request that requires moving the arm the least distance is served next: 11, 12, 9, 16, 1, 34, 36. The [problem with this algorithm is that it encourages access patterns that keep the disk head in one place. Lone accesses to distant tracks suffer very long access times, or, in the worst case, never get served at all. A compromise between optimization and fairness is needed.

• The Elevator Algorithm The elevator algorithm tries to keep the disk arm moving one direction. On our canonical input, with the head moving toward higher sectors, the access pattern is 11,

• 12, 16, 34, 36, 9 1. In practice the elevator algorithm strikes a good balance between efficiency and fairness.

There are other issues in multiplexing, mostly related to how intelligent the controller is. An intelligent controller may be able to handle several outstanding requests from the software, in which case the device driver needs to do a little bookkeeping, but can generally leave it to the controller. Of course, if the controller cannot multiplex, the simple arm scheduling above applies.

Error Handling

The disk device is the first element of the OS to see an error. Accordingly, it has to adopt some strategy for which errors to try to correct and which to report. There are a variety of things that can go

200

wrong in the disk, and Tannenbaum discusses quite a few. In most cases, the appropriate response to the error is to reset some confused part of the hardware and retry the operation.

Errors that are resolved by this are transient errors. In some cases, transient errors can be ignored. Some of them are not reproducible, and will never bother the system again. Some, however, are predictors of future woe. When a sector shows a sharp increase in checksum errors, it’s likely that the sector or the disk itself is wearing out. A human being or a higher level of the operating system may want to check into the matter further. Good device drivers try to strike a balance between reporting too many and reporting too few errors.

Bigger, Faster, Smarter, More

Disk drivers and controllers are getting smarter by the revision. Functionality that was traditionally in the drivers is now being moved to controllers and disk firmware. As a result, device drivers are less concerned with directly manipulating the devices as they are with programming the controllers to do so. Some important functionalities that have begun to appear in disk hardware:

• Interleaving. The Berkeley fast file system did clever layout of file blocks within a track to reduce the rotational latency when the file was read sequentially. Many disks today encode this layout as the interleaving of sectors on the disk. The disk firmware renumbers the sectors rather than the OS doing layout.

• Caching. Most disks and disk controllers cache one or more tgracks of data in memory on the hardware. Sectors read from those tracks are read not from disk, but from the memory, eliminating the rotational and seek delays.

• Bad blocks. As mentioned above, some disk firmware locates and remaps bad blocks on disk directly.

• Arm Scheduling. Smart disk controllers, for example high-end SCSI controllers, allow several outstanding simultaneous requests for data form the same disk. The controller firmware schedules the requests internally.

In some sense, this is all positive news. Hardware is getting smarter, the software has less to do and life is great. There are two problems: the software and hardware may be unaware of each other, and burning algorithms in hardware makes them hard to change.

For an example of the hardware and software being at cross purposes, consider the disk interleaving case. The file system spends some additional time when blocks are allocated to ensure that they’re placed to minimize latency, and then the hardware moves them again because of its interleaving. The result is a block layout that is almost certainly suboptimal. You can find similar problems with the other helpful features above - transparently repaired bad blocks may show up as a performance penalty; caching sectors both in memory and on disk is wasteful and degrades the value of one of the caches; spending CPU time to schedule the disk arm in the device driver only to have the controller do the same caching algorithm in hardware is a waste of CPU time.

The solution is to make sure that the hardware and OS are aware of what the other is doing. Ideally, the OS should detect hardware features and either disable the ones that are replicated in the OS, or disable the OS routines that do work done by the controller. In practice this may be less straightforward.

The other problem, lack of flexibility, exists primarily if the features of the hardware cannot be disabled.

201

As we have seen, for every scheduling algorithm there is a counter scheduling strategy that confuses it. If your workload is a counter strategy for the algorithms wired into your hardware, performance will suffer. Frequently its; faster to tune or recode an algorithm in the OS than in hardware, but if you can’t disable the smart feature of the hardware, you’re sunk.

Terminals

Terminal is a generic term for the keyboard/screen pair through which much of the computer input in the world today occurs. In times past, this was largely through serial line (RS-232) terminals that passed data a bit at a time, although these days a large number of terminals are intelligent or memory mapped (or both). The console keyboard, screen and mouse of a PC or workstation fall into the latter category.

Serial terminals process data conceptually process data a character or line at a time. (In reality, data may be transmitted a bit at a time down the serial wire, but the device generally collects 7 or 8 bit words to work with.) Particularly intelligent terminals may have data stream editing capabilities built in, or they may be provided by the device driver. When we say character editing capabilities, we mean everything from cursor control to simple character erasures.

To effect such editing, the device driver often collects characters as the keyboard transmits them, only committing the characters to the standard input of the running process when the enter key is sent. Smart terminals may do the same thing in the hardware.

For output terminals have a simple language for output functions. A particular string of non-printing characters may serve to move the output cursor (the point at which the next character will be output) to a given position or to clear or scroll areas of the screen1. The device driver is responsible for arranging for canonical output control sequences to be translated into the specific sequences that the terminal hardware understands.

Even in a world of windowing systems and GUIs, the concept of a terminal is useful for its simplicity and power. The concepts carry over to line-driven modem systems and other simple devices. For example Coke machines.2

On the other end of the spectrum are bit-mapped displays and modern graphics processors. A bitmapped display has its drawing memory directly accessible to the diver in a way that allows the driver to draw graphics on the screen directly. Coupled with a pointing device this allows completely graphically oriented

User interfaces (GUIs).

The screen driver directly draws the windows and other screen cues that allow a user to navigate the GUI. There are usually routines either in the kernel or in user libraries to facilitate such constructs. The interface contains simple interface elements (called Widgets or Gadgets or other things depending on whether you’re on an X machine, an Amiga or some lesser machine). These libraries create the visual cues for the user and respond to events generated by the pointer driver.

The pointer driver keeps track of the user’s reference point into the bit-mapped display used to manipulate GUI elements. The driver receives interrupts from the pointer whenever the pointer device is moves changing the location of the reference point. Generally any motion results in an interrupt, so the ISRs to follow the pointer must be quick. Most pointers supply only relative motion events; the device driver must track the reference point and update the gui element indicating its location.3

202

The pointer and drawing routines work together to provide an event stream to the OS and applications which the applications can then use to get user input. Events are asynchronous notifications that contain information like “the use has activated button number 5” or “the reference pointer is over slider 2.” The exact mechanisms of delivering events vary widely, and this class won’t really discuss them in detail.

Hopefully your understanding of IPC already has given you some ideas: a record-based file of events; signals that carry additional information; monitors with procedures defined for various events.

1. 1 Some terminals allow individual pixels to be addressed or vectors drawn for graphics capabilities.

2. Some devices like touch screens and graphics pads do give absolute coordinate locations.

Beyond bit mapped displays are terminals with graphics co-processors. These are dedicated processors that do nothing but render detailed images on the terminal, perhaps employing sophisticated lighting and texture effects. The language used to communicate with such processors may be very intricate and more resembles programming a multiprocessor than running a peripheral. We won’t discuss these in detail either, but again your experiences with interprocess communication should give you some ideas of the interfaces in use here. Locks and semaphores to control access to the lists of polygons to be drawn by the co-processor but arranged by the main CPU, for example. Context switching a color map on the co-processor when the reference point moves from one window to another, etc.

There are other devices in the world, too that we won’t have time to investigate. Sound recording and playback systems that input and output sampled streams of data that have to be filtered in real time. Network adapters that need to fragment and reassemble large data blocks into small ones to be transferred across a network (and that may have to determine a route across such a network). Other stranger things.

203

CHAPTER 12

NETWORKING

Network

The network layer connects hosts on different physical networks. It extends the ideas of addressing, naming, and routing to their global extreme. The headers added to the network layer are independent of the

Network hardware

The network layer solves some difficult distributed problems, e.g., how to store routes from every host to every host efficiently. Actually, it just makes sure that certain routers in the network know enough to sent the packet the right general direction, with each router knowing more about its local area.

I don’t have time to really address these problems in this class, but I strongly advise you to check out one of the networking classes to find out for yourself.

Computer Networking is becoming a bigger and bigger issue every day. It’s a versatile and inexpensive way to share resources and trade data. This section addresses the basic OS issues involved in communicating between computers.

Network vs. System

Domain

host

router

host

router

host

router

host

network204

host

That’s the same diagram from our discussion of the I/O system, only relabeled to represent a computer network. Some of the issues are remarkably similar. The system still has to address:

• Asynchrony: Events on different hosts are not synchronized.

• Data corruption and reordering: Reordering is similar to the problem of multiplexing responses, and is handled the same way. Data in the packets can be used to order data. Because there are more sources of error in the network, the OS has to address errors directly.

• Buffering: Each host is responsible for queuing data until the interested process retrieves it, similar to the way disk blocks are queued. There are some significant differences between a network and a hardware system, though:

• Autonomy: Individual pieces of hardware in a system are all controlled by the same entity, the owner of the machine or the CPU. In a network, each host may be an autonomous (or self-controlling) entity, with goals that may be in direct opposition of other hosts and no central authority to which to appeal to resolve conflicts. If this wasn’t enough to worry about in the abstract, there is the problem of two communicating entities sitting in different human domains. The legal requirements on the hardware or even the data content are often at issue.

• Latency: The latency between a CPU and a disk is a few tens of milliseconds at worst, and this is perceived as a glacial pace. The round trip time to a geosynchronous satellite and back is a quarter second. There are documented reports of packets taking minutes to get from host to host in the Internet. The latencies are often considerably higher in a network. But sometimes they are lower. Hosts on an uncontested LAN sometimes use a distributed file system to reduce disk latency. The range of latencies with which systems have to contend in networks is the issue.

• Connectivity Richness: In a physical box, there are only so many elements that can sit physically on one bus, so the CPU need only concern itself with a few entities. There are millions of computers connected to the Internet. Systems have to exhibit vastly different scaling properties in networks.

Basic Concepts

Because there are so many more elements connected sparsely, issues of naming, addressing, and routing become paramount. It’s important to grasp the distinction between a name, an address, and a route.

Names are a convenient way for humans (or programs) to refer to an entity. My name is Ted; my computer’s name is vermouth.isi.edu. In both cases this is just a convenient string of characters that refers to a physical entity.

Addresses are a special kind of name that can be used to plot a path to reach one of the entities.

My address in Madison Wisconsin was 765 W. Washington Ave. #302 Madison, WI 53715. My computer’s address is 128.9.160.247 . These names are special because they can be used to convey information to the place they name. (Not all things that are namable have an address).

205

Routes are a description of how to convey information between two addresses. A route to my address in Wisconsin from USC would be the sets of interstate highways and side streets to use to get from here to there. A route to my computer from a computer at USC would be a list of the IP addresses to pass through in order.

In some sense, addresses and routes are the only entities that are required for networking, but having names is so useful that most general purpose networking systems have some mechanism.

Defining and allocating these entities is one of the most difficult parts of networking. That the Internet provides a global (in the purest sense of that word) naming, addressing, and routing system is nothing short of phenomenal.1 This is only possible because the system was designed to scale to global sizes. Even at that cracks are showing - the address space may not be large enough. Routing information taxes the ability of hardware to store and search it. And the one that we had working, naming, is under assault from lawyers.

Another basic concept that underlies networking is the protocol. A protocol is a set of rules that communicating entities follow in order to communicate meaningfully. For example, exchanging electronic mail requires a sequence of exchanges between the mailing machine and the receiving (or forwarding) machine. The mailer identifies itself, the receiver acknowledges it, the mailer tells who the mail is from, the receiver accepts or rejects the address, the mailer tells who the mail it to, and the receiver again accepts or rejects, and finally the message is exchanged and acknowledged. That set of rules is a protocol.

Protocols give rise to standards. A standard is a formal presentation of a protocol that has been sanctioned by some official body. For example the electronic mail protocol above has been sanctioned by the Internet Engineering Task Force (IETF). If your system claims to exchange RFC822-compliant mail2, it must follow those rules - this is called conforming to the standard. Of course, if your mailer doesn’t conform to the standard but sends mail without losing any, the only thing that happens is that you can’t put an RFC822-compliant sticker on it. However, because they represent an agreement between major practitioners of the field, conforming to standards provide a loose guarantee that systems interoperate. another word that net workers use a lot is packet. A packet is like a disk block - it’s an element of data exchanged between 2 hosts. Depending on the underlying hardware and its associated protocols they may be fixed or variable length, and there are different maxima and minima for the various packet parameters. It’s best to think of them as atomic elements of networking, although as we’ll see, that can be an illusion.

The final basic distinction to draw is between connection-oriented an connectionless communications.

This is exactly the difference between the post office and the phone system. In the mail, two units of transmission (2 letters) have no relation to each other. If you send them to the same place on the same day, you can’t usually tell what order they were sent, or even if they bear any relation to each other - even if sent between the same 2 people. There’s no state that ties them together. On a phone call, the notion that the various transmissions (words, or different family members talking) have some relation that is encapsulated by the idea of a call. The words that go in one end of the phone are not arbitrarily reordered, for example.

1. The Internet is not the only such system, of course. Postal addresses form a similar if less structured name/address/route space. The impressive part of the internet is that the space is well enough defined that machines can move the data with minimal per-message human intervention.

2. Internet standards are presented in Request For Comments documents - RFC for short.

206

There are networks that support both these paradigms. In fact each can be supported by the other electronic mail is connectionless in the sense that each piece of mail has no ordering relative to others, yet the mail is transferred using TCP, a connection-oriented protocol.

The Seven Layer Model

The seven layer model is the OSI (an international standards body) model for designing networking. As a tool for understanding the various issues in networking, it’s not bad. As a model for implementation, it’s a recipe for a slow network. We’ll use it to talk about protocol design, but think more seriously about what you’re doing before you implement something this way.

The each level of the stack takes provides services to layers above it using building blocks provided by layers below it. Conceptually, this is very nice, but we’ll see that some servcies are replicated and some don’t fit neatly into a layer.

Conceptually, each layer adds a header to outgoing packets, and strips them off incoming packets before passing the packet up or down the stack as the case may be. Numbering layers from the bottom up, an outgoing packet would have headers:

Physical

The physical layer specifies the format of bits on the wire and what kinds of wire you can use. This is very nuts and bolts electrical (or optical!) engineering stuff, and I won’t discuss it in any great detail.

Each type of hardware has it’s own standard: there’s an Ethernet standard, a FDDI standard, an X.25 standard, and a bunch more. They tell you what kinds of hardware to buy, how far apart nodes have to be (or must be), and what you’d see if you hooked up an oscilliscope (or spectrum analyzer) to the medium.

Link

The link layer describes the protocols used by communicating nodes connected by the same physical hardware. The scope of names, addresses and routes is therefore constrained. In a shared medium network3, the link layer is responsible for medium access. Medium access is the process of determining which host has the right to send information on the shared medium. There are many ways to do this. Ethernet uses CSMA/CD (Carrier Sense Medium Access/Collision Detection), which means that each host listens to the shared line and doesn’t send until the line is silent. That’s the CSMA; the CD is that even listening beforehand, its possible for 2 hosts far enough apart to hear the line clear, begin transmitting, and have their signals collide. If that happens, they both stop transmitting and remain silent for a random time period before trying again. The time they remain silent gets geometrically larger.

Other medium access methods involve passing a token from host to host. Like the shell in Lord of the Flies, the token allows the holder the right to send uninterrupted. Tokens generally have a fixed lifetime so that a host can only transmit for a given time period before it is forced to relinquish the token and passs it to the next host. The protocols guarantee that every host gets the token eventually. FDDI (Fiber Distributed Digitial Interface) and token rings use tokens.

Link layers are also the first layer that detects (and potentially recovers from) transmission errors.

207

This is usually accomplished by including a checksum in each packet. A checksum is a mathematical function that depends on the full contents of the packet, like the one-way functions used for authentication.

Upon receiving the packet, a host will recomputed the function (assuming the checksum field to be 0) and unless it gets the same answer as the packet contained, reject the packet.

Ethernets are a shared medium because many hosts use the same wire to communicate, while dile up modems are a point-to-point medium because the connections directly connect only 2 hosts. I’d use the party line analogy, but I fear that no one of an age to read or hear this knows what one is.

Choosing a good checksum is a difficult tradeoff. The more effective the checksum is at detecting errors, the slower it is to calculate. Because the checksum must be calculated for every packet, the speed of calculating it can determine the network speed. The science of constructing efficient strong checksums is interesting in its own right.

Some link layers correct errors, either by labeling the packets and acknowledging each packet receipt and retransmitting packets if there is no acknowledgement in a reasonable time. Another approach is to send redundant data and reconstruct damaged packets. This idea is also used in disk drive arrays, like RAID. You’ll hear more about it in CS 555.

Transport

The transport layer provides link layer style guarantees at across the network layer. For example transport resends lost packets and prevents reordering. Similar techniques to the link layer are used for these.

Transport also provides demultiplexing within the computer systems. The network layer can name, address and route to a given computer. Within the computer, the transport layer provides a way to name, address and route to given processes.

Transport is also the layer that addresses global performance of the network, for example congestion control and resource allocations.

Session: The session layer provides further multiplexing, control over which endpoint is sending data, and some check pointing behavior. It’s not often used. It’s something of an open question if this functionality is important.

Presentation: This layer is responsible for reformatting data between machines, and providing data-based semantics.

Converting floating point formats between hosts or only returning packets that contain a given type field are things that fall under Presentation’s umbrella.

Like the Session layer, Presentation isn’t often used.

Application: These are protocols designed to carry out some useful, conrete service. SMTP - the email protocol -is an application layer protocol. So is HTTP (although it’s extending its tenticles into Session and Presentation as well).

These are also standardized; there are lengthy documents on what a valid HTTP request looks like or on what behavior an FTP server has to support. They’re dry reading, but important to interoperability.

208

Other Global Issues

I probably won’t have time to talk about these in detail, but other netwoking issues include:

• Scalable Naming

• Authentication and Security

• Network Management

• Other Communication models - broadcast & multicast

• Performance tuning of protocols

• Active Networking

Networking Note

Networking deals with interconnected groups of machines talking with each other. Is a very different field than operating systems. Have a lot of standards stuff because everyone must agree on what to do when connect machines together.

What is a network? A collection of machines, links and switches set up so that machines can communicate with each other. Some examples:

• Telephone system. Machines are telephones, links are the telephone lines and switches are the phone switches.

• Ethernet. Machines are computers, there is one link (the ethernet) and no switches.

• Internet. Machines are computers, there are multiple links, both long-haul and local-area links. The switches are gateways.

Message may have to traverse multiple links and multiple switches to go from source to destination.

Circuit-switched versus Packet-switched networks. Basic disadvantage of circuit-switched Networks - cannot use resources flexibly. Basic advantage of circuit-switched networks - deliver a uaranteed resource.

Basic Networking Concepts:

• Packetization.

• Addressing.

• Routing.

• Buffering.

• Congestion.

• Flow control.

• Unreliable Delivery.

209

• Fragmentation.

Local Area Networks. Connect machines in a fairly close geographic area. Standard for many years: Ethernet. Standardized by Xerox, Intel and DEC in 1978. Still in wide use.

Physical hardware technology: coax cable about 1/2 inch thick. Maximum length: 500 meters. Can extend with repeaters. Can only have two repeaters between any two machines, so maximum length is 1500 meters.

Vampire taps to connect machines to Ethernet. Attach an ethernet transceiver to tap; the transceiver does the connection between the Ethernet and the host interface. The host interface then connects to the host machine.

Ethernet is 10 Mbps bus with distributed access control. It is a broadcast medium - all transceivers see all packets and pass all packets to host interface. The host interface chooses packets the host should receive and discards others.

Access scheme: Carrier sense multiple access with collision detection. Each access point senses carrier wave to figure out if machine is idle. To transmit, waits until carrier is idle, then starts transmitting. Each transmission consists of a packet; there is a maximum packet size.

Collision detection and recovery. Transceivers monitor carrier during transimission to detect interference. Interference can happen if two transceivers start sending at same time. If interference happens, transceiver detects a collision.

When collision detected, uses a binary exponential back off policy to retry the send. Adds on a random delay to avoid synchronized retries.

Is there a fixed bound on how long it will take a packet to get successfully transmitted? Is any packet guaranteed to be transmitted at all?

Addressing. Each host interface has a hardware address built into it. Addresses are 48 bits long. When change host interface hardware, address changes.

Are three kinds of addresses:

• Physical address of one network interface.

• Broadcast address for the network. (All 1's).

• Multicast addresses for a subset of machines on network.

Host interface looks at all packets on the ethernet. It passes a packet on to the host if the address in the packet matches its physical address or the broadcast address. Some host interfaces can also recognize several multicast addresses, and pass packets with those addresses on to the host.

How do vendors avoid ethernet physical address clashes? Buy blocks of addresses from a central authority.

Packet (frame) format.

• Preamble. 64 bits of alternating 1 and 0, to synchronize receivers.

• Destination address. 48 bits.

210

• Source address. 48 bits.

• Packet type. 16 bits. Helps OS route packets.

• Data. 368-120000 bits.

• CRC. 32 bits.

Ethernet frames are self-identifying. Can just look at frame and know what to do with it. Can multiplex multiple protocols on same machine and network without problems. CRC lets machine identify corrupted packets.

Token-ring networks. Alternative to ethernet style networks. Arrange network in a ring, and pass a token around that lets machine transmit. Message flows around network until reaches destination. Some problems: long latency, token regeneration.

ARPANET. Ancestor of current Internet. Long-haul packet-switched network. Consisted of about 50 C30 and C300 BBN computers in US and Europe connected by long-haul leased data lines. All computers are dedicated packet-switching machines (PSNs).

Interesting fact: ARPANET, like highway system, was initially a DOD project set up officially for defense purposes.

In original ARPANET, each computer connected to ARPANET connected directly to a PSN. Each packet contained address of destination machine and PSN network routed the packet to that machine. Now this is totally impractical and have a much more complex local structure before get onto Internet.

Design of Internet driven by several factors.

• Will have multiple networks. Different vendors compete, plus have different technical tradeoffs for local area, wide area and long haul networks.

• People want universal interconnection.

Will have multiple networks around the world. An internet work, or internet, connects the different networks. So, job of internet is to route packets between networks.

One goal of internet: Network transparency. Want to have a universal space of machine identifiers and refer to all machines on the internet using this universal space of machine identifiers. Do not want to impose a specific interconnection topology or hardware structure.

Internet architecture. Connect two networks using a gateway machine. The job of the gateway is to route packets from one network to another.

As network topologies become more complicated, gateways must understand how to route data through intermediate networks to reach final destination on a remote network.

In Internet, gateways provide all interconnections between physical networks. All gateways route packets based on the network that the destination is on.

Internet addressing. Each host on the Internet has a unique 32-bit address that is used for all Internet traffic to that host. Each internet address is a (netid, hosted) pair. The network identifies the network that the host is on; the hosted identifies the host within the network.

211

Three classes of Internet addresses:

• Class A. First Bit: 0. Bits 1-7: Netid. Bits 8-31: Hostid. Can have 128 Class A networks.

• Class B. Bits 0-1: 10. Bits 2-15: Netid. Bits 16-31: Hostid. Can have 16,384 Class B networks.

• Class C. Bits 0-2: 110. Bits 3-23: Netid. Bits 24-31: Hostid. Can have 2 Gig Class C networks.

• Class D. (multicast addresses). Bits 0-3: 1110. Used for Internet multicast.

• Class E. Bits 0-3 1111. Reserved.

Interesting point: Whole structure of internet is available in RFC's (request for comments). Available over the Internet - use the net search functionality for RFC and you'll find pointers. Can read them to figure out what is going on.

Gateways can extract network portion of address quickly. Gateways have two responsibilies:

• Route packets based on network id to a gateway connected to that network.

• If they are connected to destination network, make sure the packet gets delivered to correct machine on that network.

Conceptually, an Internet address identifies a host. Exceptions: gateways have multiple internet addresses, at least one per network that they are connected to.

Because network id is encoded in Internet address, a machine's internet address must change if it switches networks.

Dotted Decimal notation: Reading Internet addresses. Four decimal integers, with each integer representing one byte.

• cs.stanford.edu - 36.8.0.47 (what kind of network is it on).

• cs.ucsb.edu - 128.111.41.20

• ecrc.de - 141.1.1.1

• lcs.mit.edu - 18.26.0.36

• sri.org - 199.88.22.5

Who assigns internet addresses? The Network Information Center! A centralized authority. It just allocates network ids, leaving requesting authority to allocate host ids.

Mapping Internet addresses to Physical Network addresses. Will discuss case when physical network is an Ethernet. Given a 32 bit Internet address, gateway must map to a 48 bit Ethernet address. Uses Address Resolution Protocol (ARP).

Gateway broadcasts a packet containing the Internet address of the machine that it wants to send the packet to. When machine receives packet, it sends back a response containing its physical address. Gateway uses physical address to send packet directly to machine.

212

Also works for machines on same network even when they are not gateways.

Use a address resolution cache to eliminate ARP traffic.

ARP request and response frames have specific type fields. An ARP request has a type field of 0806, responses have 8035. Standard set up by the Ethernet standard authority.

How does a machine find out its Internet address? Store it on disk, and looks there to find out when it boots up. What if it is diskless? Contacts server and finds it out there using Reverse ARP (RARP). RFC 903 - Ross Finlayson, etc.

RARP request is broadcasted to all machines on network. RARP server looks at physical address of requestor and sends it a RARP response containing the internet address. Usually have a primary RARP server to avoid excessive traffic.

Now switch to talking about IP - the Internet Protocol. The internet conceptually has three kinds of services layered on top of each other: Connectionless, unreliable packet delivery service, reliable transport service, and application services. IP is the lowest level - the packet delivery.

The basic unit of transfer in the Internet is the IP datagram. IP datagram has header and data. Header contains internet addresses and the Internet routes IP datagrams based on Internet addresses in header.

Internet makes a best effort attempt to deliver each datagram, but does not deal with error cases. In particular, can have:

• Lost Packets

• Duplicated Packets

• Out of order Packets

Higher level software layered on top of IP deals with these conditions.

IP packets always travel from gateway to gateway across physical networks. If the IP packet is larger than the physical network frame size, the IP packet will be fragmented: chopped up into multiple physical packets. IP is designed to deal with this situation and provides for fragmentation.

Once a packet has been fragmented, must be reassembled back into a complete packet. Usually reassembled only when fragments reach final destination. But, could build a system that reassembled fragments when got to a physical network with a larger frame size.

Why is there a need for possibility of fragmentation? No good way to impose a uniform packet size on all networks. Some networks may support large packets for performance, while others can only route small packets. Should not prevent some networks from using large packets just because there exists a network somewhere in the world that can not handle large packets. But must be able to route large packets through a network that only handles small packets - network transparency.

Important fields in IP header:

• VERS: protocol version.

• LEN: length of header, in 32-bit words.

• TOTAL LEN: total length of IP packet.

213

• SOURCE IP ADDRESS: IP address of source machine.

• DEST IP ADDRESS: IP address of destination machine.

• TTL: time to live. How many hops the packet may take without getting removed from Internet. Every time a gateway forwards the packet, it decrements this field. Required to deal with things like cycles in routing, etc.

• IDENT: packet identifier. Unique for each source. Typically, source maintains a global counter it increments for every IP datagram sent.

• FLAGS: A do not fragment flag (dangerous) and a more fragments flag - 0 marks end of datagram.

• FRAGMENT OFFSET - gives offset of this fragment in original datagram.

How to reassemble a fragmented packet? Allocate a buffer for each packet. Use IDENT and SOURCE IP ADDRESS to identify the original datagram to which the fragment belongs. Use the FRAGMENT OFFSET field to write each fragment into correct spot in the buffer. Use more fragments flag to find end of original datagram. Use some mechanism to make sure all fragments arrived before consider datagram complete.

Routing IP datagrams. There are multiple possible paths between hosts in an internet. How to decide which path for which datagram?

Routing for hosts on same network. Realize that are on same network by looking a notified of Internet address, and just use underlying physical network.

Routing for hosts on different networks. Gateways pass datagrams from network to network until reach a gateway connected to destination network.

Each gateway must decide next gateway to send datagram to.

• Source routing. The source specifies the route in the datagram. Useful for debugging and other cases in which Internet should be forced to use a certain route.

• Host-specific routes. Can specify a specific route for each host. Used mostly for debugging.

• Table driven routing. Each gateway has a table indexed by destination network id. Each table entry tells where to send datagrams destined for that network. Do example on page 82.

• Default routes. Specify a default next gateway to be used if other routing algorithms don't give a route.

Most routers use a combination of table driven routing and default routing. They know how to route some packets, and pass others along to a default router. Eventually, all defaults point to a router that knows how to route ALL packets.

How are routing tables acquired and maintained? There are a lot of different protocols, but the basic idea is that the gateways send messages back and forth advertising routes. Each advertisement says that a specific network is reachable via N hops. Some protocols also include

214

information about the different hops. The gateways use the route advertisements to build routing tables.

Internet was originally designed to survive military attacks. It has lots of physical redundancy and its routing algorithm is very dynamic and resilient to change. If a link goes away, the network should be able to route around the failure and still deliver packets. So, routing tables change in response to changes in the network.

In practice doesn't always work as well as designed. Chief threat to Internet links these days is backhoes, not bombs. Common error is routing all of the links that are supposed to give physical redundancy in the same fiber run, so are vulnerable to one backhoe.

In original internet, partition gateways into two groups. Core and noncore gateways. Core gateways have complete information about routes. Original core gateways used a protocol called GGP (Gateway to Gateway Protocol) to update routing tables.

GGP messages allow gateways to exchange pairs of messages. Each message advertises that the sender can reach a given network N in D hops. Receiver compares its current route to the new route through the sender, and updates its tables to use the new route if it is better.

Famous case: Harvard gateway bug. Memory fault caused it to advertise a 0 hop route to everybody!

Problem with GGP - distributed shortest path algorithm may take a long time to converge.

Later algorithm (SPF) replicated a complete database of network topology in every gateway. Gateway runs a local shortest path computation to build its tables.

In current Internet, there is no longer any central backbone or authority. Instead, have internet providers. The whole system has switched over to private enterprise.

A top-down view of system. There are 4 Network Access Providers. Each NAP is a very fast router connected via high-capacity lines to other gateways and NAPs. Lines may be T3 (644 Mb/s) lines. Typically big communications companies (MCI, Sprint, ATT) own the lines. Lines are typically fiber.

Organizations go to internet providers to get access to the internet. An internet provider buys a bunch of routers (usually from Cisco) and leases a bunch of lines. The internet provider must also buy access to a NAP or to a gateway that leads to a NAP. The routers talk a route advertisement protocol and implement some routing algorithm.

The internet provider can then turn around and sell internet access to whoever wants to buy it. UCSB buys its internet access from CERFNET, and it pays $23,000 per year for its internet access. All of the UC schools will band together and buy internet access from MCI, getting more bandwidth but at a higher price.

Organizations tend to chop their communications up into multiple networks, so there are too many networks in the world to give every network an Internet address. For example, the UCSB CS department has more than 10 networks.

The solution is subnetting. Internet views whole organization as having one network. The organization itself chops the host part of IP address up into a pair of local network and local host. For example, UCSB has one class B Internet network. The third byte of every IP address identifies a local network, and the fourth byte is the host on that network.

215

All IP packets from outside come to one UCSB gateway (by default). As far as the Internet is concerned, all of UCSB has only one network.

Inside UCSB, there is a set of networks connected by routers. These routers interpret the IP address as containing a local network identifier and a host on that network, and route the packet within the UCSB domain. The routers periodically advertise routes using a protocol called RIP.

This is an example of hierarchical routing. Internet routes to UCSB gateway based on Internet network id, then routers within UCSB route based on the subnet id.

216

Chapter 1: Introduction to Operating Systemdocshare01.docshare.tips/files/15627/156273040.pdf ·...

Documents

Transcript of Chapter 1: Introduction to Operating Systemdocshare01.docshare.tips/files/15627/156273040.pdf ·...