APPLICATION-SPECIFIC RESOURCE MANAGEMENT IN · PDF fileAPPLICATION-SPECIFIC RESOURCE...

APPLICATION-SPECIFIC RESOURCE

MANAGEMENT IN

REAL-TIME OPERATING SYSTEMS

Ameet Patil

SUBMITTED FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

AT THE

UNIVERSITY OF YORK

YORK, UK

SEPTEMBER 2007

c© Copyright by Ameet Patil, 2007

To Parents and my wife Sushma

Table of Contents

Table of Contents i

List of Tables v

List of Figures vii

List of Abbreviations and Symbols xi

Abstract xix

Acknowledgements xxi

Declaration xxiii

1 Introduction 11.1 Technological Growth Versus Application Complexity . . . . . 3

1.1.1 Resource Constraints . . . . . . . . . . . . . . . . . . . 71.2 Resource Management in RTOS . . . . . . . . . . . . . . . . . 8

1.2.1 Existing Approaches to Efficient Resource Management 91.3 Reflection Mechanism . . . . . . . . . . . . . . . . . . . . . . . 101.4 Thesis Proposition . . . . . . . . . . . . . . . . . . . . . . . . 121.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Resource Management and Operating System Specialisation 192.1 Real-time Embedded Systems . . . . . . . . . . . . . . . . . . 19

2.1.1 Types of Real-time Systems . . . . . . . . . . . . . . . 202.1.2 Categorising Real-time Embedded Systems . . . . . . . 212.1.3 Application-specific RTOS Specialisation . . . . . . . . 23

2.1.4 Resource-Constrained Real-time Embedded Systems . . 272.2 Resource Management in an OS . . . . . . . . . . . . . . . . . 28

2.2.1 CPU Resource . . . . . . . . . . . . . . . . . . . . . . . 282.2.2 Memory Resource . . . . . . . . . . . . . . . . . . . . . 32

2.3 Operating System Specialisation . . . . . . . . . . . . . . . . . 392.3.1 Specialisation of OS policies . . . . . . . . . . . . . . . 40

2.4 Reflection Mechanisms . . . . . . . . . . . . . . . . . . . . . . 412.4.1 Reflective Programming Languages . . . . . . . . . . . 442.4.2 Reflective Middlewares . . . . . . . . . . . . . . . . . . 482.4.3 Reflective OSs . . . . . . . . . . . . . . . . . . . . . . . 51

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3 Reflection in RTOS for Efficient Resource Management 613.1 Modifications to Reflection Mechanism . . . . . . . . . . . . . 62

3.1.1 Modifications to the Process of Reification . . . . . . . 633.1.2 Role of the Kernel . . . . . . . . . . . . . . . . . . . . 643.1.3 Component Privileges . . . . . . . . . . . . . . . . . . 653.1.4 Infolevel for Reified Information . . . . . . . . . . . . . 683.1.5 Categorisation of Reified Information . . . . . . . . . . 683.1.6 Flow of Reified Information . . . . . . . . . . . . . . . 723.1.7 In-kernel Reflection Interface . . . . . . . . . . . . . . . 753.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.2 Generic Reflective RTOS Framework . . . . . . . . . . . . . . 783.2.1 Core Elements of the Framework . . . . . . . . . . . . 793.2.2 Optional Elements of the Framework . . . . . . . . . . 803.2.3 Reflective System Modules . . . . . . . . . . . . . . . . 823.2.4 Reflective Applications . . . . . . . . . . . . . . . . . . 843.2.5 Meta Object Protocol for Reflective Components . . . 85

3.3 Prototype Implementation – DAMROS . . . . . . . . . . . . . 863.3.1 Reflection Interface in the Kernel . . . . . . . . . . . . 883.3.2 The rManager . . . . . . . . . . . . . . . . . . . . . . . 943.3.3 The iManager . . . . . . . . . . . . . . . . . . . . . . . 1023.3.4 Reflective CPU Scheduler (VRHS) . . . . . . . . . . . 1113.3.5 Reflective Memory Management System (RMMS) . . . 126

3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1363.4.1 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . 1373.4.2 Changing Application Behaviour . . . . . . . . . . . . 1373.4.3 Evaluation of VRHS . . . . . . . . . . . . . . . . . . . 1393.4.4 Evaluation of RMMS . . . . . . . . . . . . . . . . . . . 159

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

4 Support for Reification: a Case Study 1694.1 Paging Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 1704.2 Reification Calls for Paging . . . . . . . . . . . . . . . . . . . 176

4.2.1 keep(< address >, < size >) . . . . . . . . . . . . . . 1764.2.2 discard(< id >) . . . . . . . . . . . . . . . . . . . . . . 177

4.3 Inserting Reification Calls . . . . . . . . . . . . . . . . . . . . 1774.4 Manual Insertion Method . . . . . . . . . . . . . . . . . . . . 1794.5 Automatic Insertion Method . . . . . . . . . . . . . . . . . . . 181

4.5.1 Automatic Insertion for C Language . . . . . . . . . . 1824.5.2 Comparison of Manual and Automatic Insertion . . . . 190

4.6 Hybrid Insertion Method . . . . . . . . . . . . . . . . . . . . . 1914.7 Design of CASP Mechanism . . . . . . . . . . . . . . . . . . . 192

4.7.1 CASPapp Component . . . . . . . . . . . . . . . . . . . 1944.7.2 CASPos Component . . . . . . . . . . . . . . . . . . . 1944.7.3 Page-isolation Technique . . . . . . . . . . . . . . . . . 1964.7.4 Use of the Reflection Framework . . . . . . . . . . . . 197

4.8 Evaluation Strategy . . . . . . . . . . . . . . . . . . . . . . . . 1984.9 Virtual Memory Simulation . . . . . . . . . . . . . . . . . . . 199

4.9.1 Trace-driven Simulation . . . . . . . . . . . . . . . . . 1994.9.2 On-the-fly Simulation . . . . . . . . . . . . . . . . . . . 200

4.10 PROTON Virtual Memory Simulator . . . . . . . . . . . . . . 2014.10.1 PROTON Annotations . . . . . . . . . . . . . . . . . . 2024.10.2 Simulation of Multiple Applications . . . . . . . . . . . 2064.10.3 Implementing UD Paging Policies . . . . . . . . . . . . 209

4.11 Simulation Experiments using PROTON . . . . . . . . . . . . 2104.11.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 2104.11.2 Single Application . . . . . . . . . . . . . . . . . . . . 2134.11.3 Multiple Applications . . . . . . . . . . . . . . . . . . . 2154.11.4 Slow-down Factor . . . . . . . . . . . . . . . . . . . . . 218

4.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

5 Implementation of CASP in a Commodity OS (Linux) 2215.1 Overview of Linux 2.6.16 Kernel . . . . . . . . . . . . . . . . . 222

5.1.1 CART Implementation in Linux . . . . . . . . . . . . . 2235.2 Implementation in Linux . . . . . . . . . . . . . . . . . . . . . 224

5.2.1 Reflection Framework . . . . . . . . . . . . . . . . . . . 2245.2.2 CASP Mechanism . . . . . . . . . . . . . . . . . . . . . 226

5.2.3 Page-isolation in Linux-LRU . . . . . . . . . . . . . . . 2275.2.4 Page-isolation in Linux-CART . . . . . . . . . . . . . . 227

5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 2285.3.1 Hardware Platform . . . . . . . . . . . . . . . . . . . . 2285.3.2 Benchmark Applications . . . . . . . . . . . . . . . . . 2285.3.3 Single Application Scenario . . . . . . . . . . . . . . . 2305.3.4 Multiple Applications Scenario . . . . . . . . . . . . . 2455.3.5 Memory Usage . . . . . . . . . . . . . . . . . . . . . . 2475.3.6 Space Overhead . . . . . . . . . . . . . . . . . . . . . . 247

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

6 Conclusion 2516.1 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . 2516.2 Applications and Limitations . . . . . . . . . . . . . . . . . . 2546.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 2556.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 256

Bibliography 259

List of Tables

3.1 Measured Execution Times of DAMROS Interfaces . . . . . . 138

3.2 No Reflection, Basic RR Scheduler . . . . . . . . . . . . . . . 140

3.3 Reflection with One High Priority Application . . . . . . . . . 140

3.4 Reflection with One High Priority Application . . . . . . . . . 141

3.5 Reflection with One High Priority and Other Varying Priority

Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

4.1 Description of Benchmark Applications . . . . . . . . . . . . . 211

4.2 Single Application Benchmark Results for LRU . . . . . . . . 212

4.3 Two Applications Scenario for LRU (1) . . . . . . . . . . . . . 216

4.4 Two Applications Scenario for LRU (2) . . . . . . . . . . . . . 216

4.5 Three Applications Scenario for LRU . . . . . . . . . . . . . . 217

5.1 Benchmark Applications . . . . . . . . . . . . . . . . . . . . . 230

5.2 Single Application Performance in Linux-LRU (1) . . . . . . . 231

5.3 Single Application Performance in Linux-LRU (2) . . . . . . . 231

5.4 Single Application Performance in Linux-CART (1) . . . . . . 232

5.5 Single Application Performance in Linux-CART (2) . . . . . . 232

5.6 Results for Multiple Applications . . . . . . . . . . . . . . . . 245

5.7 Benchmark Code Size (bytes) . . . . . . . . . . . . . . . . . . 248

5.8 Linux Kernel Image Sizes (in bytes) . . . . . . . . . . . . . . . 248

List of Figures

1.1 Choice of RTOS for Embedded System Implementation [121] . 2

1.2 Trends in Application Complexity and Processor Speed [23] . 4

1.3 Projected Trends in Mobile Application Complexity [23] . . . 5

1.4 Need for Greater Secondary Storage [42] . . . . . . . . . . . . 7

2.1 MPEG Input Streams for Decoding . . . . . . . . . . . . . . . 25

2.2 Hierarchical Scheduling Structure . . . . . . . . . . . . . . . . 29

2.3 Tower of Reflection (Reproduced from [81,105]) . . . . . . . . 42

2.4 Object/Meta-Object Separation and Meta-Hierarchy [128] . . 53

3.1 Reification through the Kernel . . . . . . . . . . . . . . . . . . 65

3.2 Modifications to Reflection . . . . . . . . . . . . . . . . . . . . 66

3.3 In-kernel Reflection Interface . . . . . . . . . . . . . . . . . . . 77

3.4 Structure of a Reflective System Module . . . . . . . . . . . . 83

3.5 Code Snippet of Reify Interface . . . . . . . . . . . . . . . . . 96

3.6 rManager: Saving Reified Information . . . . . . . . . . . . . . 98

3.7 Two-level Scheduler in DAMROS . . . . . . . . . . . . . . . . 112

3.8 Structure of Reflective CPU Scheduler Module . . . . . . . . . 113

3.9 URQ: Representation of Threads . . . . . . . . . . . . . . . . 116

3.10 Operation of the VRHS Model . . . . . . . . . . . . . . . . . . 119

3.11 Pseudo-code of rScheduler Thread . . . . . . . . . . . . . . . . 122

3.12 Application-specific UD Scheduler Blocks . . . . . . . . . . . . 124

3.13 User-Defined FCFS Scheduler . . . . . . . . . . . . . . . . . . 125

3.14 Structure of the RMMS Model . . . . . . . . . . . . . . . . . . 129

3.15 Reflective Memory Management System (RMMS) . . . . . . . 131

3.16 Operation of the RMMS model . . . . . . . . . . . . . . . . . 133

3.17 Application-specific UD Paging Policy . . . . . . . . . . . . . 135

3.18 Pseudo-code for Thread T2 . . . . . . . . . . . . . . . . . . . . 143

3.19 Results of Experiment #1 . . . . . . . . . . . . . . . . . . . . 150

3.20 Results of Experiment #2 . . . . . . . . . . . . . . . . . . . . 152

3.21 Using RR Scheduler . . . . . . . . . . . . . . . . . . . . . . . . 154

3.22 Using UD Scheduler . . . . . . . . . . . . . . . . . . . . . . . 155

3.23 RR Vs UD Scheduler . . . . . . . . . . . . . . . . . . . . . . . 156

3.24 Static Vs Reflective LRU . . . . . . . . . . . . . . . . . . . . . 160

3.25 Experiment #1: Page-faults . . . . . . . . . . . . . . . . . . . 163

3.26 Page-faults for RMMS . . . . . . . . . . . . . . . . . . . . . . 165

4.1 OS Paging Model . . . . . . . . . . . . . . . . . . . . . . . . . 172

4.2 Benchmark Application – ‘scan’ . . . . . . . . . . . . . . . . . 178

4.3 Manual Insertion for ‘scan’ . . . . . . . . . . . . . . . . . . . . 180

4.4 Steps Involved in Automatic Insertion . . . . . . . . . . . . . . 183

4.5 Pass-1 of the cloop Tool . . . . . . . . . . . . . . . . . . . . . 185

4.6 Pass-2 of the cloop Tool . . . . . . . . . . . . . . . . . . . . . 186

4.7 CIL Transformation of – ‘scan’ . . . . . . . . . . . . . . . . . 188

4.8 Automatic Method for ‘scan’ . . . . . . . . . . . . . . . . . . . 189

4.9 Design of CASP Mechanism . . . . . . . . . . . . . . . . . . . 193

4.10 PROTON Design Model . . . . . . . . . . . . . . . . . . . . . 203

4.11 ‘scan’ with Traditional Annotation . . . . . . . . . . . . . . . 204

4.12 ‘scan’ with PROTON Annotations . . . . . . . . . . . . . . . 205

4.13 BSORT Simulation . . . . . . . . . . . . . . . . . . . . . . . . 215

4.14 FFT Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 215

4.15 MATVEC Simulation . . . . . . . . . . . . . . . . . . . . . . . 215

4.16 SCAN Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 215

5.1 MAD: Results on Linux-LRU and Linux-CART . . . . . . . . 233

5.2 FFT: Results on Linux-LRU and Linux-CART . . . . . . . . . 235

5.3 FFT-I: Results on Linux-LRU and Linux-CART . . . . . . . . 237

5.4 MATVEC: Results on Linux-LRU and Linux-CART . . . . . . 239

5.5 SCAN: Results on Linux-LRU and Linux-CART . . . . . . . . 241

5.6 Summary of Results for Linux-LRU . . . . . . . . . . . . . . . 243

5.7 Summary of Results for Linux-CART . . . . . . . . . . . . . . 243

5.8 Results for Multiple Applications . . . . . . . . . . . . . . . . 243

List of Abbreviations and

Symbols

Abbreviation Meaning

AGEING AGEING is a page replacement policy.

APEX APEX is a two-level disk scheduler.

API Application Programming Interface

ARC Adaptive Replacement Cache page replacement policy.

ATOM ATOM is a static code annotation based trace collection tool.

ATUM ATUM uses microcode to efficiently capture address traces.

BSORT Bubble sort algorithm/application

CAR CLOCK with Adaptive Replacement combines the advantages

of CLOCK and ARC policies.

CART CART is an extension to CAR policy with a temporal filter

improves upon the defects of CAR/ARC.

CASP Cooperative Application-Specific Paging mechanism.

CIL C Intermediate Language tool chain.

CLOCK CLOCK is an easy to implement page replacement policy.

CLOS CLOS is programming language that support reflection.

CORBA Common Object Request Broker Architecture is a standard

enabling software components written in multiple computer

languages and running on multiple computers to work to-

gether.

CPU Central Processing Unit.

DAMROS Dynamically Adaptive Micro-Reflective Operating System.

DBMS DataBase Management System.

DLL Dynamic Link Library.

DMA Direct Memory Access is a feature of modern computers that

allows certain hardware subsystems within the computer to

access system memory for reading and/or writing indepen-

dently of the central processing unit.

DSP Digital Signal Processor.

DVD Digital Versatile Disc is a popular optical disc storage media

format.

EDF Earliest-Deadline-First scheduling policy.

EELRU Early Eviction Least Recently Used page replacement policy.

FCFS First-Come-First-Serve policy.

FERT A notational language – Fault Tolerant Entities for Real-Time

for specifying fault-tolerant requirements on a task-by-task

basis.

FFT Fast Fourier Transformation algorithm/application.

FIFO First-In-First-Out policy.

FP Fixed Priority scheduling policy.

GOPI GOPI is a middleware layer.

GPS Global Positioning System.

GUI Graphical User Interface.

HATS An adaptive hierarchical scheduling using the puppeteer sys-

tem for scheduling network bandwidth.

HLS The Hierarchical Loadable Scheduler is a hierarchical schedul-

ing scheme.

ID An Identification number.

iManager The component to manage interception mechanism in DAM-

ROS operating system.

IP Internet Protocol.

IPC Inter-Process Communication.

J2EE Java 2 Platform, Enterprise Edition.

JIT Just-In Time.

JVM Java Virtual Machine.

KB Kilo Bytes.

LFU Least Frequently Used page replacement policy.

LIRS Low Inter-reference Recency Set page replacement policy.

LISP List Processing Language is a programming language favoured

for Artificial Intelligence research.

LRFU Least Recently/Frequently Used page replacement policy.

LRU Least Recently Used page replacement policy.

MAD MPEG decoder application.

MATVEC Matrix Vector multiplication application.

MAX Maximum value.

MB Megabytes.

MCU Micro-Controller Unit.

MFU Most Frequently Used page replacement policy.

µ-kernel Micro-kernel operating system architecture.

MM Memory Management.

MMU Memory Management Unit.

MOP Meta Object Protocol.

MPEG The Moving Picture Experts Group, commonly referred to as

simply MPEG, is a working group of ISO/IEC charged with

the development of video and audio encoding standards.

MPI Message Passing Interface is both a computer specification

and it’s implementation that allows many computers to com-

municate with one another.

MRU Most Recently Used page replacement policy.

ms Milliseconds.

µs Microseconds.

OLR Optimal LRU Reduction is an algorithm to reduce the mem-

ory access trace size.

ORB Object Request Broker.

OS Operating System.

PC Personal Computer.

PCB Process Control Block.

PDA Personal Digital Assistant.

POSIX Portable Operating System Interface is the collective name of

a family of related standards specified by the IEEE to define

the application programming interface for software compati-

ble with variants of the Unix operating system, although the

standard can apply to any operating system.

PREMO Page REplacing Memory Object is a pager in Mach operating

system which executes in the application address space.

PROTON PROTON is a virtual memory simulator to simulate multiple

applications workload.

QNX QNX is a commercial POSIX-compliant Unix-like real-time

operating system, aimed primarily at the embedded systems

market.

RAID Redundant Arrays of Independent Disks is a technology that

employs the simultaneous use of two or more hard disk drives

to achieve greater levels of performance, reliability, and/or

larger data volume sizes.

RAM Random Access Memory.

RISC Reduced Instruction Set Computer represents a CPU design.

RM Rate Monotonic scheduling policy.

rManager The component to manage the reification process in DAM-

ROS operating system.

RMI Remote Method Invocation.

RMMS Reflective Memory Management System.

RPM Revolutions Per Minute.

RR Round Robin scheduling policy.

RSS Resident memory Set Size.

RTCC Real-Time Concurrent C is a high-level programming lan-

guage.

RTOS Real-Time Operating System.

SA The Scheduler Activations (SA) model is an API that provides

a kernel interface and scheduler up-call mechanism to support

the hierarchical scheduling scheme.

SAD Safely Allowed Drop is an algorithm to reduce LRU memory

trace size.

SCAN SCAN is a micro-benchmark application to stress the virtual

memory subsystem.

SDL System Description Language is a description language used

to specify details of a network node in Spring operating sys-

SDRAM Synchronous Dynamic Random Access Memory.

SEGQ Segmented Queue page replacement policy.

SFQ Start-time Fair Queueing is a scheduling policy used to sched-

ule the intermediate nodes of a hierarchical scheduler.

SMART SMART is a scheduling scheme that uses an optimised

scheduling scheme that adapts to the working set of appli-

cations.

SMP Symmetric multiprocessing.

SMS Short Message Service is a communications protocol allowing

the interchange of short text messages between mobile tele-

phone devices.

SRAM Static Random Access Memory.

TCP Transmission Control Protocol.

UD User-Defined policy.

URL Uniform Resource Locator.

URQ Universal Run Queue.

VM Virtual Memory.

VRHS Virtually Reflective Hierarchical Scheduler.

WYNIWYG What You Need Is What You Get.

Calloc The cost of allocating a page in memory.

CASPapp The application component of CASP mechanism.

CASPos The operating system component of CASP mechanism.

Cmajor The cost to handle a major page-fault.

Cminor The cost to handle a minor page-fault.

Cpage The cost of a page-in or page-out operation.

Dlock The minimum amount of memory region (in bytes) that the

OS mechanism needs to lock.

Dsize The size of the memory region (in bytes) being accessed.

ǫ(t) The time a thread spends in the system executing on the

central processing unit.

Et The time at which a thread finishes its execution and leaves

the system.

Mfree The total free memory at any given time.

Mtotal The total memory in the system.

Nfree The number of available free pages.

Nprocess The number of the different application processes running in

the system.

ω(t) The time spent by a thread waiting from the time it entered

the system until it first executes on the CPU.

Oτ The constant time taken by the operating system to perform

activities other than paging.

Pτ The time taken by the operating system to perform paging

activity.

St The time at which a thread first starts its execution.

Sτ The time spent executing the operating system code to per-

form system activities.

Tτ The turn around time of an application process.

TTRnd(t) The time taken by a thread from the time it entered the sys-

tem to the time it leaves the system.

Uτ The time spent executing the application code.

Abstract

Complex applications impose greater resource demands on limited resource soft real-time

embedded systems. The Real-time Operating System (RTOS) should efficiently manage

the systems resources amongst several such applications. Built for the general case, rather

than to meet application-specific requirements, the RTOS is unable to meet the dynamic

resource demands of the applications. It provides average support to the increasing appli-

cation resource requirements leading to poor application performance. On the other hand,

applications that may be able to predict their resource requirements at runtime, have no

control over the RTOS’s resource management policies (e.g. CPU, memory, etc.).

In particular, giving applications control over the processor scheduling and memory

management will provide efficient resource management support. In order to provide

such application-specific resource management support, this thesis proposes a reflective

RTOS framework that allows fine-grained changes to the RTOS’s resource management

policies. Reification calls, inserted into the application source code, inform the RTOS about

application-specific resource requirements. The reflection framework uses this information

to accordingly adapt the RTOS policies.

The proposed RTOS framework has been implemented in a prototype µ-kernel DAM-

ROS and also in a commodity OS Linux (2.6.16 kernel). The experiments performed to

evaluate the reflection framework along with the use of reification calls in the context of vir-

tual memory (paging) have shown significant improvement in paging performance. The total

number of page-faults in the system were reduced by 22.3% and the application performance

improved by 12.5%.

Acknowledgements

There are many people without whom the work in this thesis would not have

been completed. I would like to express my sincere gratitude to them all. In

particular, my supervisor, Dr. Neil Audsley for his constant encouragement,

guidance and support throughout my years at the university.

I am very grateful to friends and colleagues in the Real-time Systems Group

at York for their vital feedback on my work. In particular, I would like to thank

Adam Betts, Anant Kapdi, Rachel Baker, Ian Broster, Rui Gao, Micheal Ward

and Andrew Borg for their help and encouragement. Thanks to Professor Andy

Wellings for his helpful comments during the assessment process.

Many thanks to my childhood friends Kiran Baloji, Ganesh Jannu and

Prasad Kori for their constant support and for making special those all im-

portant breaks. Special thanks goes to my teacher Jayashree Bhagoji in India

without whom it would not have been possible. She has been instrumental in

shaping my career.

I am indebted to my lovely wife Sushma for her continued support through-

out. Finally, to my parents, my sister Sneha and my brother-in-law Ravi and

all my relatives in India, my gratitude for their patience, love and support.

This work has been as much an effort on their part as it was on mine.

York, UK Ameet Patil

September, 2007

Declaration

I hereby declare that, unless otherwise stated in the text, the research work

presented in this thesis is original and undertaken by myself, Ameet Patil,

between October 2003 and September 2007 under the guidance of my supervi-

sor, Dr. Neil Audsley. I have acknowledged external sources where necessary

through bibliographic referencing. Parts of this thesis have previously been

published as technical reports or conference papers as listed below.

The generic reflective framework for RTOS presented in chapter 3 was ini-

tially published as a work-in-progress paper at the IEEE Real-time Systems

Symposium (RTSS) [95]. The reflection framework along with the µ-kernel

implementation – DAMROS appeared as a full paper in the IEEE Real-time

and Embedded Technology and Applications Symposium (RTAS) [96]. Part

of this work was also published at the International Workshop on Operating

System Platforms for Embedded Real-Time Systems (OSPERT) [14]. The Hi-

erarchical scheduling model – VRHS was published as a technical report [94].

Chapter 4 presents the design and experiments using the on-the-fly simulator

– PROTON that was published as a technical report [93]. The methods of in-

serting application hints, the design of CASP and its implementation in Linux

appeared as a full paper in the IEEE Real-time and Embedded Computing

Systems and Applications Symposium (RTCSA) [97].

Chapter 1

Introduction

The use of a real-time operating system (RTOS) within soft real-time em-

bedded systems aids runtime resource management and supports application

development via the layer of abstraction provided by the RTOS. Almost 71%

of the embedded real-time systems designed in the year 2006 used an RTOS

(see figure 1.1), of which 50% used a commercial RTOS [121]. In systems

with limited resources that support complex applications having varying re-

source requirements, an RTOS should manage the system resources efficiently

to provide application-specific resource management support.

Applications in soft real-time embedded systems are becoming increasingly

complex. The mobile telecommunications industry typifies this rising appli-

cation complexity. Successive product generations place increased demands

upon the target platform [113]. Many computationally intensive applications

such as software radio, cryptography, augmented reality, speech recognition

and mobile applications such as e-mail and word processing are making their

way into future mobile platforms [15]. In [15], it is estimated that in order to

support the above applications, a platform would require about 16 times as

much computing “horsepower” as a 2-GHz Intel Pentium 4 processor.

Figure 1.1: Choice of RTOS for Embedded System Implementation [121]

The complexity and sophistication of the CPUs within many current soft

real-time embedded systems has increased to meet application demands. Sys-

tem platforms such as the tiny micro-controller devices for example the Infi-

neon XC167CI [2] (containing a 40MHz CPU, 12KB Random Access Mem-

ory (RAM), 256KB flash); the embedded systems building blocks such as

computer-on-modules typified by the CM-X255 [3] (containing Intel’s XScale

(Arm) PXA255 CPU at upto 400MHz, 64MB SRAM and 512MB flash) are

quite sophisticated and provide advanced features such as virtual memory.

In systems that use an RTOS, efficient resource management is important

in order to support complex applications. Most commercial RTOSs use generic

resource management policies which provide average case support [41,121]. For

instance, to manage virtual memory by paging (by using a secondary storage

device, paging allows applications to use more memory than is physically avail-

able), the most commonly used generic page replacement policy in RTOSs such

as Embedded Linux is the least recently used (LRU) policy [9, 54, 86].

Different applications have different memory access patterns and differ-

ent memory requirements. The LRU policy uses only recency information of

memory page accesses to determine which pages are least used by application

processes. It does not consider other information such as a page’s frequency

of access, etc. For applications with varying memory demands and long term

memory page accesses, LRU may not provide the best support.

This thesis proposes a framework for an RTOS to allow runtime changes

to the resource management policies so as to adapt to the application-specific

resource requirements. Resource management in RTOS is examined in the

context of soft real-time embedded systems that use complex CPUs. The rest

of this chapter discusses the importance of resource management in an RTOS

and existing support for applications with dynamic resource requirements (sec-

tions 1.1 and 1.2). The thesis proposal and the contributions are detailed in

sections 1.4 and 1.5. Section 1.6 describes the thesis outline.

1.1 Technological Growth Versus Application

Complexity

In the mobile handset industry, the application complexity has surpassed the

growth in processor speed. Figure 1.2 shows the rise in application complexity

with respect to the first (1G), second (2G) and the third generation (3G) of

mobile handsets. The figure shows that the growth in the processor perfor-

mance has fallen behind the rising application complexity [23]. To meet this

challenge, the industry is driving the development of more powerful CPUs in-

volving multiple CPU cores. The TriCore [60] microcontroller from Infineon is

Figure 1.2: Trends in Application Complexity and Processor Speed [23]

the first single-core 32-bit MCU-DSP (microcontroller – digital signal proces-

sor) architecture optimised for real-time embedded systems that truly unifies

the best of three worlds - real-time capabilities of microcontrollers, computa-

tional power of DSPs, and the price/performance benefits of RISC load-store

architectures.

The use of more processing power leads to additional power consump-

tion which affects portable battery operated devices such as mobile handsets.

Unfortunately, the growth in battery technology has not seen as significant

improvement as CPU technology thereby restricting the up-time of portable

devices to only a few hours [23]. For instance, the talk time of the latest

Samsung D840 [8] mobile handset (a multimedia rich mobile handset) is up

to 2.8 hours only. This does not include the normal usage of the phone – i.e.

Figure 1.3: Projected Trends in Mobile Application Complexity [23]

to send multimedia messages, record pictures/videos, play games, play mu-

sic/videos, download Internet content and send/receive emails. Under normal

use it would require a recharge almost everyday.

Figure 1.3 (reproduced from [23]) outlines the observed and predicted

growth in the functionality incorporated in a mobile handset in the previous

and coming years. The first generation (1G in figure 1.2) mobile handsets in

1980s were only capable of making voice calls requiring relatively less process-

ing power. Later generation introduced a text messaging service, popularly

known as SMS (Short Messaging Service) or Texting. This allowed mobile

users to exchange text messages without having to make voice calls. PC and

console based software games slowly made their way into the mobile handsets.

Today’s mobile handsets are filled with advanced features. They are Internet-

ready : web browser, chat clients like Yahoo messenger; multimedia-rich: play

MP3 songs, record voice/still image/video, graphics-rich games and support

various wireless technologies: WiFi, Bluetooth. Many of them can also be

used as a PDA (Personal Digital Assistant).

Such an increase in functionality usually translates into the implementation

of multiple single or multi-threaded applications. Use of more applications

increases the computation and storage requirements adding greater demands

on the system resources. Memory is one such resource whose demand increases

with application complexity. Mobile handsets for instance, support the use of

an additional secondary storage device in the form of Flash memory cards. The

usage of flash memory technology in mobile handsets is becoming common (see

figure 1.4). In year 2006 alone, more than 20% of the handsets included nearly

512MB of flash memory [42].

This trend is also apparent in other areas of embedded systems. For

instance, car manufacturers are introducing vehicles with GPS (Global Po-

sitioning System) navigation, real-time traffic reports, satellite radio, DVD

playback, MP3 player, voice-controlled operations, hard-drive music storage

and many such capabilities all in one integrated unit [15]. To provide such

functionality, either multiple complex applications or multiple software com-

ponents integrated to form a single application are deployed in the system.

The role of an RTOS is to efficiently share the system resources such as CPU,

memory, etc. amongst the different applications without affecting the overall

Figure 1.4: Need for Greater Secondary Storage [42]

system performance.

1.1.1 Resource Constraints

Systems that are battery operated and portable often have several resource

constraints. The following are some of the commonly found ones [15]:

• Power: is limited to the battery capacity. The system (e.g. a mobile

handset) should use minimal power since it operates on battery;

• CPU: the greater a CPU’s processing ability, the more power it con-

sumes [123]. Often, there is a trade-off between CPU speed and the

power requirements. Proper sharing of the CPU resource by the RTOS

results in better power utilisation [123].

• Memory: the main memory is an important resource for embedded ap-

plications. Primary memory is not only expensive but the more it is used

the more power it consumes. The rise in application code size as well

as increase in memory requirements for data storage/processing leads to

greater memory utilisation. The RTOS needs to share memory efficiently

amongst different applications.

Adhering to such constraints requires that the applications and the un-

derlying RTOS manage the system resources efficiently. Mismanagement of

resources or provision of generic resource management policies in the RTOS

results in resource conflicts often leading to poor application performance [15].

1.2 Resource Management in RTOS

A conventional RTOS is built without the knowledge of applications that would

execute upon it – i.e. the RTOS is built for the general case, rather than to

meet application-specific requirements. Whilst a few custom built RTOSs that

support a single or fixed set of critical applications (e.g. in the aviation indus-

try) use specialised resource management policies, most commercial RTOSs

supporting the entire embedded applications domain (including mobile ap-

plications) implement generic resource management policies. The generic re-

source management policies do not identify the runtime application-specific

resource requirements which may not result in the best possible support.

With little or no application-specific support from the generic policies, the

system shows poor performance forcing the developers to disable the use of

certain advanced features such as virtual memory [37,58]. Paging is often dis-

abled in the CPU because the generic page replacement policies generate con-

siderable page-swap overhead that affects system performance. Thus, rather

than trying to provide application-specific paging support, the feature is com-

pletely disabled. For example: the ARM microprocessors (ARM7 10T), which

have been widely deployed in embedded systems, have a full MMU (Memory

Management Unit) with support for virtual memory [6]. The generic page re-

placement policies lead to poor virtual memory management because they may

not be able to identify the specific memory requirements of the applications.

Application resource requirements can be dynamic and non-deterministic in

nature which depend upon runtime factors. The performance of soft real-time

embedded systems that are constrained by limited resources may be further

degraded with the use of generic resource management policies in the RTOS.

No single resource management policy is able to equally satisfy the dynamic

resource requirements of all the applications in the system. The next subsec-

tion describes the existing approaches and technique to provide better resource

management in an RTOS.

1.2.1 Existing Approaches to Efficient Resource Man-

agement

Existing approaches are coarse-grained – i.e. they involve applications chang-

ing their own functionality by altering the processes that are executed, whilst

leaving the RTOS unchanged. However, many subtle changes in behaviour

can be achieved without altering the process functionality, rather by modify-

ing the resource management policies and the individual parameters governing

them in an RTOS. Such fine-grained change would allow the same application

functionality to be executed, but perhaps at different times or rates, poten-

tially using different resources. For example, in response to changes in the

environment an application may need to change the rates at which individual

processes are executed, or perhaps the resources that individual processes use.

On the one hand, to satisfy the dynamic requirements of the applications,

the RTOS needs to adapt to the changing application-behaviour and its re-

source requirements. On the other hand, it is the applications not the RTOS

that are in a better position to predict their actual behaviour and resource

requirements at runtime. There is a need for the applications to control and

change the way the RTOS manages resources.

Giving the applications complete control over the RTOS resource manage-

ment policies is not an ideal solution. This is because in a multi-programmed

environment a change brought in by one application can have adverse effects

on other applications. For overall system predictability and safety, although

the application designers are in a better position to know the resource require-

ments of the application, the control over managing resources should remain

with the underlying RTOS.

Furthermore, by sharing information between the applications and the

RTOS the resource management policies may be able to adapt or change in

order to support the application resource requirements. It might be possible to

achieve this using reflection [105,112,115,118,128]. The next section describes

the reflection mechanism that was first introduced in programming languages

that allows sharing of information and bringing about fine-grained changes to

either the code or data at runtime.

1.3 Reflection Mechanism

The mechanism by which an application becomes ‘self-aware’ and changes

itself accordingly either to change its behaviour or to improve its performance

is called reflection [105, 112,115,118,128].

The reflection mechanism originated in programming languages such as

small-talk, CLOS, LISP etc. Many modern programming languages have also

been extended to support reflection [81]. For example: extensions (in the form

of library packages) to programming languages such as Ada, Java and C++

have been developed to support reflection [105].

In order to achieve reflection, an application needs to be aware of many as-

pects of its design and implementation, e.g. its data structures, language con-

structs/semantics, run-time support system (or virtual machine). The mech-

anism by which this information is made available to an application is called

reification [105].

A reflective entity (e.g. the application) is divided into a base-level and

a meta-level component. The base-level component consist of the general

application functionality or the main application code. It reifies information

to the meta-level component. The meta-level component makes use of the

reified information to adjust the required application functionality at runtime

by changing the code or data in the base-level.

The process of reification can either be implicit or explicit. Implicit reifica-

tion is built into the development model where essential information is auto-

matically reified by the use of language constructs or compiler techniques. Ex-

plicit reification requires the application developer to explicitly add reification

calls into the application source code. Such reification calls could essentially

reify the runtime application-specific resource requirements. However, since

all resource management code resides in the RTOS, the application meta-level

component will not be able to help the application by changing its base-level

code or data.

This thesis proposes a generic reflective framework built into an RTOS

to obtain application reified information in order to bring about fine-grain

changes to the resource management policies. The next section describes the

thesis proposition.

1.4 Thesis Proposition

In order to support applications requiring fine-grained change to RTOS’s re-

source management policies, an RTOS must provide mechanisms that enable

such a change, whilst maintaining the predictability required by the real-time

application in terms of time and resource usage. A generic framework in the

RTOS that sets out guidelines for the resource management policies to auto-

matically handle changes is required.

The central hypothesis of this thesis is:

“Conventional CPU scheduling and memory management policies

in an RTOS provide generic support that do not, in general, al-

low application-specific resource control. This thesis contends that

application-specific control of processor scheduling and memory

management will provide better application support thereby im-

proving application performance. This thesis proposes a generic

reflective framework in the RTOS to efficiently capture application-

specific resource requirements and bring about fine-grained changes

in the resource management policies. The use of explicit reification

in application source code to specify the resource requirements will

provide better application support and improve performance”.

Using the reflection mechanism [112, 115, 118, 128], a generic reflective

RTOS framework has been proposed that enables the flow/exchange of valu-

able information – (1) within the RTOS between the kernel and several re-

source management modules; and (2) between the RTOS and the application

processes.

Further, with the use of reification (by inserting reification calls into the

application source code), the RTOS is able to gain valuable insight on the

application-specific resource requirements at runtime. This information is then

combined with existing information collected within the RTOS kernel and

forwarded to the concerned resource management module(s). These module(s)

then make fine-grained changes to their policies so as to accommodate the

current application requirements.

1.5 Contribution

This thesis proposes an approach of using a reflection mechanism built into

the RTOS in the form of a generic framework to support application-specific

resource management. At runtime, the approach uses explicit reification to

identify and communicate the application-specific resource requirements to

the RTOS. In particular, this thesis mainly focuses on resource management

pertaining to CPU scheduling and virtual memory paging technique.

As a first step, a generic reflective RTOS framework has been proposed that

establishes communication paths between applications and the RTOS kernel;

and between the resource management modules and the kernel. The frame-

work allows the reflective resource management modules to make fine-grained

changes to the code/data affecting the resource management or completely

change the resource management policy in use. Under the framework the re-

source management modules can also choose to be non-reflective (in which case

the framework imposes no or minimal overhead onto the respective modules).

An initial prototype µ-kernel – DAMROS [95,96] has been developed in or-

der to implement and verify the proposed reflective framework. Two reflective

system modules, a reflective CPU scheduler and a reflective virtual memory

module, have also been implemented within DAMROS. Several experiments

involving the two reflective resource management modules and custom built

artificial benchmark applications have been performed. The applications are

shown to dynamically adapt the RTOS’s resource management policies ac-

cording to application-specific resource requirements, which resulted in better

application performance.

This thesis describes the use of explicit reification for virtual memory as

a case study in order to capture runtime resource usage information from the

applications.

Three methods of inserting reification calls have been described – manual,

automatic and hybrid methods. For automatic insertion of memory usage reifi-

cation calls, a tool – cloop has been implemented. This tool looks for regions

of large amounts of data access in the application source code detecting data

hot-spots and inserts reification calls around them. The reification calls inform

the RTOS framework about the application’s future memory requirements and

usage patterns.

In this case study, a simple and efficient Operating System (OS) paging

mechanism called CASP [97] that uses the reified information within the

framework is presented. The case study shows that using the framework, it

is possible to implement a simple reflective module that operates on top of

an existing resource management policy in the system. CASP uses the ‘page-

isolation’ technique that allows it to transparently lock memory pages without

affecting the normal operation of the OS.

An on-the-fly virtual memory simulator - PROTON [93] has been imple-

mented to verify the benefits of explicit reification in the context of virtual

memory. PROTON supports virtual memory simulation for multiple appli-

cations workload. This helps to evaluate the overall system performance by

simulating the entire workload (multiple applications) at a time. No existing

virtual memory simulators can simulate multiple applications. Different page

replacement policies can be plugged into PROTON allowing the system engi-

neer to test and verify the effects of various page replacement policies on the

application workload prior to implementing in an RTOS. Using the simula-

tor, a system engineer can gain valuable insight into an application’s paging

performance before its deployment. PROTON has been used for simulating

experiments involving CASP with applications using explicit reification calls.

Finally, the implementation of the core framework and the CASP mech-

anism in a commodity OS, Linux (2.6.16 kernel), is presented. Experiments

involving several embedded benchmark applications chosen from MiBench [56]

(embedded benchmark application suite) show effectiveness of CASP and the

framework. The results show a considerable reduction in paging overhead and

a significant improvement in application performance. In Linux, CASP has

been implemented and evaluated in the context of two different page replace-

ment policies – LRU/CLOCK based [54] and the CART [17] policy.

1.6 Outline

This thesis is organised as follows: the next chapter introduces existing work –

describing constraints on real-time embedded systems, the OS design architec-

tures, resource management in an RTOS pertaining to the CPU and memory

resource, existing OS specialisation techniques and the reflection mechanism

as an OS specialisation technique. The chapter also presents a survey of exist-

ing use of reflection in programming languages, middleware technologies and

reflective OSs.

Complex applications have varying resource requirements that are often

non-deterministic in nature. Thus, any information pertaining to resource

usage and requirements of the applications could be quite valuable to the re-

source management policies of the RTOS. Chapter 3 investigates and proposes

a reflection-based generic RTOS framework that allows runtime adaptation

of the resource management policies depending on application requirements.

Experiments involving a prototype implementation of a µ-kernel, DAMROS,

along with two example reflective system modules, a reflective CPU sched-

uler and a reflective memory management module, are performed to verify the

effectiveness of the RTOS framework.

Further in chapter 4, a case-study on virtual memory (paging) is carried

out to illustrate the various methods of reification in the framework. Three

different methods of inserting reification calls into the application source code:

manual, automatic and hybrid methods are described. The design of another

OS mechanism, CASP [97], for virtual memory management (paging) which

works on top of existing page replacement policies is described. Experiments

involving the use of reification calls along with the CASP mechanism via sim-

ulation show considerable improvement in the performance of paging as well

as the applications.

The scalability of the reflective framework and the CASP mechanism are

then investigated in a commodity OS. Chapter 5 describes implementation of

the reflective framework and the CASP mechanism in two flavours of Linux,

one using an LRU/CLOCK [54] based page replacement policy and the other

using a CART [17] page replacement policy. Experiments involving the frame-

work and CASP are performed on both flavours of Linux and the results

compared against conventional applications and an existing solution based

on Linux’s mlock() primitives [19].

Finally, chapter 6 presents conclusions to the work in this thesis along with

a detailed layout for future work.

Chapter 2

Resource Management andOperating System Specialisation

This chapter provides a detailed study of the existing technology and ap-

proaches involving operating systems, resource management and OS speciali-

sation techniques. The chapter is organised as follows. The next section pro-

vides the background on real-time systems emphasising resource constraints

and the specialisation required to support application-specific requirements.

Section 2.2 discusses existing techniques and policies used in OS resource man-

agement, particularly for the CPU and memory. In section 2.3, OS specialisa-

tion techniques to accommodate increasing application resource demands are

discussed. Finally, in section 2.4, reflection mechanisms are discussed in the

context of programming languages, middlewares and operating systems.

2.1 Real-time Embedded Systems

The last few decades have seen the use of computers in many diverse shapes

and forms in our day-to-day life. Computers are being used in devices, from

coffee machines to highly sophisticated flight control systems in aircraft. The

systems that embed computer hardware and software for a particular cause or

application (e.g. ticketing machine) are called embedded systems [29].

Real-time systems are those systems in which the time the result is pro-

duced is as important as the logical result of the computation itself [29]. Em-

bedded systems requiring such timing behaviour are called real-time embedded

systems. Examples of real-time embedded systems include ticketing machines,

coffee machines, washing machines, automotive anti-lock braking system, in-

dustrial robots, space station control systems, battery operated devices, wire-

less telecommunication systems, aircraft, military defence systems, medical

systems, etc.

2.1.1 Types of Real-time Systems

Each real-time system has a varying level of cost in case of an error or failure.

For example, if a coffee machine fails or delays delivering a coffee, then the

user can either wait for some more time or can easily get it fixed. However,

if while travelling in an aircraft, the flight control system misbehaves or fails,

then the end result could be catastrophic. Depending on this factor real-time

systems are classified into different types:

• Soft real-time systems [11],

• Hard real-time systems [11],

• Weakly-hard real-time systems [21].

Real-time tasks in a system are characterised with several time constraints

such as deadline, inter arrival time, jitter, etc. [29]. Reliability and predictabil-

ity are the two main characteristics of a real-time system. Soft real-time sys-

tems are those that can occasionally afford to miss a deadline and still be

functional. A typical example is an MP3 player where it is acceptable to have

minor disruptions in sound every now and then caused by delays in decoding.

Hard real-time systems are not tolerant to any kinds of faults in the system.

A deadline miss in such a system may have catastrophic effects. For example,

the cost of a failure in the flight control system is severe compared to that

in a MP3 player. Such systems may still miss a deadline provided that it

happens in a known predictable way [29]. Weakly-hard real-time systems are

those real-time systems that can tolerate a clearly specified degree of missed

deadlines [21].

2.1.2 Categorising Real-time Embedded Systems

Many real-time embedded systems that provide complex functionality, use

multiple application processes which in turn can be either single or multi-

threaded. It is common to find the use of an OS in real-time embedded systems

for efficient management of resources so as to provide better support for such

complex applications. An OS is the main software program that acts as a

bridge between the underlying hardware and the applications that execute

upon it [62]. It is responsible for sharing the available system resources (e.g.

CPU, memory, networks, etc.) amongst the application tasks/processes [120].

In order to support complex real-time applications the OS needs to satisfy

their resource requirements.

Not all real-time embedded systems make use of an OS. In general real-time

embedded systems can be categorised as follows [130]:

• systems without an OS: these systems are relatively simple and have

everything hardwired into them. Making a change in such systems can

be time consuming and a difficult process.

• systems that use a simple OS: the OS is mainly involved in monitoring

system activity or it only supports a simple application without much

complexity.

• systems that use a commercial general purpose RTOS [41,131]: such sys-

tems are identified by the characteristics of the RTOS they deploy (e.g.

VxWorks, pSOSystems, OS-9, QNX, Windows CE, etc.). These RTOSs

support different kinds of application timing characteristics. However,

such RTOSs are general purpose OSs that provide average case resource

management support and are not optimised for any particular appli-

cation. For optimisation, such OSs need to be manually tweaked and

configured.

• systems that use a custom built RTOS: increasing software complexity

requires better OS abstractions. Commercial RTOSs are general pur-

pose which tend to provide unnecessary additional features adding to

the complexity. For critical applications found in hard real-time sys-

tems the use of a custom built RTOS is common. This ensures that the

RTOS is optimised for the applications and would provide the best re-

quired support. However, such RTOSs will be pinned to the application

concerned and also, it is not feasible to make changes to the OS each

time the application requirements change.

In recent years, commercial RTOSs have become quite flexible allowing

them to be configured for a required specification. Most embedded systems

(real-time or non real-time) deploy a commercial RTOS rather than using

a home grown (custom built) one, thus, saving on the costs of additional

development and maintenance.

2.1.3 Application-specific RTOS Specialisation

The increasing demand for additional functionality increases the system com-

plexity. Applications whose resource requirements depend on certain run-

time stimuli are heavily dependent on the resource management support pro-

vided by the RTOS. In systems with constrained resources, such applica-

tions put greater resource demands on the underlying RTOS requiring cus-

tom application-specific policies. Ideally, an RTOS needs to adapt its resource

management policies according to runtime resource requirements of the appli-

cations.

Changes to RTOS policies can be both static and dynamic in nature. Static

changes are easier to handle in that they only require rebuilding the OS with

the code for new policies. However, dynamic changes are difficult to handle

since the changes are made dynamically at runtime. As an example of a

dynamic change, consider an application whose output is dependant on certain

environmental/external factors. Depending on a particular external stimuli,

the application may require a change in its priority (a change in the RTOS’s

CPU scheduling policy) or may require a different policy to manage its memory

(a change in the RTOS’s memory management policy).

Furthermore, an RTOS treats each resource independently. For instance,

the CPU is independently managed by a CPU scheduler whilst memory is

managed by the memory management module. However, often resources have

an inter-dependency pattern amongst themselves such that a change made to

the management of one resource affects the other.

As an example, consider a complex real-time MPEG [51] video decoder

application. An MPEG video stream consists of several frames. A frame is a

single still image in a MPEG video stream, a group of which produce a motion

video. In the MPEG video standard, depending on the coding method, there

are several types of frames [51, 107]:

• I-Frame: stands for intra-coded. This type of frame is coded independent

of other frames. It is considered as the starting point for decoding any

MPEG video stream. These frames can be randomly indexed in an

MPEG video stream.

• P-Frame: stands for predicted. This type of frame is coded with reference

to a past frame (either an I or a P type). To decode this frame, the

reference frame must be decoded first. These frames are indexed for

future P and B-frames.

• B-Frame: stands for bi-directional. Also called interpolated frame. This

frame is coded with reference to both past and future frames (either an

I or a P type). These frames can never be referenced in coding frames

and provide maximum video compression.

Frames can be further divided into entities called macro-blocks but these are

out of scope for this discussion. A group of related frames constitute a scene.

A valid sequence of frames such as – IBBPBBPBBP forms a scene. An MPEG

video stream consists of several scenes, each containing a different set of frames.

Frames in two different scenes are not related to each other and can be decoded

in parallel.

Figure 2.1(a) shows a valid MPEG input stream where a group of frames

collectively belong to a particular scene and a sequence always starts with

an I-frame. However, to improve the decoding time, the future P-frames are

transmitted ahead of time. This is done so that both past and future refer-

ence frames have already been decoded when decoding of a B-frame begins.

Figure 2.1: MPEG Input Streams for Decoding

Thus, the real frame transmission pattern for the original input stream may

be represented as shown in figure 2.1(b).

The performance of an MPEG decoder application depends on the band-

width of the input MPEG video stream, the complexity of the scene currently

being decoded and the number of different frame types it contains. Generally,

in a well encoded MPEG video stream a larger number of B-frames exist as

compared to the other frame types [92]. However, it should be noted that the

frame decode time for each type of frame may vary depending on the scene

complexity.

Returning to the example, if the MPEG decoder application is multi-

threaded such that it decodes different unrelated scenes in parallel, then a

traditional scheduling policy may not efficiently schedule the threads on the

CPU which may also affect the memory management subsystem. This is be-

cause, the decoder threads use sufficient memory in order to store the decoded

as well as the to-be-decoded frames. In the case of memory-constrained em-

bedded systems, this results in all the available memory being used up bringing

the memory manager into action. In a system implementing virtual memory

(e.g. paging), the memory manager evicts some of the memory pages belong-

ing to another thread and allocates the freed memory pages to the requesting

thread.

Traditionally, CPU schedulers (e.g. fixed priority (FP), earliest-deadline-

first (EDF), etc.) operate on information that is either fixed off line (e.g.

priority) or only pertains to the CPU [62, 90]. Due to this fact, the CPU

scheduler may schedule the thread whose pages have been recently evicted

causing a series of page-faults. This process continues until the system starts

thrashing [90]. Often such experiences lead to the non-acceptance of paging

as a viable solution for embedded systems.

However, the problem here is not with paging. It is the CPU scheduler,

unaware of the memory manager’s operation, that causes the problem. With

proper cooperation and integration of several resource management modules, it

is possible to share valuable information, thereby avoiding the risk of conflicts

that lead to poor system performance.

For better overall performance, an OS should provide the best possible sup-

port to the resource requirements of applications. The resource requirements

of complex real-time applications depend on several factors, thus, making them

non-deterministic and dynamic in nature. An OS needs to adapt its resource

management policies at runtime to provide the required support. Most com-

mercial RTOSs only support static compile time specialisation.

2.1.4 Resource-Constrained Real-time Embedded Sys-

Many real-time embedded systems operate in a constrained environment with

limited resources (e.g. limited processing power, limited memory, limited

power, etc.). In such systems, it is even more important for an OS to efficiently

manage the available resource amongst the varying requirements of complex

applications. The use of average case general purpose resource management

policies generally results in poor application performance.

To achieve the functionality required within these constraints, real-time

embedded applications need to dynamically change the behaviour of them-

selves and the underlying OS. Usual approaches [13,98,131] are coarse-grained,

involving applications changing their own functionality by altering the pro-

cesses that are executed, whilst leaving the OS unchanged. However, many

subtle changes of behaviour can be achieved without altering process function-

ality, rather by modifying the resource management policies of the OS. Such

fine-grained changes allow the same application functionality to be executed,

but perhaps at different times or rates, potentially using different resources.

For example, in response to changes in the environment an application may

need to change the rates at which individual processes are executed, or perhaps

the resources that individual processes use. The following sections discuss re-

source management (for CPU and memory) in an OS and some existing OS

specialisation techniques.

2.2 Resource Management in an OS

The resource requirements of complex applications vary dynamically at run-

time requiring an OS to efficiently manage system resources. Adaptive resource

management in the OS is the key to provide application-specific support. This

section discusses existing techniques and approaches in OS resource manage-

ment for the main system 1 resources: the CPU and memory.

2.2.1 CPU Resource

A CPU scheduler in an OS is responsible for managing the CPU. Its main ob-

jective is to share the CPU amongst several different competing applications

depending on certain criteria/requirements. There are several scheduling poli-

cies each using a unique or mixed scheduling criteria. For instance, the fixed

priority (FP) [79] scheduling policy uses a priority-based scheme to schedule

processes on a CPU. i.e. the process with the highest priority gets to exe-

cute first. Similarly, the earliest-deadline-first (EDF) scheduling policy uses a

deadline-based scheme. i.e. the process with the earliest deadline gets to exe-

cute first. In order to support application-specific requirements, the OS should

be capable of changing the scheduling scheme or criteria dynamically at run-

time. The concept of using a hierarchical scheduling scheme might be suitable

for this purpose. The following subsection discusses hierarchical scheduling in

detail.

Hierarchical Scheduling

Hierarchical scheduling has been widely adapted in OSs. Most OSs imple-

ment a two-level scheduler – one to schedule kernel threads and the other for

1This thesis focusses only on single processor systems.

application/user threads. Additionally, hierarchical schedulers are also used

in virtual machine implementations (e.g. Java Virtual Machine, VMWare [7],

etc.), network layers, disk scheduling, etc. This section presents existing tech-

niques used to design and implement hierarchical schedulers.

Most hierarchical schedulers proposed in the past are based on a fixed tree

type structure. Figure 2.2 shows a typical hierarchical scheduling structure.

The scheduler at the root node implements a traditional scheduling policy (e.g.

FP policy as shown in figure), those at the leaf nodes implement application-

specific policies (e.g. EDF, Rate Monotonic (RM), etc.) while those at the

intermediate nodes implement either traditional, or an optimised policy or a

Start-time Fair Queueing (SFQ) policy [55] to schedule the next level sched-

ulers (either the intermediate nodes or the leaf nodes).

Threads

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12

FP EDF RM

Intermediate

Root node

Leaf nodes

Figure 2.2: Hierarchical Scheduling Structure

The operation of a hierarchical scheduling scheme is as follows: the sched-

uler at the root node schedules the next level scheduler – either an intermediate

or a leaf node scheduler depending on the depth of the scheduling hierarchy

– making the decision with respect to the policy it implements. In the same

way, if the next level scheduler is an intermediate node, then it schedules the

next level scheduler – either another intermediate or a leaf node scheduler. Fi-

nally, the scheduler at the leaf node schedules the actual thread/process that

takes over the CPU until the next scheduling/pre-emption point. Note that

the complexity of this approach increases with an increase in the depth of the

scheduling tree. Furthermore, it substantially increases the time required to

make a scheduling decision. Due to this reason the hierarchical scheduling

scheme has shown to incur considerable overhead [31, 55, 99].

However, the amount of overhead and the efficiency of the hierarchical

scheduler depends on its implementation. For example: the hierarchical sched-

uler that uses the SFQ [55] algorithm to schedule the intermediate nodes incurs

considerable time overhead in switching to the required scheduler at the leaf

node level. It is shown that although the scheduler provides flexibility for

various heterogeneous schedulers to co-exist in the same system, it introduces

considerable time overhead [55].

Similarly, in Vassal [31], a multi-policy scheduling model implements only

a two level scheduling solution in Windows NT allowing applications to in-

troduce a custom scheduler into the system. The Vassal scheduling scheme

was tested with a large time quantum (greater than 1ms) supported by Win-

dows NT making it infeasible for high resolution real-time threads [31]. Also,

only one application defined scheduler can co-exist with the native Windows

NT scheduler. This system was put forth as a tool-kit to build and experi-

ment with new scheduling policies and did not address any issues related to

application-specific scheduling.

The Hierarchical Loadable Scheduler (HLS) [100] is another solution, sim-

ilar to Vassal, where the schedulers are loaded into the kernel at runtime

as drivers. HLS implemented on Windows 2000 kernel imposed considerable

overhead due to context switch time. The context switch time on a 500MHz

Pentium III machine was noted to be 11.7µs in Windows 2000 with HLS as

compared to 7.10µs in the actual Windows 2000 release version [99]. It has

also been noted that HLS adds 0.96µs overhead to the context switch time for

each additional level in the scheduling hierarchy [99].

The SMART [87] scheduler uses an optimised scheduling scheme that

adapts to the working set of applications. SMART provides a time sharing

policy when no real-time threads are running. In the case where both types

of application threads – real-time and non real-time – exist in the system,

SMART uses an optimised scheduling policy [87].

The Scheduler Activations (SA) model [124] implemented in the NetBSD

OS is essentially an Application Programming Interface (API) that provides a

kernel interface and scheduler up-call mechanism (‘sa upcall()’ [124]) to sup-

port the hierarchical scheduling scheme. This model generates huge overheads

– a context switch time of 225µs on a 500MHz G3 processor – making it

unsuitable for real-time use.

MaRTE OS [103] provides an API to support application defined schedul-

ing. Applications in MaRTE OS are able to introduce application-specific

scheduling policies into the system that co-exist in the hierarchical structure.

However, this approach has also proved to generate sufficient overhead to make

it slower than the traditional policies [103]. A formal proposal has been made

to include the application defined scheduling mechanism into the real-time

POSIX standard [104].

There are many other hierarchical schedulers proposed such as APEX [80]

– an adaptive two-level scheduler for disk scheduling in multimedia database

management system (DBMS) and HATS [39] – an adaptive hierarchical

scheduling using the puppeteer [39] system for scheduling network bandwidth.

A scheduler is the main component of an RTOS which is responsible for

distributing the CPU bandwidth amongst different threads/processes in the

system. The time taken to make a scheduling decision is critical to system

performance. In most hierarchical scheduling models, along with the appli-

cation threads/processes, the CPU bandwidth is also shared amongst various

intermediate schedulers. Delays in making scheduling decisions increase the

time spent in the intermediate schedulers, thereby affecting the execution time

of application threads/processes.

2.2.2 Memory Resource

Similar to the CPU, memory also is an important resource in an embedded

system [126]. Applications cannot execute without memory. The memory re-

quirements of applications differ from application to application. The key is

to support applications with greater memory requirements even in memory-

constrained systems. This section provides background on the main virtual

memory technique – paging. Paging has been a topic of interest for several

decades. Though there have been many page replacement policies proposed,

each has its own advantages and disadvantages. Previous work related to pag-

ing can be classified into three categories: page replacement policies, extensible

and application controlled paging and compiler assisted paging mechanisms.

Page Replacement Policies

LRU and CLOCK based policies are the most widely accepted policies and

are used in most commercial OSs like Linux [19, 54], Mach [9, 85], etc.. Due

to recency-based paging decisions, the LRU policy fails to keep those pages in

memory that are frequently accessed over a long period of time. The proposed

improvements to LRU include: LRFU [76, 77], EELRU [111], LRU-K [44,

91], 2Q [65], and more [49]. The CLOCK replacement policy is easier to

implement than LRU and requires less book-keeping. It has been shown that

the performance of CLOCK approximates that of LRU [17].

The Adaptive Replacement Cache (ARC) [86] policy builds upon LRU

eliminating some of its disadvantages. For example: unlike LRU, ARC also

captures the frequency features of the workload. Also, ARC is not polluted

by scan (a sequence of one-time use only page requests), a well-known failure

condition of the LRU policy [86].

CAR [17] (i.e. CLOCK with Adaptive Replacement) combines the ad-

vantages of the CLOCK and ARC [86] policies. However, in both ARC and

CAR, two consecutive page hits succeed as a test for a page’s long-term utility

whereby the page may never be used again in future [17]. Most file handling

applications (e.g. file search, database, etc.) access the same pages in succes-

sion fairly quickly and never access them again.

CART [17], an extension to CAR with a temporal filter improves upon

this defect of CAR/ARC. It uses four page lists – T1, T2, B1 and B2. Lists

T1 and T2 contain pages that are currently present in memory whereas the

lists B1 and B2 maintain history information of pages that have been recently

reclaimed. Pages in T1 are considered to have a short-term utility while pages

in T2 have a long-term utility. CART [17] imposes more stringent constraints

than CAR/ARC in deciding a page’s long-term utility.

CLOCK-PRO [63] is an improved version of CLOCK combining the advan-

tages of CLOCK and the LIRS [64] policy; the latter being proposed for better

buffer cache performance. CLOCK-PRO maintains a circular list of pages with

three clock hands. The HANDhot points to a hot page (a page which is newly

allocated or recently accessed) with largest recency. Any hot pages swept by

this hand turn into cold pages (not recently accessed). The HANDcold points

to the last resident cold page (i.e. the furthest one to the head of the list).

HANDtest points to the last cold page which is in its test period. This hand

is used to terminate the test period of cold pages. The non-resident cold pages

swept by this hand will leave the circular list for reclamation.

In addition to the above, several other page replacement policies have been

proposed in the past. The next subsection discusses some existing extensible

and application-controlled paging mechanisms offered by OSs.

Extensible and Application-Controlled Paging

In a µ-kernel [120], system modules such as the CPU scheduler, memory man-

ager, etc. can be executed as independent user-space processes. Many ex-

tensible memory management solutions make use of the µ-kernel architecture

to extend existing paging mechanism either to use a different policy or to

implement an application-specific one.

The Mach [9] OS provides the user with a certain level of control over pag-

ing of the application concerned. The external pager interface of Mach allows

applications to use their own functions for moving pages to and from the sec-

ondary storage or the swap space. However, Mach does not allow applications

to choose their own page replacement policies.

McNamee et. al. [85] extended Mach’s external pager interface to allow

applications to use their own page replacement policies. These new pagers,

called the PREMO (Page REplacing Memory Object) pagers, executed in ap-

plication address space. Every virtual memory region allocated in the system

is represented as a memory object. Each PREMO pager is responsible for the

pages belonging to one or more such memory objects. On a page-fault, the

Mach pager uses a global policy to select a page for reclamation and checks

if the selected page belongs to one of the memory objects associated with a

PREMO pager. If true, then the selected page is put back into the page list

and control transferred to the PREMO pager. The PREMO pager would then

return a page from a memory object it governs. Finally, this page is reclaimed

by the Mach pager. The method is shown to add considerable communication

overhead in the system [85].

The VINO [45] OS enables applications to override some or all operations

within the MemoryResource objects to specialise their behaviour. The ‘appli-

cation kernels’ in the V++ Cache kernel [35] are allowed to cache address-

space objects and handle page-faults on these objects.

Applications in Aegis [46], an implementation of the exokernel [47], use

‘library operating systems’ instead, to implement their own memory man-

agement sub-system. The self-paging mechanism in Nemesis [57] presents a

mechanism mainly targeting continuous media applications. An application

in Nemesis [57] is responsible for handling all of its page-faults on its own.

Nemesis provides Quality of Service (QoS) guarantees in terms of the amount

of guaranteed physical memory space and the disk bandwidth available for

the requesting application. On a page-fault, the context is saved in the ap-

plication domain and later handled by the application responsible when it is

scheduled. Other application-controlled paging solutions [18, 37, 58, 73] are

similar to Nemesis [57] in that the applications control all the paging activity

including page-fault handling.

SPIN [22] is an extensible µ-kernel that is capable of loading user-defined

extensions called Spindles (SPIN Dynamically Loadable Extensions) as and

when required by the applications. Spindles are implemented using Modula-3,

a safe programming language, to ensure safety of the system.

Resources in SPIN are managed at two levels – (1) the primary system al-

locator looks after the major system resources such as memory, CPU, etc. and

(2) the secondary user allocator manages the resources already allocated by

the system allocator [22]. SPIN [22] provides user-level extensions for paging

by allowing registration of an event handler for memory management events.

The L4 [78,125] µ-kernel supports recursive construction of address spaces.

On initialisation, a user-level pager in L4 takes hold of the entire physical

memory, σ0. This pager then allocates address spaces to the tasks on a first-

come-first-serve basis. It can also divide the current address space it holds, σ0

into two, σ1 and σ2, such that it now remains responsible for only σ1 and have

another pager look after σ2.

Extensible and application controlled paging solutions tend to complicate

the generic page replacement code of an OS and add considerable overhead to

the system [18,37, 57, 58]. Such solutions have several side effects: some com-

pletely rely on the application programmer to accurately handle paging; and

some impose a performance penalty on the normal operation of the OS’s pag-

ing policy which affects other applications not using the scheme [37,45,47,58].

There is a trade-off between the OS handling everything for the applications

and allowing applications to perform some of the OS operations (paging).

Several critical applications such as databases, RAID servers, garbage col-

lectors in virtual machines, etc. tend to implement their own paging mecha-

nisms due to insufficient support from the OS [57,109,119]. The next subsec-

tion discusses some existing compiler assisted paging mechanisms.

Compiler Assisted Paging Mechanisms

The compiler assisted memory management policy in [82] analyses code, at

compile time, for loops consisting of accesses to array-based data. The com-

piler then inserts primitives LOCK, UNLOCK and ALLOCATE into the com-

piled code to control allocation of memory space for the respective arrays at

run-time. This method assumes that the underlying OS supports allocation

of memory on demand and can lock/unlock pages in memory dynamically at

run-time.

More recently, Brown et. al. [26] proposed a similar compiler assisted pag-

ing solution that uses compiler-inserted prefetch and release hints to manage

physical memory more intelligently. The main focus of this work is on the

insertion of hints into application source code. A run-time layer queues all

hinted requests either for prefetch or release operations and later passes them

onto the OS. It is assumed that the OS already supports prefetch and release

operations on memory pages. It is shown that this method adds considerable

overhead to the system increasing the application execution times [26].

A case study presented in this thesis for virtual memory management uses

similar techniques. The approach presented in this thesis uses a reflection

framework to obtain information about application memory access patterns

and accordingly prefetch or release pages from/to the swap space. There are

two advantages in this approach. Firstly, it avoids the bottleneck of queued

requests as in [26]. Secondly, only the information which is most recent is used

to prefetch or release pages.

Linux [19] provides memory lock and unlock primitives to certain privileged

applications in the form of ‘mlock()’ and ‘munlock()’ system calls. Applica-

tion processes use ‘mlock()’ to lock a range of memory in the virtual address

space such that, no matter which policy is implemented, the physical pages

mapped to these virtual addresses will not be reclaimed by the page replace-

ment code. Under Linux, the mlock() primitive only locks the virtual memory

pages associated with the application process making it possible for a physical

page to move within different page-lists (e.g. from active to inactive page-

list) [54]. Just before paging-out, pages are reverse mapped to their virtual

pages to check if they have been locked [54]. This process adds considerable

overhead on the page reclamation process.

Linux also implements a system call called ‘madvise’ which can be used

by applications to notify the OS of their probable memory access patterns.

The implementation of ‘madvise’ in the Linux kernel uses such notification

mainly to tune the extent of disk read ahead pages. In Linux, each new access

to a page in secondary storage (e.g. the disk) makes the kernel read ahead,

in advance, several pages into its cache for future use. Applications having

sequential memory access patterns typically have iterative loop-based memory

accesses. ‘madvise’ will cause unnecessary page reads, for example, when an

application reaches the end of its data access loop and starts accessing the

memory pages from the beginning.

Other work [32] in this area involves the use of various loop transformation

techniques such as loop permutation, loop fusion, loop distribution, etc. to

achieve data locality in terms of both temporal and spatial reuse of cache lines.

The next section discusses some existing OS specialisation techniques.

2.3 Operating System Specialisation

Specialisation in an OS either makes the entire OS or certain resource manage-

ment policies in the OS adapt to the application-specific requirements. This

provides better support to the applications and helps them achieve better

performance.

A general purpose OS implements generic resource management policies

that do not support all applications alike. For example, consider that an

OS implements a high-performance graphics algorithm to support graphics

intensive applications. This functionality would rarely be used by a non-

graphical application.

Specialisation of an OS also depends on the OS architecture. A monolithic

kernel contains all system modules, statically compiled, to form a single chunk

of code. This provides less flexibility for OS specialisation. Addition of nu-

merous different kinds of resource management policies into the kernel would

make the OS code larger. However, modern monolithic OSs such as Linux

are more modular in nature, allowing additional functionality to be added at

runtime by dynamically loading the required modules into the kernel.

On the other hand, a µ-kernel can be easily specialised. At runtime, the

system modules can either be changed, replaced or extended as and when

needed. The L4 µ-kernel provides recursive layers of abstraction, which is

good for specialisation [78]. An exokernel [47] gives the applications complete

control over the OS policies making it ideal for specialisation. However, the

redundant OS libraries impose unnecessary memory overhead in the system.

An exokernel can support all kinds of OS specialisations.

Apart from using the inherent features of the kernel design, it is possible to

use of some external techniques to dynamically specialise an OS. Policy-based

resource management specialisation techniques divide the bigger problem into

smaller chunks that can be dealt with individually. The next subsection pro-

vides more insight on the specialisation of OS policies.

2.3.1 Specialisation of OS policies

Each resource in a system is different and needs to be managed differently.

However, information pertaining to one resource might also be useful for man-

aging another resource.

There are numerous CPU scheduling policies proposed in the past: Rate

Monotonic (RM), Earliest Deadline First (EDF), Round Robin (RR), etc. A

scheduling policy may or may not be the most suitable policy for a particu-

lar kind of workload. Specialising this policy depending on the information

obtained at runtime will provide a mechanism to dynamically adapt the pol-

icy for better application support. The applications may want to choose the

scheduling policy themselves. Furthermore, the applications may desire that a

custom built scheduling policy be used by the OS. The possibilities are endless.

However, at some point a decision has to be taken on how much a policy

can be specialised? This is called the granularity [41] of specialisation. The

more fine-grained control the applications have on the specialisation aspects,

the more specialisable the OS is. Specialisation of a policy should be designed

in a way that suits the requirements correctly. If designed properly, any or all

resource management policies of an OS can be specialised.

The approach taken by the Infokernel [13] is to transform OS policies into

mechanisms. The Infokernel gives out information to the applications about

the policies it provides. The applications make use of this information to adapt

themselves in order to gain optimal performance from the OS policies [13].

Real-time applications have stringent resource requirements. Adapting the

OS policies to meet these requirements will provide better application support.

However, the Infokernel takes a different approach. Rather than adapting the

OS policies to suite application requirements, it changes/adapts the applica-

tions themselves. The adapted applications no more adhere to their original

resource requirements and could potentially have a completely different be-

haviour. This thesis aims to retain the original application behaviour and

focuses on improving the existing resource management support by adapting

the OS polices.

The reflection mechanism [112,115] often found in programming languages

is considered as a specialisation technique which when used in an OS context

might provide better support for dynamic OS adaptation. The next section

describes reflection in more detail.

2.4 Reflection Mechanisms

A conventional program runs through a predefined deterministic execution

path. Any behavioural change in the path requires the code to be changed,

recompiled and executed again. The ability by which an application program

can check itself at runtime is called ‘self-awareness’ or ‘introspection’ [105].

Using introspection, an application can query its status, check data structures,

Figure 2.3: Tower of Reflection (Reproduced from [81,105])

etc. at runtime. The mechanism by which an application becomes ‘self-aware’

and changes itself accordingly either to change its behaviour or to improve its

performance is called Reflection [112].

In order to achieve reflection, an application needs to be aware of many

aspects of its design and implementation, e.g. its data structures, language

constructs/semantics, runtime support system (or virtual machine). The pro-

cess by which this information is made available to an application is called

reification [105].

Reflective systems are generally made up of a ‘base-level’ component and

one or more ‘meta-level’ components or entities operating one above the other

(see figure 2.3). The base-level represents the application program code, with

the meta-level being a model of the base-level that analyses the reified in-

formation. One meta-level component can have further meta-levels above it

resulting in a reflection tower (see figure 2.3).

Generally, the meta-levels are causally connected with each other such

that a change made by one component is reflected everywhere. Using causal

connection, it is possible for a meta-level to change the behaviour of an ap-

plication without the knowledge of its base-level component. The meta-level

achieves this by intercepting and changing the behaviour of certain function

calls to/from the base-level. The change could be in the form of changing the

value of a data structure or changing the base-level code itself.

For example, consider a check-pointing approach to fault-tolerance. This

functionality can be brought into a system by introducing a suitable meta-

level entity [105]. The calls to all write operations on the check-pointed data

objects are intercepted by the meta-level which then performs the actual check-

pointing of the data-object: storing a copy of the data elsewhere. Once this is

done, the write operation continues as expected in the base-level. The result

as far as the application is concerned is a write operation to the data-object –

it is unaware of the check-pointing. Note that using reflection at runtime, it is

possible to dynamically change – the objects that are check-pointed, the fault-

tolerance mechanism used, etc. – all without the knowledge of the application’s

base-level component.

Depending on the information reified and the level of change, reflection can

be classified into two main types:

• Structural Reflection,

• Behavioural Reflection.

Structural Reflection: is the ability of a programming language to provide

reification of the program structure including any abstract data structures.

For example, a meta-level entity in structural reflection can query all com-

ponents/objects of a class, it can add/delete objects or even change their

data-type [81]. Structural reflection was first introduced in logic programming

for languages such as Smalltalk-80 [50] and LISP [81].

Behavioural Reflection: is the ability of a programming language to

provide reification of the language semantics and implementation along

with the data and implementation of the runtime system [81]. Behavioural

reflection is difficult to achieve. The meta-level has complete control over the

base-level to bring about any change to the aspects such as the way functions

are called, the value of data that is being written to or read from, etc. The

next sub-section describes the use and support for reflection in programming

languages.

2.4.1 Reflective Programming Languages

The reflection mechanism originated in programming languages such as

Smalltalk-80 [50], CLOS, LISP etc. with many modern programming lan-

guages being extended to support reflection [81]. For example: extensions

(in the form of library packages) to modern programming languages such as

Ada [106], Java [36] and C++ [106] have been developed to provide support

for reflection [105]. Java, for instance, has a reflection API that provides facil-

ities to introspect data structures (e.g. a class) used in a program. However,

Java’s ability to alter program behaviour is very limited, i.e. it only allows

to get/set a field, invoke a method through the API or just instantiate a new

class [36].

To overcome these limitations and to provide more reflective support Open-

Java [106] was introduced. The OpenJava compiler is typically a macro trans-

lation parser which translates OpenJava source code into a regular Java source

code that exhibits reflection [106]. Thereafter, it uses the facilities provided

by the Java Virtual Machine (JVM). OpenJava is considered to be a result of

the lessons learnt from OpenC++ [106].

OpenC++ [106] uses a low-level parse tree approach instead of using

OpenJava’s strong typed object interface to the syntactic structure of the

source [106]. It supports compile time structural reflection while behavioural

reflection is supported through meta-classes written by meta-level program-

OpenJIT [84] is a reflective Java just-in-time (JIT) compiler. The OpenJIT

compiler allows class-specific customisations. Mostly written in Java with a

few small Java Native Interface (JNI) stubs for JVM introspection, OpenJIT

also has a few C-level runtime routines. The OpenJIT compiler checks and

modifies itself during execution of a Java application, thereby adapting to the

runtime application-specific requirements [84]. Since most of it is written in

Java, it imposes performance overheads due to the extra level of interpretation

involving the JVM.

Another interesting development for reflective support in Java was intro-

duced in the form of a class library called Javassist [36]. Javassist supports

load-time structural reflection in Java. Javassist takes a simple approach. It

provides the following sets of classes: ones to read compiled Java byte-code,

another to create a new byte-code; ones to add/change the methods or the

name of a class in the compiled byte-code and finally, ones to load the com-

piled byte-code into the JVM for execution. Once loaded into the JVM, a

class cannot be changed thereafter.

To illustrate the potential of this approach, a simple example is presented

(as quoted in [36]). Consider a class Calendar that implements an interface

Writable provided by a third party as shown below.

class Calendar implements Writable {public void write(PrintStream s) { ... }

The class Calendar implements method write declared in the interface

Writable. Suppose that the third party changes the class name Writable to

Printable and the method name from write to print. This would necessitate

changing the Calendar class code as follows:

class Calendar implements Printable {public void write(PrintStream s) { ... }public void print() { write(System.out); }

In the real-world scenario a change like this might mean changing huge

amounts of code which can be impractical. If Java supported structural reflec-

tion then it was possible to change the interface name to Printable and also

make similar changes to the method write. The Javassist class library allows

to do this at the time of class loading.

Reification and reflection in Javassist is done by creating an object of Ct-

Class (Compile-time Class) which can read byte-code from a compiled Java

class file. The CtClass object is provided with methods such as toBytecode(),

addMethod(), addField(), setBody(), etc. to generate new byte-code, add a

new method, add a new field to a class and change the code of an existing

method respectively [36].

Behavioural reflection in Javassist is implemented using software hooks.

Intermediate hooks are inserted into the methods in a reflective class. When

that particular method is called, the call is intercepted by the hook and is

then handled by a meta-level class which might change the behaviour if re-

quired [36]. Similar to Javassist, another system called linguistic reflection

was developed. Although, it allowed dynamic creation of a new class, it did

not allow changes to an existing class definition [36].

OpenAda [106] provides compile time structural reflection to the standard

Ada 95 programming language. OpenAda makes use of the pragma Metaclass

specification construct in the application source code – files with extension .oa

– that specifies to the compiler which type or object or method to translate

for reflection [106]. A simple example (as quoted in [106]) would be:

pragma Metaclass(Verbose.Object);

with OpenAda.MOP;

procedure Verbose is

type Object is new OpenAda.MOP.Class with private;

procedure Translate Procedure Body

( This : in out Object;

Input : in out OpenAda.Syntax.Procedure Body.Node;

Control : in out OpenAda.Syntax.Visitation Controls );

private

type Object is new OpenAda.MOP.Class with null record;

end Verbose;

This example depicts an overriding of procedure Translate Procedure Body

inherited from type Class. OpenAda provides the programmer with several

packages such as OpenAda.Syntax, OpenAda.MOP (Meta-Object Protocol),

etc. to support reflection [106]. It makes use of Dynamic Link Libraries

(DLL) in Microsoft Windows OS and Shared Object Libraries (*.so files) in

UNIX variants to support dynamic loading of the meta-classes. To achieve

behavioural reflection OpenAda provides a simple set of packages that allow

changing the behaviour of a method at runtime. Methods provided by the

package are Reflect and Reify which help in interception and introspection

when required [106].

The OpenAda compiler translates the OpenAda source code into standard

Ada 95 compatible source code which the user can compile and execute using

any standard Ada compiler. However, this may introduce certain limitations to

the features that depend on the underlying Ada compiler. The next subsection

describes some existing reflective middlewares that allow dynamic changes to

their services.

2.4.2 Reflective Middlewares

A middleware is a software layer that fits between the applications and an

OS – mediating interactions between them. Middlewares provide a standard

interface to the applications by hiding complex details of the underlying OS in-

terface. The complexity includes features such as remote method invocation,

network communication protocols and cryptography [70]. Middlewares are

generally deployed in distributed computing environment and network com-

munication systems. Existing middleware technologies include CORBA [127],

Java-based J2EE [4] and the .Net framework [5].

Reflective middlewares [70] use a reflection mechanism to adapt their ser-

vices in order to accommodate changing requirements of applications. Such

middlewares use reification and meta-level components to bring about fine-

grained changes to their services. They also provide programmers with an

interface to explicitly control a change to a specific service.

Most reflective middlewares provide support for interception. Interceptors

are used to support added functionality such as fault tolerance, cryptography,

runtime monitoring or collecting system statistics, etc. Some of the reflective

middlewares proposed in the past include DynamicTAO [71], Open ORB [24],

OpenCORBA [75] and mChaRM [34].

DynamicTAO [71], an extension of the C++ TAO Object Request Bro-

ker (ORB) [71] allows runtime reconfiguration of the ORB’s internal engine

and the respective applications using ORB. It uses ComponentConfigurators

to represent dependency relationship between the different ORBs, the ORB

components and the application components. On receiving a request to re-

place a component in the system, the middleware checks its dependencies

with the other components using the attached ComponentConfigurator. Dy-

namicTAO allows runtime loading and unloading of modules by exporting a

meta-interface.

The Open ORB [24] middleware was independently developed at the same

time as the DynamicTAO. Its main aim was to support applications with

dynamic requirements. The Open ORB platform can be configured to include

appropriate components using a component model which allows hierarchical

composition and distribution.

In Open ORB, the base-level consists of the components implementing the

normal middleware services while the meta-level exports these implementa-

tions to the programmer to enable inspection and adaptation. A base-level

component can have its own private set of meta-level components which are

referred to as the component’s meta-space. Each meta-space is further parti-

tioned into various meta-space models that provide different views of the plat-

form implementation and can be independently reified. Open ORB defines four

meta-space models grouped according to the distinction between structural

and behavioural reflection. The Interfaces and Architecture meta-space mod-

els support structural reflection whereas the Interception meta-space model

supports behavioural reflection. Prototype implementations of the Open ORB

have tested the suitability of the architecture for distributed multimedia ap-

plications. Initial experiments indicated that Open ORB performed the same

as Orbacus [24], a commercial ORB, and around 10% slower than GOPI [24].

OpenCORBA [75] adds reflection support to standard CORBA. It has been

implemented in NeoClasstalk [101], a smalltalk-like [50] reflective language

based on meta-classes. In OpenCORBA, the behaviour of a CORBA service

is changed by replacing the meta-class of a class that provides that service.

Quarterware [110] is another reflective middleware platform that provides

a component framework for the ORB mechanisms. With the use of a reflective

interface, the programmers can plug custom components into the framework.

Quarterware supports multiple middlerware standards such as CORBA, Java

RMI (Remote Method Invocation), and MPI (Message Passing Interface).

The multi-Channel Reification Model (mChaRM) [34] is a reflective mid-

dleware which enables explicit control over multi-channel communication using

a communication-based reification approach. The model allows interception of

calls to the methods in communication channels in order to inspect and adapt

their structure or behaviour.

In general, reflective middlewares make extensive use of interception to in-

tercept calls from or to the applications to bring about the required change.

Many reflective middleware techniques have now been adapted into standard

middlewares. For instance, CORBA includes a standard for portable inter-

ceptors and Java includes the Core Reflection API. These architectures are

suitable for distributed computing which require application portability. The

use of such middlewares for efficient application-specific resource management

has not been fully exploited.

However, the middleware layer between the OS resource management and

the applications adds an additional level of indirection in the system. Fur-

thermore, middleware provides a standard interface to the applications and

handles all the complexity pertaining to the low-level OS interface which dif-

fers from one OS to another. With each OS implementing different policies,

middlewares are limited by what features the underlying OS can provide. The

next subsection discusses some of the existing reflective OSs.

2.4.3 Reflective OSs

To accommodate reflection in OSs, a direct analogy to the implementation of

reflection in programming languages can be taken. Essentially, an OS should

provide a mechanism by which reflection is achieved. This additional func-

tionality in the OS may introduce some overhead into the system. However,

this overhead could be justified by the additional flexibility provided to the

application and the resulting performance gain. The overhead should be zero

or minimal if applications do not make use of reflection.

For the systems that wish to provide an efficient fine-grained dynamic

adaptation mechanism, a reflective OS should allow: each OS module to have

its own set of meta-level entities and to share the associated information and

their functionality. Individual functionality allows distinct policies for different

applications (e.g. distinct scheduling policies), whilst some shared function-

ality allows shared facilities (e.g. efficient IPC). The following subsections

discuss existing reflective OSs that include – ApertOS [128], Chameleon [27],

2K [33, 69, 72] and Spring [115,116].

ApertOS

ApertOS [28,59,128,129] is one of the first generation object-oriented reflective

OSs and was particularly designed for use in mobile and distributed comput-

ing environments. It implements a reflective object-oriented framework that

provides support for object migration. The framework introduces a unique

concept of separating an object with its meta-object. This was particularly

implemented in ApertOS to aid object migration. Here, an object is consid-

ered to encapsulate: a state, some methods which access its state and a virtual

processor which executes its methods. A meta-object is an object which de-

fines the behaviour of a particular object. For instance, a virtual processor of

an object can be viewed as a meta-object.

In the reflective framework, everything that is shared and protected is

an object. Each object belongs to a particular meta-space consisting of one

or several meta-objects. Figure 2.4 (reproduced from [128]) shows the rela-

tionship between various objects, meta-objects and the respective meta-spaces

they belong to. As an object evolves through its lifetime, its requirements

change. If the meta-space that it belongs to does not support the new require-

ments, then the object can migrate to a different meta-space that provides

the required support. This is particularly useful in the mobile communication

environment where at one instance, an object might be using a local proto-

col for communication, and at the next instance, it might require to use an

Figure 2.4: Object/Meta-Object Separation and Meta-Hierarchy [128]

inter-connection protocol.

The aim of ApertOS is not to adapt its resource management policies but is

to provide support for objects in the systems (e.g. the application processes)

to choose the required policies by selecting and migrating to the respective

meta-spaces. Thus, every system module in ApertOS is implemented as a

meta-object and belongs to one or more defined meta-spaces.

In a way, ApertOS itself can be considered as a large object using multiple

meta-spaces that consists of multiple meta-objects. These meta-objects use

other meta-objects forming a meta-hierarchy. For instance, a meta-object

which implements segmentation in virtual memory, uses another meta-object

which implements paging. The paging meta-object would in turn use a meta-

object which implements the physical memory management.

Objects in ApertOS, can migrate to a different meta-space using the

“canSpeak()” method. Each object executing in the system is associated with

a context. ApertOS provides a standard means to compose individual execu-

tion environments of each application. The existence of objects, their state,

and object migration is handled by a core module called MetaCore. The Meta-

Core does not belong to any meta-space. It forms the main communication

bridge between the objects and the meta-objects within different meta-spaces.

ApertOS was implemented for the SONY PWS1550 and MC68030 pro-

cessors. The evaluation showed that it spent 40% of its processing time in

saving, finding and restoring context of the system [128]. i.e. the overhead

for reflection in ApertOS was high. Also, ApertOS allowed only a single re-

flective module per meta-level, preventing multiple applications from having

different reflective functionality. Never-the-less ApertOS provided a new way

of dynamically specialising an OS [128].

Bryce et. al. [28] introduced a new pre-emptive hierarchical scheduler by

replacing the existing non-preemptive one in ApertOS to improve its perfor-

mance. It was shown that the application performance improved by up to 5

folds [28].

Chameleon

Chameleon [27] is an object-oriented OS that shares the same philosophical

approach as ApertOS. Based on a µ-kernel architecture, it was mainly designed

for soft real-time multimedia applications. In order to provide better adapt-

ability, Chameleon introduced new concepts such as AbstractCPU, brokers,

and the broker interface hierarchy. Furthermore, techniques such as dynamic

class binding served as a basis for all system modules. Chameleon incorporates

an event-driven model that allows new events to be defined and dynamically

introduced into a running system.

Similar to ApertOS, Chameleon has a hierarchical meta-objects structure

wherein the meta-objects actively communicate amongst each other to support

reflection. Due to this, Chameleon also showed similar overheads that were

associated with ApertOS.

2K [33, 69, 72] is a reflective, component-based distributed OS that uses a re-

flective ORB – Dynamic TAO [71] for dynamic customisation. It incorporates

a middleware layer to admit on-the-fly customisation by dynamically loading

new components into the system. The system software includes models of its

own structure, state and behaviour by using reification. This allows the sys-

tem components to access the system state and check if they need to adapt.

The reflective ORB model provides code update mechanisms to allow dynamic

replacement of system and application components [72].

2K adopts a network centric model in which all the entities, users, the

various system components and devices exist in a network. Each entity has a

network-wide identity, profile and dependencies upon other network entities.

When configuring a particular service, the entities constituting that particular

service are assembled together.

The system configures itself automatically and loads a minimum set of com-

ponents required for executing the user applications. Any further components

are downloaded and configured from the network as and when required. The

philosophy is based upon a “what you need is what you get” (WYNIWYG)

model [69, 72].

In order to achieve this, 2K reifies inter-component dependency. The sys-

tem and the application components need to fulfil an explicit representation

requirement before they can execute. For example, an Internet browser could

specify that it depends upon components implementing an X-Window system,

a local file service, the TCP/IP protocol, and the Java virtual machine [72].

The main motivation for 2K was to manage variation in the environment

(e.g. fluctuations in network bandwidth, connectivity, protocols, error rate)

and the evolution of software and hardware (e.g. software version updates and

hardware reconfigurations) [72]. Adaptation in 2K is driven by environmen-

tal and system software or hardware changes and not by application-specific

requirements. The 2K OS is essentially an OS with a built-in reflective middle-

ware framework (i.e. Dynamic TAO). The dynamic customisation takes place

in the middleware layer.

Spring

Spring [116–118] is a distributed network OS developed to work in a networked

multi-processor environment. Spring uses certain properties of reflection, but

it cannot be considered as a completely reflective OS. The use of reflection

in Spring is to share information and to represent the system state at any

given time. After prior analysis, information pertaining to an application’s

characteristics (e.g. deadline, period, etc.) is placed in the process control

blocks (PCB) of the corresponding application process.

Spring researchers developed three integrated languages for its support.

First, in order to efficiently specify the reflective information within the ap-

plications, high-level programming languages – Spring-C [88] and Real-Time

Concurrent C (RTCC) [52] – were developed. These languages allowed pro-

grammers to specify reflective information such as period, deadline, etc. Each

application in Spring must be programmed in either Spring-C or RTCC.

Second, a system description language to provide implementation details

for detailed and accurate timing analyses was designed and implemented. The

language is called System Description Language (SDL) [89]. SDL is used to

specify details such as the nodes in a network, the memory layout of the

system, the bus characteristics, etc.

Third, a notational language – Fault Tolerant Entities for Real-Time

(FERT) [25] for specifying fault-tolerant requirements on a task-by-task ba-

sis was designed. FERT allows the designer to treat each FERT object as

a fault-tolerant entity with protection boundaries. Initially, the FERT ob-

jects have no timing and redundancy constraints. The designer then specifies

a set of application modules as part of a single entity. These modules rep-

resent the user-level code for redundant operations. Furthermore, a FERT

designer can also specifies one or more adaptive control policies which inter-

act with scheduling and analysis algorithms (both off-line and online analysis

algorithms) providing dynamic guarantees.

Spring does not provide any mechanism for dynamic adaptation of OS

policies. It encapsulates the static application requirements into the process’s

PCB and does not accommodate any dynamic changes to these requirements.

2.5 Summary

This chapter discussed resource constraints in real-time embedded systems

along with existing OS resource management and specialisation techniques.

The CPU and memory were the two main system resources discussed in this

chapter. Efficient management of these resource by an RTOS is the key to

provide better application support. There exist several resource management

policies for managing the CPU and memory. However, each policy has its own

failure or inefficient use scenario. Most policies are generic in nature, provide

average case support and do not adapt to application-specific requirements.

The OS specialisation techniques help customise certain parts of an OS such

as resource management policies by adapting them to meet the application-

specific requirements. Most techniques are static in nature, i.e. the system is

statically adapted to the given requirements without considering the dynamic

application requirements.

The reflection mechanism mainly found in programming languages [105]

can be considered as a specialisation technique that can help bring about

dynamic changes to the OS policies. A reflection mechanism can help an OS

dynamically adapt to the application-specific requirements at runtime.

The use of reflective middleware technology provides applications with an

easy-to-use interface also allowing dynamic customisations. Most reflective

middlewares provide support for distributed and pervasive computing. They

do not focus on providing application-specific resource management.

On the other hand, reflective OSs such as ApertOS [128], Chameleon [27],

etc. provide support for reflection within the OS itself allowing the system

to undergo changes at runtime. The OS is divided into several reflective ob-

jects which interact with each other using a meta-object protocol. Objects,

including application processes, are grouped to form meta-spaces. An object

in meta-space A can migrate to meta-space B if the meta-space B implements

a feature required by the object. This feature could be a resource management

policy or the implementation of a specific algorithm.

Existing reflective OSs do not provide explicit support for application-

specific resource management. Furthermore, by allowing dynamic customisa-

tion of all components, they increase the system complexity thereby increase

overhead due to reflection.

In order to provide application-specific resource management in systems

with constrained-resources, an OS should adapt or change its policies at run-

time according to the application requirements. Most OS specialisation tech-

niques [37, 48] support customisation of a single resource. Often, resources

are inter-dependent on each other such that a change made to one resource

management policy could affect the other.

The reflection mechanism provides support to bring about dynamic changes

in OS policies. Existing reflective approaches are too complex and focus on

issues other than application-specific resource management. There is a need

for a reflection-based mechanism in an OS that can adapt/change the resource

management policies on an OS according to the runtime application require-

ments.

Chapter 3

Reflection in RTOS for EfficientResource Management

This chapter proposes the generic reflective framework for an RTOS. The

reflection mechanism is modified for use in the context of an RTOS such that

it has little or no overhead in the system. The framework allows fine-grained

changes to the RTOS’s resource management policies by obtaining application-

specific resource requirements from applications and the system modules alike.

This helps build efficient and adaptive resource management modules that

dynamically adapt/change their behaviour according to application-specific

requirements. Also, the implementation and evaluation of a prototype µ-kernel

– DAMROS [95,96], as an instantiation of the framework, is described.

The chapter is organised as follows: the next section discusses existing

properties of the reflection mechanism and presents modifications to it for use

in an RTOS context. This section describes the process of reification, the role

of the kernel, categorisation of information and the in-kernel reflection inter-

face. Section 3.2 presents the generic reflective framework for an RTOS using

the modified reflection mechanism. Section 3.3 describes the implementation

of a prototype µ-kernel – DAMROS [96] along with two example reflective

system modules: a reflective CPU scheduler and a reflective virtual memory

manager. Finally, section 3.4 presents experimental results of applications

using the reflective framework in DAMROS with the two example reflective

system modules implemented in DAMROS.

3.1 Modifications to Reflection Mechanism

On the one hand, the RTOS needs to identify application resource require-

ments and accordingly adapt its resource management policies. On the other

hand, applications need a mechanism to specify their resource requirements to

the RTOS. A reflection mechanism helps bring about dynamic changes in the

behaviour of the RTOS policies and establishes an information exchange path-

way between applications and the RTOS. The approach taken is not to make

the entire RTOS reflective, rather only the required resource management

modules or applications are reflective. Furthermore, a resource management

module or an application may choose not to be reflective at all.

Most implementations of reflection mechanisms, in both programming lan-

guages and reflective OSs, consider the method of implicit reification [27, 36,

72, 106, 128]. i.e. the mechanism implicitly reifies anything and everything

in the system. This generates enormous amounts of information at runtime

imposing considerable overhead on the reflective subsystem.

In reflective OSs [27,72,128], the mechanism is implemented with an inten-

tion to transparently allow dynamic changes to all components in the system.

By default, every component in the system, either a resource management

module or a device driver, is part of the reflective mechanism in the OS. The

main goal is to provide utmost flexibility for runtime changes to the system

and not to provide efficient application-specific resource management support.

Such fully-fledged implementation of reflective has significant benefits in

terms of the flexibility offered, but it is not essential for efficient resource

management. This thesis aims to use minimal properties of reflection in a way

that is applicable only to the participating OS resource management modules

and the applications, such that the mechanism introduces little or no overhead

to the ones that do not participate.

Reflection has heavy dependency on reification of information between the

various base-level and the meta-level components [105, 128]. For instance, in

an RTOS context, the flow of reified information is from application to appli-

cation, application to system modules and between different system modules.

It is important that an RTOS moderates and has control over the flow of this

information. Such control will allow to restrict any non-legitimate use and

also, to change the information if required for the benefit of the application.

This requires certain modifications to the process of reification. The next

subsection describes the modifications needed to the conventional reification

process.

3.1.1 Modifications to the Process of Reification

It is not necessary to reify all the information available in the system. Conse-

quently, the following changes are made to the reification process:

• Rather than passing the reified information directly to the concerned

meta-level components, it is first passed to the kernel. This helps the

kernel to moderate and have control over the reified information.

• Traditionally, a particular meta-level component receives only the in-

formation reified by its base-level component. The change implies that

not only can one or more meta-level components receive reified infor-

mation from multiple base-level components, but also they can receive

from non-reflective applications. This is very useful because, resource

information is not just confined to a particular base-level component.

By using multiple sources, it is possible to obtain more relevant and

accurate information.

• Any information reified is stored in the kernel and passed to the meta-

level components only when explicitly requested. This helps reduce the

communication overhead that might have been caused by the transfer of

unnecessary information between the kernel and the meta-level compo-

nents.

Figure 3.1 shows the new process of reification involving the flow of in-

formation through the kernel. The kernel core in the figure represents the

minimal part of a kernel that includes support for reflection. Note that the

applications as well as the base-level components, either of system modules

or the applications, can reify information which is then stored in the kernel.

This information, when requested, is transferred to the respective meta-level

components. The dotted lines show the traditional method of reification where

information is passed directly to the meta-level component. The next subsec-

tion describes the role of the kernel in the modified mechanism.

3.1.2 Role of the Kernel

Figure 3.2(a) shows the information exchange mechanism between the base-

level and the meta-levels of a conventional reflection tower. In the modified

Figure 3.1: Reification through the Kernel

reflection tower (see figure 3.2(b)), information reified by any base-level com-

ponent (either application or resource management base-level) is passed to

the RTOS kernel instead of the meta-level. The kernel acts as an information

base for all the meta-levels which then explicitly request for certain category

of information. During this process, the kernel may change the information if

required for the benefit of the entire system. This method allows information

to be shared with not just one meta-level but amongst multiple meta-levels.

Also, many base-levels can share a single meta-level eliminating the need for

redundant meta-levels for each and every base-level in the system.

3.1.3 Component Privileges

In the modified reflection mechanism, it is possible that a meta-level belonging

to one base-level component can affect a change in another base-level compo-

nent. This is accomplished by assigning privileges to both – the base-level

Figure 3.2: Modifications to Reflection

and meta-level components. Privilege assignment is similar to process privi-

leges found in OSs such as Linux [19] where processes with ‘root’ privilege are

superior to the normal user processes.

Base-level Privileges

During initialisation, the application processes and the base-level components

(of applications and system modules) are assigned either an ‘application’ or a

‘system’ privilege by the kernel. These privileges allow the kernel to distinguish

between the application and system base-level components. The system base-

level components are considered superior to those of the applications, i.e. the

kernel assigns greater importance-level to the information reified by a base-

level having a ‘system’ privilege. The kernel maintains a list of initialised

base-level components in the system.

Each resource in the system has a unique identification number – resour-

ceID. A base-level component is uniquely identified by the resource it repre-

sents (e.g. CPU or memory) and thus, has an associated resourceID. Further-

more, each base-level provides the kernel with a list containing resourceID,

meta-level privilege pairs. This list is used to give ‘read’ or ‘write’ access over

the code/data of that base-level to the requesting meta-level component.

Meta-level Privileges

Each base-level component grants a ‘read’ or ‘write’ privilege to a requesting

meta-level component. During initialisation, a meta-level component requests

the kernel for a privilege over one or more base-level components. If assigned a

‘write’ privilege, the meta-level can affect a change in the base-level component

irrespective of whether it is the meta-level component for that base-level or not.

However, the meta-level components with a ‘read’ privilege can only query the

kernel for the information reified by the particular base-level component.

Each meta-level component must represent at least one resource identified

by the resourceID. The kernel associates a list of meta-levels along with their

privileges to each base-level component such that it can easily identify the

destination meta-level component when a particular base-level reifies informa-

tion. Similarly, a list of meta-levels is also associated with each non-reflective

application process. The next subsection describes the process in the kernel

for assigning importance-level to each reified information.

3.1.4 Infolevel for Reified Information

Any newly reified information is validated by the kernel against existing in-

formation and state of the system at that time; accordingly categorised and

assigned an importance-level – infoLevel. Information when stored is cate-

gorised with respect to the resource it belongs to. If an information belongs

to more than one resource, then a different infolevel is assigned according to

the resource. Information with the highest infoLevel is delivered first to the

requesting meta-level component. The next subsection describes the process

of categorising the reified information.

3.1.5 Categorisation of Reified Information

Identifying useful information from non-useful ones is the key to reduce reifica-

tion overhead. At runtime, enormous amounts of information is generated by

the base-levels as well as the application processes. It is not logical to use all

the reified information. Thus, information needs to be categorised according

to the resources and some discarded/ignored.

Often information pertaining to one meta-level is also useful to other meta-

levels. Furthermore, an information relevant to one or more meta-levels might

be more important for one meta-level than the other. Thus, it is essential to

assign different importance-levels to each category an information belongs to.

Categories are represented by the resourceIDs of the resources. This is

obtained by looking at the meta-level list associated with the respective reify-

ing component. Thus, any reified information can be categorised according to

the resource(s) (e.g. CPU, memory) represented by the meta-levels. The fol-

lowing subsections categorise the information and its use pertaining to: CPU,

memory and other resources.

Information for the CPU resource

Any information that corresponds to a change in the scheduling order, or the

execution time of the processes in a system falls in this category. The following

information belongs to the CPU (scheduling) category:

• process priority: the scheduler uses this information to select the next

runnable process. This information is vital for the CPU scheduler.

• process deadline: this is another vital piece of information which can be

used by the scheduler to raise or lower the priority of a process if using

priority based scheduling.

• scheduling policy: this information suggests the scheduling policy that

is to be used. All the application processes will be affected as a direct

result of any change brought about due to this information. However, in

such a case the kernel should intervene and disallow the change.

There exist several other kinds of information which help in efficient manage-

ment of the CPU. The above list could be used as a general guidance applicable

to any real-time application process but is in no respect complete. Additions

to the list are implementation specific.

Information for Memory resource

Memory resource also includes the possible use of virtual memory techniques

such as paging. Any information that suggests the use of memory either by

executing a piece of code or accessing data belongs to this category.

The following is a list of the most important information belonging to this

category:

• read-access: It is difficult for an RTOS to predict memory access patterns

of an application process. Thus, information that suggests a memory

read access is valuable for efficient memory management. For instance,

in case of a paged system, this information could be used to make only

those pages that are actually being read to be available, while moving

the unused pages to the swap-space.

• write-access: this information is similar to read-access but has additional

impact. For instance, in an RTOS implementing the copy-on-write fea-

ture [19], a memory write-access would trigger a particular copy opera-

tion. Knowing in advance when an application is going to do a memory

write-access helps the RTOS to make more accurate memory allocations.

• allocation: this information can be obtained implicitly or explicitly from

the system. It provides the RTOS and other reflective modules with

information about a process’s memory usage statistics.

• deallocation: this works in conjunction with the allocation and helps

keep track of the amount of memory usage by a process at any given

• reservation: this information suggests the RTOS to reserve certain

area/region in memory, such that the pages belonging to this region

are always physically available for access.

There is no limit on the kind or category of information an application process

can provide. The above mentioned categories should be treated only as a

guideline and may differ with different implementations.

Information for Other Resources

Although this thesis mainly focuses on the CPU and memory resources, similar

categorisation can be made for other resources in the system. Furthermore, the

resource management code for other resource may choose to receive informa-

tion reified for the CPU and memory. More accurate decisions could be made

by a particular resource management module by knowing the status of other

resources in the system. For instance, a memory management module may

handle a page-fault at a later time if it knew that the faulting process’s CPU

budget has expired. Thus, switching to the next process rather than handling

the page-fault would prevent the ready process from waiting unnecessarily.

Power – particularly in case of battery operated embedded devices – is

another such resource where information pertaining to other resources is very

useful.

The next subsection describes the flow of reified information through the

kernel.

3.1.6 Flow of Reified Information

Figure 3.2(c) shows the stages involved in the flow of information through

the kernel in both directions – base-level to meta-level and vice-versa. Each

meta-level component (of an application or a resource management module) is

assigned a privilege which governs the information it can access. For instance,

an application meta-level entity is not allowed to obtain sensitive information

regarding other application processes, thus, its meta-level component would

not be granted ‘read’ privilege by the scheduler base-level component. The

next subsection describes the information flow from the base-level to the meta-

level.

Base-level to Meta-level

The flow of information from a base-level to a meta-level is divided into two

main phases. In the first phase, information reified by a base-level is passed to

the kernel where it is validated against the privileges of the sending application

process or the base-level component. An importance-level is assigned to the

information depending on the privilege of the sending component and the

resource the information is relevant for.

The kernel then checks for available free memory in the system. If there

is no free memory to store the information, then the information is either dis-

carded (in case of lower importance-level than existing information) or existing

information having lower importance-level is deleted in order to store the new

Once the kernel decides to store the information, it determines the effects

the information could have on the system. For example: consider that an

application process has requested a higher priority. Once the meta-level gets

this information, it might request the kernel to grant the application a higher

priority. If this request is granted, then it directly affects other processes

executing in the system. Hence, at the time of reification itself the kernel

checks the system state making suitable changes to the information if required

(e.g. lower the requested priority). This way, care is taken that the information

reified by one base-level cannot adversely affect other processes or system

modules in the system.

In the second phase, the destination meta-level component(s) of the reified

information are determined and accordingly the information is categorised.

Note that information reified by one base-level might be useful to several meta-

level components. In the above example, information about raising a process’s

priority would be useful to the reflective scheduler module as well. Hence,

the kernel categorises the reified information by attaching a list of probable

meta-level(s) to it.

This information is then held in the kernel-space until the concerned meta-

level(s) request it, the application that reified the information has exited the

system or it is overwritten by information with higher infoLevel. The meta-

level component may choose to query the information either periodically (e.g.

every 10 milliseconds) or intermittently (e.g. after every base-level change

it requests). This decision is made by the application or system developer

implementing the meta-level code. The next subsection describes the flow

information from the meta-level to a base-level.

Meta-level to Base-level

The flow of information in the form of requests/commands from a meta-level

to a base-level is moderated by the kernel. Any change that a meta-level wants

to bring about, has to be passed as a request to the kernel. Again, the kernel

validates the request against the privileges assigned to the requesting meta-

level. The kernel then analyses the effects the change might have on the entire

system. For example: the effects of changing a process’s priority as described

above will affect the process scheduling order.

Furthermore, depending on the system state at the time of the request, the

kernel may make changes to the request itself or the way it is handled. For

instance: consider that the meta-level component of an application requests

a higher priority in the system. The kernel cannot bring about this change

by changing the application’s base-level component. In this case, the kernel

lets the request be handled by the meta-level of the reflective scheduler. The

scheduler’s meta-level would then manipulate the process priorities in its base-

level scheduler to affect the change.

The kernel is involved in every activity inside the reflection tower and plays

an important role in maintaining system integrity. This additional level of

indirection in the flow of reified information might seem to impose considerable

overhead into the system. However, since reified information remains in the

kernel at any given time and is only passed to the meta-level on explicit request,

most of the communication overhead is avoided. Furthermore, the kernel can

exercise complete control over the reification process and over reflection as a

whole. This also allows the kernel to analyse the existing state of the system

and manipulate information if necessary for the benefit of the entire system.

The key is that the kernel is able to discard certain unwanted information much

earlier in the reflection process avoiding any additional penalties that it might

have incurred. The modified reflection tower forms the basis of the reflective

framework explained later in section 3.2. The next sub-section describes the

support for reflection in the form of an in-kernel reflection interface.

3.1.7 In-kernel Reflection Interface

Most OS kernels store or have knowledge of the current system state or the

occurrence of any future events such as the process that executes next, expiry

of a timer, etc. In addition to this, applications have information regarding

their resource requirements and execution behaviour. Provision of mechanism

to exchange or share this valuable information is not enough to achieve efficient

resource management. The kernel also needs to provide additional mechanisms

to the resource management modules so as to query information at will and

bring about change(s) at runtime. The following facilities constitute the in-

kernel reflection interface:

• reification: an ability for the reflective system modules, applications

and the kernel alike to reify information. The process of reification via

the kernel has been explained in detail earlier in section 3.1.1. Along with

the information that is to be reified, the interface should also capture

the type of information (e.g. memory allocation, CPU requirement, etc.).

This helps the kernel categorise and assign an importance-level to it.

• introspection: an interface for the reflective modules to inspect the

reified information. This could be in the form of a simple function call

in the case of a single address space RTOS or the use of a system call.

The calling meta-level component would use this interface to query reified

information.

• interception: an interface or mechanism for the meta-level components

to intercept the base-level. The interception interface would work on a

function-level granularity allowing the meta-level components to inter-

cept functions present in a base-level component. A meta-level could

either request the kernel to intercept calls to a particular function found

within a fixed region of code or to intercept all calls to the function in

the entire base-level. The amount of flexibility or any additional func-

tionality provided by this interface is implementation-specific.

• code change: an ability to add, change or install code into the RTOS

or the applications. If possible, this interface should make use of the

existing dynamic loading features of an OS. The meta-level components

should be able to add new code into the base-level such that the new code

would either compliment the existing base-level functionality or com-

pletely replace it. Again, whether the newly added code is dynamically

relocated or is linked during compile time is implementation-specific.

Figure 3.3 shows the use of the in-kernel reflection interface between the

base-level and the meta-level components of a system module or an application.

The kernel is shown to occupy the area between the vertical dotted lines. On

the left-hand side of the kernel are the base-level components including the

applications and system modules and on the right-hand side are the respective

meta-level components. There is no specific order in which the interface is to

be used and the components may either be in an application address space or

the kernel address space depending on the implementation of the OS.

Information reified by the base-level components is stored within the kernel

and sent to the corresponding meta-level component(s) on explicit request.

Similarly, requests from the meta-level to intercept the base-level code or to

install new code into the base-level are passed to the kernel. At any stage, the

Applicationcode

System

base−levelcode

Kernel

intercept

install code

meta−levelcode

readmodule

module

install code

intercept

readreify

Application/System

Figure 3.3: In-kernel Reflection Interface

kernel has complete control over the information as well as the changes being

3.1.8 Summary

In summary, it is noted that conventional reflection mechanism provides

promising features but has lesser support in terms of having central control

over the information in an RTOS context. The communication overhead asso-

ciated with traditional reification process is not acceptable for real-time per-

formance. The modified reflection mechanism gives the RTOS kernel complete

control over the reflection mechanism. In particular, for soft real-time systems,

this means the system can maintain its integrity such that no change occurs

without the knowledge of the kernel. The method of reification via the kernel

such that information is transmitted to the meta-level components only on

request reduces the communication overhead.

There are no guidelines defined for the use of reflection in an RTOS context.

There is a need for a reflective framework in an RTOS which lays a well-

defined structure with useful guidelines for the development of efficient resource

management modules. Such modules could take advantage of reflection and

provide the required application-specific support. The next section describes

the generic reflective framework for an RTOS.

3.2 Generic Reflective RTOS Framework

The RTOS framework is based on the modified reflection mechanism described

in section 3.1. Most commercially available RTOSs are based on either – a

µ-kernel or a monolithic OS architecture. There is a significant difference be-

tween the two architectures. In a µ-kernel, the system modules – commonly

called servers – run as independent processes similar to the application pro-

cesses either in single or independent address spaces [109]. However, in a

monolithic kernel, the system modules are compiled into a single kernel mod-

ule that operates at specified time intervals (e.g. at scheduling instance, on

a timer interrupt, etc.) [109]. The µ-kernel architecture is modular in nature

and can be more easily specialised. The generic reflective framework is appli-

cable to both architectures and is defined in the form of an open framework

with no restriction on how it is implemented. The system developer is free to

add additional features to the framework so as to adapt it to the needs of a

particular system.

The framework assumes that the underlying OS has the notion of time

such that the framework is aware of the passage of time. The OS may provide

a simple system call such as gettime() which would return the current time in

the system. The time may either be an integer value (i.e. clock ticks) or in a

particular time format (i.e. hh:mm:ss).

The key aspect of the framework is the RTOS kernel. It is the centre-

point for all interactions within the system. The kernel should have reflection

support (as described in section 3.1) that allows the resource management

modules to adapt to application-specific requirements at runtime. The next

subsection describes the core elements of the framework.

3.2.1 Core Elements of the Framework

It is not essential to include all properties of reflection in the framework. This

section describes the core elements (particularly with respect to reflection)

that the framework must include in order to provide the required application-

specific resource management. Under the framework, the kernel must include

support for the following features:

• explicit reification: support for explicit reification i.e. the source is

compiled with explicit reification calls to reify only the required informa-

tion at runtime. These reification calls could be either added manually

by the developer or automatically inserted using compiler-assisted tech-

niques. The reification interface adheres to the process explained in

section 3.1.1.

• introspection: an interface in the kernel to allow the meta-level com-

ponents to query reified information at will. A simple function call or a

system call interface can be used for this purpose.

• function interception: this interface allows the meta-level component

to intercept a function. The core interception mechanism should allow:

1. to intercept all calls to a function or a given number of calls found

within a specified region of code,

2. to intercept a function such that the control is transferred to the

intercepting function provided by the meta-level before executing

the original function code,

3. to intercept a function such that the control is completely trans-

ferred to the intercepting function and the original function is never

executed. Thus, the intercepting function replaces the original func-

tion’s functionality.

• causal connection or link: an interface for the meta-level component

such that it could form a causal connection with data in the base-level.

This means, a change made by a meta-level component to the causally

connected data will be reflected on the actual data in the base-level. In

case of a single address space RTOS, this could be achieved by means

of providing the meta-level with a C like pointer to the data. In case of

a multi-address space RTOS, the causal link facility should at least be

supported for data belonging to the same address space.

The next subsection describes the optional elements of the framework that

may or may not be implemented.

3.2.2 Optional Elements of the Framework

In addition to the core elements, the framework may choose to extend or im-

plement some additional elements. The following guidelines provide extended

features to the existing core elements and some additional elements that can

be included in the framework:

• selective introspection: allowing the meta-level to specify the infor-

mation it is particularly interested in such that the kernel automatically

alerts it once such information is reified. This would also help the kernel

assign better importance-level to the information. The meta-level could

set a timeout for such kind of information whereby on timeout the kernel

would no longer alert the meta-level.

• enhanced interception: in addition to the core interception features

described in the previous subsection, the following could also be imple-

mented:

1. the ability to intercept a function such that the control is trans-

ferred to the intercepting function provided by the meta-level after

executing the original function code,

2. the ability to intercept calls to a function rather than the function

itself. This facility would allow custom behaviour of the function

depending on where it is called from. The interface could inter-

cept all or a said number of calls to a function within defined code

boundaries (i.e. limited by start and end addresses).

• install new code: the ability to install new code into the kernel to

either add an extra functionality or complement the existing one. The

implementation could use techniques such as dynamic loading and re-

location to provide this facility. If the extra functionality is already

pre-compiled into the module then it could be added or removed using

the replacement facility provided by the core interception element.

The next sections describe the model for constructing reflective system mod-

ules and reflective applications making use of both core and optional elements

of the framework.

3.2.3 Reflective System Modules

Each reflective system module in the framework is separated into two entities:

a base-level and a meta-level. The base-level component implements a stan-

dard resource management policy. For instance, the base-level of a reflective

CPU scheduler could implement a fixed priority scheduling policy.

The meta-level component would analyse reified information in the kernel

pertaining to its base-level (using the introspection interface) and identify the

need for a change in the base-level. This change could be in the form of a

change to the policy (e.g. EDF scheduling policy instead of FP) or a minor

change in data structure(s) present in the base-level (e.g. change a process’s

priority). Whether a meta-level component is executed as a separate process

or only when information pertaining to it has been reified is implementation

specific. Ideally, when not required the meta-level component should remain

in an idle state (i.e. not consuming CPU time).

Figure 3.4 shows the general structure of a reflective system module. The

base-level component sets up meta-level privileges during initialisation and

reifies information at runtime. For example: the base-level of a reflective CPU

scheduler would reify: the status of the current process, information about the

current scheduling policy, process priorities, etc. The meta-level component

can intercept a base-level function, install new code or establish a causal link

in order to change or adapt the base-level’s behaviour at runtime.

It is possible to share a single meta-level component between two or more

reifieddata

ApplicationBase Kernel Core

CodeBase−level

CodeMeta−level

readreified

causal

install codeor

request for

transferintercepted

reify data

Reflective System Module

installcode

interception

Figure 3.4: Structure of a Reflective System Module

reflective system modules. The meta-level component in this case would simply

obtain privileges from the respective base-level components during its initiali-

sation. Sharing meta-level component amongst one or more base-levels would

save memory required by any additional meta-level components. It is suggested

that this facility should only be used when the two base-level components are

closely related to each other. For example: in case of power management, the

base-level components of resources such as memory and CPU could share their

meta-level component with the base-level of the power management module

as well. This will provide more information about the resource for efficient

power management. However, the possibility and impact of a shared meta-

level component has not been explored in this thesis. Note that the figure 3.4

does not show the shared meta-level(s).

3.2.4 Reflective Applications

Applications are the primary users of the facilities an RTOS provides. The

better the support they obtain from an RTOS, the better is their performance.

The main motivation of this work is not to allow implementation of reflective

applications but to provide efficient resource management support by using

the information available within the applications and the system as a whole.

However, similar to the reflective system modules, the framework also supports

the implementation of reflective applications.

With the use of privileges, the kernel restricts the way applications use the

reflection interface. With the assigned application privilege, an application

meta-level component cannot perform the following operations:

• access information reified by the resource management modules (e.g. the

system’s process queue) unless given explicit privilege by the respective

base-level component of the resource management module. However,

an application is able to access the process queue containing its child

processes/threads.

• install or replace new code into any of the system module base-level

components when the change affects other applications in the system.

• share a meta-level component amongst multiple application or system

base-levels. This ensures complete isolation within different applications.

Other than the above restrictions, the structure of the reflective applica-

tions is similar to the reflective system modules.

There is however one more difference: irrespective of whether an applica-

tion has a meta-level component or not, it is still allowed to reify information.

In this case, the reified information, when stored in the kernel, can be used by

a reflective system module.

Reification plays an important role in extracting valuable information from

the applications. By reifying application-specific resource requirements to the

system, applications in a way control and adapt the RTOS’s reflective resource

management policies. The next subsection describes the meta object protocol

for the reflective modules.

3.2.5 Meta Object Protocol for Reflective Components

The Meta Object Protocol (MOP) provides a basic set of rules/guidelines

that decide how the reflective components (e.g. meta-levels) will operate and

interact with each other in the system. The following are the MOP guidelines

defined for the reflective framework:

• exactly what and how much is reified? : the factor that affects this deci-

sion is the importance-level of the currently reified information and the

availability of memory. The approach is to store as much of the reified

information as possible in the order of importance, discarding informa-

tion with lower infoLevel to minimise memory utilisation. At runtime,

the base-level component would use the reification interface to reify all

the relevant information. Depending on the availability of memory, the

kernel either chooses to store or discard some of the reified information

(more details in section 3.1.4).

• number of allowed meta-levels: ideally, one meta-level component for a

base-level would suffice. The framework supports the use of multiple

meta-level components operating one on top of the other. Practically,

there is no limitation in the framework on the number of meta-levels a

reflective system module/application can have at any given time. This

is possible by letting the first order meta-level component to act as a

base-level component to the second order meta-level and so on.

• interaction between different meta-levels: similar to supporting multiple

meta-level components, a meta-level ‘A’ can interact with a meta-level

‘B’ by acting as its base-level component and using reification to pass

information to ‘B’. The meta-level ‘B’ on the other hand can also act as

the base-level of ‘A’ and reify information to it. This phenomenon can

be termed as cyclic reflective tower where the meta-level components

can interact and change each other. The interaction between meta-level

components needs to be explicitly configured by providing the required

privileges to each other during initialisation as described in section 3.1.3.

In order to practically verify the reflective framework in an OS context,

a prototype RTOS has been implemented. The following sections describe

the implementation of a prototype RTOS – DAMROS. DAMROS implements

two reflective system modules: reflective CPU scheduler and reflective virtual

memory manager (paging).

3.3 Prototype Implementation – DAMROS

This section presents the prototype implementation of the reflective framework

in a home-grown micro-kernel RTOS - DAMROS [95, 96]. DAMROS stands

for Dynamically Adaptive Micro-reflective Real-time Operating System. The

generic reflective framework allows applications to express their specific re-

source requirements via the process of reification.

Based on a µ-kernel architecture, DAMROS has been implemented as a

single address space RTOS for the Intel x86 CPU architecture [61]. It sup-

ports virtual memory paging and implements a two-level CPU scheduler. The

main goal of implementing DAMROS is to test the reflective framework for

application-specific resource management, in particular, for the CPU and

memory resource. For the prototype implementation, the development of

DAMROS was restricted to implementation of a two-level CPU scheduler,

a paged memory management subsystem, and a few device drivers to provide

an interface to run the experiments.

The base core kernel consists of the reflection interface, a minimalist lower-

level scheduler (whose main objective is to schedule the various system modules

when required including the higher-level application scheduler) and interrupt

handling routines (e.g. timer interrupt). All the system modules (e.g. CPU

scheduler, memory manager, etc.) execute as separate individual system pro-

cesses/threads.

In order to support the framework, DAMROS implements a gettime() func-

tion which can be used to obtain the current time in the system. The value

returned (data type time t)is a 64-bit value of the CPU’s time-stamp counter

(i.e. Intel RDTSC machine instruction [38]). Furthermore, DAMROS imple-

ments timers using which the processes/threads in the system can sleep until

or be woken up at a particular time in future.

Since DAMROS is a single-address space RTOS, applications execute in

the same address space as the OS. There is no distinction between a process

and a thread. Both mean one and the same and can be used interchangeably.

The next subsection describes the implementation of the reflection interface

in the kernel.

3.3.1 Reflection Interface in the Kernel

According to the reflective framework, the implementation of the in-kernel

reflection interface allows applications and the system modules to reify infor-

mation from the base-level as well as the applications; to introspect, intercept

and also install new code into the base-level. The interface has been imple-

mented as two separate components: the rManager : for managing reification

and introspection; and the iManager : for managing interception and the in-

stallation of code. Before describing the interfaces offered by each component,

the next subsection describes the different types of information and the sup-

port of reification in DAMROS.

Support for Reification

In DAMROS, each resource is assigned a unique ID (an Identification number)

which is used when reifying information about a particular resource. The

IDs assigned for the CPU and memory resources are 1 and 2 respectively.

DAMROS, implemented in the C language, defines C constants CPU and

MEMORY with values 1 and 2 respectively. Information related to a resource

is further categorised into information type, represented as infoType.

Since DAMROS is a single address space OS, reification of information is

accomplished using a direct function call interface. The base-level component

of a reflective module or an application prepares and reifies a data structure

containing the required information. The C language data structure used for

reification is as follows:

typedef struct {

process_id_t pid;

int resourceID;

int infoType;

unsigned data;

void *dataPtr;

int infoLevel;

time_t time;

} reify_t;

In the above data structure, pid represents a unique ID assigned to a pro-

cess/thread in DAMROS. The resourceID represents one of the resources in

the system (i.e. the CPU or MEMORY). The significance of other fields is

described later in the following subsections. For each resource there are a

number of different information types (i.e. infoType) defined.

CPU infoTypes

Each infoType (an integer constant) has an associated infoLevel field which

depicts the importance of the information. A greater infoLevel value gives

greater importance to the information. For the CPU, DAMROS defines the

following infoTypes (the associated infoLevel value is shown in brackets):

• HI PRIORITY (infoLevel = 22): used when a process requires a higher

priority. If granted, the process obtains one higher priority than its

existing priority.

• LO PRIORITY (infoLevel = 21): used when a process requires a lower

priority. If granted, the process obtains one lower priority than its exist-

ing priority.

• CHILD FP (infoLevel = 25): used by a process to request for a Fixed

Priority (FP) scheduling policy to schedule its child threads. For threads

with equal priorities, DAMROS uses a pre-emptive FP scheduler that

switches between the threads with equal priority similar to a RR sched-

uler. Thus, all child threads have equal priorities, the existing RR sched-

uler is used until one of the threads changes its priority.

• NO CHILD FP (infoLevel = 24): used by a process to undo the effect

of the above request. DAMROS ignores and discards this information if

there is no FP scheduling policy in use by the process.

• DEADLINE (infoLevel = 22): used by a process/thread to specify a time

deadline when using EDF scheduling policy. In this case, the dataPtr

field of reify t points to the 64-bit time t value. A value of 0 or 1 in the

data field indicates whether the given time is absolute or relative to the

current time in the system.

• CHILD EDF (infoLevel = 25): used by a process to request a Earliest-

Deadline-First (EDF) scheduling policy to schedule its child threads.

• NO CHILD EDF (infoLevel = 24): used by a process to undo the effect

of the above request. DAMROS ignores and discards this information if

there is no EDF scheduling policy in use by the process.

• CHILD FCFS (infoLevel = 20): used by a process to request for a First-

Come-First-Serve (FCFS) scheduling policy to schedule its child threads.

• NO CHILD FCFS (infoLevel = 19): used by a process to undo the ef-

fect of the CHILD FCFS request. DAMROS ignores and discards this

information if there is no FCFS scheduling policy in use by the process.

• CHILD RR (infoLevel = 15): used by a process to request for a round

robin scheduling policy to schedule its child threads.

• CHILD SCHED (infoLevel = 30): used by a process to request for a

user-defined (UD) scheduling policy to schedule its child threads. In

this case, there are two possibility: (1) an application process might

implement its own UD scheduler code or (2) a process might want to

use a UD scheduler implemented elsewhere in the system – either in the

scheduler’s meta-level or a different process in the system. In case of (1),

the address location of the UD scheduler is placed in the dataPtr field of

the reify t structure with the data field set to 0. In case of (2), the UD

scheduler can only be used if the component implementing it allows it in

which case, the dataPtr field points to a string that uniquely identifies

the US scheduler function implemented elsewhere with the data field set

to 1. Case (2) is described in more detail later in section 3.3.3.

• NO CHILD SCHED (infoLevel = 29): used by a process to undo the

effect of the CHILD SCHED request. In other words, disables the UD

scheduling policy for the child threads of the calling process. DAMROS

ignores this information if no UD scheduler is active for the child threads.

• PROCESS QUEUE (infoLevel = 35): this infoType is reified by the base-

level scheduler. It is used by an UD scheduler to obtain the process queue

from the base-level scheduler using the requestInfo() interface (described

in section 3.3.2). The process queue, thus returned, consists of only the

child threads of the application implementing the UD scheduler.

DAMROS implements a RR scheduling policy at the base-level with a time

quantum of 5ms. For the above infoTypes beginning with ‘NO ..’ (e.g.

NO CHILD FP), the previously installed scheduler is replaced by the default

RR scheduler. Note that the infoType NO CHILD RR would have no effect

and hence, it does not exist. The infoLevel values are used by the rManager to

prioritise information. The assignment of infoLevels is such that information

pertaining to UD scheduling gets higher priority than the rest.

Memory infoTypes

For memory, particularly with respect to paging, DAMROS defines the follow-

ing infoTypes :

• MEM READ (infoLevel = 25): used by a process to specify that a par-

ticular region of memory is being read by the application. The dataPtr

field of reify t structure holds the starting location of the access and the

data field holds the size (in bytes) of the read operation from the starting

location.

• MEM WRITE (infoLevel = 24): used by a process to specify that a

particular region of memory is being written to by the application. The

range of the memory region that is written to is specified similar to the

MEM READ.

• MEM ALLOC (infoLevel = 20): mainly used by the base-level code of

the memory manager but can also be used by an application. It informs

the meta-level component of the memory manager about a memory al-

location. The range of the memory region allocated is specified similar

to the MEM READ.

• MEM FREE (infoLevel = 19): similar to MEM ALLOC, but specifies

that a memory region is freed in memory.

• KEEP ALIVE (infoLevel = 22): used by a process to request a certain

region of virtual memory to be always resident in physical memory.

• ALLOW DEATH (infoLevel = 21): used by a process to voluntarily

suggest a certain virtual memory region to be removed from physical

memory (i.e. moved to swap space).

• LRU (infoLevel = 15): used by a process to request an LRU paging

policy in the system.

• LFU (infoLevel = 27): used by a process to request an LFU (Least

Frequently Used) paging policy in the system.

• NO LFU (infoLevel = 26): used by a process to undo the effects of an

LFU request. DAMROS ignores and discards this information if the

process is not using an LFU policy.

• MRU (infoLevel = 29): used by a process to request an MRU (Most

Recently Used) paging policy in the system.

• NO MRU (infoLevel = 28): used by a process to undo the effects of an

MRU request. DAMROS ignores and discards this information if the

process is not using an MRU policy.

• UD POLICY (infoLevel = 31): used by a process to request that

a UD paging policy be used to manage its memory pages. Like

CHILD SCHED, there are similar two possibilities and the correspond-

ing field in reify t are set as in case of CHILD SCHED.

• NO UD POLICY (infoLevel = 30): used by a process to undo the ef-

fects of a UD POLICY request. DAMROS ignores and discards this

information if the process is not using an UD policy.

• PAGE TABLE (infoLevel = 35): this infoType is reified by the base-

level of the memory manager to reify the page table belonging to an

application in the system. The information is used by a UD paging

policy to obtain the page table belonging to its application using the

requestInfo() interface (described next). The page table, thus returned,

consists of only the pages allocated to the application implementing the

UD policy.

Each infoType has a unique integer value. When an information is reified,

the reify t data structure containing the right values is passed to the kernel

via a system call. In case of DAMROS, a system call would be a direct

function call. This data structure is stored in the kernel and passed to the

respective meta-level component(s) on request. The next subsection describes

the interfaces provided by the rManager component.

3.3.2 The rManager

The rManager component provides several interfaces for the applications and

the system modules to take advantage of the reification mechanism in the

kernel.

The implementation of each interface in terms of the C language specifica-

tion is described as follows:

Interface reify():

SYNOPSIS:

int reify(int infoType, ...);

DESCRIPTION:

Any component in the system either reflective or non-reflective including the

applications, system modules and the kernel itself can use this interface to reify

resource related information. For example: an application process can reify

information to request a higher priority using the following call (C language

representation):

reify(HI_PRIORITY);

where HI PRIORITY is the infoType.

In DAMROS, the reify interface is a C function that accepts a variable

number of arguments (i.e. the C language va args) such that the component

that reifies information is able to provide additional information as required.

For instance, to reify a request to install an UD scheduler, the application

process would be required to provide the location of the UD scheduler’s code.

Thus, the usage of reify in this context is as follows:

reify(CHILD_SCHED, &UD_scheduler);

where CHILD SCHED is the infoType and ‘&UD scheduler’ represents the

address location of the scheduler code. The function reify() uses the specified

infoType retrieving any additional information from the function arguments

and prepares the reify t data structure.

int reify(int infoType, ...)

va_list args;

reify_t *info;

va_start(args, infoType);

info = (reify_t *) malloc(sizeof(reify_t));

info->pid = getpid();

info->infoType = infoType;

info->resourceID = getresourceID(infoType);

info->infolevel = getinfolevel(infoType);

info->time = gettime();

switch(infoType)

case ...

case CHILD\_SCHED:

/* store the address location of UD scheduler */

info->dataPtr = (void *)va_arg(args, unsigned);

break;

va_end(args);

/* pass the information to the rManager */

return rManager_save(info);

Figure 3.5: Code Snippet of Reify Interface

Figure 3.5 shows the code snippet of the reify interface for handling the

infoType – CHILD SCHED. Here, getpid() and getresourceID() are functions

that return the process ID and the resource ID associated with the passed in-

foType. In the above example, since the infoType CHILD SCHED corresponds

to the CPU scheduler, the resource ID returned by getresourceID() would be

1 or the defined constant – CPU. Also, functions getinfolevel() and gettime()

return the infoLevel for a given infoType and the current system time.

The newly prepared data structure (reify t) is passed to the rMan-

ager save() where it is either stored or discarded. In the rManager, each reified

information is tagged with an infoLevel field to signify its importance-level.

Figure 3.6 lists the code snippet of rManager save().

Depending on the availability of memory and the infoLevel, the rManager

either stores or discards the information. Each resource in DAMROS is as-

signed 8 KB of memory for the storage of reified information. The function

memFree() returns the memory available for the corresponding resource. In

the case where there is no memory available, the rManager accommodates the

new information (if it has higher infoLevel) by discarding an already existing

information with a lower infoLevel value.

The function getLeastInfoLevel() returns the information which has the

least infoLevel value amongst the already reified information for that resource.

If the return information has greater infoLevel than that of the newly reified

information, then the new information is discarded. Each stored information

is assigned a unique ID which is returned by the saveInfo() function.

int rManager_save(reify_t *info)

reaify_t *curInfo;

int id;

/* check if memory available */

if (memFree(info->resourceID) < sizeof(reify_t)) {

/* get stored info with least infoLevel

for the concerned resource (e.g. memory) */

curInfo = getLeastInfoLevel(info->resourceID);

/* if infoLevel of reified info is lower */

if(curInfo->infoLevel > info->infoLevel) {

/* discard the newly reified information */

discardInfo(info);

return -1;

/* discard info with a lower or equal infoLevel */

discardInfo(curInfo);

/* save the newly reified information */

id = saveInfo(info);

return id;

Figure 3.6: rManager: Saving Reified Information

Interface requestInfo():

SYNOPSIS:

int requestInfo( int resourceID

, int infoType

, time_t after

, time_t before

, reify_t* info);

DESCRIPTION:

This interface is used by the meta-level component to obtain reified informa-

tion stored in the rManager. During system initialisation, each meta-level is

assigned a privilege list of <resourceID, access privilege> pairs. The access

privilege in terms of read access or write access assigns the meta-level a read-

/write privilege for a particular resource represented by its resourceID. As per

the framework, a base-level provides the kernel with this list during its initiali-

sation. The kernel assigns this list to its meta-level component. The rManager

checks for the privilege of the requesting component prior to providing it with

the requested information.

In the implementation of function requestInfo(), it is not necessary for the

requesting component to specify all the function parameters. The parameter

resourceID must be specified, whereas the rest are optional. When called with

only the resourceID, the function copies the resource’s reified information with

highest infoLevel into the info parameter.

By specifying the infoType parameter, the function provides information

with the given infoType which was most recently reified. If any requested

information does not exist, then the function returns ‘-1’ with the info param-

eter pointing to a Null value. Using the parameters after and before either

alternatively or together, a meta-level can correspondingly query information

that was reified after a given time, before a given time or between a given time

period.

Furthermore, a meta-level can request for a causal link to the data present

in the base-level using the requestInfo() function. In this case, instead of

copying information into the info parameter, a unique ID representing the

data structure in the base-level is returned. This unique ID is then used by

the function linkData() to establish the causal link. If the causal link is not

granted, then the function returns ‘-2’.

Interface linkData():

SYNOPSIS:

void* linkData(int ID);

DESCRIPTION:

This interface allows a meta-level to form a causal link with the data present

in the base-level. Such a link helps the meta-level to directly inspect/anal-

yse/modify data in the base-level without incurring any extra overhead in

the system. A call to linkData() must always be preceded by a requestInfo()

call. The call to requestInfo() lets the rManager know what data the calling

component wants to link to.

The rManager authenticates this request against the privileges, prepares

for a causal link and provides the calling component with a unique identifier

(ID) that resembles the request. This ID is then used by linkData() call to

form the actual causal link.

An example use of a causal link is as follows: the meta-level code of a

reflective CPU scheduler could establish a causal link with the scheduler’s

process queue. Any change made by the meta-level to the order of the processes

in this process queue would affect the scheduling of the processes without the

knowledge of the base-level. Note that, in order to form a causal link, the base-

level code must reify the data and assign a ‘write’ privilege to the meta-level

that would request for the causal link. DAMROS follows the same guidelines

described in section 3.1.3 to assign privileges to the meta-level components.

For process synchronisation, the access to the data is protected by a mu-

tex [19]. The base-level code must provide a mutex along with the data during

reification. i.e. the dataPtr field in the reify t structure holds the pointer to

the data being reified and the data field holds the mutex. The mutex inherits

the meta-level privilege list of <resourceID, access privilege> pairs from the

base-level such that only the meta-level with a ‘write’ privilege is able to lock

the mutex and use the data.

Interface unlinkData():

SYNOPSIS:

void unlinkData(int ID);

DESCRIPTION:

This interface is used by the meta-level code to close/invalidate a previously

established causal link. This function uses the given ID to locate the reified

data and the associated mutex. The meta-level entry is disabled in the mutex’s

privilege list such the calling meta-level can no more lock the mutex to use the

data. This entry is reset by using linkData() again.

Note that, the applications, system modules and the kernel are compiled

together to run in a single address space such that the above interfaces are

equally visible to the applications as they are to the system modules. The

reflection interface in DAMROS does not impose any additional execution

time penalties caused by using the IPC (Inter-Process Communication) mech-

anisms. Race conditions are avoided by the use of mutexes where necessary.

The next subsection describes the implementation of the iManager component.

3.3.3 The iManager

The iManager component provides interception and code installation inter-

faces to the system modules as well as the applications. The implementation

of each interface in the terms of the C language specification is as follows:

Interface allowIntercept():

SYNOPSIS:

int allowIntercept( void *function

, char *func_name

, int resourceID);

DESCRIPTION:

This interface is used by base-level code during its initialisation to allow the

interception of a particular function (represented by parameter function) by

the meta-level component representing the given resourceID. If the given re-

sourceID is ZERO, then any thread in the system is allowed to intercept the

function. The iManager verifies and assigns a ‘write’ privilege to correspond-

ing entry in its meta-level privilege list if required. This interface assigns a

unique ID to represent the function such that a meta-level or a thread can use

either this ID or the given string in the func name parameter it to intercept

the function in future. This is used for case (2) of both CHILD SCHED and

UD POLICY infoTypes of the CPU and memory resource.

Interface interceptAllowed():

SYNOPSIS:

int interceptAllowed( char *func_name

, int func_ID

, void *new_func);

DESCRIPTION:

This interface is used by a meta-level or a thread to intercept a function which

is present in a different base-level or implemented by a different process in

the system. The component implementing this function must allow it to be

intercepted using the allowIntercept() interface. Only one of the parameters

func name and func ID is provided to identify the function to be intercepted.

The parameter new func specifies the location of the new function. Once

intercepted, the control is transferred to the new func. Further details of the

implementation of interception is described in detail in the next interceptCall()

interface section.

Interface interceptCall():

SYNOPSIS:

int interceptCall( void *orig_func

, void *new_func

, void *start

, void *end

, int ncalls

, int nparams);

DESCRIPTION:

This interface is used by the meta-level to intercept a given function (parameter

orig func) in the base-level. Before a function is intercepted, the iManager

checks if the requesting meta-level has a ‘write’ privilege (granted by using

allowIntercept() call). In DAMROS, following are the two different methods

in which a meta-level can intercept a function in the base-level:

Method #1:

In this method, after interception, the control is transferred to the function

at location represented by the new func parameter. The original function is

never executed unless explicitly called from within the meta-level function. A

typical use of this type of interception is:

id = interceptCall( &base_level_function

, &meta_level_function

, NULL

The above call intercepts the function base level function() and transfers

control to the function meta level function() when rest of the parameters are

set to the values shown above.

The operation of interception can be illustrated with an example. Con-

sider that the functions base level function() and meta level function(), used

in the interceptCall(), are located at addresses 0x08048517 and 0x08068000

respectively. For the Intel x86 architecture [61], the machine code at the start

of function base level function() is generally represented as follows:

0x08048517: 55 => PUSH EBP (1 byte)

0x08048518: 89 E5 => MOV EBP,ESP (2 bytes)

0x0804851A: 83 EC 38 => SUB ESP, 0X38 (3 bytes)

0x0804851D: XX XX XX

In the above code, 55 is the opcode for instruction PUSH EBP stored at

address 0x08048517 and the total space taken by this instruction is 1 byte.

In order to intercept this function, a JMP instruction (opcode = E9) is added

to the start of the function such that control is transferred to the function

meta level function(). The format of the JMP instruction is:

JMP <signed 32-bit displacement>

i.e. E9 <signed displacement> in machine code.

Using this displacement, the control jumps to an effective address calcu-

lated as follows:

Effective Address = (Next Instruction Address) +

(<signed displacement>)

The effective address in the example needs to point to the function

meta level function(), i.e. address location 0x08068000. By adding the

JMP instruction which is 5 bytes long at address 0x08048517, the next in-

struction address is 0x0804851B. The displacement for JMP instruction is

0x08068000− 0x0804851B = 0x0001FAE5. Thus, the above code is changed

as follows:

0x08048517: E9 E5 FA 01 00 => JMP 0x08068000 (5 bytes)

0x0804851C: 90 => NOP (1 byte)

0x0804851D: XX XX XX

The instructions with an opcode value 90 (representing a no operation instruc-

tion) are accordingly inserted to preserve the previous instruction order. The

iManager creates a function stub on-the-fly by generating the following code:

0x08050000: 55 => PUSH EBP (1 byte)

0x08050001: 89 E5 => MOV EBP,ESP (2 bytes)

0x08050003: 83 EC 38 => SUB ESP, 0X38 (3 bytes)

0x08050006: E9 12 85 FF FF => JMP 0x0804851D (5 bytes)

0x0805000B: C3 => RET (1 byte)

Here, the instructions from address 0x08050000 until address 0x08050006 are

from the original function base level function() that were replaced. Note that,

there is a JMP instruction added at address 0x08050006. This jumps to the lo-

cation 0x0804851D into the function base level function(), i.e. the code which

has been preserved after interception. The location to this newly generated

code is associated with a unique ID for this particular interception operation.

The meta-level can obtain this location by using a helper function void

*getOriginalCall(int ID). Thus, a meta-level can explicitly execute the inter-

cepted function, base level function() in the example, as follows:

void meta_level_function(int param1, int param2)

void *original_function;

original_function = getOriginalCall(interceptID);

original_function(param1, param2);

In the above code snippet, interceptID is a variable known to the meta-level

that holds the unique interception ID returned by interceptCall() interface.

Note that, in DAMROS, the function meta level function() has the flexibility

to call the intercepted function at any time – initial stage, in the middle or

towards the end of its execution. However, if it does not call the base-level

function, then the meta level function() essentially replaces the functionality

of the base-level function.

Method #2:

In this method, rather than intercepting a function as a whole, calls to a

particular function from within a fixed region of code can be intercepted.

A typical call would be as follows:

id = interceptCall( &base_level_function

, &meta_level_function

, &some_function

, NULL

The above call intercepts calls to function base level function() residing

within the function some function(). The code starting at location represented

by some function() is scanned for calls to function base level function() until

a return assembly instruction (e.g. RET in Intel x86 assembly [38]) is found

when the parameter end is set to NULL. Otherwise the process carries on until

the location pointed to by the parameter end is reached. If the parameter

‘ncalls’ has a value greater than ZERO, then only the first ‘ncalls’ number of

calls are intercepted (i.e. 2 in the above call).

The two functions can have a different number of arguments. The intercep-

tor function (i.e. meta level function()) must always have equal or fewer num-

ber of parameters than the intercepted function (i.e. base level function()).

The parameter nParams of interceptCall() specifies the number of parame-

ters required by the interceptor function. The iManager rejects the request

if this number is greater than the number of parameters for the intercepted

function. This number for the intercepted function is determined by decoding

the machine code instructions before a call to the intercepted function. The

parameters are pushed onto the stack using PUSH instructions before calling

the function. If the given nParams is less than the parameters for intercepted

function, then the PUSH instructions for the later parameters are replaced by

NOP instructions. If the nParams is 0, then no changes are made.

In Intel x86 architecture, the assembly code for a function call instruction

is represented as follows:

CALL <signed 32-bit displacement>

i.e. E8 <signed displacement> in machine code.

Here, E8 is the opcode for a CALL instruction and the displacement is used

to calculate the effective address of the function [38, 61] as described before.

For the same example as above, consider that the parameter start is point-

ing to location 0x106C00. The iManager needs to replace calls to function

base level function() with calls to function meta level function(). It scans the

machine code starting from location 0x00106C00 until it detects a CALL in-

struction represented as follows:

0x00106C7A: E8 99 18 F4 07 => CALL 0x08048517 (5 bytes)

0x00106C7E: XX XX

Here, the instruction at location 0x00106C7A is a CALL instruction (opcode =

E8). The next instruction starts at address 0x00106C7E. Thus, the effective

address of the called function is 0x00106C7E + 0x07F41899 = 0x08048517

which is the location of function base level function().

In order to transfer control to function meta level function(), the displace-

ment needs to be changed.

Displacement = 0x08068000 - 0x00106C7E

= 0x07F61382

The code is changed such that the CALL instruction points to function

meta level function() instead of base level function(). The instruction stream

after the change would look as follows:

0x106C7A: E8 82 13 F6 07 => CALL 0x08068000 (5 bytes)

0x106C7E: XX XX

This process of replacement continues until the required number of calls have

been intercepted. The interceptCall() interface returns this unique ID to the

calling meta-level component which can be used in the future to refer to the

particular interception operation.

Interface uninterceptCall():

SYNOPSIS:

int uninterceptCall( int ID

, boolean keepAlive);

DESCRIPTION:

This interface is used by the meta-level code to undo the effects of a intercept-

Call(). The iManager restores the machine code to its original state. If the

parameter keepAlive is set to TRUE, then the iManager retains the saved in-

terception information. This lets a meta-level to re-intercept a function much

faster in the future by using the same ID eliminating the time required scan

the underlying machine code.

Interface reinterceptCall():

SYNOPSIS:

int reinterceptCall(int ID);

DESCRIPTION:

This interface is used by a meta-level to re-intercept a previously un-

intercepted function, the information of which was retained using the keepAlive

parameter of uninterceptCall(). On receiving this request, the iManager checks

if the calling meta-level is the owner of the ID and immediately changes the

underlying machine code.

Interface installCode():

SYNOPSIS:

int installCode( int resourceID

, void *function

, char *codename);

DESCRIPTION:

This interface is used by a meta-level component to install new code into the

system or application (e.g. a user-defined scheduling policy). This interface

specifies the iManager about the existence of a function that could act as a

replacement for the base-level of the given resourceID. The function location is

specified by the parameter function and is uniquely identified by the string in

parameter codename. The iManager maintains a list containing the function

location, its codename and the resourceID it corresponds to.

Normally, a base-level component or an application thread could reify and

use a function it implements as an alternative resource management module.

For instance, a base-level can reify to use a function UD scheduler() as the

scheduler for its child threads as described in reification section before. The

installCode() interface allows a base-level to use a function that it does not

implement and which is present in a meta-level. Thus, once a meta-level

installs the code using this function, a base-level or an application could request

to use it via the uniquely identified name (parameter codename). This interface

is similar to allowIntercept() but it allows the use of a function implemented

in the mate-level rather than the base-level.

Interface uninstallCode():

SYNOPSIS:

int uninstallCode( int resourceID

, char *codename);

DESCRIPTION:

This interface is used by the meta-level to un-install the code previously in-

stalled using installCode(). The iManager deletes the entry represented by

the given resourceID and codename from the list of installed functions.

In summary, the rManager and the iManager components provide differ-

ent interfaces to the applications as well as the system modules to exchange

information and bring about runtime changes to the resource management

modules. The following section describes the design and implementation of

a reflective CPU scheduler that makes use of these interfaces. DAMROS im-

plements two reflective system modules: reflective CPU scheduler (described

next) and reflective virtual memory manager (described in section 3.3.5).

3.3.4 Reflective CPU Scheduler (VRHS)

The design of the reflective CPU scheduler uses the framework provided by

DAMROS. This section describes a Virtual Reflective Hierarchical Scheduler

(VRHS) model [94] in which threads of a common parent are grouped together

and scheduled by a custom application-specific scheduling policy.

Generally, applications are either developed without the knowledge of the

RTOS they would execute upon or are developed with a specific RTOS in

mind. Due to this fact, there might be several assumptions made about an

RTOS early on in the development cycle. One such assumption is about the

scheduling policy implemented in an RTOS. The use of a specific scheduling

policy controls the timing behaviour of the real-time application threads. An

application developed for one RTOS would behave differently when executed

on a different RTOS implementing a different scheduling policy.

The motivation to use a hierarchical application-specific scheduling model

is to avoid this behavioural impact on applications and to allow the use of

application-specific scheduling policies to schedule the threads. This preserves

the application’s timing behaviour no matter how it was developed. The parent

application thread can install a UD scheduling policy to schedule its child

threads.

The VRHS model implemented in DAMROS is designed as a two-level

scheduler (see figure 3.7). The lower-level scheduler, called the ‘System sched-

uler ’, implements a fixed priority scheduling policy. The system scheduler is

built into the DAMROS kernel. The higher-level scheduler, called the ‘Appli-

cation scheduler ’, executes as an independent system process/thread which is

scheduled by the system scheduler. The next subsection describes the system

scheduler in more detail.

SchedulingVirtual

Hierarchy

SystemScheduler

ApplicationSchedulerA

Figure 3.7: Two-level Scheduler in DAMROS

User−defined

causallink to

reify()install

installCode()

requestInfo()

interceptCall()

linkData()

callinterceptedcontrol on

transfer

change behaviourafter interception

base−level

meta−level

static default

OptimizedPriority

rScheduler

scheduling policy

Reflective Scheduler

processqueue

Base Kernel Core (also holds application reified data)

Figure 3.8: Structure of Reflective CPU Scheduler Module

System Scheduler

The lower-level system scheduler mainly schedules two important system mod-

ules – the application scheduler and the meta-level component of the appli-

cation scheduler, called the ‘rScheduler ’. The System scheduler uses a FP

scheduling policy with ‘rScheduler’ having a higher priority than the applica-

tion scheduler so that the meta-level always make changes prior to the execu-

tion of the base-level (application scheduler).

Furthermore, the system scheduler executes the rScheduler thread only

when relevant information, pertaining to the application scheduler is reified.

All application threads are scheduled by the application scheduler.

While the system scheduler remains unchanged in the entire life-time of the

system, the application scheduler undergoes many changes ( brought about by

the rScheduler). Figure 3.8 shows the model of the reflective scheduler mod-

ule. The rScheduler has access to multiple schedulers and it can replace the

base-level scheduler at runtime using interceptCall() interface. The next sub-

section explains the operation of the application scheduler and the rScheduler

component in more detail.

Application Scheduler

DAMROS supports hierarchical scheduling mechanism with the base-level

code, application scheduler, implementing a round robin (RR) scheduling pol-

icy with a 5ms time quantum. The traditional hierarchical scheduling ap-

proaches use a tree-based structure in the scheme where each intermediate

node represents a scheduler. The leaf nodes represent the application threads

that need to be executed. A node at a higher-level schedules a scheduler at a

lower-level, eventually scheduling the application threads [39, 99].

Using reflection, in DAMROS, it is possible to have a virtual hierarchy

of schedulers whilst still maintaining a simple two-level scheduler structure.

DAMROS currently implements the following schedulers: FP, RR, EDF and

FCFS. These schedulers are used by the rScheduler to replace the existing

application scheduler if required at runtime. DAMROS does not support dy-

namic loading. All schedulers have to be preloaded.

The rScheduler (meta-level component of the application scheduler) com-

ponent is designed to run as an independent system thread having a higher

priority than the base-level application scheduler thread. At the time when the

application scheduler needs to schedule the newly added scheduler, the rSched-

uler simply replaces the application scheduler with the new scheduler code.

This action makes the system scheduler transparently schedule the new sched-

uler instead of the application scheduler. Note that, neither the application

scheduler nor the system scheduler know about this change. The rScheduler

then reverts back to the original scheduler when the new scheduler is no longer

required to run.

When information pertaining to the CPU is reified, the rManager activates

the rScheduler thread which obtains the reified information and makes appro-

priate changes to the base-level application scheduler if required. The change

may require either the manipulation of base-level data structures or replac-

ing the application scheduler itself (e.g. using linkData() and interceptCall()).

The operation of rScheduler is explained in more detail later. The next section

describes the implementation of the universal run queue in DAMROS.

Universal Run Queue

To facilitate the operation of the rScheduler, DAMROS implements a Universal

Run Queue(URQ) that contains all the runnable threads maintained in the

order of their execution. All threads in the system are linked together to form

a family tree hierarchy making it easier to trace parent and child relationships.

Furthermore, each thread is associated with a scheduler and threads belonging

to a common parent having the same scheduler are grouped together forming

a process queue for that scheduler. The application scheduler has access to

one such process queue containing only the threads that it is responsible for.

The system scheduler keeps track of the process queue currently in use by the

application scheduler.

The representation in the URQ can be illustrated with an example. Sup-

pose that there are three applications executing in the system. Threads T1 to

T4 belong to application A1, threads T5 to T7 belong to application A2 and

threads T8 to T12 belong to application A3. It is known that threads T1 to T4

are best scheduled using FP scheduling policy, while threads T5 to T7 require

EDF scheduling policy and threads T8 to T12 require the FCFS scheduling pol-

icy. Thus, in all there are 12 threads executing in the system. For simplicity,

it is assumed that none of the threads use shared resources or block on I/O.

Figure 3.9: URQ: Representation of Threads

Figure 3.9 shows the representation of the URQ for the above example.

The CPU bandwidth is equally divided amongst different applications in the

system. If the application uses a different scheduler for its threads then the al-

located CPU bandwidth is distributed to the threads depending on the schedul-

ing policy.

As per the example, the application scheduler should schedule the lower-

level schedulers – either FP, EDF or FCFS – which in turn schedule the cor-

responding application threads.

Distribution of CPU bandwidth

The VRHS model distributes the CPU bandwidth amongst each scheduler used

by the application threads. Using a timer in DAMROS, the rScheduler thread

get activated when the CPU budget allotted to a scheduler is exhausted. The

CPU bandwidth in terms of CPU time for the above example is distributed

as follows: there are 3 different applications in the system. If scheduled using

RR scheduler, each application thread is executed for 5 ms in a RR order.

The CPU bandwidth is to be divided equally amongst all three applications.

Thus, an application with the least number of threads is considered. In this

case, it is application A2. The total CPU bandwidth to be allocated to each

scheduler is calculated as the product of the time quantum of RR scheduler

and the least number of threads in an application. i.e. each scheduler in the

second level is allocated 5ms× 3 = 15ms.

Depending on the scheduling policy, the allocated CPU bandwidth might

be used by all, a few or only one of the threads attached to that particular

scheduler. Note that, the execution of the scheduler code at any level is not

accounted for distributing the CPU bandwidth.

Shared Resources

When using different schedulers, the framework restricts the use of shared

resources to the threads belonging to a single applications. i.e. Threads in

different applications using different scheduling policies cannot share resources

amongst each other.

The threads of the same application scheduler by a single scheduler can be

share resources. Each shared resource is associated with a mutex for thread

synchronisation. In this case, each mutex has an associated list consisting of

the threads that are allowed to lock it. In DAMROS, when a thread blocks

on a shared resource, the thread that currently holds the mutex is executed

on high-priority until it releases the mutex. This is done by the rScheduler.

When a thread A tries to lock an already locked mutex, it is blocked and the

rScheduler thread is activated by setting the appropriate flag. If the mutex

has currently locked by thread B, then irrespective of the scheduler being

used by the application, the rScheduler manipulates the process queue of the

scheduler such that thread B uses the allocated CPU bandwidth until it release

the mutex. On releasing the lock, thread A which was blocked before gets

activated and the rScheduler is activated to reset all the changes it made to

the process queue.

Operation of VRHS Model

Figure 3.10 shows the virtual structure of the schedulers in the VRHS model

that schedule the applications threads for the previous example. The System

scheduler invokes only one lower-level scheduler (the application scheduler) at

any given time. The rScheduler changes the base-level application scheduler

code such the scheduler that is next required to schedule the threads is directly

invoked by the system scheduler avoiding an additional level of indirection

through the application scheduler (root node of the hierarchy).

The rManager activates the rScheduler thread when information concern-

ing the CPU is reified. In figure 3.10, threads T8 to T12 are shown ready to

FP System SchedulerKernel

rSchedulerthread

intercept

threadsReady

T8 T9 T10 T11 T12

T1 T2 T3 T4 T5 T6 T7

FP policy EDF policy FCFS policy

ApplicationScheduler

Figure 3.10: Operation of the VRHS Model

be scheduled for the first time in the system. At this point, the rScheduler

thread replaces the application scheduler with the FCFS scheduler using the

interceptCall() interface and sets timer that expires after a time equal to the

CPU budget of FCFS scheduler. The system scheduler schedules the appli-

cation scheduler which now implements FCFS scheduling policy. Note that

neither the system scheduler nor the application scheduler are aware of this

change.

The rScheduler gets activated for two reasons: one is when informa-

tion is reified (activated by the rManager) and the other is when the ap-

plication scheduler needs to be changed (expiry of timer). Any new sched-

uler implementation is added to the system using either installCode() or

reify(CHILD SCHED, &UD scheduler) call. The VRHS model makes it simple

to add schedulers to the hierarchy.

Using this model, a scheduler hierarchy of any depth can be virtually cre-

ated. In the previous example, suppose that thread T9 spawns child threads

T13 to T16 and introduces an UD (application specific) scheduling policy to

schedule its child threads. In this case, the CPU budget is recalculated for the

threads of the FCFS scheduler alone such that the CPU budget of T9 is shared

by its child threads as well. Appropriate changes are made to the URQ. Just

before scheduling the threads T13 to T16, the rScheduler changes the FCFS

policy at the application scheduler to the new UD scheduler. Now there ex-

ists a three-level virtual hierarchy of schedulers in the model. However, the

rScheduler always maintains a two-level scheduler hierarchy at any time.

Minimising Context Switches

Both rScheduler and the base-level application scheduler threads should use

as little time as possible. The implementation of the rScheduler thread is such

that it is not dependent on any external parameters (e.g. shared resources) and

can execute to completion in a single shot (without interruption). Thus, rather

than context switching to it, the system scheduler does a simple procedure

call to directly jump to the thread’s code eliminating any context switching

overhead associated with the execution of the rScheduler thread.

All the in-built schedulers in DAMROS are similarly designed to be ex-

ecuted by a procedural call instead of context switching to them. Thus, if

all the scheduling policies in use by the application threads are the in-built

ones, then the VRHS model incurs no context switch overhead due to the

various schedulers in the hierarchy. However, when a UD scheduler is used,

the rScheduler sets a contextSwitch flag in DAMROS, such that the system

scheduler context switches to the application scheduler instead of the usual

procedural call.

Operation of the rScheduler

Figure 3.11 lists the pseudo code of the rScheduler thread. The reified infor-

mation pertaining to the CPU is processed, adding or removing schedulers to

the hierarchy or manipulating the base-level data structures. Later, the URQ

is checked to see if the next scheduler to be active is different from the current

scheduler. This is true when the rScheduler is activated by an expired timer,

in which case, the application scheduler is replaced using interceptCall() inter-

face. The pseudo-code in the figure is self-explanatory. The next subsection

discusses some of the issues related to VRHS or the hierarchical scheduling

schemes in general.

Issues with Scheduler Behaviour

Due to the hierarchical scheduling structure, various different kinds of sched-

ulers are active in the hierarchy. It can be difficult to determine the state or

behaviour of the system at any given point in time or to accurately distribute

CPU bandwidth amongst all co-existing schedulers and the threads in the sys-

tem. This issue is often attributed to the scheduler composition problem [99]

which arises from the incorrect choice of schedulers that coexist in the system.

Also, in the hierarchical model the distribution of CPU bandwidth amongst

the application threads varies and is dependent on the type of schedulers com-

posed in the hierarchy. Furthermore, this distribution is in direct proportion

to the scheduler type used at each level. Though the composition of certain

scheduling policies is flawed [99], the VRHS model still allows the existence

void rScheduler(void)

reify_t info;

void *next_scheduler;

/* read all reified information and perform any operations

if required */

while ( requestInfo(CPU, 0, 0, 0, &info) != -1 ) {

if(isSchedulerAdd(info.type)){

/* if reified info requires adding a new scheduler then

group the corresponding threads forming a new process

queue for the new scheduler and setup the URQ

else if(isSchedulerRemove(info.type)){

/* if reified info requires removing a scheduler then

regroup the corresponding threads adding them to the

process queue of the scheduler one level up the

hierarchy and accordingly setup the URQ.

else if(isDataChange(info.type)){

/* make sure the requesting thread is not a parent

application thread, then change thread priority

or deadline if FP or EDF policy is used.

Otherwise ignore.

} // while loop

/* Analyse the hierarchy and determine

the next level scheduler. */

next_scheduler = next_level_scheduler(_current_apps_scheduler);

/* change scheduler only if required */

if ( next_scheduler != _current_apps_scheduler ) {

/* Using DAMROS reflection API intercept the

current scheduler and change it to next scheduler.

NOTE: this call changes the program code

interceptCall(_current_apps_scheduler, next_scheduler, NULL, NULL, 0, 0);

_current_apps_scheduler = next_scheduler;

} // end if (interception)

/* relinquish CPU and let

application scheduler execute */

} // end rScheduler

Figure 3.11: Pseudo-code of rScheduler Thread

of such schedulers without incurring high penalty to other application threads

not scheduled using those particular schedulers. The problem can be addressed

by the use of a budgeting system as employed in [12, 30, 102]. In this case, all

threads share certain amount of CPU bandwidth that is negotiated using a

contract with the main scheduler in the system. The scheduler then prioritises

the threads for scheduling and sets timers to expire when the budget of the

currently executing thread is exhausted.

The VRHS model is presented as a simple example showing the potential

of the framework and the amount of flexibility it offers. The next subsection

describes the sample implementation of an application-specific UD scheduler.

Application-Specific UD Scheduler

This subsection provides guidelines for the development of application-specific

UD schedulers that can be accommodated into the VRHS model. Figure 3.12

shows a flow chart representing a typical application-specific scheduler.

The shaded blocks shown are application-specific and need to be imple-

mented by the application developer, while the rest of the blocks are provided

by VRHS model. Block 1 contains the code that requests the process queue. If

granted, the queue consists of only the threads belonging to the application of

the requesting scheduler. The UD scheduler requests the process queue each

time it is invoked. Block 2 checks if the request for process queue has been

granted. If not granted, then the Failure recovery routine implemented by the

application developer is executed (block 4). This code could either do a retry

using timeout (shown as dotted line in figure) or could simply shutdown the

application.

If the request is granted, a causal link with the requested process queue

Context Switch to the selected thread (End).

using the QueueMake Scheduling decision

Link to the Process Queue

Request for Application Specific Process Queue.

Failurerecovery

Request Granted ?

Figure 3.12: Application-specific UD Scheduler Blocks

is established in block 3. The code in block 5 (provided by the application

developer) uses the process queue to determine the next thread to be scheduled

using an UD scheduling policy. Block 6 context switches to the selected thread.

This structure if used by the UD schedulers ensures uniformity across all

UD schedulers in the system and helps with system integrity. By hiding away

the complexity, the reflective framework lets the application developer concen-

trate on the UD scheduling policy alone.

Figure 3.13 lists the pseudo code of an UD scheduler implementing the

FCFS scheduling policy. This UD schedules schedules all the ready threads

in a FCFS order. If the first thread that entered the system is not ready, the

scheduler executes the next ready thread. In the pseudo code, requestInfo() call

void fcfs_scheduler(void)

process_queue_t *p;

thread_t *next_thread, *current_thread;

int id = 0, retry = 1;

try_again:

/* request for the process Queue */

id = requestInfo(CPU, PROCESS_QUEUE, 0, 0, NULL);

/* request granted if id greater than 0 */

if ( id > 0 )

/* causal link to the requested data */

p = (process_queue_t *)linkData(id);

/* FCFS Scheduling policy */

/* get first thread in queue */

for(next_thread = p;

next_thread != NULL;

next_thread = next_thread->next_in_q)

/* check if the thread is ready to RUN */

if(next_thread == THREAD_READY){

break;

/* activate rScheduler to change the application

scheduler if no thread is ready to run */

if(next_thread == NULL){

activate_rscheduler();

/* context switch to the selected thread */

context_switch_to(next_thread);

/* retry 5 times */

else {

retry++;

/* kill the application after 5

retry attempts */

if( retry == 5 ){

kill_application();

goto try_again;

} // end of FCFS scheduler

Figure 3.13: User-Defined FCFS Scheduler

requests the rManager to establish a causal link to the application’s process

queue. The rManager validates this request, checks if the calling component

is the scheduler for the application process using the URQ and returns an ID

if successful. The scheduler uses a ‘linkData(id)’ call to form a causal link

with the process queue. The remaining code carries out the FCFS scheduling

policy by switching a ready thread in the FCFS order. Note that most of the

complexity of dispatching, timing/event management, etc. is hidden from the

UD scheduler making it easier to implement and maintain.

In order to use this scheduler for its child threads, the parent application

thread would use the reify(CHILD SCHED, &fcfs scheduler) call. The next

subsection describe the design and implementation of the reflective memory

management system in DAMROS.

3.3.5 Reflective Memory Management System

(RMMS)

Memory in real-time embedded systems is an important resource and needs

to be managed efficiently. This section presents the reflective memory man-

agement system (RMMS) implemented in DAMROS. Memory requirements of

complex embedded applications (e.g. multimedia applications) vary dynami-

cally at run-time. DAMROS implements a paged virtual memory management

scheme. The size of each memory page is set to 4 KB.

Using paging, the code/data pages present in subsequent virtual memory

pages need not be physically contiguous in memory with 0% external memory

fragmentation. The use of auxiliary memory as swap space allows applications

with larger memory requirements to use more memory than is actually (phys-

ically) available. However, the page-swap operations associated with paging

are considered to incur significant performance penalties in the system which

is why it has not been widely deployed in embedded systems. This thesis con-

tends that by proper use of information about to the memory access patterns

of applications, it is possible to efficiently manage memory and reduce the

associated paging overhead.

In a paged system, the main operation of a memory manager (MM) is to

allocate memory to a requesting system module/application. In DAMROS,

memory is allocated in pages, i.e. the size of memory allocation is in multiples

of a page. When allocating a new page, if there is no free memory page

available for allocation then the memory manager selects an already allocated

page (called the victim page) for allocation by moving its contents to the swap

space. This process is called page swapping, in which the contents of a memory

page are moved to the swap space or vice-versa.

If a process tries to access a swapped page, then the system generates a

page-fault transferring the control to the RTOS’s page-fault handler routine.

The page-fault handler copies the contents of the swapped page back into

memory. This operation may also cause another memory page to be swapped

if there are no free memory pages.

Since the auxiliary memory device is slower than the main memory, each

page swap operation costs a certain number of CPU cycles or CPU time. Thus,

for better system performance, it is important that the Memory Management

(MM) module minimises the total number of page-faults (i.e. page swap op-

erations). Ideally, the MM module should select a victim page such that, once

swapped, the page is not be accessed in the immediate future. This ideal case

is nearly impossible to implement. This is because, it is difficult if not im-

possible to predict, at runtime, the memory access patterns of all application

threads running in the system. Furthermore, it cannot be ascertained in ad-

vance whether a particular victim page would be required in the immediate

future.

There are several page replacement policies such as Least Recently Used

(LRU) [120], Most Recently Used (MRU) [120], etc. and various optimisations

to these policies which an MM can implement. However, none of the policies

satisfy the ideal requirement and mostly provide an average case support.

Using reflection mechanism, it is possible to obtain reified information

about the memory access patterns of the applications. Such information can

be in the form of accesses to a particular memory location; or to suggest a UD

policy. The MM module can adapt the underlying paging policy accordingly

to provide better support. The RMMS module makes use of the reflection

framework in DAMROS [96] to either dynamically adapt the page replace-

ment policy or change it to use a different policy depending on application

requirements.

The RMMS model is shown in figure 3.14 (see upper-right corner). It con-

sists of a base-level MM component implementing a standard page replacement

policy and a meta-level component. The meta-level uses requestInfo() inter-

face to obtain information about memory accesses that is either reified by the

applications or the base-level MM module.

The base-level maintains a global page list containing each memory page

along with the time it was last accessed, the frequency of access and a page

flag representing whether the page is new/old or has been marked as a victim

for swapping. Furthermore, pages belonging to different applications are also

grouped together to form individual page tables for each application. A meta-

level can simply make changes to the ordering of the pages in this global page

SchedulerApplication

meta−level code

Memory managerbase−level

reify()affectchange

Applicationcode

Application process

currently executing

DAMROS

reify()

Figure 3.14: Structure of the RMMS Model

list to adapt the existing paging policy. The base-level would not be aware of

such a change and would continue its normal operation.

The left-hand-side of figure 3.14 shows an application process reifying infor-

mation. Information reified could be any of the following: access to a memory

region, a memory region that is no more required, allocation of memory and

request for a change in policy.

If information about a future memory access reaches the RMMS meta-level

component, it analyses the global page list to check whether the corresponding

memory pages are marked as victim pages (for reclamation) by the base-level.

If true, then it changes the page flag accordingly to avoid the unnecessary page

swap operation that could have resulted in multiple page-faults.

In DAMROS, the RMMS base-level implements a clock-based least re-

cently used (LRU) page replacement policy similar to the Linux OS [54] which

henceforth, would be referred to as LRU policy. DAMROS maintains statisti-

cal information that reflects the current memory usage of application threads

executing in the system. The meta-level component is scheduled by the sys-

tem scheduler as a high priority kernel thread (one priority lower than the

rScheduler). It is activated by the rManager when information pertaining to

memory is reified. Other than manipulating the global page list, the meta-

level intercepts the base-level to replace it with application-specific UD page

replacement policies as and when required.

The meta-level of RMMS module in DAMROS is called rVMM. Following

are the rules that govern the functioning of rVMM :

• an application can request a change in the paging policy. This change,

if granted by the rVMM, would be applicable only to the memory pages

of the concerned application. i.e., an application can only make changes

to the pages it uses and not the pages used by any other thread.

• an application can request for a particular region of memory to be

locked/freed so that it is either not swapped or swapped out by the

MM code. This is similar to the mlock() mechanism found in the Linux

OS [19].

Figure 3.15 shows the structure of the RMMS module which is similar to

the reflective scheduler module shown in figure 3.8. DAMROS has built-in

implementations of MRU and LFU page replacement policies along with the

LRU policy at the base-level.

The applications would typically reify information about their memory

access patterns to rVMM as follows:

reify(MEM_READ, &my_var, 256);

VMM policystatic default

MRU Optimized User−defined

causallink topage−tables

reify()install

installCode()

requestInfo()

interceptCall()

linkData()

callinterceptedcontrol on

transfer

change behaviourafter interception

base−level

meta−level

Reflective Virtual Memory Manager

Base Kernel Core (also holds application reified data)

Figure 3.15: Reflective Memory Management System (RMMS)

The above reification call suggests that 256 bytes of data starting at loca-

tion pointed to by my var is being read by the application. The reify interface

prepares the reify t data structure such that dataPtr field contain the starting

memory location (i.e. address of my var) and the data field contain the size

of the access (i.e. 256 in this case). The rVMM is activated by the rManager

when such information is reified. When executed, it adjusts the page flags

accordingly.

It is difficult for the base-level MM to keep track of which memory pages

were recently accessed. There is no hardware support that suggests the time

of access for a particular memory page. Thus, the MM policy approximates

the page accesses and applies the respective policy to select victim pages.

By updating the access information in the global page list, the rVMM helps

the base-level make more accurate decisions. Furthermore, if an application

requires a different paging policy to be used to manage its pages, the rVMM

can use any of the built-in policies or a UD paging policy to replace the base-

level policy. The next subsection describes the implementation of a typical

application-specific UD paging policy.

Application-specific Paging Policy

Figure 3.16 shows the operation of RMMS when using a UD policy for an

application. Application X is using an UD paging policy. During initialisation,

the application reifies a request to change the paging policy to the UD policy

that it implements. The rManager activates the rVMM thread which makes

the appropriate changes in the base-level MM. On the next page-fault, caused

by application X, the rVMM thread gets activated. It checks if the page-fault

is to be handled by the UD policy. If true, control is transferred to the UD

policy code rather than the base-level LRU code. The UD policy has access

to a page list consisting of all the pages allocated to the faulting application.

The UD policy determines the next victim page from the page-list and

returns control back to the rVMM module. The selected page is moved to

the swap space. While handling a page-fault, the execution time of UD policy

is accounted against the scheduler’s CPU budget and the system interrupts

remain enabled. Once this budget expires, the rScheduler changes to a different

scheduler (if in use). It is not possible for a malicious UD policy to use the

CPU forever.

Page−fault HandlerSystem

Page A

Page B

Page C

Page X

Pages used byApplication X

Hardware MMU

page−fault

targetpage

Up−call

Application−specific

Application ProgramCode

pageselectRMMS module

DAMROS Kernel

Application X

UD Policy

Figure 3.16: Operation of the RMMS model

To maintain uniformity amongst different application-specific UD policies,

the applications must adhere to the following guidelines. A typical application-

specific UD paging policy consists of an initialisation phase and a decision

phase. In the initialisation phase, the application initialises all the data struc-

tures (i.e. page table). In the decision phase, the UD code selects a victim

page to be moved to the swap space. The following subsection describes this

with an example implementation of UD paging policy.

Example User-defined paging policy

Figure 3.17 shows the C style pseudo code of a UD paging policy implemented

in the application. This code is executed when a page-fault occurs on a memory

page belonging to this application. It uses the requestInfo() interface to request

the page table of the application. The policy retries the page table request for

5 times, and if not granted it kills the application. However, if the request is

granted, it links to the page table. The UD policy has access to information

about each page used by the application such as whether a page is currently

in memory or swap space or whether a page is marked as a victim page. This

is done by associating a flag field with each page in the page table. Setting

a particular bit in this flag accordingly reflects the status of the respective

page. Any change made by the UD policy to the page table affects how the

base-level policy reclaims pages from this particular page-table.

The function setVictim() operates on a given page’s flag marking it as a

victim page. The rVMM follows the following fairness policy: when the UD

policy requests for a page table, the rManager activates the rVMM thread

which keeps a record of the number of victim pages already marked in the

page table. When the UD policy returns control to it, the number of victim

pages in the page table must still be equal or greater than the previous number.

If this is not the case then the rVMM kills the application and reclaims all its

memory pages. Thus, it is not possible for a UD policy to keep all its pages in

memory. Due to this check, the operation of one UD policy does not adversely

affect other applications in the system.

Furthermore, it is possible for the RMMS module to benefit from the infor-

mation available to the CPU scheduler as well. The next subsection describes

how such information could be used to benefit each resource management

module.

Use of RMMS in Scheduling

Using reflection, it is possible for the meta-levels of both the RMMS module

and the reflective CPU scheduler to interact with each other and exchange

information. The CPU scheduler can act as a base-level to RMMS and reify

void UD_paging(void)

page_table_t *p;

thread_t *next_thread, *current_thread;

int id = 0, retry = 1;

try_again:

/* request for the page table */

id = requestInfo(MEMORY, PAGE_TABLE, 0, 0, NULL);

/* request granted if id greater than 0 */

if ( id > 0 ) {

/* causal link to the requested data */

p = (page_table_t *)linkData(id);

/* UD paging policy */

/* from the page table, select a victim page */

/* set the selected page as victim */

setVictim(selected_page);

/* retry 5 times */

else {

retry++;

/* kill the application after 5

retry attempts */

if( retry == 5 ){

kill_application();

goto try_again;

} // end of UD paging policy

Figure 3.17: Application-specific UD Paging Policy

information to it. For instance, a reflective CPU scheduler can reify applica-

tion timing information to the RMMS module (e.g. remaining CPU budget,

deadline, etc.). The cost of swapping pages in and out of memory is high in

terms of both time and power as compared to a context switch if the pages

required by next application thread/process are already present in memory.

The RMMS module could use such information to determine if the cost of

swapping a page is higher than simply reducing the remaining budget of the

application thread/process and requesting the scheduler to perform a context

switch to another process.

Similarly, the scheduler can acquire information from the RMMS module

regarding an application’s memory usage to check if its pages are in memory

before context switching to it. Note that DAMROS is a single address space

RTOS supporting reflective hierarchical scheduling. The context of this discus-

sion is scheduling of multiple threads/processes of a single application in order

to efficiently use the CPU budget allotted to the scheduler. The interaction of

one scheduler with the RMMS module does not affect any other scheduler or

application threads/processes in the system.

The implementation of the reflective CPU scheduler (VRHS model) and the

reflective virtual memory (RMMS module) allow experiments to be performed

on DAMROS to evaluate the generic reflective framework. This is discussed

in the next section.

3.4 Evaluation

Compiling DAMROS using gcc (version 3.2.2) for the Intel x86 architec-

ture [61], produces an image of size 83 KB including the framework, two reflec-

tive system modules, device drivers and the test applications. The hardware

used to conduct the experiments included an embedded single-board computer

with a Cyrix MediaGX (233 MHz) processor and 64 MB of SDRAM. DAMROS

was configured to use only the first 4 MB of RAM. On an actual embedded

system, a flash memory or other similar auxiliary memory device would be

used for the swap space. For simplicity, DAMROS uses the upper 48MB of

RAM (i.e. 16MB onwards) as the swap space. Note that, the application

timings in this case would be much faster than in the actual system. However,

all experiments use the same setup to guarantee uniformity in the recorded

timings.

The objective of this evaluation is to verify the operation of the reflective

framework, to show the amount of flexibility offered for application develop-

ment and to support application-specific resource requirements at runtime.

More detailed experiments have been carried out in the following chapters

that compare against other approaches.

3.4.1 Timing Analysis

Table 3.1 lists the maximum time taken by each interface in the reflection

framework along with the page-fault handler routine in DAMROS. The time

was measured using the time-stamp counter of the processor. The timing

measurements depend on the hardware being used (i.e. the CPU clock speed).

Nevertheless, the figures are indicative of the relative performance of the in-

terface in DAMROS.

3.4.2 Changing Application Behaviour

This section demonstrates the ability of the framework to allow applications

to change their behaviour at runtime. Two application threads T1 and T2

RTOS Function Max. time, t in µs

reify() 0 ≤ t ≤ 1requestInfo() 1 ≤ t ≤ 2linkData() 0 ≤ t ≤ 1unlinkData() 0 ≤ t ≤ 1interceptCall() 1 ≤ t ≤ 2uninterceptCall() 0 ≤ t ≤ 1reinterceptCall() 0 ≤ t ≤ 1allowIntercept() 0 ≤ t ≤ 1interceptAllowed() 1 ≤ t ≤ 2installCode() 4 ≤ t ≤ 5uninstallCode() 1 ≤ t ≤ 2page-fault handler 0 ≤ t ≤ 4

Table 3.1: Measured Execution Times of DAMROS Interfaces

were implemented. Thread T1 calls a function read packet() in an infinite

loop. The function read packet() implements a particular algorithm or pro-

tocol to read data from a memory buffer. This function is critical to the

functioning of the application. Before entering into the loop, T1 uses the al-

lowIntercept(&read packet, ”read packet”, 0) call to allow the interception of

function read packet() by any component or thread in the system.

Assume that, in future, the function read packet() needs to be changed due

to a bug or a change in requirements. Also, suppose that the new change is to

be patched at runtime without having to stop, recompile and rerun the thread.

This is particular the case when the application has been deployed on board

a satellite or a Mars exploration vehicle for instance.

Thus, in order to fix this issue without having to stop, recompile, and re-

run thread T1, an independent thread T2 containing the new implementation

of function – read packet() is developed off-line. This thread when executed

in the system intercepts the function read packet() in thread T1 using the

interceptAllowed(”read packet”, 0, &new function) call. The control is now

transferred to the new implementation in thread T2 replacing the original

functionality.

From the data collected over 1000 samples, it was found that on an average

it took 4 µs for thread, T2 to affect the change in thread T1. The time has

been measured from the moment interceptAllowed() was called in thread T2

until it returned.

Similar to the above, an application can affect a change to various other

attributes such as priority, the paging policy, the scheduling policy, etc. It was

found that the measured maximum execution time to affect any such change

took no more than 30 µs in DAMROS (Note: a combination of installCode(),

reify() and other interface calls can take more time due to privilege checks

performed). The next subsection presents a detailed evaluation of VRHS –

the reflective CPU scheduler.

3.4.3 Evaluation of VRHS

The evaluation is divided into two parts: one using preliminary tests and the

other using detailed experiments.

Preliminary Tests

[Case 1] Consider an application with threads T1 and T2 consisting of a loop

with a maximum of 10,000 iterations. Table 3.2 shows the results obtained

after executing T1 and T2 simultaneously under normal conditions (i.e. with no

reflective scheduler). In table 3.2, Start indicates the time (in seconds) an ap-

plication thread started executing in the system; End indicates the time when

the application thread finished execution; Lifespan (i.e. End – Start), indi-

cates the time the application thread spent in the system (this does not mean

that the application process was executing throughout its lifespan). Here, the

time measured is relative to the time the RTOS was initialised. For instance,

in table 3.2, T1 starts 0.134s after the RTOS was initialised.

It is evident that both applications take almost the same time to complete

with T2 finishing 0.065s later than T1. This is because, T1 started execution

ahead of T2. The results were obtained as an average of several samples with

DAMROS implementing a RR scheduling policy. Only two application threads

were running in the system at any given time.

Process Start End Lifespan

T1 0.134 1.284 1.150T2 0.139 1.354 1.215

Table 3.2: No Reflection, Basic RR Scheduler

[Case 2] In this case, the reflective scheduler is initialised and the appli-

cation threads: T1 and T2 are rerun. This time however, an FP schedul-

ing policy is used and thread T1 requests for a higher priority using the

reify(HI PRIORITY) call. The rScheduler module obtains this information

from the rManager. It replaces the base-level scheduler with an FP scheduler

and assigns a higher priority to thread T1.

T1 0.136 0.827 0.691T2 0.827 1.607 0.780

Table 3.3: Reflection with One High Priority Application

On an average, it took no more than 30 µs for the rScheduler to affect the

required change once thread T1 had reified the request. From table 3.3, it is

observed that the lifespan of thread T1 is drastically reduced from 1.150s to

only 0.691s and that of thread T2 is also reduced from 1.215s to 0.780s. This

is because thread T2 is the only thread executing in the system once thread T1

has finished executing. However, note that the end time of thread T2 is 1.607s

as compared to 1.354s in the previous case (see table 3.2). It is delayed by

0.253s because thread T1 is executed on a higher priority.

[Case 3] This case reverses the above scenario, in that, thread T2 requests

for a higher priority while thread T1 executes as normal. Initially, RR scheduler

is used since both threads have equal priority. Thus, thread T1 starts execution

and on the next context switch thread T2 starts executing. It then reifies the

high priority request. The rScheduler module makes similar changes as above

so that thread T2 gets a higher priority this time.

T1 0.122 1.590 1.468T2 0.127 0.912 0.785

Table 3.4: Reflection with One High Priority Application

From table 3.4, it is observed that the lifespan of thread T2 is reduced

from 1.354s to 0.785s. The reason for thread T2 to not have similar figures

as thread T1 in table 3.3 is because, T1 had already executed for at least one

time quantum before T2 started executing. Thread T1 spent 0.785s waiting for

execution thereby incurring a total delay of 0.306s as compared to its normal

execution as per table 3.2.

[Case 4] In this case, thread T1 is run as normal whereas thread T2 is mod-

ified to request for a higher priority only during the execution of a particular

section of its code (i.e. during a critical section). This is a typical behaviour

in priority-based real-time systems that use priority ceiling protocol to control

access to a shared resource or a critical section [29]. To imitate this behaviour,

the code for thread T2 was modified such that it requests for a higher priority

just before it enters the original loop. Then after executing half way through

the loop, it requests for a lower priority (see pseudo-code in figure 3.18).

The rScheduler module gives T2 a higher priority and later switches back

to default RR policy when thread T2 requests a lower priority. Note that, the

rScheduler replaces the fixed priority scheduling policy with the default RR

policy only when all threads have equal priorities (which is true in this case).

Thus, the order of execution of both threads in the system is as follows:

thread T1 starts executing first. Later, when thread T2 is scheduled, it is

executed on a higher priority until it is half way through the loop. At this

point, both threads T1 and T2 execute at the same priority and are scheduled

by the default RR policy until they finish execution. Table 3.5 shows that

thread T2 finishes execution in 1.187s whereas in the normal case it would

have taken 1.215s (as per table 3.2). Thread T2 finishes its execution 0.028s

faster, while thread T1 takes an additional 0.318s to complete. The delay in

the execution time of T1 can be attributed to its waiting time when thread T2

was executing on a high priority.

[Case 5] In order to evaluate scheduling policies for multiple threads be-

longing to different applications, two separate applications – A1 and A2 were

developed. Each application spawns three independent child threads. All child

threads perform similar operation – printing their corresponding thread IDs

void thread_T2(void)

int i = 0;

reify(HI_PRIORITY);

while( i < 10000)

if(i < 5000 ){

/* execute code in critical section */

/* lower the priority */

if (i == 5000) {

reify(LO_PRIORITY);

i = i + 1;

Figure 3.18: Pseudo-code for Thread T2

over 500 iterations of a loop.

T1 0.122 1.590 1.468T2 0.127 1.314 1.187

Table 3.5: Reflection with One High Priority and Other Varying Priority Ap-plication

Assume that the child threads C11, C12 and C13, belonging to the parent

thread A1, are to be scheduled using the default RR scheduling policy and the

child threads: C21, C22 and C23, belonging to the parent thread A2, are to be

scheduled using a FCFS scheduling policy. Also assume that threads C11 to

C13 entered the system before threads C21 to C23.

Under the normal circumstances, all threads would be scheduled using the

default RR policy. In this case, the scheduling order of the threads would be:

A1, A2, C11, C12, C13, C21, C22 and C23.

Using the framework, the application thread A2 requests for a FCFS

scheduling policy to be used for scheduling its child threads by using

reify(CHILD FCFS) call. The rScheduler module identifies this application-

specific scheduling requirement and installs a FCFS scheduler for threads –

C21 to C23. The CPU bandwidth is equally divided amongst both applica-

tion’s child threads. Thus, each child thread set gets 15 ms (5ms× 3) of CPU

time after A1 and A2 are scheduled by the RR scheduler.

After execution, the resulting scheduling order was observed to be: A1,

A2, C11, C12, C13, C21 (for 15ms). Again, A1, A2, C11, C12, C13, and C21

until C21 finishes execution. Then, thread C22 starts executing in the similar

manner and so on until all threads finish execution.

Summary

In summary, the above test cases showed the amount of flexibility offered

by the framework in DAMROS. It is evident that the application are able

to adapt or bring about changes in the CPU scheduling policy. Compared

to the overhead (a few micro-seconds) of reifying information in the system,

the gain in application performance is significant. As observed, even a slight

change in priority can make a big difference in the execution times of an

application. More detailed experiments involving the use of different schedulers

in the VRHS model are described in the next subsection.

Detailed Experiments

It is difficult to simulate a dynamically changing reflective hierarchical schedul-

ing model such as the VRHS. It is also not possible to perform the evaluation

on the basis of a discrete event simulation model, a deterministic model or a

queueing model [90,109]. The VRHS model is composed of various traditional

scheduling policies (e.g. FCFS, FP, EDF, etc.) along with the application-

specific UD policies (if used). All such policies co-exist in a single system

and are used to schedule different groups of application threads. Note that,

the evaluation of VRHS does not provide any performance metrics for the

schedulers themselves. The evaluation is based on the following criteria:

• Flexibility offered,

• Scalability of the System,

• Performance of the System,

• Overheads incurred.

The following are the definitions of the terms used in the experiments:

• Execution time: the time a thread spends in the system executing on

the CPU is its Execution time; denoted as ǫ(t) where t is the executing

thread.

• Wait time: the time spent by a thread waiting from the time it entered

the system until it first executes on the CPU is its Wait time; denoted

as ω(t) where t is the thread. In the literature, this is also known as the

Response time.

• Turn around time: the time taken by a thread from the time it entered

the system to the time it leaves the system is its Turn around time;

denoted as TTRnd(t) where t is the thread.

• Start-time: The time at which a thread first starts its execution is its

Start-time; denoted as St where t is the thread.

• End-time: The time at which a thread finishes its execution and leaves

the system is its End-time; denoted as Et where t is the thread.

The following subsections present the results of three different experiments

conducted to test the VRHS model in DAMROS. Experiment #1 uses only

two schedulers in the hierarchy with only one application affecting a change

in the system; experiment #2 uses three different schedulers in the hierarchy

with two applications affecting a change; and finally, experiment #3 emulates

an MPEG decoder application that uses an application-specific UD schedul-

ing policy. Towards the end, a detailed discussion of performance and the

overheads incurred by VRHS is presented.

Experiment #1

Two applications were implemented and executed as threads – A1 and A2 –

each having equal execution time. The application thread A1 spawned 4 child

threads – T1 to T4, which have the same execution time as the parent thread

A1. Equal execution time is ensured by using a common code across all the

threads. The application threads were executed in DAMROS for the following

test cases:

[Test Case #1] : Using the RR policy at the root node, applications A1

(including the child threads) and A2 were executed with no changes made to

the VRHS scheduling model. In total, there were 6 threads executing in the

system: 2 application threads – A1 and A2, and 4 child threads of A1 – T1

to T4. The measured average wait time ( ω(t) ) and the average turn around

time (TTRnd(t)) of each thread were 0.438 ms and 16.188 ms respectively.

This case presented the default case behaviour without using any features of

the VRHS model.

[Test Case #2] : In this test case, the application thread A1 changes the

scheduling policy of its child threads to the FCFS scheduling policy using

the reify(CHILD FCFS) call. On execution, the avg. ω(t) and the avg.

TTRnd(t) for the threads were 4.896 ms and 12.146 ms respectively. Note the

increase in the value of ω(t). This is because, only one child thread executes

to completion while its siblings wait in the ready queue as per the FCFS policy.

[Test Case #3] : In this test case, the scheduling policy was changed from

FCFS to FP for the child threads T1 – T4 such that they all have equal

priority but higher than the parent application threads. For this, instead of

using reify(CHILD FCFS) the application thread A1 uses a reify(CHILD FP)

call. On execution, the avg. ω(t) and the avg. TTRnd(t) were measured to

be 0.438 ms and 14.688 ms respectively. Note that the avg. ω(t) is equal to

that measured for the default case. This is because all the child threads had

equal priority and were executed in the order similar to the RR scheduling

policy. This shows that the reflection interface operates non intrusively with

little or no overhead. The avg. TTRnd(t) improved in this case as compared to

the default case. This is because, the threads had higher priority than their

parent application thread A1.

[Test Case #4] : In the previous test, since all the child threads had equal

priority, the thread that executed first was the first one to leave the system.

In this test case, the code was modified such that thread T1 has a higher

priority than the rest of the threads. For this, a reify(HI PRIORITY) call

was used in thread T1. This results in thread T1 being executed to completion

before any of its siblings start execution. In this case, the measured avg. ω(t)

and the avg. TTRnd(t) were 1.813 ms and 13.438 ms respectively.

[Test Case #5] : In order to verify the correctness of the VRHS model,

thread T3 was modified to have the same priority as thread T1. Now, the

threads T1 and T3 have equal but a higher priority than threads T2 and T4.

The measured avg. ω(t) and the avg. TTRnd(t) in this case were 3.167 ms and

12.542 ms respectively. There is an increase in ω(t) since threads T2 and T4

wait until threads T1 and T3 to finish execution.

[Test Case #6] : In this test case, the code was modified such that thread

T3 has the highest priority followed by the threads T2 and T4, while the

thread T1 has the lowest priority of all. The measured avg. ω(t) and the

avg. TTRnd(t) in this case were 2.417 ms. and 12.583 ms respectively. Here,

ω(t) was observed to be less as compared to the previous test case. This is

because, only the lowest priority thread T1 had a lengthy wait time.

[Test Case #7] : In this test case, consider that thread T1 had a critical

section code which is to be executed at the highest priority. Thread T1 was

executed with the lowest priority in the previous test case. Using the reflection

interface it is possible to request a higher priority for thread T1 only while it

executes the critical section code. This was achieved by using the same code

mentioned in the preliminary tests before.

In this test case, the measured avg. ω(t) and the avg. TTRnd(t) were

2.396 ms and 10.458 ms respectively. There was a significant speed-up in

terms of TTRnd(t) (10.458 ms). This is because, the lowest priority thread

T1 executed some parts of its code (critical section) at highest priority that

resulted in all the threads finishing early. This behaviour is also evident by

observing the lower value of ω(t) (2.396 ms).

Summary

In summary, it is observed that the changes brought in by the application

thread A1 affect only its child threads and not A2. i.e., the VRHS model

only produces local effects and does not spread the overheads, if any, to other

threads in the system. Each scheduler in the virtual hierarchy operates in

complete isolation and does not know about the existence of other schedulers.

Figure 3.19: Results of Experiment #1

The test cases show that the applications can make use of the available reflec-

tion interface to satisfy their application-specific requirements. The figure 3.19

shows the bar graph representation of the avg. ω(t) and the avg. TTRnd(t)

values for all test cases.

Experiment #2

The VRHS model was tested using two separate schedulers for scheduling

different sets of application threads. The test application threads A1 and A2

spawn 4 child threads each and introduce a different scheduling policy for each

set of child threads. The applications code was similar to experiment #1. In

this experiment, 10 threads were executed: 2 parent application threads – A1

and A2, and 4 child threads of each application T1 – T4 and T5 – T8.

[Test Case #1] : In this test case, all application threads were executed

using the default RR scheduling policy. This case is a representative of the

default case with no change brought in by the VRHS model. The measured

avg. ω(t) and the avg. TTRnd(t) were 0.688 ms and 25.338 ms respectively.

[Test Case #2] : Threads A1 and A2 were both modified to use FCFS

scheduling policy to schedule the corresponding child threads. On execution,

the measured avg. ω(t) and the avg. TTRnd(t) were 8.6 ms and 17.863 ms

respectively.

[Test Case #3] : The application thread A2 was modified to use an FP

scheduling policy while thread A1 used the FCFS policy. On execution,

the measured avg. ω(t) and the avg. TTRnd(t) were 5.15 ms and 19.75 ms

respectively.

[Test Case #4] : In this test case, the scheduling policies were swapped.

i.e., FCFS policy was used for the child threads of A2 and an FP policy

for those of thread A1. On execution, the measured avg. ω(t) and the avg.

TTRnd(t) were 5.1 ms and 19.725 ms respectively.

[Test Case #5] : In this test case, using the same schedulers as above,

thread T2 is made to have the highest priority and thread T4 the lowest among

other threads. On execution, the measured avg. ω(t) and the avg. TTRnd(t)

were 5.588 ms and 17.925 ms respectively.

[Test Case #6] : This test case used an FP policy to schedule both child

thread of both applications. On execution, the measured avg. ω(t) and the

avg. TTRnd(t) were 0.75 ms and 22.7 ms respectively.

[Test Case #7] : In this test case, thread T2 is assigned the highest priority

and thread T3 the lowest. Similarly, thread T6 is assigned the highest priority

and thread T8 the lowest. On execution, the measured avg. ω(t) and the avg.

TTRnd(t) were 1.936 ms and 18.288 ms respectively.

Figure 3.20: Results of Experiment #2

Summary

Figure 3.20 shows the bar graph representation of the avg. ω(t) and the avg.

TTRnd(t) for all the above test cases. It is evident from the better avg. TTRnd(t),

that an application can improve its performance if it gets application-specific

resource (CPU in this case) management support from the RTOS. Clearly,

the reflection framework provides this application-specific support. The next

subsection describes an experiment involving a multi-threaded MPEG decoder

application whose performance is shown to improve with the use of a UD

scheduler.

Experiment #3 (Application-specific)

An application emulating the behaviour of a multi-threaded MPEG [51] de-

coder was implemented. The parent application thread buffers the in-coming

MPEG video stream and activates one of the decoder threads which decodes

an MPEG frame of a particular type (i.e. either an I, a P or a B frame [51]).

The implementation made the following assumptions:

• a constant bandwidth for the in-coming MPEG video stream either from

a network resource or a local storage disk,

• a constant decoding time per frame for each of the decoder threads,

• an MPEG video stream consisting of the following frame pattern (3

different scenes):

(IPBBPBB)(IPBBPBBPBB) (IPBBPBBPBB)

The test video stream has 27 different frames constituting 3 different scenes.

The application was tested for concurrent decoding of these 3 scenes. The

parent application thread invoked 27 decoder threads to decode each frame

in the stream. However, due to the inherent frame dependencies, it was not

possible to decode each frame independently. i.e. not all the decoder thread

could be ready for execution at any given time. Furthermore, threads that

decoded the frames belonging to a particular scene could execute in a known

execution order. i.e. decoding one frame at a time. With 3 different scenes in

0 5 10 15 20 25 30 35

Random Arrival of Threads

Arrival TimeEnd Time

Figure 3.21: Using RR Scheduler

the video stream, there could only be a maximum of 3 active decoder threads

decoding a frame independently.

To determine the effects of using a UD policy on other applications exe-

cuting in the system, the system executed another application App, consisting

of 4 child threads. Now, there are two different applications, consisting of

multiple thread, executing in the system.

During normal execution of both application threads using the default

RR scheduling policy, it was observed that the MPEG decoder application

showed poor performance. Figure 3.21 plots the start-time and end-time of

all application threads including application App’s threads. Threads – 3 to 7

in the figure belong to application App , thread 32 is the parent application

thread of the MPEG decoder and the rest are the decoder threads.

0 5 10 15 20 25 30 35

Arrival TimeEnd Time

Figure 3.22: Using UD Scheduler

Next, the MPEG application was modified to introduce an application-

specific UD scheduler into the VRHS model using reify(CHILD SCHED,

&UD scheduler) call. The UD scheduler made use of the information about

the MPEG data arrival time and kept track of the scene a frame belonged

to. This information allowed the UD scheduler to schedule the corresponding

threads using a priority-based scheduling policy.

Both applications were executed again with the MPEG application using

an application-specific UD scheduler. The results (see figure 3.22) show con-

siderable improvement in performance of the MPEG application. The decoder

threads completed execution much earlier than in the previous case.

Summary

In summary, this experiment showed that applications whose requirements are

0 5 10 15 20 25 30 35

DefaultApplication Specific

Figure 3.23: RR Vs UD Scheduler

not satisfied by the existing policies in an RTOS can make use of the framework

to introduce an application-specific UD policy. Comparing the end-times of the

threads scheduled using the default RR policy against the application-specific

UD policy, it can be seen that the decoder threads were scheduled at the right

time using the application-specific UD scheduler (see figure 3.23). Further, the

MPEG decoder application showed much better performance using its own UD

scheduler.

Note that, the start-time of application App’s threads is shown to have

a delay of 4 ms on an average. This is caused by the overhead added by

the reflection interface in DAMROS for handling the reification process. How-

ever, by using another scheduler (e.g. FP, or an application-specific policy) for

application App, it is possible to improve its performance. The following sub-

sections discuss the performance and related overheads of VRHS as compared

to other hierarchical scheduling schemes.

Performance and Overheads of VRHS

Unlike the traditional hierarchical approaches where the performance of the

system deteriorates with the addition of extra schedulers in the hierarchy,

the performance of VRHS is not affected by the presence of extra schedulers.

This is because, the system scheduler does not context switch between several

schedulers in the hierarchy. It only context switches to an UD scheduler, else a

direct procedure call interface is used avoiding much of the context switching

overhead. For experiment #2, which used two schedulers, if traditional ap-

proaches (e.g. MaRTE OS API [103], SFQ [55] method, etc.) were to be used,

then they would have incurred context switching overhead to switch between

the schedulers in the hierarchy.

Generally, in the traditional approaches, the number of context switches

increases with each additional level of scheduling hierarchy. An exception

to this is the HLS [100] implementation which, like VRHS, does not context

switch between schedulers. However, HLS adds significant overhead to the

actual context switch time itself (a 11.7 µs context switch time as compared

to the 7.10 µs in Windows 2000 kernel without HLS on a 500MHz Pentium III

machine [99]). Such an overhead affects all application threads being scheduled

in the system. Furthermore, the HLS model incurs an overhead of 0.96 µs for

each additional level in the scheduling hierarchy [99].

The overheads incurred by most hierarchical scheduling models are because

of the context switches between the various schedulers in the hierarchy. Lower

the number of schedulers in the hierarchy lower is the context switching over-

head. To get an idea about the scale of this overhead, consider Linux [19] and

NetBSD [124] for instance. The context switch time of Linux on a 500MHz

G3 processor was 89 µs [124], and that of the SA model developed to run on

a NetBSD system on a similar system was found to be 225 µs [124]. If this

time is multiplied by each scheduler in the hierarchy, the overhead can have a

massive impact on overall system performance.

In VRHS, only the required scheduler remains active at any given time

eliminating the need for several context switches. However, there is a one-

time overhead of the rScheduler thread that executes before an application

scheduler. The maximum observed execution time of the rScheduler thread

is 10 µs. Hence, the VRHS model has a one-time overhead of nearly 10 µs

irrespective of the number of schedulers in the hierarchy. The only real context

switch that occurs is initiated by the application scheduler to switch to the

next ready thread (when using built-in schedulers).

Also, the framework incurs an additional overhead in terms of memory.

Memory is required to store and retrieve reified information. For all of the

above test cases, it was found that the maximum memory used for storage of

reified information was around 200 bytes at any given time. This is negligible

compared to the flexibility offered.

Given the amount of flexibility provided by the VRHS model and the ease

of using the reflection framework, the approach can be considered to be bet-

ter than the traditional ones. The next section describes the experiments

performed to evaluate the reflective memory management system (RMMS).

3.4.4 Evaluation of RMMS

In order to simulate constrained memory resource system in a controlled ex-

periment, the total number of free memory pages available in the system was

reduced to 64. i.e., the available free memory in DAMROS was now only

262 KB. Although the experiments use a small amount of memory, the ap-

proach is nevertheless scalable to support real systems. Reducing the memory

size in this case, allows to easily analyse and perform experiments.

Also, most OSs maintain a certain number of memory pages in the system

that are always free. The total memory of 262 KB available to the applications

in the following experiments excludes the maintained free pages. DAMROS

can allocate all 64 memory pages to the applications. Similar to the evaluation

of the VRHS model, the evaluation of RMMS is also divided into two parts:

one for the preliminary tests and the other for detailed experiments.

Preliminary Tests

Test applications A1 and A2 were developed. When executed, at first, the

application thread A1 allocates all the available free memory to itself. Later,

it reads the allocated memory (one byte at a time) in a loop. This action

emulates a periodic sequential page access in the system. The application

thread A2 also behaves in a similar manner but allocates only 10 memory

pages to itself. Both threads require 1 memory page each for the code and

data segment.

Thread A1 is initiated before thread A2. It occupies all the available free

memory pages such that when thread A2 executes, there is no free memory.

In order to allocate 10 memory pages to thread A2, the RMMS module should

select 10 victim pages to be swapped. In total, there were 12 page swap

0 100 200 300 400 500 600 700 800 900 1000

Trial runs

Page Fault comparison

Figure 3.24: Static Vs Reflective LRU

operations: 2 page swaps to initialise thread A2’s code and data segments

(each using one page) and 10 page swaps to allocate A2’s memory pages. At

this point, it is important that the RMMS module choose the right pages to

be swapped out.

In this scenario, without using reflection, each of the traditional LRU and

MRU page replacement policies were tested. This helps to compare against

the use of reflection later. The number of page faults generated in the total

lifespan of both applications were recorded in each of the following cases:

[Test Case #1] : This test case used the traditional LRU policy without

reflection. On an average there were 109 page-faults generated in the system

(see upper dark lines in figure 3.24).

Next, the reflective interface was used in the RMMS to optimise memory

utilisation and to reduce the number of page faults. The rVMM module was

initialised and the base-level reified page-tables to rVMM. Also, application

thread A1 was modified to reify information suggesting that it would use the

first 10 pages allocated to it using reify(MEM READ, &memory data, (10 *

PAGE SIZE)) call. Thread A2 also reified similar information. The RMMS

base-level used the LRU paging policy while the rVMM manipulated the page

flags to reflect the reified information. On an average, 95 page-faults were

observed in this case (see lower dark lines in the figure 3.24).

[Test Case #2] : This test case executed the same application threads under

normal conditions (without using reflection) but using an MRU policy instead

of LRU. On an average, 2404 page-faults were generated over the 1000 test

runs of the applications.

Using the reflection framework, the same way as above, with the RMMS

base-level using an MRU paging policy only 145 (avg.) page-faults were gen-

erated.

The number of page-faults generated in the system were significantly re-

duced by using the framework. When applications reify memory usage in-

formation, the rVMM forms a causal connection with the page-table using

linkData() call. It then changes/marks the page flags such that the pages that

would be used by the applications would remain in memory (i.e. they would

not be swapped out by the base-level).

The worst case time to handle a page-fault was observed to be nearly

4µs. Thus, using reflection in the case of the LRU policy, a reduction of 14

page-fault saves 56µs of valuable CPU time. Similarly, for the MRU policy a

massive saving of 9.04ms of CPU time is achieved. The reify() interface can be

accounted to incur an overhead of nearly 1µs with an additional 3µs overhead

added by the rVMM component to bring about the change. Thus, removing

the 4µs overhead off the above figures, a total saving of 52µs for LRU and that

of 9.036ms for MRU policy is quite significant. The next subsection describes

a more detailed experiment using several different test cases.

Detailed Experiments

For more accurate and controlled measurements, the number of free pages

available to the applications was further reduced to 32. i.e., only 124 KB of

memory was made available to the applications. Note that, in this case, the

free memory excludes the memory pages used by the application code/data

segments. To maintain uniformity in the values, a fixed test application set

was used across all the experiments.

Test Application Set

The test application set constitutes of applications A1 and A2. Both applica-

tions require 20 memory pages each. Thread A1 randomly accesses its memory

pages in an infinite loop whereas thread A2 (also in an infinite loop) sequen-

tially accesses its pages. The total pages available in the system is only 32, but

the total pages required by both applications is 40 pages. The execution of

both applications would generate an increasingly large number of page-faults in

the system. Once the application threads start executing, the number of page-

faults generated in the system are recorded for every 1,000 context switches. A

total of 100 such readings are recorded before killing the application threads.

The following experiments use different paging policies along with reflection.

Experiment #1

This experiment used a global page replacement strategy with no knowledge of

the individual application’s memory usage. A victim page was selected on the

0 20 40 60 80 100

Context Switches

CASE #1

LRUMRULFU

Figure 3.25: Experiment #1: Page-faults

basis of its global usage statistics collected over a period of time. Figure 3.25

shows the corresponding page-fault graph after executing the test application

set with LRU, MRU and LFU page replacement policies respectively.

The LRU paging policy showed relatively poor performance. This is be-

cause, LRU is a recency-base policy and does not track the frequency of usage.

Application thread A2 has a loop-based sequential access in which after ac-

cessing the 20th page, it again starts accessing pages from the 1st. Towards

the end of A2’s access loop, the LRU policy would have marked the 1st page

as the least recently used page. If a page-fault occurred while the thread was

accessing the last page in the sequence, then the LRU policy would reclaim

the 1st page. This would further result in a series of page-faults as the thread

proceeds to re-iterate its memory access through the loop.

The total number of page-faults generated when using LRU, MRU and

LFU policies were observed to be 83,750, 69,199 and 66,669 respectively. If

application thread A2 was allowed to specify its memory usage pattern, then

by reification it may be possible to reduce the number of page-faults.

Experiment #2

In this experiment, the page replacement strategy was modified such that a

selected victim page did not belong to the thread that caused the page-fault.

For instance, in the test application set, if thread A1 caused a page fault, then

the selected victim page would belong to thread A2. In this case, the number

of page-faults generated corresponding to each paging policy (i.e. LRU, MRU

and LFU) was observed to be 133,330, 133,334 and 133,327 respectively. All

the three paging policies showed poor performance compared to the previous

experiment. Clearly, the global page replacement strategy generated much less

number of page faults than the one used in this case.

Experiment #3

In this experiment, the page replacement strategy was modified to select a

victim page which belonged to the application thread that caused the page-

fault. The total page-faults, thus generated, for this strategy corresponding to

each paging policy (i.e. LRU, MRU and LFU) were – 66,666, 66,665 and 66,636

respectively. This strategy generated relatively less number of page-faults as

compared to both of the above experiments. Perhaps if an application-specific

UD paging policy is used, there could be a further reduction in the number of

page-faults.

0 20 40 60 80 100

Context Switches

Using RMMS

Reflective LRUReflective MRUReflective LFU

App. Specific

Figure 3.26: Page-faults for RMMS

Experiment #4 (UD paging policy)

The RMMS module allows applications to introduce an application-specific

UD paging policy into the system. In order to show the significance of re-

flection in the RMMS model, an UD policy was used for thread A2. Also,

Thread A2 was modified to reify its memory access pattern at runtime using

reify(MEM READ,&variable, PAGE SIZE) call before accessing a page.

The UD policy keeps track of the recent memory usage pattern of thread

A2. Furthermore, it is customised to support the thread’s sequential loop-

based access. Thus, on a page-fault, the UD policy selects a page (from A2’s

page-table) that would not be used immediately in the future. For instance, if

a page-fault occurs when A2 is accessing the 20th page, the UD policy selects

the 19th page instead of the 1st page.

Along with the use of an UD policy, more tests were performed using

the rVMM module to use the reified information and use the existing paging

policies (LRU, MRU and LFU) for A2 alone. The application thread A1 with

a random memory access used LFU page replacement policy. The graph in

figure 3.26 shows the page-fault graph for each paging policy using reflection

framework along with the UD policy. The LFU policy was used globally in

the RMMS base-level which handled the page-faults generated by thread A1.

Amongst the four different test cases, the first uses an LRU policy for A2,

the second an MRU, third an LFU and finally the fourth uses an UD policy.

The total page-faults generated in each case were – 65,266, 56,766 , 56,066 and

45,766 respectively.

As compared to the previous experiments, there is a significant reduction

in the number of page-faults generated. The UD policy showed best results

amongst all other reflective policies. The use of reflection with the traditional

policies also showed significant reduction in page-faults.

In the above experiments, the RMMS module incurred a memory over-

head of nearly 2 KB. This memory was used to store information reified by

the applications and system components. However, this overhead is negligible

in comparison to the amount of flexibility offered and the significant reduction

in page-faults. Comparing the number of page-faults generated using the tra-

ditional paging policies (LRU, MRU and LFU) against the application-specific

UD policy, the UD policy shows 31%–65% reduction in the number of page-

faults. Thus, by allowing applications to introduce custom paging policies into

the system, the RMMS provide the required application-specific support.

3.5 Summary

In summary, this chapter presented the generic reflective framework for an

RTOS. The traditional reification process in reflection was modified such that

information reified is stored in the RTOS kernel and later passed to the meta-

level component on explicit request. The design and implementation of DAM-

ROS, a reflective RTOS, implementing the reflective framework was also de-

scribed.

The implementation of two reflective resource management modules in

DAMROS: a reflective CPU scheduler (VRHS model) and a reflective memory

management system (RMMS) were described. Several experiments to evaluate

each reflective resource management module were presented. The experimen-

tal results showed improvement in application performance with minimal or

negligible overheads in terms of time and memory. Both VRHS and RMMS

modules were shown to be flexible enough to accommodate application-specific

resource requirements pertaining to the CPU and memory.

It is evident that reification in the reflection framework plays an important

role in adapting the system policies according to the application requirements.

The next chapter presents a case study for virtual memory to investigate dif-

ferent methods of adding reification calls into application source code. Instead

of relying on the applications to explicitly reify information, the chapter in-

troduces a method of automatically inserting reification calls into application

source code which would specify the RTOS about an application’s memory

usage patterns at runtime.

Chapter 4

Support for Reification: a CaseStudy

In order to satisfy their resource requirements, applications need to reify the

resource usage information to the RTOS. This helps the RTOS to adapt its

resource management policies accordingly and satisfy the application require-

ments. The generic reflective framework presented in the previous chapter

requires applications to explicitly reify information. There are various meth-

ods to support reification. One such method is the insertion of reification calls

into application source code at compile-time. Other methods could involve

the analysis of source code to identify resource usage patterns and insertion of

reification calls at certain points either manually or automatically.

This chapter uses virtual memory management (paging) as a case study

to show the significance of reification and the methods to support it. Paging

allows to run applications with greater memory requirements (i.e. memory size

greater than physically available). The case study will make use of reification

and accordingly adapt the OS’s paging module.

This case study considers applications with greater memory requirements

and that exhibit a loop-based memory access. A mechanism that exploits such

memory access to automatically add reification calls and to dynamically adapt

the paging policy is described.

The chapter is organised as follows. A brief introduction to the paging

model used in this chapter is presented in section 4.1. Two simple reification

calls to specify memory usage are discussed in section 4.2. This is followed by

the description of three methods of inserting reification calls: manual method

(section 4.4) – used by the application programmer to manually insert calls;

automatic method (section 4.5) – to automatically identify data locality and

insert appropriate calls; hybrid method (section 4.6) – involves a mixture of

both manual and automatic methods. The design of an OS paging mechanism

called CASP is presented in section 4.7 which uses the reified information to

optimise the virtual memory subsystem. Further, to simulate the behaviour

of CASP along with reification within the applications, the implementation of

an on-the-fly virtual memory simulator (PROTON) is described in section 4.9.

Finally, simulation results involving benchmark applications with the CASP

mechanism are presented in section 4.11.

4.1 Paging Model

This section describes the paging model used in this chapter. The model is

based on the following assumptions:

• there exists hardware support (e.g. MMU) to trap page-faults and trans-

fer control to the OS’s page-fault handler,

• there exists fixed amount of physical memory, M which can be divided

into exactly n number of unique equal sized pages,

• the system does not support multiple sized pages,

• the OS implements a demand paging system [19],

• memory is allocated in multiples of a page (i.e. one or more) and that it

is virtually contiguous while physically this may or may not be the case,

• there exists a secondary auxiliary storage device (e.g. a hard disk) that

acts as a swap space [19].

Definition 4.1a A page-fault in a demand paged system occurs when the

page requested by a process is not present in physical memory. Each applica-

tion process uses a data structure called a page table that maps the process’s

virtual pages to the physical ones. The hardware checks this page table and

maps the memory requests accordingly. In cases where there is no entry in the

page table to map the requested virtual page, the hardware reports a page-fault

to the OS.

In a demand paged system, the actual memory pages are allocated to

application processes only when accessed for the first time. Figure 4.1 is a

diagrammatic representation of the paging model. The hardware traps page-

faults to the OS page-fault handler routine. This page-fault handler routine

analyses the information provided by hardware and transfers control to the

page replacement code if needed. The page replacement code is responsible

for all the paging activity in the system such as reclaiming unused pages from

memory, bringing back the evicted pages from the swap-space, etc.. Depending

on the type of a page-fault the page-fault handler does the following to handle

• a page from memory is moved to swap-space. This is called page-out or

swap-in operation,

Figure 4.1: OS Paging Model

• a page from swap-space is moved back into memory. This is called page-

in or swap-out operation,

• in case of demand paging, a new page is allocated to the process accessing

a virtual page for the first time.

Definition 4.1b A page-fault is said to be a minor page-fault when the

page requested has not been allocated and there exists a free page in memory

that can be allocated to the requesting process. This is true for demand ZERO

pages [90]. If the cost to allocate a page in memory is Calloc, then the cost to

handle a minor page-fault,

Cminor ≈ Calloc

Definition 4.1c A page-fault is said to be a major page-fault in two

different scenarios: one is when no free page exists in memory for allocation,

in which case an existing memory page needs to be paged-out to make space

for allocation; and the other is when a previously allocated page could not

be found because it was previously paged-out by the page replacement code,

in which case it needs to be paged-in. Sometimes a page-in operation might

cost an additional page-out operation if there is no free page in memory to

page-in. Thus, a major page-fault can either cause only one page operation

(page-in) or it might cause an additional operation (page-out). Assuming the

cost of page-in and page-out is nearly equal, say Cpage, then the cost to handle

a major page-fault is given as,

Cmajor ≈

(Cpage + Calloc) : 1 page op.

(2 × Cpage + Calloc) : 2 page ops.

Disk read or write operations have always been expensive as compared to

memory read or write. Hence, the cost to allocate a new page in memory,

i.e. Calloc is much lower compared to the cost to page-in/page-out, i.e. Cpage.

Thus, it can be inferred that:

Cmajor ≫ Cminor (4.1)

Algorithm 1 describes the operation of a page-fault handler routine. On

a page-fault a page-fault handler checks if a page has been allocated. If not,

it allocates a new page. Also, it checks if a page has been paged-out by the

page replacement code. In this case, it allocates a new page in memory and

page-in the old page from the swap space. Note that, in both the cases, the

newly allocated page needs to be added to the internal page-list(s) of the OS.

The OS maintains one or more page-lists to keep track of all the pages in

memory. Finally, the page-fault handler maps the page into the page table of

the requesting process.

Procedure Page-fault Handler (fault_address)Begin

if page allocated (fault address) is FALSE then

newpage := allocate page ()add page to list (newpage)

else if paged out (fault address) is TRUE then

newpage := allocate pagepage in (newpage, faultaddress)add page to list (newpage)

set mapping (fault_address, newpage)

End Procedure

Algorithm 1: Page-fault Handler Routine

Let ψ = {P1, P2, ..., Pn} be the set of all ‘n’ (allocated + free) pages in

memory. Let φ = {P1, P2, ..., Pf} ∀ Pi ∈ ψ be the set of ‘f ’ free pages

and ω = {P1, P2, ..., Pm} ∀ Pi ∈ ψ be the set of ‘m’ allocated pages. Now,

ψ = φ ∪ ω.

The function allocate page() in algorithm 1 may cause a page-out operation

if no free pages are available for allocation. i.e. a page, Pi ∀ Pi ∈ φ is selected

if f 6= 0 else it requests the page replacement code to page-out a page Pi ∈ ω.

Depending on the replacement policy used, one or more pages may be

identified as a candidate for eviction by the page replacement code. In case of

LRU, each page has an associated reference field ‘Ref ’ which is marked or in-

cremented whenever a process accesses that page. The page replacement code

for LRU selects a page Pi ∀ Pi ∈ ω which was least recently used (determined

by the value for Ref and position of the page in the LRU stack) [109].

The time spent by an OS in paging during the execution of a process affects

the process’s turn around time (Tτ ). Tτ for a process can thus be divided into

user-time and system-time. User-time (Uτ ) is the time for which a process

actually executes its code and system-time (Sτ ) is the time for which the OS

executes code either on behalf of the process (e.g. system calls) or is involved

in paging. Thus, Tτ = Uτ + Sτ

The paging activity of an OS depends on the page replacement policy being

used and the system load at the time of process execution. Other system tasks

like execution of system calls, etc. can be thought of having a constant time

as compared to paging. Thus, Sτ = Pτ +Oτ where Pτ is the time taken by an

OS in paging activity and Oτ is considered to be the constant time taken for

other OS activities. In an OS with global page replacement policy, page-faults

caused by one process in the system affect the turn around time of another.

Thus, the time taken for the paging activity can be summarised as:

Pτ ∝ (ρ · Cminor + η · Cmajor)

where ρ = no. of minor page-faults and

η = no. of major page-faults.

The turn around time, Tτ of a process can thus be summarised as:

Tτ ∝ {Uτ + (ρ · Cminor + η · Cmajor) +Oτ} (4.2)

From eqn. (4.1) and (4.2), it is clear that a greater value of ‘η’ will affect Tτ

more than it will for a greater value of ‘ρ’. i.e., Tτ can be improved if the no.

of major page-fault, i.e. ‘η’ is reduced. Clearly, an efficient mechanism that

reduces the number of page-faults (mainly ‘η’) can substantially reduce the

system time thereby improving the application’s execution time.

4.2 Reification Calls for Paging

This section describes reification calls that will be inserted into application

source code. The reification calls for paging should identify the memory ac-

cesses in the source and reify this information to the reflection framework. The

intention here is to ensure that the memory being accessed is always present in

the memory during its access. An OS paging mechanism could use such reified

information to lock and release memory pages allocated to the corresponding

virtual memory addresses being reified. A detailed description of such an OS

mechanism called CASP [97] is given in section 4.10 later.

For simplicity, two simple names have been chosen to represent the reifica-

tion calls: keep() – suggests that a memory region will be accessed in the near

future and discard() – suggests that a memory region will not be accessed in

the near future. Essentially, both calls are wrappers around the original reify()

call. It is not necessary to define separate reification calls for each and every

type of information that needs to be reified. In this case, keep() and discard()

help in better understanding. Also, wrapper functions provide a better level

of abstraction improving code readability without adding any code penalties.

The following subsections explains the two calls in more detail.

4.2.1 keep(< address >, < size >)

This call captures memory access information of the application under consid-

eration. It indicates to the meta-level of the paging module that a particular

virtual memory region will soon be accessed by the application. keep() can be

defined as a C constant as follows:

#define keep(address, size) reify(KEEP_ALIVE, address, size)

This call returns a unique identifier that is associated with the memory

region. The unique identifier can be later used to by the ‘discard()’ reification

call to simply refer to the previously reified memory region.

4.2.2 discard(< id >)

The discard() reification call indicates that a virtual memory region will not

be accessed in the near future. i.e. the paging module may move it to the

swap space. Note that the fact that this memory region can still be accessed

anytime but not immediately in the future means that the paging module

cannot completely get rid of the region.

Similar to keep(), discard() is defined as follows:

#define discard(id) reify(ALLOW_DEATH, id)

Note that, discard does not specify a memory address or size. It uses

a unique identifier returned by one of the keep() calls to refer to the corre-

sponding memory region. This restricts the application programmer to always

precede a discard() call with a keep() call. The next section explains how these

two reification calls can be effectively inserted into the application source code

to benefit from the reflection framework.

4.3 Inserting Reification Calls

Insertion of reification calls into the application source code will provide valu-

able runtime information to the RTOS about an application’s memory access

patterns. Three methods of insertion are described: manual method of in-

sertion – calls are inserted by the application programmer himself; automatic

method of insertion – calls are automatically inserted into the application

source code using a software tool (requiring no intervention from the appli-

cation programmer); hybrid method of insertion – calls are inserted using a

mixture of manual and automatic methods. The methods make use of large

memory region access within loops to identify data locality and insert reifica-

tion calls around them.

The following sections describe the three methods using a sample appli-

cation – ‘scan’. This is a micro-benchmark application that allocates itself

100 MB of virtual memory and loops 5 times – in each loop iteration reading

all allocated memory (a byte at a time) in a sequential order.

The C source code representation of ‘scan’ is shown in figure 4.2. Notice

that, the inner loop sequentially accesses large amounts of memory (i.e. reads

the array memptr of size 100 MB). Assuming that the available physical mem-

ory is only 64 MB, ‘scan’ stresses the virtual memory subsystem generating

a worst-case scenario for traditional page replacement policies. Such applica-

tions that use more memory than is physically available are generally termed

as out-of-core applications [26].

void scan(){

int index, loops = 5, size = 100 * MB;char *memptr;memptr = (char *) malloc(size);while(loops){

for(index=0; index < (size); index++){

temp = memptr[index];}loops--;

Figure 4.2: Benchmark Application – ‘scan’

4.4 Manual Insertion Method

The application programmer who has sufficient knowledge of the application’s

data size and its usage in the source code is able to accurately insert reification

calls in the source at the time of programming. Manual insertion of calls into

the application source can be time consuming and error prone if the person

inserting them is not the application developer. It is best to add reification

calls during application development. Otherwise the application programmer

needs to:

1. know the application source language,

2. understand the application behaviour,

3. determine data access points or in other words the data hot-spots.

Although the above criteria seem daunting to the application programmer,

the end result nevertheless can be very satisfactory. Particularly in the case

of out-of-core applications executed upon limited resource portable embedded

systems where efficient utilisation of resources has a significant impact on the

overall performance of the system.

For the sample application – ‘scan’, the programmer would make limited

modifications by splitting the inner loop into several loops (4 in this case) and

adding reification calls around them (see code in figure 4.3). The programmer

can test the performance of the modified source and accordingly vary the

location of the reification calls or the number of split loops to achieve better

performance.

The call to keep(memptr + index, size/4) suggests that the memory region

of size ‘size/4 ’ starting at the virtual address ‘memptr + index’ will be accessed

immediately in future.

void scan(){

int index, loops = 5, size = 100 * MB;char *memptr;int id;memptr = (char *) malloc(size);while(loops){

id = keep(memptr, size/4);for(index=0; index < (size/4); index++){

temp = memptr[index];}discard(id);

id = keep(memptr + index, size/4);for(; index < (size/2); index++){

id = keep(memptr + index, size/4);for(; index < (3*(size/4)); index++){

id = keep(memptr + index, size/4);for(; index < (size); index++){

loops--;}

Figure 4.3: Manual Insertion for ‘scan’

4.5 Automatic Insertion Method

In order to assist the RTOS to adapt its policy from a memory management

point of view, the RTOS needs to know about the application’s memory re-

quirements and its access patterns during execution. Reification calls need to

be added into the application at points where the application process accesses

or allocates memory to itself.

Previous work in this area considered several compiler based techniques for

inserting custom memory management hints [26, 82]. The compiler directed

memory management [82] analyses application code at compile time for loops

consisting of accesses to data arrays, inserting primitives such as LOCK, UN-

LOCK and ALLOCATE into the code to control the allocation of memory for

the corresponding arrays at run-time.

Brown et. al. [26] proposed a similar approach using compiler-inserted

pre-fetch and release hints to manage physical memory. Brown’s approach

used a run-time software layer to queue hints in application space and then

send them across to the underlying OS. These techniques assume that the

underlying OS supports allocation of memory on demand and also provides

an efficient lock/release mechanism. The scope of using reification process

is much wider and is not restricted to only virtual memory management, this

chapter uses virtual memory as a case study to describe the usage and support

for reification in the reflective framework.

The process of automatically detecting regions having large memory ac-

cesses in the application source code can be particularly hard and restrictive

without much information (i.e. the control flow graph or a pre-execution trace

of the application) about the application. The method described in this sec-

tion uses only the application source code and no other information to insert

reification calls.

The automatic insertion method improvises on loop-based sequential mem-

ory accesses. It parses the application source, detects loop with large data

access, splits the loop into multiple smaller loops still maintaining the original

application behaviour and then inserts reification calls around these loops to

provide information to the RTOS. The process is similar to manual insertion

but it done automatically. The next subsection explains the automatic method

for applications written in the C language.

4.5.1 Automatic Insertion for C Language

The C language which is widely used for embedded application development

was chosen for analysis. Note that in C, an algorithm can be expressed in many

different ways (e.g. the use of pointers instead of arrays, or the use of ‘for’

loops instead of ‘while’ loops, etc.). To counter this, the CIL (C Intermediate

Language) tool set [53], which can transform such C source code into a uniform

C source representation, has been used. For instance, CIL transforms all

loop constructs (for, do-while, etc.) into while loops, all data accesses and

declarations are represented in a way such that there is no difference between

pointer reference and array reference (i.e. ‘a[i]’ will be transformed into ‘(*(a

+ i))’ ).

A tool – ‘cloop’ has been developed to parse the CIL transformed C source

code, detect loops with large amounts of memory accesses and insert the reifi-

cation calls if required [97].

Figure 4.4 shows the process involved in automatic insertion of reification

Figure 4.4: Steps Involved in Automatic Insertion

calls. With respect to paging, reification calls will be inserted to specify the

application’s memory access patterns to the RTOS at runtime. The tool –

cloop is specialised to detect loops with large amounts of data accesses and

insert reification calls.

In the sample application ‘scan’, reification calls are inserted to suggest

immediate and non-immediate accesses to certain memory regions accessed

within loops. Note that, an OS mechanism (in a system with only 64 MB)

would not be able to lock the memory pages if the information reified by ‘scan’

suggests that a memory region of size 100 MB will be accessed. In other words,

reification calls needs to be inserted more intuitively than just specifying the

memory access. This involves taking into account the amount of available

physical memory and reducing the amount of locked memory accesses by an

application at any given time. Since this case study improvises on loops, it is

logical to split the loops such that each split loop accesses smaller portions of

the memory region.

The following relation is used to determine the minimum size of memory

region to be locked and/or the number of split loops. A minimum watermark

for the amount of physical memory always to be free is set to ‘y%’ of the total

amount of memory (Mtotal) in the system. If Dsize represents the size of the

memory region being accessed (determined by the data size of the variable as

well as the loop bound) and Mfree represents the total free memory, then the

minimum amount of memory region that the OS mechanism needs to lock, i.e.

Dlock is given by:

Dlock = min(

⌊x% ×Dsize⌋,Mfree − y% ×Mtotal

where x is the minimum percentage of the memory region to be accessed and

y is the percentage of total memory that needs to be free.

For the sample application, assume that the minimum watermark, y is set

to be 10% and that the minimum amount of data to be accessed, x is set to

be 25%. Thus, the eqn. 4.3 now becomes:

Dlock = min(

⌊0.25 ×Dsize⌋,Mfree − 0.1 ×Mtotal

The mechanism would then lock either 25% of the data size or 90% of the

available free memory – whichever is less. Thus, the number of split loops is

given by: ⌊Dsize/Dlock⌋.

The cloop tool uses a two stage process:

• for a loop with a known loop bound, loop splitting techniques [16] are

used to split the loop into separate individual loops. This depends on

Figure 4.5: Pass-1 of the cloop Tool

Figure 4.6: Pass-2 of the cloop Tool

the loop bound and the size of the data being accessed. Appropriate

keep() and discard() reification calls are then inserted in between the

split loops.

• for a loop with an unknown loop bound, a separate function (called

checkpoint function) containing the split loops with the reification calls

is created while a conditional statement is inserted before the original

loop. The loop bound is checked at run-time, invoking the new function

containing the reification calls if the loop bound and data size is large

enough to benefit from the reification calls. This is determined by looking

at the free memory available at that time and the size of the memory

being accessed. In order to benefit from the use of reification calls, an

application should access data of size which is either greater than the

physical memory size or at least greater than the available free memory

in case of unknown loop bounds.

Figure 4.5 shows the flowchart of the pass-1 phase of the cloop tool. In this

phase, cloop parses the CIL transformed source code and builds a meaningful

internal representation to help find loops with large memory access. The

flowchart of pass-2 phase is shown in figure 4.6. In this phase, cloop identifies

the target loops that need to be split with the reification calls inserted. Using

the relation 4.3 above, a loop is either split (in case of known loop bounds) or

conditional statements added (in case of unknown loop bounds).

For the sample application, it is assumed that the system has 64 MB of

physical memory and at the time of execution the available free memory is

58 MB (this is an ideal value when using a freshly booted Linux system). The

code in figure 4.7 shows the CIL transformation of the original ‘scan’ source

void scan(){

int index, loops = 5, size = 100 * MB;char *memptr;memptr = (char *) malloc(size);while(loops != 0){

index = 0;while(index != (size -1))){

temp = *(memptr + index);index = index + 1;

}loops = loops - 1;

Figure 4.7: CIL Transformation of – ‘scan’

code. Note that, the inner ‘for’ loop has been converted to a ‘while’ loop so

that all loops in the source are uniform. Also, the term ‘*(memptr + index)’

accesses indexth element of the data array memptr (i.e. memptr[index]).

This source is parsed to detect loops and determine data access points (Pass

1 of cloop). For ‘scan’, the inner loop is selected as the prime candidate for

splitting. Using the equation 4.4, Dlock = min(

⌊0.25 × 100⌋ = 25, 58 − 0.1 ×

64 = 51.6)

, i.e. the minimum size of data region to be locked is 25 MB (i.e.

⌊size/4⌋). Thus, the inner loop needs to be split into 4 (⌊100/25⌋) similar

loops. Since the loop bound for the inner loop is known, it is split into 4

smaller loops and the reification calls inserted. The transformed source code

of ‘scan’ is shown in figure 4.8.

Note that, the variable ‘index’ is not initialised at the start of every loop.

This allows the continuation of data access similar to the original loop (before

splitting). Each time in the split loop, only the required amount of memory

(⌊size/4⌋) is used as determined above.

In this particular example, both manual and automatic methods produce

almost similar transformations. However, if ‘scan’ dynamically varied the loop

void scan(){

int index, loops = 5, size = 100 * MB;char *memptr;int id;memptr = (char *) malloc(size);while(loops != 0){

index = 0;id = keep(memptr, size/4);while(index != (size/4 -1))){

}discard(id);

id = keep(memptr + index, size/4);while(index != (size/2 -1))){

}discard(id);

id = keep(memptr + index, size/4);while(index != (3*size/4 -1))){

}discard(id);

id = keep(memptr + index, size/4);while(index != (size -1))){

}discard(id);

loops = loops - 1;}

Figure 4.8: Automatic Method for ‘scan’

bounds using parametric values, then the automatic method would generate

a different transformation. A conditional statement would be inserted before

the inner loop to execute a new function containing the split loops if the loop

bound of the inner loop exceeded the limit determined by equation 4.4.

4.5.2 Comparison of Manual and Automatic Insertion

Generally, the manual insertion process is considered to be more efficient than

the automatic method. This is because, the application programmer is able

to accurately insert reification calls even at non-loop based data access points.

For instance, consider that a large amount of memory is being accessed in

parts across several function routines. Since the automatic insertion method

only observes memory accesses at certain fixed locations in the source (i.e.

loops in this case study), it fails to recognise this scattered memory access.

The application programmer, on the other hand, would know that the memory

access is scattered across several function routines and thus, can add reification

calls encompassing these function routines.

For the application ‘scan’, both manual and automatic insertion methods

produce the same results. This is because, the application code iterates over

a fixed sized memory array within a loop with a known loop bound. By

changing the code such that the loop bound and the array size are passed to

‘scan’ via function parameters, manually inserted reification calls will not yield

the same results for all array sizes and loop bounds. If these values depend

on certain runtime events, then the programmer is unable to accurately insert

the reification calls.

After executing the two versions of the application scan in Linux, it was

found that the one using automatic insertion method finished execution nearly

125 seconds earlier than the one using manual insertion method. Nevertheless,

this is not true for all applications. For example, the MPEG decoder appli-

cation which had data accesses scattered across function boundaries showed

better execution time using the manual insertion method. A more detailed

analysis of both insertion methods for different applications is provided in the

next chapter.

Generally, manual insertion is a slow or time consuming process and its

accuracy depends on the skills of the programmer in detecting memory accesses

and accordingly inserting reification calls. To conclude, both manual and

automatic methods have their associated pros and cons. This leads to the

hybrid method of insertion (described next). It uses the best of both methods

in a way that would yield better results.

4.6 Hybrid Insertion Method

Although the automatic method produces acceptable results, there are known

failure scenarios. It is not possible to automatically add accurate reifica-

tion calls in all applications, particularly where memory is being accessed

across function boundaries. Consider for example the MPEG decoder appli-

cation [107]. In this application, large amount of MPEG data is read into

memory and later decoded by several decoding functions depending on the

frame type (e.g. I frame, B frame or a P frame) [51]. Each decoding func-

tion is responsible to decode a particular kind of frame, consuming part of the

MPEG data in the process. In an MPEG stream consisting of several different

frames, data is consumed across various function boundaries. The automatic

method of insertion fails to identify this kind of scattered data access.

On the other hand, the manual method of insertion only produces best

results if the application programmer is aware of such data access and adds

appropriate reification calls within function boundaries. Complete reliance

on the application programmer can have adverse effects as well. Due to the

application complexity, it is possible that the programmer fails to insert certain

key reification calls that could make a huge difference. Thus, there exist trade-

offs between both methods prompting for a combined hybrid approach.

In the hybrid approach, instead of manually analysing the entire applica-

tion source code, the programmer initially uses the automatic tool (cloop) to

analyse the application source and insert reification calls. The cloop tool can

be configured to output important information about the application source.

For instance, it can show the allocation/de-allocation of memory, location of

looped memory accesses, etc. This provides the programmer with sufficient

information early on about the application – pointing him/her at specific loca-

tions in the source code which could then be manually analysed for insertion.

Thus, if the application source code is huge and complex, then using the hy-

brid method, the programmer needs to analyse only a certain percentage of

the source code. By mixing the two methods, the programmer can at least

speed-up the insertion process for looped memory accesses. The next sec-

tion describes CASP, an OS paging mechanism, that makes use of keep() and

discard() calls.

4.7 Design of CASP Mechanism

This section presents the design of a Co-operative Application-Specific Paging

(CASP) [97] mechanism in an OS. CASP makes use of the information pro-

vided by the reification calls inserted using one of the above methods. Note

that, in the previous chapter, DAMROS was a single address space OS. But

Figure 4.9: Design of CASP Mechanism

the design of CASP supports multiple address spaces as well. The model of

the CASP mechanism is as shown in figure 4.9. Reification calls corresponding

to memory access/usage in the form of keep() and discard() calls are placed in

the application source code. The operation of CASP is divided into an appli-

cation level component called CASPapp and an in-kernel OS component called

CASPos. CASPos acts as a meta-level component of the OS paging module.

Both components are described in the following subsections.

4.7.1 CASPapp Component

The CASPapp component consists of a runtime library attached to the appli-

cation code. The library uses an OS system call interface pass information (in

the form of reify() calls) to the CASPos component. A keep() call suggests a

memory region to be locked for use and a discard() call suggests to unlock a

previously used memory region. For example, an application process will use

keep(< Address >, < Size >) to suggest CASP that it will access the mem-

ory pages mapped for virtual addresses ranging between Address and (Address

+ Size). The CASPapp component passes this information using reify() calls.

The reified information is picked up by the CASPos component which uses re-

flection to lock the pages in memory along with techniques such as pre-paging

and page-isolation (described later). A call to discard(ID) would suggest the

CASPos component to unlock the locked pages.

4.7.2 CASPos Component

CASPos component is activated by the rManager when an application pro-

cess uses keep() or discard() reification calls. After receiving the information,

i.e. the memory Address and Size, the process’s address space is checked to

see if the memory pages have already been allocated or if any pages in the

given memory region need to be paged-in. Accordingly, pages are allocated or

paged-in from the swap-space and mapped into the process’s page table. This

operation is called pre-paging.

Algorithm 1 listed the pseudo code of the page-fault handler routine in an

OS. Note the use of add page to list() function. This function is used to add

a newly allocated page into the OS maintained page lists. When a page at a

particular virtual address is not found, the page-fault handler routine either

allocates a new page or pre-pages from the swap space.

The CASPos component has a similar operation. In order to lock pages

in memory, the CASPos component uses a technique called page isolation

(explained in the next subsection) such that the locked pages are not placed

in the OS maintained page lists. The only difference between the page-fault

handler routine and the CASPos component is that the page-fault handler

routine adds the allocated/pre-paged pages into the OS maintained page lists

(via add page to list() as in algorithm 1) whereas CASPos does not.

One of the advantages of the generic reflective framework is the ability

to use existing code. The CASPos component makes use of the interception

mechanism to re-use the existing code for the page-fault handler routine [95,

96] (see section 4.7.4). This mechanism allows the CASPos component to

intercept calls to add page to list() in the page-fault handler routine such that

the control is transferred to a page-isolation() routine instead. In a sense,

the CASPos component uses the information provided by the reification calls

to adjust the working set memory image of the application by pre-fetching,

locking/releasing and pre-swapping pages in and out of memory. This helps to

always keep the memory pages in physical memory every time the application

accesses them. The next subsection describes the page-isolation technique used

by the CASPos component for efficient non-intrusive page locking operation.

4.7.3 Page-isolation Technique

The page-isolation technique has a relatively simple operation. Algorithm 2

lists the pseudo code of the page-isolation routine. It is assumed that the OS

maintains two-page lists (similar to Linux [19]): an active page list – consisting

of all pages that are in use, and an inactive page list – consisting of the

remaining pages. The page-isolation routine determines the list a page belongs

to and removes the page from that particular list. The page, thus removed, is

completely isolated from the OS maintained page-lists. This process is termed

as page-isolation.

In order to keep track of the isolated pages, the CASPos component main-

tains a separate page list specific to each application that use the reification

calls. This page list is stored in the OS address space and is not accessible to

the applications. Also, it incurs no extra memory overhead since it occupies

the same amount of memory if the pages were stored in the original OS page

lists.

The isolated page list is later emptied by adding the pages back into the

respective OS page lists when: the corresponding application terminates, when

it uses a discard() call or when there is no free memory available for other

application processes.

Care has been taken by adding checks in the CASPos component such that

a single application is not allowed to lock all available memory to itself. Even

if a programmer greedily adds keep() reification calls in order to lock more

memory, the CASPos only locks the first Nfree/Nprocess pages; where Nfree =

the number of available free pages and Nprocess = the number of the different

application processes running in the system. Each keep() call is time-stamped

so that when there is no more free memory available in the system the CASPos

component recovers pages starting from the oldest isolated page-list.

Procedure Page-Isolation (page)Begin

if page in active list (page) is TRUE then

remove from active list ()

remove from inactive list ()

add page to isolated list (page)

End Procedure

Algorithm 2: Page-isolation Routine

CASP operates non-intrusively with the existing page replacement code

and thus, has relatively no side-effects. Since the isolated pages do not exist in

the OS page lists, they are never considered as candidates for reclamation by

the OS’s page replacement code. This could, in fact, speed up the reclamation

process since the code has less number of candidates for page reclamation.

CASP achieves page locking without the knowledge of the original page re-

placement code making it a generic approach that can easily operate on top of

any existing page replacement policy. The next subsection describes the use

of the reflection framework in CASP.

4.7.4 Use of the Reflection Framework

CASP uses the interception mechanism built into the generic reflective frame-

work to re-use of the existing page replacement code of the OS. The mechanism

allows to intercept unwanted function calls in a particular function routine and

transfer control to another routine instead. This promotes code reuse-ability

and also eliminates code redundancy.

During OS initialisation, CASPos component sets itself as the meta-level

component of the resource represented by the resourceID MEMORY. Thus,

whenever the reflection framework receives information reified for MEMORY

resource, the rManager activates the CASPos component. Also, during ini-

tialisation, CASPos intercepts calls to add page to list() in the page-fault han-

dler routine once and immediately unintercepts them by setting keepAlive to

When the CASPos component starts pre-paging, it uses reinterceptCall()

to intercept calls to add page to list() before calling the page-fault handler rou-

tine. This results in the control being transferred to the page-isolation routine

when the page-fault handler routine calls the add page to list(). By skipping

the execution of function add page to list(), the pre-paged pages are avoided

from being added into the OS page lists. At the same time, by executing the

page-isolation() function instead, these pages are added into the isolated page-

list of the corresponding application. The next section discusses the evaluation

strategy for CASP.

4.8 Evaluation Strategy

DAMROS is a single address space OS and CASP is designed to support mul-

tiple address spaces. It would be interesting to see the applicability of the

framework and CASP to an OS supporting multiple address spaces. Rather

than implementing a new OS or changing DAMROS, it would be ideal to

implement just the core elements of the framework along with CASP mecha-

nism in a commodity multi-address space OS. However, before implementing

the framework and CASP in a commodity OS, evaluation using simulation is

considered in this chapter.

Existing virtual memory simulators do not allow customisation in order to

add CASP capabilities into them and also, are generally slow (i.e. in terms of

simulation time). The following sections present existing techniques involving

virtual memory simulations followed by the description of PROTON [93], a

home-grown customisable on-the-fly virtual memory simulator. Later, in the

evaluation section, PROTON is used to simulate CASP along with applications

that use the reification calls.

4.9 Virtual Memory Simulation

This section surveys the existing virtual memory (VM) simulation techniques.

The VM simulation techniques can be classified into two main categories:

Trace-driven simulation and On-the-fly simulation.

4.9.1 Trace-driven Simulation

In the trace driven approach, a complete memory reference trace of the given

workload executing upon the real hardware is obtained. This trace is gener-

ally recorded as a disk file or transmitted via a communication medium (e.g.

Ethernet). The memory reference trace, so obtained, is processed by a simple

VM simulator implementing the required paging policy (e.g. LRU). Several

solutions, such as Laplace [66] and kVMTrace [67], exist to obtain memory

reference traces of the system workload.

ATOM [114] is a static code annotation based trace collection tool which

analyses a single application. ATUM [10] on the other hand, uses microcode

to efficiently capture address traces. Since the microcode operates beneath the

OS layer, the captured trace consists of memory accesses of all the software

components running on the hardware.

Trace-driven approaches are known to generate huge reference traces even

for only a few seconds of workload execution. Due to this problem, several

errors such as trace discontinuities, time dilation and memory dilation are

introduced into the system [122]. Such errors are collectively called trace

distortions [122].

Kaplan et. al. [68] proposed two algorithms - Safely Allowed Drop (SAD)

and Optimal LRU Reduction (OLR) - for reducing the trace size by several

factors. Using this approach, the simulation error in terms of number of page-

faults for CLOCK and SEGQ (segmented queue) replacement policies was

under 3%. However, since the trace reduction algorithms discard information

which is not required by an LRU policy, this approach only applies to LRU-

based policies. The next subsection discusses the existing on-the-fly simulation

techniques.

4.9.2 On-the-fly Simulation

The On-the-fly VM simulation techniques simulate the memory references

alongside the execution of an application process. The approach involves an-

notating the application code to call a simulator function for each memory

access instruction. This function simulates the memory reference in the simu-

lator. Although, this method eliminates the need for recording and handling

huge memory reference traces, it adds considerable runtime overhead.

MemSpy [83] is one such simulator which annotates application’s assembly

code. Typically, MemSpy exhibited a slow-down factor in the range of about

20 to 60 for the simulation of direct-mapped data cache of size 128 KB [122].

Fast-Cache [74] is an on-the-fly simulator based on an abstraction called

‘active memory’. It results in a slow-down factor in the range of about 2 to 7

for the simulation of direct-mapped data caches between sizes 16 KB to 1 MB.

Most of the existing on-the-fly simulators support single application pro-

cesses only. It is neither possible to determine the overall system performance

nor to predict the effects of a particular application on an existing system

workload. PROTON on the other hand can simulate multiple applications

making it possible to determine the VM performance of the entire system.

Eggers et. al. [43] present techniques involving efficient placement of in-

strumentation trace points or annotations into the application assembly code.

The approach is particularly focused on shared-memory multiprocessor ar-

chitectures. PROTON focuses on reducing code annotations in a high-level

language source rather than the assembly code. This gives it more flexibility

and makes the process portable to any platform (particularly if the application

source is written in the C language).

The next section presents the design and implementation of PROTON that

is used to stimulate the reification calls as well as the CASP mechanism.

4.10 PROTON Virtual Memory Simulator

PROTON [93] simulator has been specifically designed for flexibility, easy cus-

tomisability and to support the simulation of multiple applications at once.

Another objective of PROTON is to improve upon the VM simulation time.

Figure 4.10 shows the model of PROTON. The implementation uses the

POSIX [98] library for handling multiple threads. Similar to other on-the-fly

VM simulators, PROTON annotates the application source. The annotations

call a PROTON function at the point of memory reference in the application.

Adding annotations in the high-level language helps to better analyse the data

access pattern and optimise the placement of annotations. Following subsec-

tions explain the optimised code annotation technique and the operation of

PROTON in more detail.

4.10.1 PROTON Annotations

The application source code written in a high-level language (C in this case)

is annotated in three phases. In phase #1, PROTON parses the high-level

language application source and builds an internal representation of dynamic

data flow of the application. This is similar to the one generated by the cloop

tool. Phase #2 detects memory accesses and inserts suitable annotations using

the optimal placement technique (explained in the next subsection). In phase

#3, all the memory allocation/deallocation functions (i.e. malloc(), free(), etc.

) of the underlying C library are intercepted such that PROTON can trap

dynamic memory allocations and simulate them.

Optimal Placement of Annotations

This subsection uses the application ‘scan’ (figure 4.2) as an example for plac-

ing annotations. The traditional annotation methods involve annotating the

assembly code of an application. These methods utilise little or no information

regarding the looped sequential access; neither can such methods detect the

size of the dynamic memory allocation (e.g. size of memptr in scan).

If PROTON were to use similar techniques, then the resulting annotated

Figure 4.10: PROTON Design Model

application source would look like the listing in the figure 4.11. The annota-

tion ‘sim mem access(&memptr[index], 1, READ)’ indicates to the simula-

tor that the application is reading 1 byte from the memory location pointed

to by memptr[index]. Note that, this example only shows annotations for the

dynamic memory allocations. The variable temp is stored on the stack and its

access has not been shown to be annotated. It is evident that the annotation

for the inner loop can be easily inserted before the start of the loop such that

a single annotation is enough to represent the entire memory access of the

variable memptr in the loop.

Such optimised placement of annotations is only possible by analysing the

application source written in a high-level language. Phase #2 of PROTON

detects this kind of memory reference and inserts a single combined annotation

outside the loop. In case of ‘scan’, by adding a single annotation outside the

inner loop, the annotation is called only 5 times instead of (5 × 104, 857, 600)

times.

Each annotation is a function call which would require the storage and re-

trieval of the processor flags and registers on the stack. Thus, rigorous use of

such annotations would incur a substantial runtime overhead potentially slow-

ing down the simulation process. By optimal placement of the annotations,

PROTON minimises the number of annotations (i.e. function calls), thereby,

reducing the associated overhead (see code in figure 4.12).

void scan(){

for(index=0; index < (size -1); index++){

sim_mem_access(&memptr[index], 1, READ);temp = memptr[index];

}loops--;

Figure 4.11: ‘scan’ with Traditional Annotation

Nevertheless, such placement technique may introduce an element of error

in the simulation. For example, in case of ‘scan’, the placement strategy is

accurate when memptr is the only variable being accessed in the loop. How-

ever, if there are other variables that are being accessed in the loop, then these

intermittent accesses would affect the state of the virtual memory subsystem

resulting in a different set of paging operations. For instance, consider the

following code statement:

memptr[index] = strptr[index] + memptr[index];

If this statement is added to the inner loop of ‘scan’, then the memory

locations strptr[index] and memptr[index] are first read from memory, the sum

calculated and the result written back again to the location memptr[index]. In

this case, it is not appropriate to insert three annotations outside the loop: two

indicating read access to variables memptr and strptr using the READ option

and one annotation indicating a write access to memptr using the WRITE

option to sim mem access(). For all such cases, PROTON uses the traditional

approach of adding annotations. i.e. add annotation for each memory access

inside the loop.

void scan(){

sim_mem_access(memptr, size, READ);for(index=0; index < (size -1); index++){

temp = memptr[index];}loops--;

Figure 4.12: ‘scan’ with PROTON Annotations

A preliminary analysis of MiBench [56] applications suggested that it is

common to find loop based single variable accesses similar to ‘scan’. For in-

stance, in the bubble sort algorithm all accesses in the loop are confined to

the array list being sorted. However, there are some exceptions such as the

FFT (Fast Fourier Transformation) application whose access pattern depends

on the data at runtime.

PROTON is able to detect dead code in loops. For instance, consider a

loop containing dead code as shown in the following code fragment:

condition = FALSE;

for(i=0; i< SIZE; i++)

if(condition) {

temp = memptr[i];

else {

In the above code, the read access to data memptr would never be executed.

During phase #1, PROTON builds an internal symbol table with the knowl-

edge of statically initialised variables. For the above code fragment, in phase

#2, when analysing the ‘for’ loop, PROTON evaluates the condition of the

‘if ’ statement which in this case results in FALSE. Hence, no annotations are

added for access to the variable memptr. However, PROTON can only analyse

static conditions (i.e. it cannot analyse for instance, if(func call(condition))).

For such cases, adds annotation inside the dead code. Since the annotation is

within the dead code, it is never executed, thus, maintaining the exact runtime

behaviour.

4.10.2 Simulation of Multiple Applications

In order to simulate multiple applications, the annotations from all applica-

tions should be recorded by a common PROTON simulator code base. The

use of IPC mechanisms [90] such as pipes or shared memory can help address

this issue.

However, using IPC would add additional overhead into the simulation pro-

cess making it much slower. PROTON takes a different approach. During the

application source analysis in phase #1, a minor source modification enables

the entire application workload, consisting of all applications, to run as a sin-

gle multi-threaded application which is linked with the PROTON simulator.

Each application is then executed as an independent thread of the resulting

application process. This way, the annotations added into each applications

provide a direct function call interface without incurring extra communication

overhead.

Although PROTON supports multi-threaded applications, it does not

guarantee their scheduling behaviour. Since PROTON is built on top of the

POSIX pthreads [98] library, a multi-threaded application can opt to use the

underlying POSIX scheduling framework if available.

If a single application is multi-threaded, then during simulation of multiple

applications, the individual threads of that particular multi-threaded applica-

tion would be executed along with other application threads, all sharing a

single address space. One limitation of this approach is that, there is no pro-

tection between the threads belonging to different applications. Thus, a faulty

or erroneous application within the simulated workload can affect the simula-

tion process by interfering with other application threads. It is assumed that

each application has been tested to be bug free. Tools such as Valgrind [108]

can be used to detect memory errors in the application code.

The simulator has been implemented as a shared library which is linked

to the application at runtime. The following parameters are used to configure

PROTON is virtual memory system:

• TOTAL MEMORY: This parameter sets the amount of physical memory

to simulate.

• PAGE SIZE: This parameter sets the size of a memory page. PROTON

supports different page sizes allowing it to simulate virtual memory for

different architectures.

• DISK READ TIME: This parameter is used to set the time required to

read a single page from the disk. PROTON simulates a disk read/write

by executing a delay sequence for the given time.

• DISK WRITE TIME: This parameter is used to set the time required

to write a single page to the disk.

• TOTAL APPS: This parameter specifies the number of application

threads that would need to be simulated.

• POLICY: This parameter specifies the paging policy to be used for sim-

ulation. PROTON implements the following paging policies [20, 40, 90,

1. LRU policy,

2. CLOCK policy,

3. MRU policy,

4. FIFO (First-In-First-Out) policy,

5. AGEING policy,

6. LFU policy,

7. MFU (Most Frequently Used) policy,

8. RANDOM policy,

9. USER - specifies that a user-defined policy is in use.

Note that PROTON implements the CASP mechanism described in sec-

tion 4.7 which operates on top of any of the above page replacement

policies.

• MK TRACE: This parameter sets/un-sets the recording of memory ref-

erence trace to a disk file.

• GUI: This parameter graphically displays the execution of applications

and draws dynamic graphs of the paging activity of the system. Use of

GUI (Graphical User Interface) slows the simulation process and is not

recommended for workloads with lengthy execution times.

A configuration file is used to setup the above simulator parameters. A set

of annotated test applications are compiled and linked together to form one

executable, which when executed starts the PROTON initialisation function.

This function reads the configuration file and initialises the virtual memory

system, PROTON spawns the application threads and begins simulation. The

simulation is either set to be continuous until all application threads finish

execution or set for a fixed amount time. PROTON then generates the virtual

memory statistics of the application(s). The PROTON virtual memory sys-

tem assumes the following: cache is disabled, the CPU direct access to main

memory, and the swap space has unlimited storage space.

4.10.3 Implementing UD Paging Policies

The value ‘USER’ for configuration parameter ‘POLICY’ specifies the use of

a user-defined page replacement policy. PROTON provides user-hooks which

when set are executed at certain paging events during simulation. A new

UD policy needs to use these user-hooks to trap the required paging events.

Following user-hooks have been provided in PROTON:

• pgrep init() and pgrep exit(): These user-hooks are called by PROTON

to initialise/de-initialise the page replacement policy code.

• pgrep replace(): This user-hook is called when there is no free mem-

ory available for allocation. The function is expected to return a page

number starting from 0 to MAX PAGE NUMBER (globally accessi-

ble value) that will be swapped/replaced by PROTON. An UD policy

must implement this hook failing which the PROTON simulator will not

be initialised.

• pgrep alloc() and pgrep free(): These user-hooks are called after a new

page has been allocated/deallocated to an application. The implemen-

tation of these hooks is optional.

• pgrep access(): This user-hook is called whenever a page is accessed by

an application. This helps the UD paging policy to maintain page usage

statistics.

The next section describes simulation experiments using PROTON to de-

termine the effects of reification calls and the CASP mechanism for different

kinds of benchmark applications.

4.11 Simulation Experiments using PROTON

4.11.1 Benchmarks

Experiments were performed using four different kinds of applications, one is

the sample application - ‘scan’, two were taken from the embedded benchmark

suite – MiBench [56] and one from Brown’s work [26]. The data-set size of

these applications was increased such that they required more physical memory

than was available. PROTON was configured to simulate 64 MB of memory

with an unlimited swap-space. Table 4.1 summarises the characteristics of

these applications. Each chosen application exhibits a different memory access

pattern.

Name Description Input Data Set Memoryrequired

BSORT Bubble Sort 2559 records of 100 MB40 KB each

FFT Fast Fourier four 32-bit float 64 MBtransform arrays of

size 222

MATVEC Matrix vector three 32-bit 101 MBmultiplication integer arraysapplication

SCAN Example continuous 100 MBbenchmark byte arrayapplication

Table 4.1: Description of Benchmark Applications

BSORT – a bubble sort algorithm that sorts a list of records (approx.

40 KB in size) in an ascending order. The access pattern is mainly sequential

with a few random accesses due to record swap operations.

FFT – uses 4 floating point number arrays to perform the fast fourier

transform. The data access pattern in the main loop of FFT depends on the

value of the data itself, resulting in a random data access pattern.

MATVEC – a matrix vector multiplication application uses three data

arrays of different sizes to perform intensive matrix multiplication operations

in loops with known loop bounds. Therefore, it was easy to insert reification

calls by both automatic and manual methods. MATVEC was used to test

CASP for locking multiple data sets.

Finally, SCAN – the previously described micro-benchmark application. In

order to evaluate the automatic insertion method under dynamic conditions,

SCAN was modified to use parametric values for the variable ‘loops’ and ‘size’

(see original code in figure 4.2). SCAN generates a worst-case scenario for the

page replacement policies which helps determine the performance of CASP in

the worst case.

Original (O) using PROTON (P) using CASP (C)Application Time Faults Time Faults Time Faults

BSORT 20,289 1,248,444 9,115 1,216,539 7,152 1,128,046FFT 7,832 809 3,616 799 2,855 654MATVEC 3,215 164,774 2,664 164,761 1,041 103,502SCAN 1,773 112,385 1,581 112,385 599 80,681

Table 4.2: Single Application Benchmark Results for LRU

The results obtained from PROTON simulation have been validated

against the results of the original simulation method for the same applica-

tions. In the original method, applications were annotated for each and every

memory access and simulated for the respective paging policies. It is observed

that PROTON simulation has minimal simulation error (in terms of the dif-

ference in page faults generation) as compared to the original simulation. This

is further discussed towards the end of this section. The results obtained by

the original simulation method are be represented as O for each application

workload.

4.11.2 Single Application

Initially, to test the performance of CASP with the use of reification calls

in complete isolation, a single application scenario is considered . Also, the

performance of PROTON simulator with the described optimisation was com-

pared against the original simulation results (O). Three versions of all the

benchmark applications were produced: (1) PROTON version – the annota-

tions were added into the application source code using the placement tech-

nique, (2) CASP version – similar to (1) but including the reification calls (i.e.

keep() and discard()) and (3) Original version – the annotations were added at

each memory reference in the application source code. The benchmarks were

simulated for the following page replacement policies: LRU, MRU, LFU and

MFU. Each version of the benchmark application was simulated in a single

application scenario for the above page replacement policies.

Table 4.2 lists the simulation time (in seconds) and the number of page-

faults for each of the benchmark applications when simulating an LRU paging

policy. The values under column (O) represent the original simulation, those

under column (P) represent the normal simulation using PROTON and those

under column (C) represent PROTON simulation with the applications using

the CASP mechanism. Since the original simulation annotates each and every

memory reference, the simulation time is considerably larger than the PRO-

TON simulation. The timing of (O) are is used as a guidance for the maximum

value for the simulation time while comparing (P) and (C) values.

Applications using CASP generated significantly less number of page faults

(see table 4.2) and exhibited lower simulation times. On average, for an LRU

policy, applications using CASP generated 19% less number of page faults.

MATVEC using CASP performed the best amongst all other applications.

This is because the data access pattern of MATVEC is known and also, it

consists of loops with known bounds. The access pattern of SCAN is also

deterministic and hence, is benefited from CASP. The non-deterministic nature

of the data to be sorted and the intensive swap operations between random

memory locations caused BSORT to perform the worst. FFT also has nearly

random access pattern. The reification calls in FFT try to lock part of the

memory being accessed sequentially.

Simulator Performance

For BSORT, PROTON has shown to reduce the simulation time by almost

57% (avg.) across all policies (see figure 4.13). Application BSORT accesses

memory rigorously. PROTON inserts common annotations outside the loops

saving processing time which results in better simulation time. However, due

to this optimised placement PROTON incurs a 2% simulation error in terms

of number of page faults.

Similarly, for FFT, MATVEC and SCAN, PROTON reduces simulation

time by 48%, 28% and 17% respectively (see figures 4.14, 4.15 and 4.16). The

analysis of application source written in a high-level language helps PROTON

identify data locality in the code and optimise the placement of code annota-

tions.

On average, PROTON reduces the simulation time between 17% to 57%

with the simulation error ranging from 0% to 2%. Considering the improve-

ment in the simulation time and the flexibility offered by PROTON, a simu-

lation error of 2% is acceptable.

MFULFUMRULRU

BSORT execution times

OriginalProtonCASP

Figure 4.13: BSORT Simulation

MFULFUMRULRU

SCAN execution times

OriginalProtonCASP

Figure 4.14: FFT Simulation

MFULFUMRULRU

MATVEC execution times

OriginalProtonCASP

Figure 4.15: MATVEC Simulation

MFULFUMRULRU

.SCAN execution times

OriginalProtonCASP

Figure 4.16: SCAN Simulation

4.11.3 Multiple Applications

Another advantage of PROTON is its ability to simulate multiple applica-

tions workload. Conventional VM simulators do not support simulation of

multiple applications. Such simulation would not only help the system devel-

opers to determine the performance of the entire workload, but also allow to

determine the effects of a particular application on a given workload. Further-

more, simulation using different page replacement policies in PROTON helps

the developer identify the best suitable page replacement policy for a given

workload.

Tables 4.3, 4.4 and 4.5 list the simulation results for an LRU policy in two

and three application scenarios. In the tables, the column ‘App-set’ shows

the combination of applications that were simulated together. To denote this,

the first letter of each application is used. For instance, S-M suggests that

applications SCAN and MATVEC were simulated together. Similarly, S-B-F

suggests that applications SCAN, BSORT and FFT were simulated together.

using PROTON (P) using CASP (C)App-set Time Faults Time Faults

S-M 7,286 284,159 5,947 123,608S-B 18,492 1,392,173 12,058 1,062,146S-F 8,254 114,637 7,947 88,037

Table 4.3: Two Applications Scenario for LRU (1)

using PROTON (P) using CASP (C)App-set Time Faults Time FaultsM-F 10,145 172,732 9,047 143,233M-B 20,348 1,392,324 17,094 1,102,068F-B 21,026 1,218,372 22,542 1,234,016

Table 4.4: Two Applications Scenario for LRU (2)

It is evident from the tables that, on average, the simulation time of the ap-

plications almost doubles when compared to the a single application scenario.

However, the number of page-faults varies according to the applications. This

is because, the CPU resource is being shared amongst several applications

using a scheduling algorithm whereas the performance of paging is largely

dependant on the page replacement policy being used and an application’s

memory access patterns. If the page replacement policy in the OS is non-

changeable, then the type of applications and their respective memory access

patterns dictate the paging performance of the system. Thus, it is important

to be able to determine the effects of certain applications on a given workload.

It was observed that applications with known or deterministic memory

access patterns showed better performance using CASP. The key is to use ac-

curate reification calls in the source code. Since it is difficult to predict the

memory access pattern of applications such as BSORT and FFT, reification

calls were inserted to intermittently lock and release certain memory regions.

Notice that, although the applications using CASP show better performance

than the ones not using CASP, the performance improvement is not signifi-

cant. Looking at the results obtained for the two applications scenario it can

be concluded that CASP depends on accurate reification calls and provides

better support to applications with known memory access patterns. When

the applications – FFT and BSORT are simulated together, there a more page

faults generated increasing the simulation times. Since both the applications

have nearly random access pattern, the insertion of reification calls has no

significant benefits.

using PROTON (P) using CASP (C)App-set Time Faults Time Faults

S-B-F 22,394 1,334,428 19,065 1,186,685S-B-M 25,235 1,502,785 19,376 1,163,424S-F-M 12,947 282,728 10,031 153,139B-F-M 13,283 1,394,482 12,997 1,288,604

Table 4.5: Three Applications Scenario for LRU

Similar results were obtained for the three applications scenario in which

the applications performed best when SCAN and MATVEC were executed

together in a group (see table 4.5). From the above simulations, it can be noted

that, even if applications using CASP do not show significant performance

improvement, CASP does not impose huge penalties onto the system. Also,

CASP provides best support for applications with deterministic memory access

patterns (e.g. sequential access).

4.11.4 Slow-down Factor

Conventional on-the-fly simulation techniques have shown to add a slow down

factor ranging from 20 to 60 [122]. In comparison to (O), PROTON has

shown to reduce the simulation time by 17%–57%. Thus, it can be considered

that PROTON-based simulation has nearly 17%–57% faster than using the

traditional method of full annotations. Furthermore, previously reported sim-

ulators simulated small amounts of memory ranging between 128KB to 1MB.

The use of larger memory may further affect their performance. For the above

simulations PROTON was configured to use 64MB of memory.

In summary, the evaluation in this chapter has accomplished two main

objectives. Firstly, it showed that the CASP mechanism helps reduce the

number of page faults by almost 19% when using accurate reification calls.

Secondly, the PROTON simulator is a better alternative for on-the-fly virtual

memory simulation which supports simulation of multiple applications. The

simulator improves simulation time by reducing the required number of code

annotations.

4.12 Summary

This chapter used virtual memory (paging) as a case study to show the sig-

nificance of reification in the reflective framework. The chapter considered

various methods of inserting reification calls into the application source code

written in the C language. The design of CASP – an OS paging mechanism to

utilise the information provided by the reification calls was presented. CASP

efficiently locked/released pages in memory such that the pages that an appli-

cation would access in the immediate future were always present in memory.

The mechanism operates non-intrusively on top of any existing page replace-

ment policies in the OS.

Furthermore, the design and implementation of PROTON, an on-the-fly

virtual memory simulator was described. Simulation experiments using PRO-

TON showed improvement in performance of the benchmark applications that

used CASP via the reification calls. It was also shown that PROTON sim-

ulator performed better than conventional on-the-fly simulators. In the next

chapter, the implementation of CASP in a commodity OS – Linux (2.6.16

kernel) is described along with the experimental evaluation.

Chapter 5

Implementation of CASP in aCommodity OS (Linux)

In chapter 3, the generic reflective RTOS framework was presented and eval-

uated with a prototype µ-kernel implementation – DAMROS. Chapter 4 pre-

sented a case-study of using reification calls in conjunction with the virtual

memory resource evaluated using a virtual memory simulator – PROTON [93].

An OS mechanism – CASP [97] was proposed in order to provide application-

specific memory management support. The CASP mechanism allows applica-

tions to lock and release memory pages dynamically at runtime via the use of

reification calls. This process helps reduce the number of page faults mainly

caused due to incorrect page eviction by the underlying page replacement pol-

icy. It is essential to evaluate the CASP mechanism in a real-world commodity

OS. This chapter presents the implementation and evaluation of the framework

and CASP in Linux (2.6.16 kernel).

Linux is an open source operating system widely used in embedded systems

such as mobile phones, PDAs, media players, etc. The widespread use and the

availability of the kernel source code played a key role in choosing Linux for

implementation of the reflection framework and CASP. The framework and

CASP has been implemented in two flavours of Linux: one using an LRU-

based paging policy and the other using CART [17] paging policy.

The chapter is organised as follows. The next section provides an overview

of the Linux 2.6.16 kernel that implements an LRU-based paging policy. The

section also describes the implementation of CART [17] paging policy in Linux.

Section 5.2 describes the implementation of the reflection framework and

CASP in both flavours of Linux. Finally, section 5.3 presents the evaluation

of CASP using standard benchmark applications in both single and multiple

application scenarios.

5.1 Overview of Linux 2.6.16 Kernel

This section provides a brief overview of the memory management subsystem

in the vanilla Linux 2.6.16 kernel. Memory in Linux is divided into three

different zones: ZONE DMA – the lower end memory zone (addressable by 16-

bit devices) mainly used for I/O or DMA (Direct Memory Access) operations;

ZONE NORMAL – the normal zone above ZONE DMA memory used by the

applications; and ZONE HIGHMEM – the high-end memory mainly used by

the kernel [19, 54]. The memory pages belonging to each zone are stored in

two zone-wise lists – active list and inactive list. The active list consists of

most recently accessed pages and all newly allocated pages.

Unlike in theory [62], Linux does not reclaim pages upon a page-fault. A

special kernel daemon thread ‘kswapd()’ reclaims pages when invoked depend-

ing on set watermarks. The kswapd() thread tries to maintain a fixed number

of free pages that are available in a zone determined by the value of the zone

watermark. This thread moves pages present in the active list that have not

been recently accessed into the inactive list. While in the inactive list, the

pages are again marked as accessed by the kernel so that the kswapd() moves

them back into active list. When the ratio of the number of pages in the inac-

tive list and the active list reaches a certain watermark, the kswapd() thread

starts reclaiming unreferenced pages from the inactive list.

The vanilla Linux 2.6.16 kernel implements an LRU-based page replace-

ment policy which can be closely compared with LRU-2Q [54]. In practice it

has been shown that the performance of this replacement policy is close to

LRU [17]. For simplicity, in all further discussions the vanilla Linux kernel

implementing this page replacement policy will be referred to as Linux-LRU.

Note that, Linux-LRU makes page replacement decisions solely on the basis of

recency without using any frequency features. i.e. under heavy system load,

it is possible for the page replacement policy to replace the most frequently

used page. For more information about the Linux kernel and its memory

management subsystem please refer to [54].

The CASP mechanism has been designed to operate in conjunction with

any existing page replacement policy. In order to test CASP in Linux with

another page replacement policy a patch consisting of a CART-based [17]

policy implementation in Linux was obtained from Peter Zijlstra 1 [1]. The

next subsection describes in brief the implementation of CART policy in Linux.

5.1.1 CART Implementation in Linux

The term Linux-CART will be used to refer to the implementation of CART

in Linux. The Linux-CART implementation uses four different page lists: T1,

T2, B1 and B2 for each memory zone.

The pages in T1 are considered to have a short-term utility while the pages

1Downloaded from URL: http://programming.kicks-ass.net/kernel-patches/cart/

in T2 have a long-term utility. The CART page replacement policy reclaims

recently unreferenced pages from T2 first and then reclaims similar pages from

T1. This ensures that frequently accessed pages are not reclaimed by CART.

However, this can affect applications with large sequential loop based access

which generally have pages that can be said to have long-term utility.

The other two page lists: B1 and B2 maintain page-history information

of those pages that were recently reclaimed. A more detailed explanation can

be found in [17]. The paging model of Linux 2.6.16 kernel has been explained

in section 4.1 in chapter 4.

5.2 Implementation in Linux

This section describes the implementation of the reflection framework and

CASP [97] in Linux 2.6.16 kernel. The CASP implementation depends on the

generic reflective framework. The next subsection describes the implementa-

tion of the reflection framework.

5.2.1 Reflection Framework

The core elements of the reflective framework, as described in section 3.2.1,

have been implemented in the Linux kernel. Most of the reflection code has

been ported from DAMROS. The interface to these elements remain the same

as in DAMROS. The Linux kernel has been modified to implement the follow-

ing functions:

• reify()

• requestInfo()

• interceptCall()

• uninterceptCall()

• linkData()

Linux is a multi address space OS. Thus, unlike DAMROS, the above functions

can not be directly called by the applications. For this purpose, a system call

has been implemented which passes information between the application space

and the kernel space. This is further discussed in the next subsection.

The implementations of interceptCall() and linkData() support the inter-

ception of a function and the formation of a causal link to data belonging to

a common address space. i.e. an application is able to intercept functions

implemented in its own address space. However, since Linux is a monolithic

OS, kernel code including all system modules resides in a single kernel address

space. Thus, a meta-level of a system module can intercept all functions or

causally link to any data in the kernel address space.

The implementation of the interception mechanism makes similar changes

to the underlying machine code as explained in section 3.3.3. Linux uses cer-

tain hardware features to set read-only, read/write or execution permissions on

memory pages. For instance, in Linux, a memory page containing the applica-

tion code has read-only and execution permissions set such that no process in

the system can change its contents. The implementation of interceptCall() in

Linux temporarily changes set read/write permissions to such memory pages,

then changes the machine code and resets the original permissions.

Similar to the implementation in DAMROS, information reified in Linux

is stored in the kernel and passed on to the requesting meta-level modules.

Also, the implementation of the framework is specific to the Intel x86 archi-

tecture [61].

5.2.2 CASP Mechanism

CASP has been implemented in two different flavours of Linux: one using the

original LRU-based policy and the other using the CART page replacement

policy. CASP consists of two components: CASPapp operates in the appli-

cation space and CASPos operates in the kernel space. The reification calls

keep() and discard(), defined in chapter 4, for virtual memory have been im-

plemented as an application library – CASPapp. A new system call has been

implemented in Linux to facilitate the communication between CASPapp and

CASPos components.

In both flavours of Linux, handle mm fault() is the main page-fault han-

dler routine. The functions – pagevec add() and pgrep add() are used for

adding a page to the page list in Linux-LRU and Linux-CART respectively.

During the process of OS initialisation, the interception code of CASPos

scans the handle mm fault() routine recording the locations of calls to the

pagevec add() routine. Later while pre-paging, CASP requests the interception

of all the recorded calls. This action makes the interception mechanism replace

the underlying machine code that calls the routine – pagevec add() with the

code to call the routine – page isolate() instead. Typically, there is only one

call to pagevec add() in the handle mm fault() routine. Since the location

of this call is marked at the beginning of OS initialisation, the actual cost of

interception during pre-paging is only a few micro-seconds. This is negligible

compared to the cost of the paging subsystem in general.

After CASP finishes pre-paging, it un-intercepts the calls back to the

pagevec add() routine (i.e. resets the machine code to its original state). The

implementation of the framework is common for both flavours of Linux.

Finally, the page-isolation mechanism depends on the page replacement

policy implemented in the Linux kernel. This is because, each replacement

policy maintains different page lists from which the locked pages need to be iso-

lated. The following subsections describe the implementation of page-isolation

routine in each Linux flavour.

5.2.3 Page-isolation in Linux-LRU

Similar to the algorithm 2 in chapter 4, the page-isolation routine in Linux-

LRU removes a page from either active list or inactive list depending on where

it resides during isolation and then adds this page to the corresponding appli-

cation’s isolated page list. When isolated pages are discarded, CASPos inserts

these pages into the inactive list making them the most likely candidates for

reclamation.

5.2.4 Page-isolation in Linux-CART

The Linux-CART implementation maintains 4 page lists – T1, T2, B1 and B2.

Since the pages in B1 and B2 do not reside in the physical memory, the lists of

interest here are T1 and T2, only. Thus, the page-isolation routine in Linux-

CART removes a page from either T1 or T2 depending on where it resides

during isolation and then adds this page to the corresponding application’s

isolated page list. When isolated pages are discarded, CASPos inserts these

pages into T2 making them the most likely candidates for reclamation.

The implementation of CASP makes use of the information generated by

the reification calls inserted in the applications and works non-intrusively with-

out affecting the normal operation of the existing page replacement code.

The efficient implementation of the interception mechanisms ensures code re-

usability without incurring any high penalty both in terms of space and time.

The next section describes the experimental evaluation of CASP in Linux.

5.3 Experimental Evaluation

5.3.1 Hardware Platform

All the experiments were performed on an embedded Cyrix MediaGX 233 MHz

processor-based system with 64 MB SDRAM memory and 128 MB Linux swap

partition on a 7200 RPM IDE disk drive. For each benchmark application,

three versions were produced: (1) manually inserted reification calls, (2) au-

tomatically inserted reification calls and (3) manually inserted Linux mlock()

primitives. Version (3) is same as (1) except that CASP’s keep() and discard()

are replaced by Linux’s mlock() and munlock() primitives. The workload (sin-

gle or multiple benchmark applications) was executed on a freshly booted test

platform running the corresponding Linux flavour, maintaining the same en-

vironment to obtain accurate measurements.

5.3.2 Benchmark Applications

In order to test the performance of CASP, different kinds of benchmark appli-

cations were selected. The selection of benchmarks was based on the following

criteria:

• Memory usage: the application must be out-of-core – i.e. applications

using more memory than was physically available.

• Access pattern: the applications selected should have different types of

memory access patterns (e.g. sequential, random, etc.).

• Embedded: the application or part of the application code must be ap-

plicable to embedded systems, including real-time systems.

• Linux: the application should compile and execute on the implemented

Linux flavours.

Several kinds of applications were surveyed and finally five applications

were chosen for this evaluation. Benchmark applications used were: three from

MiBench [56] (embedded applications benchmark suite); one from Brown et.

al. [26]; and the application ‘scan’ (described in chapter 4). The data-set

size of these applications was increased such that they required more physical

memory than is available. No other modification was done to the application

source. Table 5.1 summarises application characteristics. Each benchmark

application has a different memory access pattern:

MAD – MPEG decoder application which sequentially decodes data into

a fixed size buffer. The buffer is used by several functions, each consuming

part of the data. It is not possible to analyse the data locality of MAD using

automatic insertion method. Manual reification calls were inserted to lock and

release part of the buffer when the data was consumed; and around function

calls rather than loops.

FFT – uses 4 floating point number arrays to perform the fast fourier

transform. The data access pattern in the main loop of FFT depends on the

value of the data itself, resulting in a random access pattern.

FFT-I – inverse FFT with similar code to FFT. It has mostly non-

sequential data access but includes small sections with sequential data access.

FFT-I is used to determine the effects of CASP on applications with small

sections of sequential access.

MATVEC – a matrix vector multiplication that uses three data arrays

of different sizes accessed within loops with known loop bounds. The known

loop bounds enable insertion of reification calls by both automatic and manual

methods. MATVEC was used to test CASP for locking multiple data sets.

SCAN – see section 4.3 in chapter 4. In order to evaluate the automatic

insertion method under dynamic conditions, SCAN was modified to use dy-

namic values for the variable ‘loops’ and ‘size’ (see alg. 4.2). SCAN generates

a worst-case scenario for the page replacement policies which helps determine

the performance of CASP in the worst case.

Tests involved the execution of the benchmark applications a single appli-

cation and multiple applications scenarios. Note that, the experiments show

results for the manual and automatic insertion methods only. This is because,

the hybrid method is essentially a combination of both manual and automatic

methods. i.e. For a given application, the results for the hybrid method would

be the best of the manual and automatic method.

Name Description Input Data Set Memoryrequired

MAD MPEG layer I, II 128 kbps 96.25 93 MB& III decoder min. MP3 data

FFT Fast Fourier four 32-bit float 64 MBtransform arrays of size 222

FFT-I Inverse Fast four 32-bit float 64 MBFourier transform arrays of size 222

MATVEC Matrix vector mult. three 32-bit 101 MBapplication integer arrays

SCAN Example benchmark continuous 100 MBapplication byte array

Table 5.1: Benchmark Applications

5.3.3 Single Application Scenario

This subsection presents the experimental results of executing each bench-

mark application on both flavours of Linux in a single application scenario.

Tables 5.2, 5.3, 5.4 and 5.5 list the execution time (in seconds), the number

of minor and major page faults, and the resident memory set size (RSS) for

each benchmark application (in pages). The following subsections describe the

paging performance in terms of the number of major/minor page-faults and

the memory usage of each application – one at a time.

Original (O) using mlock (L)Application Time Minor Major RSS Time Minor Major RSS

MAD 1,899 27,944 1,685 12,429 1,795 22,506 1,071 12,116FFT 343 21,828 833 6,849 1,323 22,060 915 11,644FFT-I 403 22,538 940 6,582 1,347 22,652 1,193 11,019MATVEC 2,256 143,056 101,174 13,057 2,707 178,820 126,467 14,235SCAN 638 78,721 31,315 12,778 860 100,939 32,847 13,431

Table 5.2: Single Application Performance in Linux-LRU (1)

using CASP manual (M) using CASP automatic (A)Application Time Minor Major RSS Time Minor Major RSS

MAD 1,740 22,087 793 8,124 1,786 22,511 1,682 11,204FFT 342 21,099 776 6,758 349 22,234 970 6,743FFT-I 342 21,126 773 6,395 352 23,431 1,178 6,939MATVEC 1,860 175,327 81,311 13,063 2,139 158,330 95,896 13,132SCAN 544 94,281 21,975 12,876 417 106,319 15,652 13,084

Table 5.3: Single Application Performance in Linux-LRU (2)

In MAD, data is consumed across several different functions. Hence, it is

not easy to insert reification calls using the automatic method. The cloop

tool inserts the calls around the memory buffer used to store MPEG data.

Figures 5.1(a) and 5.1(b) plot the occurrence of minor and major page faults

for three versions of MAD executed upon Linux-LRU: one is the original (O),

one using mlock() (L) and one using CASP with manual reification calls (M).

Original (O) using mlock (L)Application Time Minor Major RSS Time Minor Major RSS

MAD 1,872 33,605 2,351 10,386 1,794 23,669 1,013 11,010FFT 352 22,542 865 6,614 511 24,068 10,845 8,678FFT-I 413 24,724 952 6,619 576 22,792 12,567 9,242MATVEC 1,419 165,618 75,686 12,834 1,281 152,678 68,977 12,959SCAN 340 103,731 17,231 12,644 426 111,699 22,452 12,959

Table 5.4: Single Application Performance in Linux-CART (1)

using CASP manual (M) using CASP automatic (A)Application Time Minor Major RSS Time Minor Major RSS

MAD 1,787 21,342 788 5,651 1,800 28,205 1,879 12,311FFT 336 18,857 540 6,596 347 20,578 700 6,774FFT-I 338 21,067 436 6,431 351 22,266 579 6,651MATVEC 1,279 111,586 61,826 12,628 1,330 107,649 68,471 12,684SCAN 368 107,527 14,278 12,939 373 93,568 21,814 12,523

Table 5.5: Single Application Performance in Linux-CART (2)

The x-axis plots the time elapsed during the execution of the application

and the y-axis plots the number of page-faults for corresponding time on the x-

axis. The lines in the graphs terminate when the application finishes execution

and exits the system. A shorter line indicates that an application takes less

time to finish its execution.

Note that, the graphs shown provide data for CASP using manually in-

serted reification calls only. The automatic reification calls also produced

similar information. MAD using CASP (M) generated fewer minor and major

page faults as compared to (O) and (L). The steps shown in the graph depict

the lock and release of partial regions in the MPEG data buffer. The curve

for CASP is uniform and quite predictive in nature.

Similar results can be seen in the graphs shown in figures 5.1(c) and 5.1(d)

that plot the minor and major page faults for Linux-CART. On average, for

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Time in seconds

(a) Minor Page-faults for MAD (Linux-LRU)

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Time in seconds

(b) Major Page-faults for MAD (Linux-LRU)

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Time in seconds

(c) Minor Page-faults for MAD (Linux-CART)

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Time in seconds

(d) Major Page-faults for MAD (Linux-CART)

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Execution time in seconds

(e) Resident Memory Set Size for MAD (Linux-LRU)

VM-size O L

M avg. O avg. L

avg. M 0

0 200 400 600 800 1000 1200 1400 1600 1800 2000

(f) Resident Memory Set Size for MAD (Linux-CART)

VM-size O L

M avg. O avg. L

avg. M

Figure 5.1: MAD: Results on Linux-LRU and Linux-CART

both flavours of Linux, MAD-CASP generated 29% and 6% less major page

faults as compared to MAD and MAD-MLOCK, respectively.

The graphs in figures 5.1(e) and 5.1(f) plot the RSS of MAD at any given

time during its execution on both Linux-LRU and Linux-CART respectively.

MAD-CASP (M) uses fewer resident memory pages as compared to the other

variants. This is because, the reification calls inserted into MAD helps CASP

to lock and release only the required memory pages at runtime. Thus, the

working set size of MAD-CASP is reduced to only the CASP locked pages in

memory. The consistent steps shown in the graphs depict the lock and release

operations of CASP. On average MAD-CASP uses nearly 4,000 fewer memory

pages than the other variants.

Application FFT mainly has a random memory access pattern. Manual reifica-

tion calls were added in and around the main application loop which partially

lock and release data arrays at runtime. Since the access pattern of FFT de-

pends on the value of the data obtained within the loop, the reification calls

only attempt to lock 25 % of the data arrays starting from the data value.

The graphs in figures 5.2(a) and 5.2(b) plot the occurrence of minor and ma-

jor page faults for FFT in Linux-LRU. The performance of FFT-MLOCK is

the worst in that it generates more page faults than the original. However,

CASP performed only slightly better than the original. This is mainly because

of the non-predictive nature of FFT’s memory access patterns.

Similar graphs have been plotted for Linux-CART in figures 5.2(c)

and 5.2(d). The CASP mechanism in Linux-CART performed much better

0 200 400 600 800 1000 1200 1400

Time in seconds

(a) Minor Page-faults for FFT (Linux-LRU)

0 200 400 600 800 1000 1200 1400

Time in seconds

(b) Major Page-faults for FFT (Linux-LRU)

0 50 100 150 200 250 300 350

Time in seconds

(c) Minor Page-faults for FFT (Linux-CART)

0 50 100 150 200 250 300 350

Time in seconds

(d) Major Page-faults for FFT (Linux-CART)

0 200 400 600 800 1000 1200 1400

(e) Resident Memory Set Size for FFT (Linux-LRU)

VM-size O L

M avg. O avg. L

avg. M 0

0 50 100 150 200 250 300 350

Time in seconds

(f) Resident Memory Size for FFT (Linux-CART)

VM-size O

avg. O M

avg. M

Figure 5.2: FFT: Results on Linux-LRU and Linux-CART

for FFT in comparison to Linux-LRU. This is because, the CART paging pol-

icy maintains frequency information about memory page accesses along with

the recency information. Thus, CASP benefits from the underlying page re-

placement policy in this case to help improve the performance of an application

with random access pattern. The performance of FFT-MLOCK is too poor

in Linux-CART to consistently plot with other values in the graph and hence,

has been omitted.

The resident memory size of FFT and FFT-CASP are almost similar in

both Linux-LRU and Linux-CART. Due to the random access patterns of FFT,

many times CASP was noted to release and lock similar pages in memory. This

resulted in FFT-CASP having similar resident memory size to the original FFT

(see the graphs in figures 5.2(e) and 5.2(f)).

Application FFT-I is very similar to FFT differing in the final section of the

code which calculates the inverse operation. In this final section, FFT-I has

a sequential memory access pattern. In FFT-I, apart from adding reification

calls similar to FFT, reification calls were also added to take advantage of the

sequential access pattern in the final section.

The graphs in figures 5.3(a) and 5.3(b) plot minor and major page faults

of FFT-I in Linux-LRU. In comparison to the results of the FFT application,

it is evident that CASP generated better results for FFT-I. This is because,

CASP improvises on FFT-I’s sequential access pattern in the final section of

its code.

0 200 400 600 800 1000 1200 1400

Time in seconds

(a) Minor Page-faults for FFT-I (Linux-LRU)

0 200 400 600 800 1000 1200 1400

Time in seconds

(b) Major Page-faults for FFT-I (Linux-LRU)

0 50 100 150 200 250 300 350 400 450

Time in seconds

(c) Minor Page-faults for FFT-I (Linux-CART)

0 50 100 150 200 250 300 350 400 450

Time in seconds

(d) Major Page-faults for FFT-I (Linux-CART)

0 200 400 600 800 1000 1200 1400

(e) Resident Memory Set Size for FFT-I (Linux-LRU)

VM-size O L

M avg. O avg. L

avg. M 0

0 50 100 150 200 250 300 350 400 450

Time in seconds

(f) Resident Memory Set Size for FFT-I (Linux-CART)

VM-size O

avg. O M

avg. M

Figure 5.3: FFT-I: Results on Linux-LRU and Linux-CART

Similar results have been obtained for Linux-CART as shown in fig-

ures 5.3(c) and 5.3(d). Using CASP, FFT-I performed much better in Linux-

CART in comparison with FFT. From above two applications, FFT and FFT-

I, it is evident that accurate reification calls regarding the memory access

patterns provide much better results using CASP. Thus, although CASP im-

proved application performance for both random and sequential memory access

pattern, the real benefit of the mechanism can be gained by using as accurate

reification calls as possible.

Again, the resident memory size of FFT-I using CASP was reduced in both

Linux-LRU and Linux-CART (see the graphs in figures 5.3(e) and 5.3(f)).

MATVEC

Application MATVEC uses three different data sets to perform complex mul-

tiplication operations involving matrices making it CPU intensive. Due to

known loop bounds, it was easier to insert reification calls using both manual

and automatic methods. Figures 5.4(a) and 5.4(b) show the occurrence of

minor and major page faults of MATVEC in Linux-LRU. MATVEC-CASP

generated more minor page faults than the original MATVEC. However, the

major page faults generated by MATVEC-CASP were fewer in comparison.

In Linux, reclaimed pages are stored in a region called the swap-cache

which is present in physical memory. These pages remain in the swap-cache

until they are actually moved to the swap space by a kernel thread [54]. Due

to the intensive memory and CPU operations in MATVEC, memory pages are

more frequently referenced. Instead of immediately swapping out the released

pages to swap space, CASP stores them in the swap-cache. Such pages that

exist in the swap-cache, when re-referenced immediately in the future cause

100000

120000

140000

160000

180000

0 500 1000 1500 2000 2500

Time in seconds

(a) Minor Page-faults for MATVEC (Linux-LRU)

100000

120000

0 500 1000 1500 2000 2500

Time in seconds

(b) Major Page-faults for MATVEC (Linux-LRU)

100000

120000

140000

160000

180000

0 200 400 600 800 1000 1200 1400 1600

Time in seconds

(c) Minor Page-faults for MATVEC (Linux-CART)

0 200 400 600 800 1000 1200 1400 1600

Time in seconds

(d) Major Page-faults for MATVEC (Linux-CART)

0 500 1000 1500 2000 2500

(e) Resident Memory Set Size for MATVEC (Linux-LRU)

VM-size O M

avg. O avg. M

0 200 400 600 800 1000 1200 1400 1600

(f) Resident Memory Set Size for MATVEC (Linux-CART)

VM-size O L L

avg. O avg. L

avg. M

Figure 5.4: MATVEC: Results on Linux-LRU and Linux-CART

only a minor page fault. Hence, there was an increase in the number of minor

page faults for MATVEC-CASP.

Similar results can be seen for MATVEC in Linux-CART (see figures 5.4(c)

and 5.4(d)). However, in Linux-CART, the execution time of MATVEC-CASP

was slightly larger and it generated more page faults compared to MATVEC-

MLOCK.

Due to the intensive memory and CPU operations in MATVEC, the RSS

for MATVEC-CASP in Linux-LRU and Linux-CART remains close to the

original application. The huge variation in the curves show the intensity of

memory operations in MATVEC (see figures 5.4(e) and 5.4(f)).

Application SCAN generates a worst-case scenario for most traditional page

replacement policies by stressing the virtual memory subsystem to its limits.

Similar to MATVEC-CASP, in Linux-LRU, SCAN-CASP also generated more

minor page faults and fewer major page faults as compared to the original. The

substantial reduction in the number of major page faults resulted in better

performance of SCAN-CASP (see figures 5.5(a) and 5.5(b)).

Similar results were seen in Linux-CART as well. However, SCAN-CASP

had a slightly larger execution time than the original. This is because, the in-

tensive memory operations of SCAN continuously generated a large number of

page faults stressing the lock and release operations of CASP. Thus, although

SCAN-CASP generated fewer major page faults, it incurred a minor overhead

which resulted in a larger execution time (see figures 5.5(c) and 5.5(d)).

The graphs in figures 5.5(e) and 5.5(f) show the intensity at which SCAN

accesses memory pages in both Linux-LRU and Linux-CART respectively.

100000

120000

0 100 200 300 400 500 600 700 800 900

Time in seconds

(a) Minor Page-faults for SCAN (Linux-LRU)

0 100 200 300 400 500 600 700 800 900

Time in seconds

(b) Major Page-faults for SCAN (Linux-LRU)

100000

120000

0 50 100 150 200 250 300 350 400 450

(c) Minor Page-faults for SCAN (Linux-CART)

0 50 100 150 200 250 300 350 400 450

(d) Major Page-faults for SCAN (Linux-CART)

0 100 200 300 400 500 600 700 800 900

(e) Resident Memory Set Size for SCAN (Linux-LRU)

VM-size(25,933) O L

M avg. O avg. L

avg. M

0 50 100 150 200 250 300 350 400 450

Time in seconds

(f) Resident Memory Set Size for SCAN (Linux-CART)

VM-size(25,933) O

avg. O L

avg. L M

avg. M

Figure 5.5: SCAN: Results on Linux-LRU and Linux-CART

CASP strives to bring the resident memory size down to a respectable level

thereby incurring a minor overhead.

Overall Performance

The number of page-faults (both minor and major) and the execution times

of the benchmark applications were measured in a single application scenario.

Since the cost to handle a major page-fault is greater compared to that of a

minor page-fault, reducing the number of major page-faults will reduce the

paging overhead and improve the execution time of the application. Irrespec-

tive of the execution times, reduction in paging overhead is beneficial to the

overall system. This is particularly true in the case of an embedded system

with limited memory resource.

The graphs in figures 5.6(a), 5.7(a) and figures 5.6(b), 5.7(b) show the num-

ber of page-faults and the execution times of the corresponding benchmark ap-

plications executed individually in both Linux-LRU and Linux-CART. Shown

for each application are four bars: the original application (O), the application

using Linux’s mlock() primitives (L), the application using CASP with manual

insertion (M) and the application using CASP with automatic insertion (A).

A bar is further divided into two parts – for figures 5.6(a) and 5.7(a): the

top part shows the number of minor page-faults; the bottom part shows the

number of major page-faults and for figures 5.6(b) and 5.7(b): the top part

shows the user-time; the bottom part shows the system-time.

FFT has a nearly random data access pattern – difficult for the automatic

method to identify data locality. Manual reification calls were inserted at

small regions of sequential access. mlock() imposed a large overhead due to

the random access pattern – it thrashes. Thus, although (L) generates only a

O L M AMATVEC

O L M ASCAN

O L M AFFT-I

O L M AFFT

O L M AMAD

(a) Benchmark Total Page-faults

Minor faultsMajor faults

O L M AMATVEC

O L M ASCAN

O L M AFFT-I

O L M AFFT

O L M AMAD

(b) Benchmark Execution Times

User TimeSystem Time

Figure 5.6: Summary of Results for Linux-LRU

O L M AMATVEC

O L M ASCAN

O L M AFFT-I

O L M AFFT

O L M AMAD

(a) Benchmark Total Page-faults

Minor faultsMajor faults

O L M AMATVEC

O L M ASCAN

O L M AFFT-I

O L M AFFT

O L M AMAD

(b) Benchmark Execution Times

User TimeSystem Time

Figure 5.7: Summary of Results for Linux-CART

O L M AEXEC. TIMES

O L M APAGE-FAULTS

(a) Two Applications (Linux-LRU)

Page-faultsExec. Times

O MEXEC. TIMES

O MPAGE-FAULTS

(b) All Applications (Linux-LRU)

Page-faultsExec. Times

Figure 5.8: Results for Multiple Applications

few more page-faults, the execution time is much larger.

FFT-I, being similar to FFT, has sequential data access in parts – showed

better results using CASP with manually inserted reification calls. Note that,

although minor page-faults increased for the automatic method (A), the num-

ber of major page-faults were still tolerable. This shows that CASP could also

be used for applications with random data access patterns.

For MATVEC, the aggressive loop based multiplication operations on sev-

eral data arrays lead to a greater number of page-faults. Known loop bounds

made it easy to insert reification calls by both methods. Results for (M) and

(A) indicate better performance than (L) and (O) for both major page-faults

and the execution times. Manual insertion method reduced MATVEC’s exe-

cution time by nearly 18% and generated 20% fewer major page-faults.

Using CASP, SCAN generated fewer major page-faults (for both (M) and

(A)) than (O) and (L). Automatic insertion method performed better than

the manual method since it added run-time conditional loop bound checking

code for the appropriate use of reification calls. Pages reclaimed from the

inactive list were stored in the swap-cache before being written out to the

swap space. Since both SCAN and MATVEC aggressively walked through the

data, stressing the page replacement code, pages that have been only recently

reclaimed are recalled into the active list. If such pages still resided in the swap

cache, then it caused only a minor page-fault to fetch them. CASP releases

isolated pages after discard() is used making such pages remain in the swap-

cache when recalled. Hence, an increase in the number of minor page-faults is

observed for (M) and (A).

Across all individually executed benchmark applications, on average, man-

ually inserted reification calls generated 22% fewer major page-faults improv-

ing the execution time by 13%; automatically inserted reification calls gen-

erated 15.13% fewer major page-faults, improving execution time by 9% in

Linux-LRU. Clearly, the manual method out performed the automatic method,

but the latter yields better results in comparison with mlock() primitives.

5.3.4 Multiple Applications Scenario

Two or more applications were executed simultaneously with one of them using

CASP. Due to the similarity in the implementation of CASP in both flavours

of Linux, the experiments involving multiple applications workload have been

executed on Linux-LRU only. The ‘Workload’ (see table 5.6): (1) TWO-SO

– two original versions of the SCAN application process; (2) TWO-SL – one

original SCAN process and another SCAN using mlock(); (3) TWO-SM –

one original SCAN process and another SCAN with manual reification calls;

(4) TWO-SA – one original SCAN process and another SCAN using automatic

reification calls; (5) ALL-O – executing original versions of all benchmark

applications; (6) ALL-1M – all original applications with SCAN using manual

reification calls.

Results obtainedWorkload Time Minor Major RSS

TWO-SO 2,240 223,925 33,953 (6,149 + 6,310)TWO-SL 1,839 219,703 35,174 (7,341 + 5,532)TWO-SM 1,612 218,514 28,728 (8,600 + 4,885)TWO-SA 1,778 233,767 30,394 (7,489 + 5,292)ALL-O 15,341 606,334 133,020 22,514ALL-1M 13,454 621,451 109,003 20,019

Table 5.6: Results for Multiple Applications

The RSS for (1) to (4) is divided into two parts with the first part repre-

senting the RSS of the original process. Figures 5.8(a) and 5.8(b) show major

page-faults and execution times for the execution of two and all applications.

The benchmark for two applications shows four bars: O, L, M, A – each rep-

resenting both original processes (O), 1 mlock() process (L), 1 CASP process

with manual reification calls (M) and 1 CASP process with automatic reifi-

cation calls (A) respectively. The results for page faults are represented in

striped bars and those for execution times in dark bars. The benchmark for

all applications shows two bars: O and M – each representing all original pro-

cesses (O) and 1 SCAN process using CASP with manual reification calls (M)

respectively.

Note that, since the results are normalised for both page-faults and exe-

cution times (exec. times), the figures plot both values against y-axis of the

same graph for both two and all applications. The bars for page-faults and

exec. times cannot be compared with each other.

The experiments enabled to determine the effects of a single CASP process

on the entire system workload. Using mlock() in more than one applications

resulted in out-of-memory errors. This is because, mlock() has not been de-

signed for dynamic locking. The experiments were, thus, limited to the use of

locking (either mlock() or CASP) in only one application in the workload.

Results show that, CASP using manually inserted calls, in two applications

scenario generated 15% fewer major page-faults and reduced execution times

by 28%; and in all applications scenario generated 18% fewer major page-

faults and reduced execution times by 12%. When the system was out of

memory, CASP automatically released isolated pages into the OS page list.

As explained in chapter 4, multiple applications using CASP are less likely to

be affecting the performance of other application processes by locking memory

pages.

5.3.5 Memory Usage

By isolating pages from the global page lists, CASP adjusts to the current

process’s working-set and reduces its resident memory set size (RSS). Ta-

bles 5.2 and 5.3 list the average RSS values for individual benchmark execu-

tions in Linux-LRU. For MAD with manually inserted reification calls, CASP

reduces its RSS by 35%. Under stressed conditions, applications using CASP

have shown to use slightly more RSS than (O) (see results for MATVEC and

SCAN). However, in comparison to (L), the average RSS of (M) is almost 24%

less. Note that, a lower value of RSS for applications using CASP causes less

interruption to other processes in the system. In fact, it helps other processes

to use more memory. For instance, in table 5.6, the results of two applica-

tions show that the application using CASP uses fewer resident memory pages

while the original application uses more. This indirectly resulted in fewer ma-

jor page-faults for the original application as well. Furthermore, is also reduced

its execution time.

5.3.6 Space Overhead

Insertion of reification calls increases the application code size. Table 5.7 lists

the compiled image size of each benchmark application. Manually inserted

reification calls added an overhead ranging from 0.4% to 21% over the original

application code whilst automatic insertion added 0.8% to 23% overhead. The

overhead is nearly equivalent to one extra memory page (assuming a page size

of 4KB) which is negligible compared to the reduction in RSS and page-faults.

Benchmark Original(O) mlock(L) Manual(M) Automatic(A)

MAD 414,477 414,617 416,117 419,415FFT 511,261 520,778 513,544 515,215FFT-I 511,325 521,390 513,864 515,664MATVEC 9,038 10,592 10,338 10,802SCAN 8,912 9,729 10,751 10,997

Table 5.7: Benchmark Code Size (bytes)

Further reduction in code size can be achieved by optimisation of the

CASPapp library. The CASPos component mostly re-uses existing memory

management code of the Linux kernel. Additional code was added to imple-

ment the framework, a new system-call and the page-isolation routine. The

final kernel image size (‘bzImage’ ) for both Linux-LRU and Linux-CART in-

creased only by 0.6%. CASP implementation in Linux has not been fully

optimised and includes code for reverse mapping [54] required for mlock().

Removing any unwanted code can further reduce the kernel size.

Linux version Original with CASP

Linux-LRU 2,046,074 2,059,147Linux-CART 2,047,190 2,060,479

Table 5.8: Linux Kernel Image Sizes (in bytes)

Summarising, CASP reduced major page-faults (22% for single, 18% for all

applications) and the RSS. It was observed that the manual insertion method

was more accurate and performed better than the automatic method. When

using paging to support out-of-core embedded applications, CASP helps to

reduce the inherent paging overheads and improves application execution times

(13% for single, 12% for all applications). Furthermore, CASP mechanism was

shown to perform better than the existing mlock() primitives found in Linux.

5.4 Summary

This chapter presented the implementation and evaluation of the reflection

framework and CASP mechanism in the Linux 2.6.16 kernel. CASP allowed

adaptation of Linux’s virtual memory management subsystem according to

application-specific memory requirements. CASP operates non-intrusively on

top existing page replacement policies and uses reification calls inserted into

application source to efficiently lock/release memory pages at runtime. CASP

has been compared against the existing Linux system call mlock(). Evaluation

showed that applications using CASP generated fewer page-faults, required

fewer resident memory pages, and improved their overall execution times.

Furthermore, applications using manually inserted reification calls performed

better than those with automatic insertion.

Chapter 6

Conclusion

This chapter concludes the research work presented in this thesis. The over-

all thesis contribution is presented in section 6.1 along with some identified

applications and limitations of the work in section 6.2. The future research

initiatives and directions are discussed section 6.3. Finally, section 6.4 presents

the concluding remarks.

6.1 Thesis Contribution

Chapter 1 presented the central hypothesis of this thesis:

“Conventional CPU scheduling and memory management policies

in RTOS provide generic support that do not, in general, allow

application-specific resource control. This thesis contends that

application-specific control of processor scheduling and memory

management will provide better application support thereby im-

proving application performance. This thesis proposes a generic

reflective framework in the RTOS to efficiently capture application-

specific requirements and bring about fine-grained changes in the

resource management policies. The use of explicit reification in

application source code to specify the resource requirements will

provide better application support and improve performance”

The main objective of this research work was to prove the above hypothesis

by: showing that the generic resource management policies in the existing OSs

do not provide application-specific support; proposing and implementing a re-

flective OS framework that captures application-specific CPU and memory

requirements and accordingly adapts its policies; and proving that the pro-

posed framework helps to provide application-specific resource management

support to the applications and improves their performance.

In this respect, chapters 1 and 2 provided numerous examples of increas-

ing application resource requirements which are supported by average-case

resource management policies in the OS. It was made clear that applications

needed more control over the OS resource management and be able to adapt

the OS policies according to the application-specific requirements. Also, the

experiments carried out in chapters 3, 4 and 5 showed that, under normal

conditions, applications using the generic resource management policies of the

OS often showed average performance.

Chapter 2 reviewed the existing reflection mechanisms in programming lan-

guages, middlewares and OSs. It emphasised the use of reflection mechanisms

to bring about runtime changes in the behaviour of the system. Chapter 3 put

forth modifications to the reflection mechanism particularly to the process of

reification and proposed a generic reflective OS framework. Later in this chap-

ter, the implementation of the proposed reflective framework in DAMROS, a

prototype RTOS, was described. DAMROS was implemented as a single ad-

dress space OS. It implemented a reflective CPU scheduler (VRHS model)

and a reflective virtual memory manager (RMMS). Both used the framework

to adapt/change their policies at runtime. Several experiments were carried

out to show the ability of the framework to bring about runtime changes in

these resource management policies. The experiments proved that by pro-

viding application-specific support, it is possible to improve the application

performance.

Chapter 4 presented methods that could be used to provide support for ex-

plicit reification in the framework. The chapter used virtual memory paging as

a case study and described three methods of inserting reification calls into the

application source code: manual, automatic and hybrid. It presented CASP,

an OS mechanism that used the reification calls to adapt the paging policy of

an OS. Later in this chapter, the implementation of PROTON, an on-the-fly

virtual memory simulator, was described. Simulation experiments involving

standard benchmark applications along with the reification calls and CASP

showed a significant improvement in paging performance (i.e. by reducing the

total number of page-faults generated).

To verify the scalability of the framework and the CASP mechanism, chap-

ter 5 presented the implementation in Linux. Two different flavours of Linux

2.6.16 kernel implementing the core elements of the reflective framework along

with the CASP mechanism operating on top of LRU and CART [17] page

replacement policies were described. Experiments described in chapter 5 were

executed in a multi address space OS (Linux). They involved benchmark ap-

plications using reification calls inserted into them using the manual and au-

tomatic methods. The results showed significant performance improvement in

application using the framework. Under normal conditions, applications that

used LRU and CART policies showed average paging results. In particular,

this chapter proved that it is possible to improve the application performance

and virtual memory management by using the framework to adapt the under-

lying OS paging policy. The next subsection discusses the applications and

some identified limitations of this work.

6.2 Applications and Limitations

Many different kinds of applications can make use of the reflection frame-

work in an RTOS. In industry, the development process of a product is dis-

tributed, in that the applications are developed independent of an OS and vice

versa. This process may lead to integration problems during system deploy-

ment where the applications and the OS are integrated into the target system.

Using the work presented in this thesis, the applications adapt the OS policies

according to their specific requirements thereby gaining better support.

The application developers need not worry about the target OS or the

resource management policies it implements. If the target OS implements

the reflective framework, the developers can introduce application-specific UD

policies into the target OS thereby eliminating any integration issues related

to resource conflicts. This provides a level of confidence to the developers early

on regarding the final behaviour of the application.

The evidence presented in the thesis, particularly for virtual memory man-

agement, showed better support for out-of-core applications having sequential

memory access patterns. The experiments presented in chapter 3, 4 and 5 also

used applications with non-sequential memory access patterns. Although the

performance of such applications did improve, it was not significant.

One of the limitation of the CASP mechanism described in chapter 4 is

that it does not consider shared memory pages. This could be addressed by

future work.

6.3 Future Directions

There are several directions in which this work can be carried forward. In par-

ticular, this work needs further evaluation with respect to the use of resources

other than CPU and memory. Efficient power management for instance, can

be of particular interest for researchers to use this work. Another interest-

ing topic could be to study the effects of process execution times due to the

variation in the paging activity.

This thesis presented the reflective framework in the form of an open

framework not restricted by an API. An implementation of the framework

could choose a suitable API depending on the specific requirements. How-

ever, a POSIX-style [98] API could help establish a common interface to the

framework making the applications portable across all implementations of the

framework. The implementation section of DAMROS already presents some

standard interfaces to the framework, but with only the CPU and memory

resources in mind. Future work could extend/standardise these interfaces to

accommodate other system resources.

The reflective framework supports sharing of meta-level components

amongst several different base-level components. Future work could explore

the possibility and impact of one or more shared meta-level components in the

reflective framework.

There are several reification calls an application could use to provide in-

formation to the OS. This thesis described calls pertaining to the CPU and

memory resource. Future work could also identify certain key reification calls

associated with other resources in the system.

Many non-embedded systems such as desktop computers are introducing

applications with real-time requirements. Also, some existing desktop applica-

tions such as graphical (picture or video) editors use large amounts of memory

often leading the system to thrash. This work could be further explored in the

context of applications for the non-embedded world.

The CASP mechanism can be extended to support applications using

shared memory pages. The reflection framework in the RTOS and the CASP

mechanism have been evaluated in the context of a single processor. The

model can be further extended to support multi-core CPUs, distributed or

SMP systems.

6.4 Concluding Remarks

This thesis emphasised the importance of resources in resource-constrained

embedded systems and highlighted the need for an OS to adapt its policies to

support application-specific resource requirements. The thesis mainly focused

on the CPU and virtual memory resource in the context of soft real-time

embedded systems.

Existing RTOSs provide average-case resource management support and do

not take into account the dynamic changes in an application’s resource require-

ments. As a first initiative, the proposed reflective framework implemented in

an RTOS allowed the CPU and virtual memory management policies to be

adapted or changed at runtime depending on application requirements. The

framework helped applications to adapt/change the OS policies according to

their needs. The applicability of the approach was testing by implementing

the framework in both a single as well as a multiple address space OS. In each

case, applications using the framework were able to adapt OS policies which

significantly improved their performance.

Paging, associated with the page swap overheads, is generally regarded as

not suitable for soft real-time embedded systems. This thesis showed that,

by using application-specific paging mechanism, it is possible to reduce the

associated overheads. This work made a successful attempt to show that

paging may be a viable approach.

As the complexity of systems increases with more and more applications

being deployed onto single platforms, there is an ever increasing need for an

RTOS to manage its resources efficiently. It is not possible for a single resource

management policy to satisfy the dynamic demands of all applications running

in a system. This thesis has taken one step towards providing application-

specific resource manage support in an RTOS particularly for the CPU and

memory resources. However, much needs to be accomplished yet.

Bibliography

[1] The Linux Kernel Mailing List, (linux-kernel@vger.kernel.org), Online:http://www.lkml.org.

[2] Infineon Technologies. XC167CI 16-bit Single-Chip Microcontroller datasheets, October 2002.

[3] CompuLab. CM-X255 (ARMCORE-GX) Embedded Computer Modulereference guide, April 2005.

[4] Sun Microsystems. JavaTM 2 Platform Enterprise Edition, v 1.4 APISpecification, http://java.sun.com/j2ee/1.4/docs/api/ 2003.

[5] Microsoft Corporation. .NET Framework 3.5, 2007,http://msdn2.microsoft.com/en-us/library/w0x726c2(vs.90).aspx.

[6] ARM710T Datasheet, ARM DDI 0086B, ARM Ltd., UK, July, 1998,Online: http://www.arm.com/documentation/ARMProcessor Cores/.

[7] Virtualization Products, VMWare, Inc., Palo Alto, CA, USA, Online:http://www.vmware.com/.

[8] Samsung D840 Specifications, Samsung Electronics Co. Ltd., 2007,http://uk.samsungmobile.com/mobile/SGH-D840/spec.

[9] Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid,

R., Tevanian, A., and Young, M. Mach: A New Kernel Founda-tion for UNIX Development. Tech. rep., Computer Science Department,Carnegie Mellon University, August 1986.

[10] Agarwal, A., Sites, R. L., and Horowitz, M. Atum: a newtechnique for capturing address traces using microcode. In ISCA ’86:Proceedings of the 13th annual international symposium on Computer

architecture (Los Alamitos, CA, USA, 1986), IEEE Computer SocietyPress, pp. 119–127.

[11] Albert M. K. Cheng. Real-Time Systems: Scheduling, Analysis andVerification. Wiley-Interscience, August 2002.

[12] Aldea, M., Bernat, G., Broster, I., Burns, A., Dobrin, R.,

Drake, J. M., Fohler, G., Gai, P., Harbour, M. G., Guidi, G.,

Gutirrez, J., Lennvall, T., Lipari, G., Martnez, J., Medina,

J., Palencia, J., and Trimarchi., M. FSF: A Real-Time SchedulingArchitecture Framework. In Proceedings of the 12th IEEE Real-Time andEmbedded Technology and Applications Symposium, RTAS 2006 (SanJose, CA, USA, April 2006).

[13] Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., Burnett,

N. C., Denehy, T. E., Engle, T. J., Gunawi, H. S., Nugent,

J. A., and Popovici, F. I. Transforming Policies into Mechanismswith Infokernel. In Proceedings of the 19th ACM symposium on Op-erating systems principles (New York, NY, USA, 2003), ACM Press,pp. 90–105.

[14] Audsley, N., Gao, R., Patil, A., and Usher, P. Efficient OSResource Management for Distributed Embedded Real-Time Systems. InProceedings of Workshop on Operating Systems Platforms for EmbeddedReal-Time applications (Dresden, Germany, July 2006).

[15] Austin, T., Blaauw, D., Mahlke, S., Mudge, T., Chakrabarti,

C., and Wolf, W. Mobile Supercomputers. IEEE Computer 35, 5(May 2004), 81–83.

[16] Bacon, D. F., Graham, S. L., and Sharp, O. J. Compiler trans-formations for high-performance computing. ACM Computer Survey 26,4 (1994), 345–420.

[17] Bansal, S., and Modha, D. S. CAR: Clock with Adaptive Replace-ment. In FAST ’04: Proceedings of the 3rd USENIX Conference on Fileand Storage Technologies (Berkeley, CA, USA, 2004), USENIX Associ-ation, pp. 187–200.

[18] Barve, R. D., Grove, E. F., and Vitter, J. S. Application-Controlled Paging for a Shared Cache. SIAM Journal on Computing 29,4 (2000), 1290–1303.

[19] Beck, M., Bohme, H., Dziadzka, M., Kunitz, U., Magnus, R.,

and Verworner, D. Linux Kernel Internals, second ed. Addison–Wesley, 1998.

[20] Belady, L. A. A Study of Replacement Algorithms for Virtual-StorageComputer. IBM Systems Journal 5, 2 (1966), 78–101.

[21] Bernat, G., Burns, A., and Llamosi, A. Weakly hard real-timesystems. IEEE Transactions on Computers 50, 4 (2001), 308–321.

[22] Bershad, B. N., Chambers, C., Eggers, S., Maeda, C., Mc-

Namee, D., Pardyak, P., Savage, S., and Sirer, E. G. SPIN: anExtensible Microkernel for Application-specific Operating System Ser-vices. SIGOPS Oper. Syst. Rev. 29, 1 (1995), 74–77.

[23] Bilgic, A. M., and Hemmert, J. W. The Algorithmic Driving Force,Infineon Technologies, 2006.

[24] Blair, G. S., Coulson, G., Andersen, A., Blair, L., Clarke,

M., Costa, F., Duran-Limon, H., Fitzpatrick, T., Johnston,

L., Moreira, R., Parlavantzas, N., and Saikoski, K. ReflectiveMiddleware: The Design and Implementation of Open ORB 2. IEEEDistributed Systems Online (see http://www.computer.org/dsonline), 6(September 2001).

[25] Bondavalli, A., Stankovic, J., and Strigini, L. Adaptable FaultTolerance for Real-Time Systems. In Proceedings of the 3rd InternationalWorkshop on Responsive Computer Systems (September 1993).

[26] Brown, A. D., and Mowry, T. C. Taming the Memory Hogs: UsingCompiler-Inserted Releases to Manage Physical Memory Intelligently. InProceedings of the Fourth Operating Systems Design and ImplementationConference (OSDI) (Oct 2000), p. 72.

[27] Bryce, R. W. Chameleon, a dynamically extensible and configurableobject-oriented operating system. PhD thesis, Victoria, B.C., Canada,Canada, 2003. Adviser-G. C. Shoja.

[28] Bryce, R. W., Murata, K., Shoja, G. C., and Manning, E. G.

”porting and enhancements of a real-time object-oriented operating sys-tem”. In Proceedings of the PacRim ’95 conference (May 1995), IEEE.

[29] Burns, A., and Wellings, A. Real-Time Systems and ProgrammingLanguages, Second ed. Addison–Wesley, 1997.

[30] Campos, J. L., Gutierrez, J. J., and Harbour, M. G. Inter-changeable Scheduling Policies in Real-Time Middleware for Distribu-tion. In Proceedings of the 11th International Conference on ReliableSoftware Technologies, Ada-Europe (Porto, Portugal, 2006), pp. 227–240.

[31] Candea, G. M., and Jones, M. B. Vassal: Loadable Scheduler Sup-port for Multi-Policy Scheduling. In Proceedings of the Second USENIXWindows NT Symposium (August 1998), pp. 157–166.

[32] Carr, S., McKinley, K. S., and Tseng, C.-W. Compiler Opti-mizations for Improving Data Locality. In ASPLOS-VI: Proceedings ofthe sixth International Conference on Architectural Support for Program-ming Languages and Operating Systems (New York, NY, USA, 1994),ACM Press, pp. 252–262.

[33] Carvalho, D., Kon, F., Ballesteros, F., Roman, M., Camp-

bell, R., and Mickunas, D. Management of Execution Environ-ments in 2K. In Proceedings of the Seventh International Conferenceon Parallel and Distributed Systems (ICPADS’2000) (July 2000), IEEEComputer Society, pp. 479–485.

[34] Cazzola, W., and Ancona, M. ”mcharm: A reflective middlewarefor communications-based reflection”. Tech. Rep. ”DISI-TR-00-09”, Uni-versita defli Studi di Milano, Milan, Italy, May 2000.

[35] Cheriton, D. R., and Duda, K. J. A Caching Model of OperatingSystem Kernel Functionality. In Proceedings of the 1st Symposium onOperating Systems Design and Implementation (November 1994), ACMPress, pp. 179–194.

[36] Chiba, S. Load-Time Structural Reflection in Java. Lecture Notes inComputer Science 1850 (2000), 313.

[37] Cox, M., and Ellsworth, D. Application-Controlled Demand Pag-ing for Out-of-Core Visualization. In VIS ’97: Proceedings of the 8thConference on Visualization (Los Alamitos, CA, USA, 1997), IEEEComputer Society Press, pp. 235–ff.

[38] Crawford, J. H., and Gelsinger, P. P. Programming the 80386.SYBEX, 1987.

[39] de Lara, E., Wallach, D. S., and Zwaenepoel, W. HATS: Hier-archical Adaptive Transmission Scheduling for Multi-Application Adap-tation. In Proceedings of the 2002 Multimedia Computing and Network-ing Conference (MMCN’02) (San Jose, CA, January 2002).

[40] Denning, P. J. The Working Set Model for Program Behavior. InSOSP ’67: Proceedings of the first ACM symposium on Operating SystemPrinciples (New York, USA, 1967), ACM Press, pp. 15.1–15.12.

[41] Denys, G., Piessens, F., and Matthijs, F. A Survey of Cus-tomizability in Operating Systems Research. ACM Computing Surveys(CSUR) 34 (December 2002).

[42] Doller, E. Flash Memory Trends and Tech-nologies, Intel Developer FORUM, MEMS001,http://download.intel.com/idf/us/docs/PS MEMS001.pdf 2006.

[43] Eggers, S. J., Keppel, D. R., Koldinger, E. J., and Levy,

H. M. Techniques for efficient inline tracing on a shared-memory mul-tiprocessor. In SIGMETRICS ’90: Proceedings of the 1990 ACM SIG-METRICS conference on Measurement and modeling of computer sys-tems (New York, NY, USA, 1990), ACM Press, pp. 37–47.

[44] Elizabeth J. O’Neil and Patrick E. O’Neil and Gerhard

Weikum. The LRU-K page replacement algorithm for database diskbuffering. In Proceedings of the ACM SIGMOD International Confer-ence on Management of Data (1993), pp. 297–306.

[45] Endo, Y., Gwertzman, J., Seltzer, M., Small, C., Smith,

K. A., and Tang, D. VINO: The 1994 Fall Harvest. Tech. Rep.TR-34-94, Center for Research in Computing Technology, Harvard Uni-versity, December 1994.

[46] Engler, D. R., Gupta, S. K., and Kaashoek, M. F. AVM:Application-level Virtual Memory. In Proceedings of the Fifth Workshopon Hot Topics in Operating Systems (HotOS-V) (May 1995), p. 72.

[47] Engler, D. R., Kaashoek, M. F., and O’Toole, Jr., J. Exok-ernel: an Operating System Architecture for Application-level Resource

Management. In Proceedings of the fifteenth ACM Symposium on Oper-ating Systems Principles (1995), ACM Press, pp. 251–266.

[48] Feizabadi, S. A Formally Verified Application-Level Framework forReal-Time Scheduling on POSIX Real-Time Operating Systems. IEEETransaction on Software Engineering 30, 9 (2004), 613–629. StudentMember-Peng Li and Senior Member-Binoy Ravindran and StudentMember-Syed Suhaib.

[49] Fiat, A., and Rosen, Z. Experimental studies of access graph basedheuristics: beating the LRU standard? In SODA ’97: Proceedingsof the eighth Annual ACM-SIAM Symposium On Discrete Algorithms(Philadelphia, PA, USA, 1997), Society for Industrial and Applied Math-ematics, pp. 63–72.

[50] Foote, B., and Johnson, R. E. Reflective facilities in Smalltalk-80.ACM SIGPLAN Notices. 24, 10 (1989), 327–335.

[51] Gall, D. L. MPEG: A Video Compression Standard for MultimediaApplications. Communications of the ACM 34, 4 (1991), 46–58.

[52] Gehani, N., and Ramamritham, K. Real-Time Concurrent C: ALanguage for Programming Dynamic Real-Time Systems. Real-TimeSystems 3, 4 (December 1991).

[53] George C. Necula and Scott McPeak and Shree Prakash

Rahul and Westley Weimer. CIL: Intermediate Language andTools for Analysis and Transformation of C Programs. In ComputationalComplexity (2002), pp. 213–228.

[54] Gorman, M. Understanding the Linux Virtual Memory Manager. Pren-tice Hall, April 2004.

[55] Goyal, P., Guo, X., and Vin, H. M. A Hierarchical CPU Schedulerfor Multimedia Operating Systems. In Proceedings of the Second Sym-posium on Operating Systems Design and Implementation (Seattle, WA,October 1996), USENIX Association, pp. 107–121.

[56] Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M.,

Mudge, T., and Brown, R. B. MiBench: A free, commerciallyrepresentative embedded benchmark suite. In Proceedings of IEEE 4thAnnual Workshop on Workload Characterization (December 2001).

[57] Hand, S. M. Self-Paging in the Nemesis Operating System. In the ThirdSymposium on Operating Systems Design and Implementation (New Or-leans, Louisiana, USA, February 1999), pp. 73–86.

[58] Harty, K., and Cheriton, D. R. Application-Controlled PhysicalMemory using External Page-Cache Management. In Proceedings of the5th International Conference on Architectural Support for ProgrammingLanguages and Operating System (ASPLOS) (New York, USA, 1992),vol. 27, ACM Press, pp. 187–197.

[59] ichiro Itoh, J., Lea, R., and Yokote, Y. Using meta-objects tosupport optimisation in the apertos operating system. In COOTS’95:Proceedings of the USENIX Conference on Object-Oriented Technolo-gies on USENIX Conference on Object-Oriented Technologies (COOTS)(Berkeley, CA, USA, 1995), USENIX Association.

[60] Infineon Technologies AG, 81726 Mnchen, Germany. TriCore- 32-bit Unified Processor Core Embedded Applications Binary Interface(EABI), February 2007.

[61] Intel Corporation. Intel x86 Processor Family - Developer’s Manu-als Vol. I, II and III., December 1998.

[62] James L. Peterson and Abraham Silberschatz. Operating Sys-tem Concepts. Addison–Wesley, 1988.

[63] Jiang, S., Chen, F., and Zhang, X. CLOCK-Pro: an Effective Im-provement of the CLOCK Replacement. In Proceedings of 2005 USENIXAnnual Technical Conference (USENIX’05) (Berkeley, CA, USA, April2005), USENIX Association.

[64] Jiang, S., and Zhang, X. LIRS: an Efficient Low Inter-reference Re-cency Set Replacement Policy to improve Buffer Cache Performance. InSIGMETRICS ’02: Proceedings of the 2002 ACM SIGMETRICS Inter-national Conference on Measurement and Modeling of Computer Sys-tems (New York, NY, USA, 2002), ACM Press, pp. 31–42.

[65] Johnson, T., and Shasha, D. 2Q: a low overhead High Perfor-mance Buffer Management Replacement Algorithm. In Proceedings ofthe Twentieth International Conference on Very Large Databases (San-tiago, Chile, 1994), pp. 439–450.

[66] Kaplan, S. F. Collecting whole-system reference traces of multipro-grammed and multithreaded workloads. In WOSP ’04: Proceedings ofthe 4th international workshop on Software and performance (New York,NY, USA, 2004), ACM Press, pp. 228–237.

[67] Kaplan, S. F. Complete or fast reference trace collection for simulatingmultiprogrammed workloads: choose one. In SIGMETRICS ’04/Perfor-mance ’04: Proceedings of the joint international conference on Measure-ment and modeling of computer systems (New York, NY, USA, 2004),ACM Press, pp. 420–421.

[68] Kaplan, S. F., Smaragdakis, Y., and Wilson, P. R. Trace reduc-tion for virtual memory simulations. In SIGMETRICS ’99: Proceedingsof the 1999 ACM SIGMETRICS international conference on Measure-ment and modeling of computer systems (New York, NY, USA, 1999),ACM Press, pp. 47–58.

[69] Kon, F., Campbell, R. H., Mickunas, M. D., Nahrstedt, K.,

and Ballesteros, F. J. 2K: A Distributed Operating System forDynamic Heterogeneous Environments. In Proceedings of the 9th IEEEInternational Symposium on High Performance Distributed Computing(HPDC’9) (Pittsburgh, August 2000), pp. 201–208.

[70] Kon, F., Costa, F., Blair, G., and Campbell, R. H. The casefor reflective middleware. Communications ACM 45, 6 (2002), 33–38.

[71] Kon, F., Roman, M., Liu, P., Mao, J., Yamane, T., Clau-

dio Magalh a., and Campbell, R. H. Monitoring, security, anddynamic configuration with the dynamicTAO reflective ORB. In Middle-ware ’00: IFIP/ACM International Conference on Distributed systemsplatforms (Secaucus, NJ, USA, 2000), Springer-Verlag New York, Inc.,pp. 121–143.

[72] Kon, F., Singhai, A., Campbell, R. H., Carvalho, D., Moore,

R., and Ballesteros, F. J. 2K: A Reflective, Component-BasedOperating System for Rapidly Changing Environments. In ECOOP’98Workshop on Reflective Object-Oriented Programming and Systems(Brussels, Belgium, July 1998).

[73] Krueger, K., Loftesness, D., Vahdat, A., and Anderson, T.

Tools for the Development of Application-Specific Virtual Memory Man-agement. In Proceedings of the OOPSLA ’93 Conference on Object-oriented Programming Systems, Languages and Applications (1993),pp. 48–64.

[74] Lebeck, A. R., and Wood, D. A. Active memory: a new abstractionfor memory system simulation. ACM Trans. Model. Comput. Simul. 7,1 (1997), 42–77.

[75] Ledoux, T. OpenCORBA: A reflective open broker. In Proceedings ofthe Reflection’99 (July 1999), Springer-Verlag, pp. 197–214.

[76] Lee, D., Choi, J., Kim, J.-H., Noh, S. H., Min, S. L., Cho, Y.,

and Kim, C. S. LRFU (Least Recently/Frequently Used) ReplacementPolicy: A Spectrum of Block Replacement Policies. Tech. Rep. SNU-CE-AN-96-004, Seoul National University, March 1996.

[77] Lee, D., Choi, J., Kim, J. H., Noh, S. H., Min, S. L., Cho, Y.,

and Kim, C. S. LRFU: A Spectrum of Policies that Subsumes theLeast Recently Used and Least Frequently Used Policies. IEEE Trans.Computers 50, 12 (2001), 1352–1361.

[78] Liedtke, J. L4 Reference Manual (486, Pentium, Pentium Pro). Tech.rep., GMD-German National Research Center for Information Technol-ogy, September 1996.

[79] Liu, C. L., and Layland, J. W. Scheduling Algorithms for Multi-programming in a Hard-Real-Time Environment. Journal of the ACM20, 1 (January 1973), 46–61.

[80] Lund, K., and Goebel, V. Adaptive Disk Scheduling in a Multi-media DBMS. In MULTIMEDIA ’03: Proceedings of the 11th ACMInternational Conference on Multimedia (2003), ACM Press, pp. 65–74.

[81] Malenfant, J., Jaques, M., and Demers, F.-N. A Tutorial onBehavioral Reflection and its Implementation. In Proceedings of the Re-flection 96 Conference, Gregor Kiczales, editor, pp. 1-20, San Francisco,California, USA (April 1996).

[82] Malkawi, M., and Patel, J. Compiler Directed Memory Manage-ment Policy for Numerical Programs. In SOSP ’85: Proceedings of the

tenth ACM Symposium on Operating Systems Principles (New York,NY, USA, 1985), ACM Press, pp. 97–106.

[83] Martonosi, M., Gupta, A., and Anderson, T. Memspy: an-alyzing memory system bottlenecks in programs. In SIGMETRICS’92/PERFORMANCE ’92: Proceedings of the 1992 ACM SIGMET-RICS joint international conference on Measurement and modeling ofcomputer systems (New York, NY, USA, 1992), ACM Press, pp. 1–12.

[84] Matsuoka, S., Ogawa, H., Shimura, K., Kimura, Y., Hotta,

K., and Takagi, H. OpenJIT - A Reflective Java JIT Compiler. InProceedings of OOPSLA ’98 Workshop on Reflective Programming inC++ and Java (November 1998), pp. 16–20.

[85] McNamee, D., and Armstrong, K. Extending the Mach ExternalPager Interface To Accommodate User-Level Page Replacement Poli-cies. In Proceedings of the USENIX Association Mach Workshop (1990),pp. 17–29.

[86] Megiddo, N., and Modha, D. S. ARC: A Self-Tuning, Low OverheadReplacement Cache. In FAST ’03: Proceedings of the 2nd USENIXConference on File and Storage Technologies (Berkeley, CA, USA, 2003),USENIX Association, pp. 115–130.

[87] Nieh, J., and Lam, M. S. The Design, Implementation and Evalua-tion of SMART: A Scheduler for Multimedia Applications. In SOSP’97:Proceedings of the 16th ACM Symposium on Operating Systems Princi-ples (October 1997), ACM Press, pp. 184–197.

[88] Niehaus, D. Program Representation and Translation for PredictableReal-Time Systems. In Proceedings of the IEEE Real-Time SystemsSymposium (December 1991), pp. 43–52.

[89] Niehaus, D., Stankovic, J., and Ramamritham, K. The SpringSystem Description Language. Tech. Rep. UMASS TR-93-08, Universityof Massachusetts Amherst, 1993.

[90] Nutt, G. Operating Systems, third ed. Addison Wesley, 2004.

[91] O’Neil, E. J., O’Neil, P. E., and Weikum, G. An optimalityproof of the LRU-K page replacement algorithm. Journal of ACM 46, 1(1999), 92–112.

[92] Patel, K., Smith, B. C., and Rowe, L. A. Performance of a Soft-ware MPEG Video Decoder. In MULTIMEDIA ’93: Proceedings of thefirst ACM International Conference on Multimedia (1993), ACM Press,pp. 75–82.

[93] Patil, A. PROTON: a customisable on-the-fly Virtual Memory Simu-lator. Tech. Rep. YCS-2007-420, University of York, York, UK, 2007.

[94] Patil, A. VRHS: an Application Specific Reflective Hierarchical Sched-uler. Tech. Rep. YCS-2007-419, University of York, York, UK, 2007.

[95] Patil, A., and Audsley, N. An Application Adaptive GenericModule-based Reflective Framework for Real-time Operating Systems.In Proceedings of the 25th IEEE Work in Progress session of Real-timeSystems Symposium (Lisbon, Portugal, December 2004).

[96] Patil, A., and Audsley, N. Implementing Application-SpecificRTOS Policies using Reflection. In Proceedings of the 11th IEEE Real-time and Embedded Technology and Applications Symposium (San Fran-cisco, 2005), pp. 438–447.

[97] Patil, A., and Audsley, N. Efficient Page lock/release mechanismin OS for out-of-core Embedded Applications. In Proceedings of the13th IEEE Real-time and Embedded Computing Systems and Applica-tions Symposium (Daegu, Korea, August 2007), pp. 81–88.

[98] POSIX.1. IEEE Standard for Information Technology - Portable Oper-ating System Interface (POSIX) - Part 1: System Application ProgramInterface (API) [C Language]. Tech. rep., IEEE Std 1003.1-1988, 1988.

[99] Regehr, J., and Stankovic, J. A. HLS: A Framework for Compos-ing Soft Real-Time Schedulers. In Proceedings of the 22nd IEEE Real-time Systems Symposium (RTSS’01) (London, UK, December 2001),Computer Society, IEEE, pp. 3–14.

[100] Regehr, J. D. Using Hierarchical Scheduling to Support Soft Real-Time Applications in General-Purpose Operating Systems. PhD thesis,University of Virginia, May 2001.

[101] Rivard, F. A New Smalltalk Kernel Allowing Both Explicit and Im-plicit Metalclass Programming. In Proceedings of OOPSLA’96, Work-shop : Extending the Smalltalk Language (October 1996).

[102] Rivas, M. A., and Harbour, M. G. Application-Defined Schedulingin Ada. ACM Ada-Letters XXII, 4 (December 2002), 77–84.

[103] Rivas, M. A., and Harbour, M. G. POSIX-CompatibleApplication-Defined Scheduling in MaRTE OS. In Proceedings of the14th Euromicro Conference on Real-Time Systems (June 2002), IEEEComputer Society, pp. 67–75.

[104] Rivas, M. A., and Harbour, M. G. Proposal of Application-DefinedScheduling Interface, Proposal submitted for consideration by the Real-time POSIX Working Group, July, 2002.URL:http://marte.unican.es/appsched-proposal.pdf.

[105] Rogers, P. Software Fault Tolerance, Reflection and the Ada Program-ming Language. PhD thesis, University of York, UK, October 2003.

[106] Rogers, P., and Wellings, A. J. OpenAda: A Metaobject Protocolfor Ada 95.

[107] Rowe, L. A., Patel, K. D., Smith, B. C., and Liu, K. MPEGVideo in Software: Representation, Transmission and Playback. InProceedings of Symposium on Electronic Imaging Science & Technology(February 1994).

[108] Seward, J., and Nethercote, N. Using valgrind to detect undefinedvalue errors with bit-precision. In Proceedings of the USENIX’05 AnnualTechnical Conference (April 2005).

[109] Silberschatz, A., Galvin, P. B., and Gagne, G. Operating Sys-tem Concepts, sixth ed. John Wiley & Sons, Inc., 2002.

[110] Singhai, A. Quarterware: a middleware toolkit of software risc compo-nents. PhD thesis, Champaign, IL, USA, 1999. Adviser-Roy H. Camp-bell.

[111] Smaragdakis, Y., Kaplan, S., and Wilson, P. EELRU: Simpleand Effective Adaptive Page Replacement. In SIGMETRICS ’99: Pro-ceedings of the 1999 ACM SIGMETRICS International Conference onMeasurement and Modeling of Computer Systems (New York, NY, USA,1999), ACM Press, pp. 122–133.

[112] Smith, B. C. Reflection and Semantics in a Procedural Language. PhDthesis, Massachusetts Institute of Technology, January 1982.

[113] Spencer, B., Wilson, L., and Doering, R. The SemiconductorTechnology Roadmap. Tech. rep., Future Fab International, December2005.

[114] Srivastava, A., and Eustace, A. Atom: a system for buildingcustomized program analysis tools. In PLDI ’94: Proceedings of theACM SIGPLAN 1994 conference on Programming language design andimplementation (New York, NY, USA, 1994), ACM Press, pp. 196–205.

[115] Stankovic, J. A. Reflective Real-Time Systems. Tech. Rep. 93-56,Univeristy of Massachusetts, 1993.

[116] Stankovic, J. A., and Ramamritham, K. The Spring Kernel: aNew Paradigm for Real-Time Operating Systems. SIGOPS Oper. Syst.Rev. 23, 3 (1989), 54–71.

[117] Stankovic, J. A., and Ramamritham, K. The Spring Kernel: aNew Paradigm for Real-Time Operating Systems. SIGOPS OperatingSystems Rev. 23, 3 (1989), 54–71.

[118] Stankovic, J. A., and Ramamritham, K. A Reflective Architecturefor Real-Time Operating Systems. Prentice-Hall, Inc., 1995.

[119] Stonebraker, M. Operating System support for Database Manage-ment. Communications of ACM 24, 7 (1981), 412–418.

[120] Tanenbaum, A. S., and Woodhull, A. S. Operating Systems: De-sign and Implementation, second ed. Prentice Hall, 1997.

[121] Turley, J. Operating Systems on the Rise, Embedded Systems Design,http://www.embedded.com/columns/surveys/187203732? requestid=605762006.

[122] Uhlig, R. A., and Mudge, T. N. Trace-driven memory simulation:a survey. ACM Comput. Surv. 29, 2 (1997), 128–170.

[123] Venkatachalam, V., and Franz, M. Power Reduction Techniquesfor Microprocessor Systems. ACM Computing Surveys 37, 3 (2005),195–237.

[124] Williams, N. J. An Implementation of Scheduler Activations on theNetBSD Operating System. In Proceedings of the FREENIX Track:USENIX Annual Technical Conference (June 2002).

[125] Winwood, S., and Heiser, G. Flexible Scheduling Mechanisms inL4. Tech. rep., University of New South Wales, Australia, November2000.

[126] Wolf, W. Computers as Components: Principles of Embedded Com-puting System Design. Morgan Kaufmann, July 2005.

[127] Yang, Z., and Duddy, K. CORBA: a platform for distributed objectcomputing. SIGOPS Operating Systems Review 30, 2 (1996), 4–31.

[128] Yokote, Y. The Apertos Reflective Operating System: The Conceptand Its Implementation. In Conference Proceedings on Object-OrientedProgramming Systems, Languages, and Applications (1992), ACM Press,pp. 414–434.

[129] Yokote, Y., and Tokoro, M. The new structure of an operatingsystem: the apertos approach. In Proceedings of the 5th workshop onACM SIGOPS European workshop (New York, NY, USA, 1992), ACM.

[130] Zhu, M.-Y., Luo, L., and Xiong, G.-Z. The Minimal Model ofOperating Systems. ACM SIGOPS Operating Systems Review 35 (July2001).

[131] Zuberi, K. M., Pillai, P., and Shin, K. G. EMERALDS: a Small-Memory Real-Time Microkernel. In Proceedings of the 17th ACM Sympo-sium on Operating Systems Principles (1999), ACM Press, pp. 277–299.

APPLICATION-SPECIFIC RESOURCE MANAGEMENT IN · PDF fileAPPLICATION-SPECIFIC RESOURCE...

Documents

Transcript of APPLICATION-SPECIFIC RESOURCE MANAGEMENT IN · PDF fileAPPLICATION-SPECIFIC RESOURCE...

application-notes. fileapplication-notes.digchip.com

CAREER RESOURCE GUIDE - Montgomery College · CAREER RESOURCE GUIDE ... NCLEX RN Review ... Virginia, and Washington D.C. region Specific Montgomery College educational programs and

RESOURCE INNOVATION CAMPUS - Phoenix, Arizona Innov… · Arizona State University and opportunities to develop solutions to specific waste and other resource management challenges.

TENNESSEE State Specific Signs - ComplianceSigns.com · Compliance – Resource Bulletins are reference summaries of rules which govern the design of ... TENNESSEE – State Specific

Site specific application for a new environmental ...€¦ · Web viewApplication form. Application form. Site specific application for a new environmental authority for a resource

2014/68/EU COMPLIANT - - DADCO® – · PDF fileApplication Examples DADCO offers a variety of mount options to meet specific customer applications. Installation and fastening of the

Domain specific facet based language resourcedisi.unitn.it/~bernardi/Courses/DL/Slides_10_11/2011_05... · · 2011-05-25Domain specific facet based language resource Biswanath Dutta

Guidelines Specific to Long Term Care A good general resource:

Biospecimen Resource Landscape · Main Messages (2) •Each BSR follows resource-specific SOPs to collect specimens for distribution to researchers. •BSRs Include completed and

Bakersfield Field Office Resource Management Plan€¦ · Bakersfield Field Office Resource Management Plan ... East Temblor Range ACEC ... resources, including specific locations

THE ONLY WOMEN’S-SPECIFIC RESOURCE FOR RUNNERS ONLY WOMEN’S-SPECIFIC RESOURCE FOR RUNNERS. ... CIrculation. WR/ CIRCULATION GROWTH. ISSUE 87 ... Recent Press “ Women’s Running

Summary guide: Deriving site-specific guideline …€¦ · Web viewDeriving site-specific guideline values for physico-chemical parameters and toxicants Background The coal resource

Deriving and Documenting Army Specific … Resource Library/TG363_Deriving_and...Deriving and Documenting Army Specific Occupational Exposure Levels ‒ Time-Weighted Averages ...

GOALKEEPING RESOURCE - Hockey New Zealandhockeynz.co.nz/wp-content/uploads/Goalkeeper-Resource.pdf · In this resource are specific rules that all goalkeepers should know including

ORGANIZATIONAL DEVELOPMENT A Resource Guide for ... · Specific hazards associated with their specific job tasks Hazard recognition and control Accident/incident investigation First

An Education Resource Guide - Bureau of Land … Education Resource Guide. ... See separate Teacher Resource for Specific Benchmarks for individual questions & activities. ... them

Tourism Resource Analysis - British Columbia · Tourism Resource A natural or cultural resource that is important for a specific tourism product. For example, wildlife viewing is

Pathema: A Clade Specific Bioinformatics Resource Center

BusinessStrategyandHumanResource Management:SettingtheScene · BusinessStrategyandHumanResource Management:SettingtheScene ... namely organisation-specific factors and human resource

SITE-SPECIFIC MANAGEMENT SHEET TIER-3 …dnr.wi.gov/topic/lands/masterplanning/documents/MP-NA-GasnerHollo… · 08.06.2016 · SITE-SPECIFIC MANAGEMENT SHEET TIER-3 RESOURCE MANAGEMENT