Post on 21-Mar-2018
APPLICATION-SPECIFIC RESOURCE
MANAGEMENT IN
REAL-TIME OPERATING SYSTEMS
By
Ameet Patil
SUBMITTED FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
AT THE
UNIVERSITY OF YORK
YORK, UK
SEPTEMBER 2007
c© Copyright by Ameet Patil, 2007
To Parents and my wife Sushma
Table of Contents
Table of Contents i
List of Tables v
List of Figures vii
List of Abbreviations and Symbols xi
Abstract xix
Acknowledgements xxi
Declaration xxiii
1 Introduction 11.1 Technological Growth Versus Application Complexity . . . . . 3
1.1.1 Resource Constraints . . . . . . . . . . . . . . . . . . . 71.2 Resource Management in RTOS . . . . . . . . . . . . . . . . . 8
1.2.1 Existing Approaches to Efficient Resource Management 91.3 Reflection Mechanism . . . . . . . . . . . . . . . . . . . . . . . 101.4 Thesis Proposition . . . . . . . . . . . . . . . . . . . . . . . . 121.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Resource Management and Operating System Specialisation 192.1 Real-time Embedded Systems . . . . . . . . . . . . . . . . . . 19
2.1.1 Types of Real-time Systems . . . . . . . . . . . . . . . 202.1.2 Categorising Real-time Embedded Systems . . . . . . . 212.1.3 Application-specific RTOS Specialisation . . . . . . . . 23
i
2.1.4 Resource-Constrained Real-time Embedded Systems . . 272.2 Resource Management in an OS . . . . . . . . . . . . . . . . . 28
2.2.1 CPU Resource . . . . . . . . . . . . . . . . . . . . . . . 282.2.2 Memory Resource . . . . . . . . . . . . . . . . . . . . . 32
2.3 Operating System Specialisation . . . . . . . . . . . . . . . . . 392.3.1 Specialisation of OS policies . . . . . . . . . . . . . . . 40
2.4 Reflection Mechanisms . . . . . . . . . . . . . . . . . . . . . . 412.4.1 Reflective Programming Languages . . . . . . . . . . . 442.4.2 Reflective Middlewares . . . . . . . . . . . . . . . . . . 482.4.3 Reflective OSs . . . . . . . . . . . . . . . . . . . . . . . 51
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Reflection in RTOS for Efficient Resource Management 613.1 Modifications to Reflection Mechanism . . . . . . . . . . . . . 62
3.1.1 Modifications to the Process of Reification . . . . . . . 633.1.2 Role of the Kernel . . . . . . . . . . . . . . . . . . . . 643.1.3 Component Privileges . . . . . . . . . . . . . . . . . . 653.1.4 Infolevel for Reified Information . . . . . . . . . . . . . 683.1.5 Categorisation of Reified Information . . . . . . . . . . 683.1.6 Flow of Reified Information . . . . . . . . . . . . . . . 723.1.7 In-kernel Reflection Interface . . . . . . . . . . . . . . . 753.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2 Generic Reflective RTOS Framework . . . . . . . . . . . . . . 783.2.1 Core Elements of the Framework . . . . . . . . . . . . 793.2.2 Optional Elements of the Framework . . . . . . . . . . 803.2.3 Reflective System Modules . . . . . . . . . . . . . . . . 823.2.4 Reflective Applications . . . . . . . . . . . . . . . . . . 843.2.5 Meta Object Protocol for Reflective Components . . . 85
3.3 Prototype Implementation – DAMROS . . . . . . . . . . . . . 863.3.1 Reflection Interface in the Kernel . . . . . . . . . . . . 883.3.2 The rManager . . . . . . . . . . . . . . . . . . . . . . . 943.3.3 The iManager . . . . . . . . . . . . . . . . . . . . . . . 1023.3.4 Reflective CPU Scheduler (VRHS) . . . . . . . . . . . 1113.3.5 Reflective Memory Management System (RMMS) . . . 126
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1363.4.1 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . 1373.4.2 Changing Application Behaviour . . . . . . . . . . . . 1373.4.3 Evaluation of VRHS . . . . . . . . . . . . . . . . . . . 1393.4.4 Evaluation of RMMS . . . . . . . . . . . . . . . . . . . 159
ii
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4 Support for Reification: a Case Study 1694.1 Paging Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 1704.2 Reification Calls for Paging . . . . . . . . . . . . . . . . . . . 176
4.2.1 keep(< address >, < size >) . . . . . . . . . . . . . . 1764.2.2 discard(< id >) . . . . . . . . . . . . . . . . . . . . . . 177
4.3 Inserting Reification Calls . . . . . . . . . . . . . . . . . . . . 1774.4 Manual Insertion Method . . . . . . . . . . . . . . . . . . . . 1794.5 Automatic Insertion Method . . . . . . . . . . . . . . . . . . . 181
4.5.1 Automatic Insertion for C Language . . . . . . . . . . 1824.5.2 Comparison of Manual and Automatic Insertion . . . . 190
4.6 Hybrid Insertion Method . . . . . . . . . . . . . . . . . . . . . 1914.7 Design of CASP Mechanism . . . . . . . . . . . . . . . . . . . 192
4.7.1 CASPapp Component . . . . . . . . . . . . . . . . . . . 1944.7.2 CASPos Component . . . . . . . . . . . . . . . . . . . 1944.7.3 Page-isolation Technique . . . . . . . . . . . . . . . . . 1964.7.4 Use of the Reflection Framework . . . . . . . . . . . . 197
4.8 Evaluation Strategy . . . . . . . . . . . . . . . . . . . . . . . . 1984.9 Virtual Memory Simulation . . . . . . . . . . . . . . . . . . . 199
4.9.1 Trace-driven Simulation . . . . . . . . . . . . . . . . . 1994.9.2 On-the-fly Simulation . . . . . . . . . . . . . . . . . . . 200
4.10 PROTON Virtual Memory Simulator . . . . . . . . . . . . . . 2014.10.1 PROTON Annotations . . . . . . . . . . . . . . . . . . 2024.10.2 Simulation of Multiple Applications . . . . . . . . . . . 2064.10.3 Implementing UD Paging Policies . . . . . . . . . . . . 209
4.11 Simulation Experiments using PROTON . . . . . . . . . . . . 2104.11.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 2104.11.2 Single Application . . . . . . . . . . . . . . . . . . . . 2134.11.3 Multiple Applications . . . . . . . . . . . . . . . . . . . 2154.11.4 Slow-down Factor . . . . . . . . . . . . . . . . . . . . . 218
4.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5 Implementation of CASP in a Commodity OS (Linux) 2215.1 Overview of Linux 2.6.16 Kernel . . . . . . . . . . . . . . . . . 222
5.1.1 CART Implementation in Linux . . . . . . . . . . . . . 2235.2 Implementation in Linux . . . . . . . . . . . . . . . . . . . . . 224
5.2.1 Reflection Framework . . . . . . . . . . . . . . . . . . . 2245.2.2 CASP Mechanism . . . . . . . . . . . . . . . . . . . . . 226
iii
5.2.3 Page-isolation in Linux-LRU . . . . . . . . . . . . . . . 2275.2.4 Page-isolation in Linux-CART . . . . . . . . . . . . . . 227
5.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . 2285.3.1 Hardware Platform . . . . . . . . . . . . . . . . . . . . 2285.3.2 Benchmark Applications . . . . . . . . . . . . . . . . . 2285.3.3 Single Application Scenario . . . . . . . . . . . . . . . 2305.3.4 Multiple Applications Scenario . . . . . . . . . . . . . 2455.3.5 Memory Usage . . . . . . . . . . . . . . . . . . . . . . 2475.3.6 Space Overhead . . . . . . . . . . . . . . . . . . . . . . 247
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
6 Conclusion 2516.1 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . 2516.2 Applications and Limitations . . . . . . . . . . . . . . . . . . 2546.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 2556.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . 256
Bibliography 259
iv
List of Tables
3.1 Measured Execution Times of DAMROS Interfaces . . . . . . 138
3.2 No Reflection, Basic RR Scheduler . . . . . . . . . . . . . . . 140
3.3 Reflection with One High Priority Application . . . . . . . . . 140
3.4 Reflection with One High Priority Application . . . . . . . . . 141
3.5 Reflection with One High Priority and Other Varying Priority
Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.1 Description of Benchmark Applications . . . . . . . . . . . . . 211
4.2 Single Application Benchmark Results for LRU . . . . . . . . 212
4.3 Two Applications Scenario for LRU (1) . . . . . . . . . . . . . 216
4.4 Two Applications Scenario for LRU (2) . . . . . . . . . . . . . 216
4.5 Three Applications Scenario for LRU . . . . . . . . . . . . . . 217
5.1 Benchmark Applications . . . . . . . . . . . . . . . . . . . . . 230
5.2 Single Application Performance in Linux-LRU (1) . . . . . . . 231
5.3 Single Application Performance in Linux-LRU (2) . . . . . . . 231
5.4 Single Application Performance in Linux-CART (1) . . . . . . 232
5.5 Single Application Performance in Linux-CART (2) . . . . . . 232
5.6 Results for Multiple Applications . . . . . . . . . . . . . . . . 245
5.7 Benchmark Code Size (bytes) . . . . . . . . . . . . . . . . . . 248
5.8 Linux Kernel Image Sizes (in bytes) . . . . . . . . . . . . . . . 248
v
List of Figures
1.1 Choice of RTOS for Embedded System Implementation [121] . 2
1.2 Trends in Application Complexity and Processor Speed [23] . 4
1.3 Projected Trends in Mobile Application Complexity [23] . . . 5
1.4 Need for Greater Secondary Storage [42] . . . . . . . . . . . . 7
2.1 MPEG Input Streams for Decoding . . . . . . . . . . . . . . . 25
2.2 Hierarchical Scheduling Structure . . . . . . . . . . . . . . . . 29
2.3 Tower of Reflection (Reproduced from [81,105]) . . . . . . . . 42
2.4 Object/Meta-Object Separation and Meta-Hierarchy [128] . . 53
3.1 Reification through the Kernel . . . . . . . . . . . . . . . . . . 65
3.2 Modifications to Reflection . . . . . . . . . . . . . . . . . . . . 66
3.3 In-kernel Reflection Interface . . . . . . . . . . . . . . . . . . . 77
3.4 Structure of a Reflective System Module . . . . . . . . . . . . 83
3.5 Code Snippet of Reify Interface . . . . . . . . . . . . . . . . . 96
3.6 rManager: Saving Reified Information . . . . . . . . . . . . . . 98
3.7 Two-level Scheduler in DAMROS . . . . . . . . . . . . . . . . 112
3.8 Structure of Reflective CPU Scheduler Module . . . . . . . . . 113
3.9 URQ: Representation of Threads . . . . . . . . . . . . . . . . 116
3.10 Operation of the VRHS Model . . . . . . . . . . . . . . . . . . 119
3.11 Pseudo-code of rScheduler Thread . . . . . . . . . . . . . . . . 122
3.12 Application-specific UD Scheduler Blocks . . . . . . . . . . . . 124
vii
3.13 User-Defined FCFS Scheduler . . . . . . . . . . . . . . . . . . 125
3.14 Structure of the RMMS Model . . . . . . . . . . . . . . . . . . 129
3.15 Reflective Memory Management System (RMMS) . . . . . . . 131
3.16 Operation of the RMMS model . . . . . . . . . . . . . . . . . 133
3.17 Application-specific UD Paging Policy . . . . . . . . . . . . . 135
3.18 Pseudo-code for Thread T2 . . . . . . . . . . . . . . . . . . . . 143
3.19 Results of Experiment #1 . . . . . . . . . . . . . . . . . . . . 150
3.20 Results of Experiment #2 . . . . . . . . . . . . . . . . . . . . 152
3.21 Using RR Scheduler . . . . . . . . . . . . . . . . . . . . . . . . 154
3.22 Using UD Scheduler . . . . . . . . . . . . . . . . . . . . . . . 155
3.23 RR Vs UD Scheduler . . . . . . . . . . . . . . . . . . . . . . . 156
3.24 Static Vs Reflective LRU . . . . . . . . . . . . . . . . . . . . . 160
3.25 Experiment #1: Page-faults . . . . . . . . . . . . . . . . . . . 163
3.26 Page-faults for RMMS . . . . . . . . . . . . . . . . . . . . . . 165
4.1 OS Paging Model . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.2 Benchmark Application – ‘scan’ . . . . . . . . . . . . . . . . . 178
4.3 Manual Insertion for ‘scan’ . . . . . . . . . . . . . . . . . . . . 180
4.4 Steps Involved in Automatic Insertion . . . . . . . . . . . . . . 183
4.5 Pass-1 of the cloop Tool . . . . . . . . . . . . . . . . . . . . . 185
4.6 Pass-2 of the cloop Tool . . . . . . . . . . . . . . . . . . . . . 186
4.7 CIL Transformation of – ‘scan’ . . . . . . . . . . . . . . . . . 188
4.8 Automatic Method for ‘scan’ . . . . . . . . . . . . . . . . . . . 189
4.9 Design of CASP Mechanism . . . . . . . . . . . . . . . . . . . 193
4.10 PROTON Design Model . . . . . . . . . . . . . . . . . . . . . 203
4.11 ‘scan’ with Traditional Annotation . . . . . . . . . . . . . . . 204
4.12 ‘scan’ with PROTON Annotations . . . . . . . . . . . . . . . 205
4.13 BSORT Simulation . . . . . . . . . . . . . . . . . . . . . . . . 215
4.14 FFT Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 215
viii
4.15 MATVEC Simulation . . . . . . . . . . . . . . . . . . . . . . . 215
4.16 SCAN Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 215
5.1 MAD: Results on Linux-LRU and Linux-CART . . . . . . . . 233
5.2 FFT: Results on Linux-LRU and Linux-CART . . . . . . . . . 235
5.3 FFT-I: Results on Linux-LRU and Linux-CART . . . . . . . . 237
5.4 MATVEC: Results on Linux-LRU and Linux-CART . . . . . . 239
5.5 SCAN: Results on Linux-LRU and Linux-CART . . . . . . . . 241
5.6 Summary of Results for Linux-LRU . . . . . . . . . . . . . . . 243
5.7 Summary of Results for Linux-CART . . . . . . . . . . . . . . 243
5.8 Results for Multiple Applications . . . . . . . . . . . . . . . . 243
ix
x
xi
List of Abbreviations and
Symbols
Abbreviation Meaning
AGEING AGEING is a page replacement policy.
APEX APEX is a two-level disk scheduler.
API Application Programming Interface
ARC Adaptive Replacement Cache page replacement policy.
ATOM ATOM is a static code annotation based trace collection tool.
ATUM ATUM uses microcode to efficiently capture address traces.
BSORT Bubble sort algorithm/application
CAR CLOCK with Adaptive Replacement combines the advantages
of CLOCK and ARC policies.
CART CART is an extension to CAR policy with a temporal filter
improves upon the defects of CAR/ARC.
CASP Cooperative Application-Specific Paging mechanism.
CIL C Intermediate Language tool chain.
CLOCK CLOCK is an easy to implement page replacement policy.
CLOS CLOS is programming language that support reflection.
CORBA Common Object Request Broker Architecture is a standard
enabling software components written in multiple computer
languages and running on multiple computers to work to-
gether.
xii
Abbreviation Meaning
CPU Central Processing Unit.
DAMROS Dynamically Adaptive Micro-Reflective Operating System.
DBMS DataBase Management System.
DLL Dynamic Link Library.
DMA Direct Memory Access is a feature of modern computers that
allows certain hardware subsystems within the computer to
access system memory for reading and/or writing indepen-
dently of the central processing unit.
DSP Digital Signal Processor.
DVD Digital Versatile Disc is a popular optical disc storage media
format.
EDF Earliest-Deadline-First scheduling policy.
EELRU Early Eviction Least Recently Used page replacement policy.
FCFS First-Come-First-Serve policy.
FERT A notational language – Fault Tolerant Entities for Real-Time
for specifying fault-tolerant requirements on a task-by-task
basis.
FFT Fast Fourier Transformation algorithm/application.
FIFO First-In-First-Out policy.
FP Fixed Priority scheduling policy.
GOPI GOPI is a middleware layer.
GPS Global Positioning System.
GUI Graphical User Interface.
HATS An adaptive hierarchical scheduling using the puppeteer sys-
tem for scheduling network bandwidth.
HLS The Hierarchical Loadable Scheduler is a hierarchical schedul-
ing scheme.
ID An Identification number.
xiii
Abbreviation Meaning
iManager The component to manage interception mechanism in DAM-
ROS operating system.
IP Internet Protocol.
IPC Inter-Process Communication.
J2EE Java 2 Platform, Enterprise Edition.
JIT Just-In Time.
JVM Java Virtual Machine.
KB Kilo Bytes.
LFU Least Frequently Used page replacement policy.
LIRS Low Inter-reference Recency Set page replacement policy.
LISP List Processing Language is a programming language favoured
for Artificial Intelligence research.
LRFU Least Recently/Frequently Used page replacement policy.
LRU Least Recently Used page replacement policy.
MAD MPEG decoder application.
MATVEC Matrix Vector multiplication application.
MAX Maximum value.
MB Megabytes.
MCU Micro-Controller Unit.
MFU Most Frequently Used page replacement policy.
µ-kernel Micro-kernel operating system architecture.
MM Memory Management.
MMU Memory Management Unit.
MOP Meta Object Protocol.
MPEG The Moving Picture Experts Group, commonly referred to as
simply MPEG, is a working group of ISO/IEC charged with
the development of video and audio encoding standards.
xiv
Abbreviation Meaning
MPI Message Passing Interface is both a computer specification
and it’s implementation that allows many computers to com-
municate with one another.
MRU Most Recently Used page replacement policy.
ms Milliseconds.
µs Microseconds.
OLR Optimal LRU Reduction is an algorithm to reduce the mem-
ory access trace size.
ORB Object Request Broker.
OS Operating System.
PC Personal Computer.
PCB Process Control Block.
PDA Personal Digital Assistant.
POSIX Portable Operating System Interface is the collective name of
a family of related standards specified by the IEEE to define
the application programming interface for software compati-
ble with variants of the Unix operating system, although the
standard can apply to any operating system.
PREMO Page REplacing Memory Object is a pager in Mach operating
system which executes in the application address space.
PROTON PROTON is a virtual memory simulator to simulate multiple
applications workload.
QNX QNX is a commercial POSIX-compliant Unix-like real-time
operating system, aimed primarily at the embedded systems
market.
RAID Redundant Arrays of Independent Disks is a technology that
employs the simultaneous use of two or more hard disk drives
to achieve greater levels of performance, reliability, and/or
larger data volume sizes.
xv
Abbreviation Meaning
RAM Random Access Memory.
RISC Reduced Instruction Set Computer represents a CPU design.
RM Rate Monotonic scheduling policy.
rManager The component to manage the reification process in DAM-
ROS operating system.
RMI Remote Method Invocation.
RMMS Reflective Memory Management System.
RPM Revolutions Per Minute.
RR Round Robin scheduling policy.
RSS Resident memory Set Size.
RTCC Real-Time Concurrent C is a high-level programming lan-
guage.
RTOS Real-Time Operating System.
SA The Scheduler Activations (SA) model is an API that provides
a kernel interface and scheduler up-call mechanism to support
the hierarchical scheduling scheme.
SAD Safely Allowed Drop is an algorithm to reduce LRU memory
trace size.
SCAN SCAN is a micro-benchmark application to stress the virtual
memory subsystem.
SDL System Description Language is a description language used
to specify details of a network node in Spring operating sys-
tem.
SDRAM Synchronous Dynamic Random Access Memory.
SEGQ Segmented Queue page replacement policy.
SFQ Start-time Fair Queueing is a scheduling policy used to sched-
ule the intermediate nodes of a hierarchical scheduler.
SMART SMART is a scheduling scheme that uses an optimised
scheduling scheme that adapts to the working set of appli-
cations.
xvi
Abbreviation Meaning
SMP Symmetric multiprocessing.
SMS Short Message Service is a communications protocol allowing
the interchange of short text messages between mobile tele-
phone devices.
SRAM Static Random Access Memory.
TCP Transmission Control Protocol.
UD User-Defined policy.
URL Uniform Resource Locator.
URQ Universal Run Queue.
VM Virtual Memory.
VRHS Virtually Reflective Hierarchical Scheduler.
WYNIWYG What You Need Is What You Get.
Calloc The cost of allocating a page in memory.
CASPapp The application component of CASP mechanism.
CASPos The operating system component of CASP mechanism.
Cmajor The cost to handle a major page-fault.
Cminor The cost to handle a minor page-fault.
Cpage The cost of a page-in or page-out operation.
Dlock The minimum amount of memory region (in bytes) that the
OS mechanism needs to lock.
Dsize The size of the memory region (in bytes) being accessed.
ǫ(t) The time a thread spends in the system executing on the
central processing unit.
Et The time at which a thread finishes its execution and leaves
the system.
Mfree The total free memory at any given time.
Mtotal The total memory in the system.
Nfree The number of available free pages.
Nprocess The number of the different application processes running in
the system.
xvii
Abbreviation Meaning
ω(t) The time spent by a thread waiting from the time it entered
the system until it first executes on the CPU.
Oτ The constant time taken by the operating system to perform
activities other than paging.
Pτ The time taken by the operating system to perform paging
activity.
St The time at which a thread first starts its execution.
Sτ The time spent executing the operating system code to per-
form system activities.
Tτ The turn around time of an application process.
TTRnd(t) The time taken by a thread from the time it entered the sys-
tem to the time it leaves the system.
Uτ The time spent executing the application code.
Abstract
Complex applications impose greater resource demands on limited resource soft real-time
embedded systems. The Real-time Operating System (RTOS) should efficiently manage
the systems resources amongst several such applications. Built for the general case, rather
than to meet application-specific requirements, the RTOS is unable to meet the dynamic
resource demands of the applications. It provides average support to the increasing appli-
cation resource requirements leading to poor application performance. On the other hand,
applications that may be able to predict their resource requirements at runtime, have no
control over the RTOS’s resource management policies (e.g. CPU, memory, etc.).
In particular, giving applications control over the processor scheduling and memory
management will provide efficient resource management support. In order to provide
such application-specific resource management support, this thesis proposes a reflective
RTOS framework that allows fine-grained changes to the RTOS’s resource management
policies. Reification calls, inserted into the application source code, inform the RTOS about
application-specific resource requirements. The reflection framework uses this information
to accordingly adapt the RTOS policies.
The proposed RTOS framework has been implemented in a prototype µ-kernel DAM-
ROS and also in a commodity OS Linux (2.6.16 kernel). The experiments performed to
evaluate the reflection framework along with the use of reification calls in the context of vir-
tual memory (paging) have shown significant improvement in paging performance. The total
number of page-faults in the system were reduced by 22.3% and the application performance
improved by 12.5%.
Acknowledgements
There are many people without whom the work in this thesis would not have
been completed. I would like to express my sincere gratitude to them all. In
particular, my supervisor, Dr. Neil Audsley for his constant encouragement,
guidance and support throughout my years at the university.
I am very grateful to friends and colleagues in the Real-time Systems Group
at York for their vital feedback on my work. In particular, I would like to thank
Adam Betts, Anant Kapdi, Rachel Baker, Ian Broster, Rui Gao, Micheal Ward
and Andrew Borg for their help and encouragement. Thanks to Professor Andy
Wellings for his helpful comments during the assessment process.
Many thanks to my childhood friends Kiran Baloji, Ganesh Jannu and
Prasad Kori for their constant support and for making special those all im-
portant breaks. Special thanks goes to my teacher Jayashree Bhagoji in India
without whom it would not have been possible. She has been instrumental in
shaping my career.
I am indebted to my lovely wife Sushma for her continued support through-
out. Finally, to my parents, my sister Sneha and my brother-in-law Ravi and
all my relatives in India, my gratitude for their patience, love and support.
This work has been as much an effort on their part as it was on mine.
York, UK Ameet Patil
September, 2007
xxi
Declaration
I hereby declare that, unless otherwise stated in the text, the research work
presented in this thesis is original and undertaken by myself, Ameet Patil,
between October 2003 and September 2007 under the guidance of my supervi-
sor, Dr. Neil Audsley. I have acknowledged external sources where necessary
through bibliographic referencing. Parts of this thesis have previously been
published as technical reports or conference papers as listed below.
The generic reflective framework for RTOS presented in chapter 3 was ini-
tially published as a work-in-progress paper at the IEEE Real-time Systems
Symposium (RTSS) [95]. The reflection framework along with the µ-kernel
implementation – DAMROS appeared as a full paper in the IEEE Real-time
and Embedded Technology and Applications Symposium (RTAS) [96]. Part
of this work was also published at the International Workshop on Operating
System Platforms for Embedded Real-Time Systems (OSPERT) [14]. The Hi-
erarchical scheduling model – VRHS was published as a technical report [94].
Chapter 4 presents the design and experiments using the on-the-fly simulator
– PROTON that was published as a technical report [93]. The methods of in-
serting application hints, the design of CASP and its implementation in Linux
appeared as a full paper in the IEEE Real-time and Embedded Computing
Systems and Applications Symposium (RTCSA) [97].
xxiii
Chapter 1
Introduction
The use of a real-time operating system (RTOS) within soft real-time em-
bedded systems aids runtime resource management and supports application
development via the layer of abstraction provided by the RTOS. Almost 71%
of the embedded real-time systems designed in the year 2006 used an RTOS
(see figure 1.1), of which 50% used a commercial RTOS [121]. In systems
with limited resources that support complex applications having varying re-
source requirements, an RTOS should manage the system resources efficiently
to provide application-specific resource management support.
Applications in soft real-time embedded systems are becoming increasingly
complex. The mobile telecommunications industry typifies this rising appli-
cation complexity. Successive product generations place increased demands
upon the target platform [113]. Many computationally intensive applications
such as software radio, cryptography, augmented reality, speech recognition
and mobile applications such as e-mail and word processing are making their
way into future mobile platforms [15]. In [15], it is estimated that in order to
support the above applications, a platform would require about 16 times as
much computing “horsepower” as a 2-GHz Intel Pentium 4 processor.
1
2
Figure 1.1: Choice of RTOS for Embedded System Implementation [121]
The complexity and sophistication of the CPUs within many current soft
real-time embedded systems has increased to meet application demands. Sys-
tem platforms such as the tiny micro-controller devices for example the Infi-
neon XC167CI [2] (containing a 40MHz CPU, 12KB Random Access Mem-
ory (RAM), 256KB flash); the embedded systems building blocks such as
computer-on-modules typified by the CM-X255 [3] (containing Intel’s XScale
(Arm) PXA255 CPU at upto 400MHz, 64MB SRAM and 512MB flash) are
quite sophisticated and provide advanced features such as virtual memory.
In systems that use an RTOS, efficient resource management is important
in order to support complex applications. Most commercial RTOSs use generic
resource management policies which provide average case support [41,121]. For
instance, to manage virtual memory by paging (by using a secondary storage
device, paging allows applications to use more memory than is physically avail-
able), the most commonly used generic page replacement policy in RTOSs such
3
as Embedded Linux is the least recently used (LRU) policy [9, 54, 86].
Different applications have different memory access patterns and differ-
ent memory requirements. The LRU policy uses only recency information of
memory page accesses to determine which pages are least used by application
processes. It does not consider other information such as a page’s frequency
of access, etc. For applications with varying memory demands and long term
memory page accesses, LRU may not provide the best support.
This thesis proposes a framework for an RTOS to allow runtime changes
to the resource management policies so as to adapt to the application-specific
resource requirements. Resource management in RTOS is examined in the
context of soft real-time embedded systems that use complex CPUs. The rest
of this chapter discusses the importance of resource management in an RTOS
and existing support for applications with dynamic resource requirements (sec-
tions 1.1 and 1.2). The thesis proposal and the contributions are detailed in
sections 1.4 and 1.5. Section 1.6 describes the thesis outline.
1.1 Technological Growth Versus Application
Complexity
In the mobile handset industry, the application complexity has surpassed the
growth in processor speed. Figure 1.2 shows the rise in application complexity
with respect to the first (1G), second (2G) and the third generation (3G) of
mobile handsets. The figure shows that the growth in the processor perfor-
mance has fallen behind the rising application complexity [23]. To meet this
challenge, the industry is driving the development of more powerful CPUs in-
volving multiple CPU cores. The TriCore [60] microcontroller from Infineon is
4
Figure 1.2: Trends in Application Complexity and Processor Speed [23]
the first single-core 32-bit MCU-DSP (microcontroller – digital signal proces-
sor) architecture optimised for real-time embedded systems that truly unifies
the best of three worlds - real-time capabilities of microcontrollers, computa-
tional power of DSPs, and the price/performance benefits of RISC load-store
architectures.
The use of more processing power leads to additional power consump-
tion which affects portable battery operated devices such as mobile handsets.
Unfortunately, the growth in battery technology has not seen as significant
improvement as CPU technology thereby restricting the up-time of portable
devices to only a few hours [23]. For instance, the talk time of the latest
Samsung D840 [8] mobile handset (a multimedia rich mobile handset) is up
to 2.8 hours only. This does not include the normal usage of the phone – i.e.
5
Figure 1.3: Projected Trends in Mobile Application Complexity [23]
to send multimedia messages, record pictures/videos, play games, play mu-
sic/videos, download Internet content and send/receive emails. Under normal
use it would require a recharge almost everyday.
Figure 1.3 (reproduced from [23]) outlines the observed and predicted
growth in the functionality incorporated in a mobile handset in the previous
and coming years. The first generation (1G in figure 1.2) mobile handsets in
1980s were only capable of making voice calls requiring relatively less process-
ing power. Later generation introduced a text messaging service, popularly
6
known as SMS (Short Messaging Service) or Texting. This allowed mobile
users to exchange text messages without having to make voice calls. PC and
console based software games slowly made their way into the mobile handsets.
Today’s mobile handsets are filled with advanced features. They are Internet-
ready : web browser, chat clients like Yahoo messenger; multimedia-rich: play
MP3 songs, record voice/still image/video, graphics-rich games and support
various wireless technologies: WiFi, Bluetooth. Many of them can also be
used as a PDA (Personal Digital Assistant).
Such an increase in functionality usually translates into the implementation
of multiple single or multi-threaded applications. Use of more applications
increases the computation and storage requirements adding greater demands
on the system resources. Memory is one such resource whose demand increases
with application complexity. Mobile handsets for instance, support the use of
an additional secondary storage device in the form of Flash memory cards. The
usage of flash memory technology in mobile handsets is becoming common (see
figure 1.4). In year 2006 alone, more than 20% of the handsets included nearly
512MB of flash memory [42].
This trend is also apparent in other areas of embedded systems. For
instance, car manufacturers are introducing vehicles with GPS (Global Po-
sitioning System) navigation, real-time traffic reports, satellite radio, DVD
playback, MP3 player, voice-controlled operations, hard-drive music storage
and many such capabilities all in one integrated unit [15]. To provide such
functionality, either multiple complex applications or multiple software com-
ponents integrated to form a single application are deployed in the system.
The role of an RTOS is to efficiently share the system resources such as CPU,
memory, etc. amongst the different applications without affecting the overall
7
Figure 1.4: Need for Greater Secondary Storage [42]
system performance.
1.1.1 Resource Constraints
Systems that are battery operated and portable often have several resource
constraints. The following are some of the commonly found ones [15]:
• Power: is limited to the battery capacity. The system (e.g. a mobile
handset) should use minimal power since it operates on battery;
• CPU: the greater a CPU’s processing ability, the more power it con-
sumes [123]. Often, there is a trade-off between CPU speed and the
power requirements. Proper sharing of the CPU resource by the RTOS
results in better power utilisation [123].
• Memory: the main memory is an important resource for embedded ap-
plications. Primary memory is not only expensive but the more it is used
8
the more power it consumes. The rise in application code size as well
as increase in memory requirements for data storage/processing leads to
greater memory utilisation. The RTOS needs to share memory efficiently
amongst different applications.
Adhering to such constraints requires that the applications and the un-
derlying RTOS manage the system resources efficiently. Mismanagement of
resources or provision of generic resource management policies in the RTOS
results in resource conflicts often leading to poor application performance [15].
1.2 Resource Management in RTOS
A conventional RTOS is built without the knowledge of applications that would
execute upon it – i.e. the RTOS is built for the general case, rather than to
meet application-specific requirements. Whilst a few custom built RTOSs that
support a single or fixed set of critical applications (e.g. in the aviation indus-
try) use specialised resource management policies, most commercial RTOSs
supporting the entire embedded applications domain (including mobile ap-
plications) implement generic resource management policies. The generic re-
source management policies do not identify the runtime application-specific
resource requirements which may not result in the best possible support.
With little or no application-specific support from the generic policies, the
system shows poor performance forcing the developers to disable the use of
certain advanced features such as virtual memory [37,58]. Paging is often dis-
abled in the CPU because the generic page replacement policies generate con-
siderable page-swap overhead that affects system performance. Thus, rather
9
than trying to provide application-specific paging support, the feature is com-
pletely disabled. For example: the ARM microprocessors (ARM7 10T), which
have been widely deployed in embedded systems, have a full MMU (Memory
Management Unit) with support for virtual memory [6]. The generic page re-
placement policies lead to poor virtual memory management because they may
not be able to identify the specific memory requirements of the applications.
Application resource requirements can be dynamic and non-deterministic in
nature which depend upon runtime factors. The performance of soft real-time
embedded systems that are constrained by limited resources may be further
degraded with the use of generic resource management policies in the RTOS.
No single resource management policy is able to equally satisfy the dynamic
resource requirements of all the applications in the system. The next subsec-
tion describes the existing approaches and technique to provide better resource
management in an RTOS.
1.2.1 Existing Approaches to Efficient Resource Man-
agement
Existing approaches are coarse-grained – i.e. they involve applications chang-
ing their own functionality by altering the processes that are executed, whilst
leaving the RTOS unchanged. However, many subtle changes in behaviour
can be achieved without altering the process functionality, rather by modify-
ing the resource management policies and the individual parameters governing
them in an RTOS. Such fine-grained change would allow the same application
functionality to be executed, but perhaps at different times or rates, poten-
tially using different resources. For example, in response to changes in the
environment an application may need to change the rates at which individual
10
processes are executed, or perhaps the resources that individual processes use.
On the one hand, to satisfy the dynamic requirements of the applications,
the RTOS needs to adapt to the changing application-behaviour and its re-
source requirements. On the other hand, it is the applications not the RTOS
that are in a better position to predict their actual behaviour and resource
requirements at runtime. There is a need for the applications to control and
change the way the RTOS manages resources.
Giving the applications complete control over the RTOS resource manage-
ment policies is not an ideal solution. This is because in a multi-programmed
environment a change brought in by one application can have adverse effects
on other applications. For overall system predictability and safety, although
the application designers are in a better position to know the resource require-
ments of the application, the control over managing resources should remain
with the underlying RTOS.
Furthermore, by sharing information between the applications and the
RTOS the resource management policies may be able to adapt or change in
order to support the application resource requirements. It might be possible to
achieve this using reflection [105,112,115,118,128]. The next section describes
the reflection mechanism that was first introduced in programming languages
that allows sharing of information and bringing about fine-grained changes to
either the code or data at runtime.
1.3 Reflection Mechanism
The mechanism by which an application becomes ‘self-aware’ and changes
itself accordingly either to change its behaviour or to improve its performance
is called reflection [105, 112,115,118,128].
11
The reflection mechanism originated in programming languages such as
small-talk, CLOS, LISP etc. Many modern programming languages have also
been extended to support reflection [81]. For example: extensions (in the form
of library packages) to programming languages such as Ada, Java and C++
have been developed to support reflection [105].
In order to achieve reflection, an application needs to be aware of many as-
pects of its design and implementation, e.g. its data structures, language con-
structs/semantics, run-time support system (or virtual machine). The mech-
anism by which this information is made available to an application is called
reification [105].
A reflective entity (e.g. the application) is divided into a base-level and
a meta-level component. The base-level component consist of the general
application functionality or the main application code. It reifies information
to the meta-level component. The meta-level component makes use of the
reified information to adjust the required application functionality at runtime
by changing the code or data in the base-level.
The process of reification can either be implicit or explicit. Implicit reifica-
tion is built into the development model where essential information is auto-
matically reified by the use of language constructs or compiler techniques. Ex-
plicit reification requires the application developer to explicitly add reification
calls into the application source code. Such reification calls could essentially
reify the runtime application-specific resource requirements. However, since
all resource management code resides in the RTOS, the application meta-level
component will not be able to help the application by changing its base-level
code or data.
This thesis proposes a generic reflective framework built into an RTOS
12
to obtain application reified information in order to bring about fine-grain
changes to the resource management policies. The next section describes the
thesis proposition.
1.4 Thesis Proposition
In order to support applications requiring fine-grained change to RTOS’s re-
source management policies, an RTOS must provide mechanisms that enable
such a change, whilst maintaining the predictability required by the real-time
application in terms of time and resource usage. A generic framework in the
RTOS that sets out guidelines for the resource management policies to auto-
matically handle changes is required.
The central hypothesis of this thesis is:
“Conventional CPU scheduling and memory management policies
in an RTOS provide generic support that do not, in general, al-
low application-specific resource control. This thesis contends that
application-specific control of processor scheduling and memory
management will provide better application support thereby im-
proving application performance. This thesis proposes a generic
reflective framework in the RTOS to efficiently capture application-
specific resource requirements and bring about fine-grained changes
in the resource management policies. The use of explicit reification
in application source code to specify the resource requirements will
provide better application support and improve performance”.
Using the reflection mechanism [112, 115, 118, 128], a generic reflective
13
RTOS framework has been proposed that enables the flow/exchange of valu-
able information – (1) within the RTOS between the kernel and several re-
source management modules; and (2) between the RTOS and the application
processes.
Further, with the use of reification (by inserting reification calls into the
application source code), the RTOS is able to gain valuable insight on the
application-specific resource requirements at runtime. This information is then
combined with existing information collected within the RTOS kernel and
forwarded to the concerned resource management module(s). These module(s)
then make fine-grained changes to their policies so as to accommodate the
current application requirements.
1.5 Contribution
This thesis proposes an approach of using a reflection mechanism built into
the RTOS in the form of a generic framework to support application-specific
resource management. At runtime, the approach uses explicit reification to
identify and communicate the application-specific resource requirements to
the RTOS. In particular, this thesis mainly focuses on resource management
pertaining to CPU scheduling and virtual memory paging technique.
As a first step, a generic reflective RTOS framework has been proposed that
establishes communication paths between applications and the RTOS kernel;
and between the resource management modules and the kernel. The frame-
work allows the reflective resource management modules to make fine-grained
changes to the code/data affecting the resource management or completely
change the resource management policy in use. Under the framework the re-
source management modules can also choose to be non-reflective (in which case
14
the framework imposes no or minimal overhead onto the respective modules).
An initial prototype µ-kernel – DAMROS [95,96] has been developed in or-
der to implement and verify the proposed reflective framework. Two reflective
system modules, a reflective CPU scheduler and a reflective virtual memory
module, have also been implemented within DAMROS. Several experiments
involving the two reflective resource management modules and custom built
artificial benchmark applications have been performed. The applications are
shown to dynamically adapt the RTOS’s resource management policies ac-
cording to application-specific resource requirements, which resulted in better
application performance.
This thesis describes the use of explicit reification for virtual memory as
a case study in order to capture runtime resource usage information from the
applications.
Three methods of inserting reification calls have been described – manual,
automatic and hybrid methods. For automatic insertion of memory usage reifi-
cation calls, a tool – cloop has been implemented. This tool looks for regions
of large amounts of data access in the application source code detecting data
hot-spots and inserts reification calls around them. The reification calls inform
the RTOS framework about the application’s future memory requirements and
usage patterns.
In this case study, a simple and efficient Operating System (OS) paging
mechanism called CASP [97] that uses the reified information within the
framework is presented. The case study shows that using the framework, it
is possible to implement a simple reflective module that operates on top of
an existing resource management policy in the system. CASP uses the ‘page-
isolation’ technique that allows it to transparently lock memory pages without
15
affecting the normal operation of the OS.
An on-the-fly virtual memory simulator - PROTON [93] has been imple-
mented to verify the benefits of explicit reification in the context of virtual
memory. PROTON supports virtual memory simulation for multiple appli-
cations workload. This helps to evaluate the overall system performance by
simulating the entire workload (multiple applications) at a time. No existing
virtual memory simulators can simulate multiple applications. Different page
replacement policies can be plugged into PROTON allowing the system engi-
neer to test and verify the effects of various page replacement policies on the
application workload prior to implementing in an RTOS. Using the simula-
tor, a system engineer can gain valuable insight into an application’s paging
performance before its deployment. PROTON has been used for simulating
experiments involving CASP with applications using explicit reification calls.
Finally, the implementation of the core framework and the CASP mech-
anism in a commodity OS, Linux (2.6.16 kernel), is presented. Experiments
involving several embedded benchmark applications chosen from MiBench [56]
(embedded benchmark application suite) show effectiveness of CASP and the
framework. The results show a considerable reduction in paging overhead and
a significant improvement in application performance. In Linux, CASP has
been implemented and evaluated in the context of two different page replace-
ment policies – LRU/CLOCK based [54] and the CART [17] policy.
1.6 Outline
This thesis is organised as follows: the next chapter introduces existing work –
describing constraints on real-time embedded systems, the OS design architec-
tures, resource management in an RTOS pertaining to the CPU and memory
16
resource, existing OS specialisation techniques and the reflection mechanism
as an OS specialisation technique. The chapter also presents a survey of exist-
ing use of reflection in programming languages, middleware technologies and
reflective OSs.
Complex applications have varying resource requirements that are often
non-deterministic in nature. Thus, any information pertaining to resource
usage and requirements of the applications could be quite valuable to the re-
source management policies of the RTOS. Chapter 3 investigates and proposes
a reflection-based generic RTOS framework that allows runtime adaptation
of the resource management policies depending on application requirements.
Experiments involving a prototype implementation of a µ-kernel, DAMROS,
along with two example reflective system modules, a reflective CPU sched-
uler and a reflective memory management module, are performed to verify the
effectiveness of the RTOS framework.
Further in chapter 4, a case-study on virtual memory (paging) is carried
out to illustrate the various methods of reification in the framework. Three
different methods of inserting reification calls into the application source code:
manual, automatic and hybrid methods are described. The design of another
OS mechanism, CASP [97], for virtual memory management (paging) which
works on top of existing page replacement policies is described. Experiments
involving the use of reification calls along with the CASP mechanism via sim-
ulation show considerable improvement in the performance of paging as well
as the applications.
The scalability of the reflective framework and the CASP mechanism are
then investigated in a commodity OS. Chapter 5 describes implementation of
the reflective framework and the CASP mechanism in two flavours of Linux,
17
one using an LRU/CLOCK [54] based page replacement policy and the other
using a CART [17] page replacement policy. Experiments involving the frame-
work and CASP are performed on both flavours of Linux and the results
compared against conventional applications and an existing solution based
on Linux’s mlock() primitives [19].
Finally, chapter 6 presents conclusions to the work in this thesis along with
a detailed layout for future work.
Chapter 2
Resource Management andOperating System Specialisation
This chapter provides a detailed study of the existing technology and ap-
proaches involving operating systems, resource management and OS speciali-
sation techniques. The chapter is organised as follows. The next section pro-
vides the background on real-time systems emphasising resource constraints
and the specialisation required to support application-specific requirements.
Section 2.2 discusses existing techniques and policies used in OS resource man-
agement, particularly for the CPU and memory. In section 2.3, OS specialisa-
tion techniques to accommodate increasing application resource demands are
discussed. Finally, in section 2.4, reflection mechanisms are discussed in the
context of programming languages, middlewares and operating systems.
2.1 Real-time Embedded Systems
The last few decades have seen the use of computers in many diverse shapes
and forms in our day-to-day life. Computers are being used in devices, from
coffee machines to highly sophisticated flight control systems in aircraft. The
19
20
systems that embed computer hardware and software for a particular cause or
application (e.g. ticketing machine) are called embedded systems [29].
Real-time systems are those systems in which the time the result is pro-
duced is as important as the logical result of the computation itself [29]. Em-
bedded systems requiring such timing behaviour are called real-time embedded
systems. Examples of real-time embedded systems include ticketing machines,
coffee machines, washing machines, automotive anti-lock braking system, in-
dustrial robots, space station control systems, battery operated devices, wire-
less telecommunication systems, aircraft, military defence systems, medical
systems, etc.
2.1.1 Types of Real-time Systems
Each real-time system has a varying level of cost in case of an error or failure.
For example, if a coffee machine fails or delays delivering a coffee, then the
user can either wait for some more time or can easily get it fixed. However,
if while travelling in an aircraft, the flight control system misbehaves or fails,
then the end result could be catastrophic. Depending on this factor real-time
systems are classified into different types:
• Soft real-time systems [11],
• Hard real-time systems [11],
• Weakly-hard real-time systems [21].
Real-time tasks in a system are characterised with several time constraints
such as deadline, inter arrival time, jitter, etc. [29]. Reliability and predictabil-
ity are the two main characteristics of a real-time system. Soft real-time sys-
tems are those that can occasionally afford to miss a deadline and still be
21
functional. A typical example is an MP3 player where it is acceptable to have
minor disruptions in sound every now and then caused by delays in decoding.
Hard real-time systems are not tolerant to any kinds of faults in the system.
A deadline miss in such a system may have catastrophic effects. For example,
the cost of a failure in the flight control system is severe compared to that
in a MP3 player. Such systems may still miss a deadline provided that it
happens in a known predictable way [29]. Weakly-hard real-time systems are
those real-time systems that can tolerate a clearly specified degree of missed
deadlines [21].
2.1.2 Categorising Real-time Embedded Systems
Many real-time embedded systems that provide complex functionality, use
multiple application processes which in turn can be either single or multi-
threaded. It is common to find the use of an OS in real-time embedded systems
for efficient management of resources so as to provide better support for such
complex applications. An OS is the main software program that acts as a
bridge between the underlying hardware and the applications that execute
upon it [62]. It is responsible for sharing the available system resources (e.g.
CPU, memory, networks, etc.) amongst the application tasks/processes [120].
In order to support complex real-time applications the OS needs to satisfy
their resource requirements.
Not all real-time embedded systems make use of an OS. In general real-time
embedded systems can be categorised as follows [130]:
• systems without an OS: these systems are relatively simple and have
everything hardwired into them. Making a change in such systems can
be time consuming and a difficult process.
22
• systems that use a simple OS: the OS is mainly involved in monitoring
system activity or it only supports a simple application without much
complexity.
• systems that use a commercial general purpose RTOS [41,131]: such sys-
tems are identified by the characteristics of the RTOS they deploy (e.g.
VxWorks, pSOSystems, OS-9, QNX, Windows CE, etc.). These RTOSs
support different kinds of application timing characteristics. However,
such RTOSs are general purpose OSs that provide average case resource
management support and are not optimised for any particular appli-
cation. For optimisation, such OSs need to be manually tweaked and
configured.
• systems that use a custom built RTOS: increasing software complexity
requires better OS abstractions. Commercial RTOSs are general pur-
pose which tend to provide unnecessary additional features adding to
the complexity. For critical applications found in hard real-time sys-
tems the use of a custom built RTOS is common. This ensures that the
RTOS is optimised for the applications and would provide the best re-
quired support. However, such RTOSs will be pinned to the application
concerned and also, it is not feasible to make changes to the OS each
time the application requirements change.
In recent years, commercial RTOSs have become quite flexible allowing
them to be configured for a required specification. Most embedded systems
(real-time or non real-time) deploy a commercial RTOS rather than using
a home grown (custom built) one, thus, saving on the costs of additional
development and maintenance.
23
2.1.3 Application-specific RTOS Specialisation
The increasing demand for additional functionality increases the system com-
plexity. Applications whose resource requirements depend on certain run-
time stimuli are heavily dependent on the resource management support pro-
vided by the RTOS. In systems with constrained resources, such applica-
tions put greater resource demands on the underlying RTOS requiring cus-
tom application-specific policies. Ideally, an RTOS needs to adapt its resource
management policies according to runtime resource requirements of the appli-
cations.
Changes to RTOS policies can be both static and dynamic in nature. Static
changes are easier to handle in that they only require rebuilding the OS with
the code for new policies. However, dynamic changes are difficult to handle
since the changes are made dynamically at runtime. As an example of a
dynamic change, consider an application whose output is dependant on certain
environmental/external factors. Depending on a particular external stimuli,
the application may require a change in its priority (a change in the RTOS’s
CPU scheduling policy) or may require a different policy to manage its memory
(a change in the RTOS’s memory management policy).
Furthermore, an RTOS treats each resource independently. For instance,
the CPU is independently managed by a CPU scheduler whilst memory is
managed by the memory management module. However, often resources have
an inter-dependency pattern amongst themselves such that a change made to
the management of one resource affects the other.
As an example, consider a complex real-time MPEG [51] video decoder
application. An MPEG video stream consists of several frames. A frame is a
single still image in a MPEG video stream, a group of which produce a motion
24
video. In the MPEG video standard, depending on the coding method, there
are several types of frames [51, 107]:
• I-Frame: stands for intra-coded. This type of frame is coded independent
of other frames. It is considered as the starting point for decoding any
MPEG video stream. These frames can be randomly indexed in an
MPEG video stream.
• P-Frame: stands for predicted. This type of frame is coded with reference
to a past frame (either an I or a P type). To decode this frame, the
reference frame must be decoded first. These frames are indexed for
future P and B-frames.
• B-Frame: stands for bi-directional. Also called interpolated frame. This
frame is coded with reference to both past and future frames (either an
I or a P type). These frames can never be referenced in coding frames
and provide maximum video compression.
Frames can be further divided into entities called macro-blocks but these are
out of scope for this discussion. A group of related frames constitute a scene.
A valid sequence of frames such as – IBBPBBPBBP forms a scene. An MPEG
video stream consists of several scenes, each containing a different set of frames.
Frames in two different scenes are not related to each other and can be decoded
in parallel.
Figure 2.1(a) shows a valid MPEG input stream where a group of frames
collectively belong to a particular scene and a sequence always starts with
an I-frame. However, to improve the decoding time, the future P-frames are
transmitted ahead of time. This is done so that both past and future refer-
ence frames have already been decoded when decoding of a B-frame begins.
25
Figure 2.1: MPEG Input Streams for Decoding
Thus, the real frame transmission pattern for the original input stream may
be represented as shown in figure 2.1(b).
The performance of an MPEG decoder application depends on the band-
width of the input MPEG video stream, the complexity of the scene currently
being decoded and the number of different frame types it contains. Generally,
in a well encoded MPEG video stream a larger number of B-frames exist as
compared to the other frame types [92]. However, it should be noted that the
frame decode time for each type of frame may vary depending on the scene
complexity.
Returning to the example, if the MPEG decoder application is multi-
threaded such that it decodes different unrelated scenes in parallel, then a
26
traditional scheduling policy may not efficiently schedule the threads on the
CPU which may also affect the memory management subsystem. This is be-
cause, the decoder threads use sufficient memory in order to store the decoded
as well as the to-be-decoded frames. In the case of memory-constrained em-
bedded systems, this results in all the available memory being used up bringing
the memory manager into action. In a system implementing virtual memory
(e.g. paging), the memory manager evicts some of the memory pages belong-
ing to another thread and allocates the freed memory pages to the requesting
thread.
Traditionally, CPU schedulers (e.g. fixed priority (FP), earliest-deadline-
first (EDF), etc.) operate on information that is either fixed off line (e.g.
priority) or only pertains to the CPU [62, 90]. Due to this fact, the CPU
scheduler may schedule the thread whose pages have been recently evicted
causing a series of page-faults. This process continues until the system starts
thrashing [90]. Often such experiences lead to the non-acceptance of paging
as a viable solution for embedded systems.
However, the problem here is not with paging. It is the CPU scheduler,
unaware of the memory manager’s operation, that causes the problem. With
proper cooperation and integration of several resource management modules, it
is possible to share valuable information, thereby avoiding the risk of conflicts
that lead to poor system performance.
For better overall performance, an OS should provide the best possible sup-
port to the resource requirements of applications. The resource requirements
of complex real-time applications depend on several factors, thus, making them
non-deterministic and dynamic in nature. An OS needs to adapt its resource
27
management policies at runtime to provide the required support. Most com-
mercial RTOSs only support static compile time specialisation.
2.1.4 Resource-Constrained Real-time Embedded Sys-
tems
Many real-time embedded systems operate in a constrained environment with
limited resources (e.g. limited processing power, limited memory, limited
power, etc.). In such systems, it is even more important for an OS to efficiently
manage the available resource amongst the varying requirements of complex
applications. The use of average case general purpose resource management
policies generally results in poor application performance.
To achieve the functionality required within these constraints, real-time
embedded applications need to dynamically change the behaviour of them-
selves and the underlying OS. Usual approaches [13,98,131] are coarse-grained,
involving applications changing their own functionality by altering the pro-
cesses that are executed, whilst leaving the OS unchanged. However, many
subtle changes of behaviour can be achieved without altering process function-
ality, rather by modifying the resource management policies of the OS. Such
fine-grained changes allow the same application functionality to be executed,
but perhaps at different times or rates, potentially using different resources.
For example, in response to changes in the environment an application may
need to change the rates at which individual processes are executed, or perhaps
the resources that individual processes use. The following sections discuss re-
source management (for CPU and memory) in an OS and some existing OS
specialisation techniques.
28
2.2 Resource Management in an OS
The resource requirements of complex applications vary dynamically at run-
time requiring an OS to efficiently manage system resources. Adaptive resource
management in the OS is the key to provide application-specific support. This
section discusses existing techniques and approaches in OS resource manage-
ment for the main system 1 resources: the CPU and memory.
2.2.1 CPU Resource
A CPU scheduler in an OS is responsible for managing the CPU. Its main ob-
jective is to share the CPU amongst several different competing applications
depending on certain criteria/requirements. There are several scheduling poli-
cies each using a unique or mixed scheduling criteria. For instance, the fixed
priority (FP) [79] scheduling policy uses a priority-based scheme to schedule
processes on a CPU. i.e. the process with the highest priority gets to exe-
cute first. Similarly, the earliest-deadline-first (EDF) scheduling policy uses a
deadline-based scheme. i.e. the process with the earliest deadline gets to exe-
cute first. In order to support application-specific requirements, the OS should
be capable of changing the scheduling scheme or criteria dynamically at run-
time. The concept of using a hierarchical scheduling scheme might be suitable
for this purpose. The following subsection discusses hierarchical scheduling in
detail.
Hierarchical Scheduling
Hierarchical scheduling has been widely adapted in OSs. Most OSs imple-
ment a two-level scheduler – one to schedule kernel threads and the other for
1This thesis focusses only on single processor systems.
29
application/user threads. Additionally, hierarchical schedulers are also used
in virtual machine implementations (e.g. Java Virtual Machine, VMWare [7],
etc.), network layers, disk scheduling, etc. This section presents existing tech-
niques used to design and implement hierarchical schedulers.
Most hierarchical schedulers proposed in the past are based on a fixed tree
type structure. Figure 2.2 shows a typical hierarchical scheduling structure.
The scheduler at the root node implements a traditional scheduling policy (e.g.
FP policy as shown in figure), those at the leaf nodes implement application-
specific policies (e.g. EDF, Rate Monotonic (RM), etc.) while those at the
intermediate nodes implement either traditional, or an optimised policy or a
Start-time Fair Queueing (SFQ) policy [55] to schedule the next level sched-
ulers (either the intermediate nodes or the leaf nodes).
Threads
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12
FP EDF RM
FP
Intermediate
Root node
nodes
Leaf nodes
Figure 2.2: Hierarchical Scheduling Structure
30
The operation of a hierarchical scheduling scheme is as follows: the sched-
uler at the root node schedules the next level scheduler – either an intermediate
or a leaf node scheduler depending on the depth of the scheduling hierarchy
– making the decision with respect to the policy it implements. In the same
way, if the next level scheduler is an intermediate node, then it schedules the
next level scheduler – either another intermediate or a leaf node scheduler. Fi-
nally, the scheduler at the leaf node schedules the actual thread/process that
takes over the CPU until the next scheduling/pre-emption point. Note that
the complexity of this approach increases with an increase in the depth of the
scheduling tree. Furthermore, it substantially increases the time required to
make a scheduling decision. Due to this reason the hierarchical scheduling
scheme has shown to incur considerable overhead [31, 55, 99].
However, the amount of overhead and the efficiency of the hierarchical
scheduler depends on its implementation. For example: the hierarchical sched-
uler that uses the SFQ [55] algorithm to schedule the intermediate nodes incurs
considerable time overhead in switching to the required scheduler at the leaf
node level. It is shown that although the scheduler provides flexibility for
various heterogeneous schedulers to co-exist in the same system, it introduces
considerable time overhead [55].
Similarly, in Vassal [31], a multi-policy scheduling model implements only
a two level scheduling solution in Windows NT allowing applications to in-
troduce a custom scheduler into the system. The Vassal scheduling scheme
was tested with a large time quantum (greater than 1ms) supported by Win-
dows NT making it infeasible for high resolution real-time threads [31]. Also,
only one application defined scheduler can co-exist with the native Windows
31
NT scheduler. This system was put forth as a tool-kit to build and experi-
ment with new scheduling policies and did not address any issues related to
application-specific scheduling.
The Hierarchical Loadable Scheduler (HLS) [100] is another solution, sim-
ilar to Vassal, where the schedulers are loaded into the kernel at runtime
as drivers. HLS implemented on Windows 2000 kernel imposed considerable
overhead due to context switch time. The context switch time on a 500MHz
Pentium III machine was noted to be 11.7µs in Windows 2000 with HLS as
compared to 7.10µs in the actual Windows 2000 release version [99]. It has
also been noted that HLS adds 0.96µs overhead to the context switch time for
each additional level in the scheduling hierarchy [99].
The SMART [87] scheduler uses an optimised scheduling scheme that
adapts to the working set of applications. SMART provides a time sharing
policy when no real-time threads are running. In the case where both types
of application threads – real-time and non real-time – exist in the system,
SMART uses an optimised scheduling policy [87].
The Scheduler Activations (SA) model [124] implemented in the NetBSD
OS is essentially an Application Programming Interface (API) that provides a
kernel interface and scheduler up-call mechanism (‘sa upcall()’ [124]) to sup-
port the hierarchical scheduling scheme. This model generates huge overheads
– a context switch time of 225µs on a 500MHz G3 processor – making it
unsuitable for real-time use.
MaRTE OS [103] provides an API to support application defined schedul-
ing. Applications in MaRTE OS are able to introduce application-specific
scheduling policies into the system that co-exist in the hierarchical structure.
However, this approach has also proved to generate sufficient overhead to make
32
it slower than the traditional policies [103]. A formal proposal has been made
to include the application defined scheduling mechanism into the real-time
POSIX standard [104].
There are many other hierarchical schedulers proposed such as APEX [80]
– an adaptive two-level scheduler for disk scheduling in multimedia database
management system (DBMS) and HATS [39] – an adaptive hierarchical
scheduling using the puppeteer [39] system for scheduling network bandwidth.
A scheduler is the main component of an RTOS which is responsible for
distributing the CPU bandwidth amongst different threads/processes in the
system. The time taken to make a scheduling decision is critical to system
performance. In most hierarchical scheduling models, along with the appli-
cation threads/processes, the CPU bandwidth is also shared amongst various
intermediate schedulers. Delays in making scheduling decisions increase the
time spent in the intermediate schedulers, thereby affecting the execution time
of application threads/processes.
2.2.2 Memory Resource
Similar to the CPU, memory also is an important resource in an embedded
system [126]. Applications cannot execute without memory. The memory re-
quirements of applications differ from application to application. The key is
to support applications with greater memory requirements even in memory-
constrained systems. This section provides background on the main virtual
memory technique – paging. Paging has been a topic of interest for several
decades. Though there have been many page replacement policies proposed,
each has its own advantages and disadvantages. Previous work related to pag-
ing can be classified into three categories: page replacement policies, extensible
33
and application controlled paging and compiler assisted paging mechanisms.
Page Replacement Policies
LRU and CLOCK based policies are the most widely accepted policies and
are used in most commercial OSs like Linux [19, 54], Mach [9, 85], etc.. Due
to recency-based paging decisions, the LRU policy fails to keep those pages in
memory that are frequently accessed over a long period of time. The proposed
improvements to LRU include: LRFU [76, 77], EELRU [111], LRU-K [44,
91], 2Q [65], and more [49]. The CLOCK replacement policy is easier to
implement than LRU and requires less book-keeping. It has been shown that
the performance of CLOCK approximates that of LRU [17].
The Adaptive Replacement Cache (ARC) [86] policy builds upon LRU
eliminating some of its disadvantages. For example: unlike LRU, ARC also
captures the frequency features of the workload. Also, ARC is not polluted
by scan (a sequence of one-time use only page requests), a well-known failure
condition of the LRU policy [86].
CAR [17] (i.e. CLOCK with Adaptive Replacement) combines the ad-
vantages of the CLOCK and ARC [86] policies. However, in both ARC and
CAR, two consecutive page hits succeed as a test for a page’s long-term utility
whereby the page may never be used again in future [17]. Most file handling
applications (e.g. file search, database, etc.) access the same pages in succes-
sion fairly quickly and never access them again.
CART [17], an extension to CAR with a temporal filter improves upon
this defect of CAR/ARC. It uses four page lists – T1, T2, B1 and B2. Lists
T1 and T2 contain pages that are currently present in memory whereas the
lists B1 and B2 maintain history information of pages that have been recently
34
reclaimed. Pages in T1 are considered to have a short-term utility while pages
in T2 have a long-term utility. CART [17] imposes more stringent constraints
than CAR/ARC in deciding a page’s long-term utility.
CLOCK-PRO [63] is an improved version of CLOCK combining the advan-
tages of CLOCK and the LIRS [64] policy; the latter being proposed for better
buffer cache performance. CLOCK-PRO maintains a circular list of pages with
three clock hands. The HANDhot points to a hot page (a page which is newly
allocated or recently accessed) with largest recency. Any hot pages swept by
this hand turn into cold pages (not recently accessed). The HANDcold points
to the last resident cold page (i.e. the furthest one to the head of the list).
HANDtest points to the last cold page which is in its test period. This hand
is used to terminate the test period of cold pages. The non-resident cold pages
swept by this hand will leave the circular list for reclamation.
In addition to the above, several other page replacement policies have been
proposed in the past. The next subsection discusses some existing extensible
and application-controlled paging mechanisms offered by OSs.
Extensible and Application-Controlled Paging
In a µ-kernel [120], system modules such as the CPU scheduler, memory man-
ager, etc. can be executed as independent user-space processes. Many ex-
tensible memory management solutions make use of the µ-kernel architecture
to extend existing paging mechanism either to use a different policy or to
implement an application-specific one.
The Mach [9] OS provides the user with a certain level of control over pag-
ing of the application concerned. The external pager interface of Mach allows
35
applications to use their own functions for moving pages to and from the sec-
ondary storage or the swap space. However, Mach does not allow applications
to choose their own page replacement policies.
McNamee et. al. [85] extended Mach’s external pager interface to allow
applications to use their own page replacement policies. These new pagers,
called the PREMO (Page REplacing Memory Object) pagers, executed in ap-
plication address space. Every virtual memory region allocated in the system
is represented as a memory object. Each PREMO pager is responsible for the
pages belonging to one or more such memory objects. On a page-fault, the
Mach pager uses a global policy to select a page for reclamation and checks
if the selected page belongs to one of the memory objects associated with a
PREMO pager. If true, then the selected page is put back into the page list
and control transferred to the PREMO pager. The PREMO pager would then
return a page from a memory object it governs. Finally, this page is reclaimed
by the Mach pager. The method is shown to add considerable communication
overhead in the system [85].
The VINO [45] OS enables applications to override some or all operations
within the MemoryResource objects to specialise their behaviour. The ‘appli-
cation kernels’ in the V++ Cache kernel [35] are allowed to cache address-
space objects and handle page-faults on these objects.
Applications in Aegis [46], an implementation of the exokernel [47], use
‘library operating systems’ instead, to implement their own memory man-
agement sub-system. The self-paging mechanism in Nemesis [57] presents a
mechanism mainly targeting continuous media applications. An application
in Nemesis [57] is responsible for handling all of its page-faults on its own.
Nemesis provides Quality of Service (QoS) guarantees in terms of the amount
36
of guaranteed physical memory space and the disk bandwidth available for
the requesting application. On a page-fault, the context is saved in the ap-
plication domain and later handled by the application responsible when it is
scheduled. Other application-controlled paging solutions [18, 37, 58, 73] are
similar to Nemesis [57] in that the applications control all the paging activity
including page-fault handling.
SPIN [22] is an extensible µ-kernel that is capable of loading user-defined
extensions called Spindles (SPIN Dynamically Loadable Extensions) as and
when required by the applications. Spindles are implemented using Modula-3,
a safe programming language, to ensure safety of the system.
Resources in SPIN are managed at two levels – (1) the primary system al-
locator looks after the major system resources such as memory, CPU, etc. and
(2) the secondary user allocator manages the resources already allocated by
the system allocator [22]. SPIN [22] provides user-level extensions for paging
by allowing registration of an event handler for memory management events.
The L4 [78,125] µ-kernel supports recursive construction of address spaces.
On initialisation, a user-level pager in L4 takes hold of the entire physical
memory, σ0. This pager then allocates address spaces to the tasks on a first-
come-first-serve basis. It can also divide the current address space it holds, σ0
into two, σ1 and σ2, such that it now remains responsible for only σ1 and have
another pager look after σ2.
Extensible and application controlled paging solutions tend to complicate
the generic page replacement code of an OS and add considerable overhead to
the system [18,37, 57, 58]. Such solutions have several side effects: some com-
pletely rely on the application programmer to accurately handle paging; and
37
some impose a performance penalty on the normal operation of the OS’s pag-
ing policy which affects other applications not using the scheme [37,45,47,58].
There is a trade-off between the OS handling everything for the applications
and allowing applications to perform some of the OS operations (paging).
Several critical applications such as databases, RAID servers, garbage col-
lectors in virtual machines, etc. tend to implement their own paging mecha-
nisms due to insufficient support from the OS [57,109,119]. The next subsec-
tion discusses some existing compiler assisted paging mechanisms.
Compiler Assisted Paging Mechanisms
The compiler assisted memory management policy in [82] analyses code, at
compile time, for loops consisting of accesses to array-based data. The com-
piler then inserts primitives LOCK, UNLOCK and ALLOCATE into the com-
piled code to control allocation of memory space for the respective arrays at
run-time. This method assumes that the underlying OS supports allocation
of memory on demand and can lock/unlock pages in memory dynamically at
run-time.
More recently, Brown et. al. [26] proposed a similar compiler assisted pag-
ing solution that uses compiler-inserted prefetch and release hints to manage
physical memory more intelligently. The main focus of this work is on the
insertion of hints into application source code. A run-time layer queues all
hinted requests either for prefetch or release operations and later passes them
onto the OS. It is assumed that the OS already supports prefetch and release
operations on memory pages. It is shown that this method adds considerable
overhead to the system increasing the application execution times [26].
A case study presented in this thesis for virtual memory management uses
38
similar techniques. The approach presented in this thesis uses a reflection
framework to obtain information about application memory access patterns
and accordingly prefetch or release pages from/to the swap space. There are
two advantages in this approach. Firstly, it avoids the bottleneck of queued
requests as in [26]. Secondly, only the information which is most recent is used
to prefetch or release pages.
Linux [19] provides memory lock and unlock primitives to certain privileged
applications in the form of ‘mlock()’ and ‘munlock()’ system calls. Applica-
tion processes use ‘mlock()’ to lock a range of memory in the virtual address
space such that, no matter which policy is implemented, the physical pages
mapped to these virtual addresses will not be reclaimed by the page replace-
ment code. Under Linux, the mlock() primitive only locks the virtual memory
pages associated with the application process making it possible for a physical
page to move within different page-lists (e.g. from active to inactive page-
list) [54]. Just before paging-out, pages are reverse mapped to their virtual
pages to check if they have been locked [54]. This process adds considerable
overhead on the page reclamation process.
Linux also implements a system call called ‘madvise’ which can be used
by applications to notify the OS of their probable memory access patterns.
The implementation of ‘madvise’ in the Linux kernel uses such notification
mainly to tune the extent of disk read ahead pages. In Linux, each new access
to a page in secondary storage (e.g. the disk) makes the kernel read ahead,
in advance, several pages into its cache for future use. Applications having
sequential memory access patterns typically have iterative loop-based memory
accesses. ‘madvise’ will cause unnecessary page reads, for example, when an
application reaches the end of its data access loop and starts accessing the
39
memory pages from the beginning.
Other work [32] in this area involves the use of various loop transformation
techniques such as loop permutation, loop fusion, loop distribution, etc. to
achieve data locality in terms of both temporal and spatial reuse of cache lines.
The next section discusses some existing OS specialisation techniques.
2.3 Operating System Specialisation
Specialisation in an OS either makes the entire OS or certain resource manage-
ment policies in the OS adapt to the application-specific requirements. This
provides better support to the applications and helps them achieve better
performance.
A general purpose OS implements generic resource management policies
that do not support all applications alike. For example, consider that an
OS implements a high-performance graphics algorithm to support graphics
intensive applications. This functionality would rarely be used by a non-
graphical application.
Specialisation of an OS also depends on the OS architecture. A monolithic
kernel contains all system modules, statically compiled, to form a single chunk
of code. This provides less flexibility for OS specialisation. Addition of nu-
merous different kinds of resource management policies into the kernel would
make the OS code larger. However, modern monolithic OSs such as Linux
are more modular in nature, allowing additional functionality to be added at
runtime by dynamically loading the required modules into the kernel.
On the other hand, a µ-kernel can be easily specialised. At runtime, the
system modules can either be changed, replaced or extended as and when
needed. The L4 µ-kernel provides recursive layers of abstraction, which is
40
good for specialisation [78]. An exokernel [47] gives the applications complete
control over the OS policies making it ideal for specialisation. However, the
redundant OS libraries impose unnecessary memory overhead in the system.
An exokernel can support all kinds of OS specialisations.
Apart from using the inherent features of the kernel design, it is possible to
use of some external techniques to dynamically specialise an OS. Policy-based
resource management specialisation techniques divide the bigger problem into
smaller chunks that can be dealt with individually. The next subsection pro-
vides more insight on the specialisation of OS policies.
2.3.1 Specialisation of OS policies
Each resource in a system is different and needs to be managed differently.
However, information pertaining to one resource might also be useful for man-
aging another resource.
There are numerous CPU scheduling policies proposed in the past: Rate
Monotonic (RM), Earliest Deadline First (EDF), Round Robin (RR), etc. A
scheduling policy may or may not be the most suitable policy for a particu-
lar kind of workload. Specialising this policy depending on the information
obtained at runtime will provide a mechanism to dynamically adapt the pol-
icy for better application support. The applications may want to choose the
scheduling policy themselves. Furthermore, the applications may desire that a
custom built scheduling policy be used by the OS. The possibilities are endless.
However, at some point a decision has to be taken on how much a policy
can be specialised? This is called the granularity [41] of specialisation. The
more fine-grained control the applications have on the specialisation aspects,
the more specialisable the OS is. Specialisation of a policy should be designed
41
in a way that suits the requirements correctly. If designed properly, any or all
resource management policies of an OS can be specialised.
The approach taken by the Infokernel [13] is to transform OS policies into
mechanisms. The Infokernel gives out information to the applications about
the policies it provides. The applications make use of this information to adapt
themselves in order to gain optimal performance from the OS policies [13].
Real-time applications have stringent resource requirements. Adapting the
OS policies to meet these requirements will provide better application support.
However, the Infokernel takes a different approach. Rather than adapting the
OS policies to suite application requirements, it changes/adapts the applica-
tions themselves. The adapted applications no more adhere to their original
resource requirements and could potentially have a completely different be-
haviour. This thesis aims to retain the original application behaviour and
focuses on improving the existing resource management support by adapting
the OS polices.
The reflection mechanism [112,115] often found in programming languages
is considered as a specialisation technique which when used in an OS context
might provide better support for dynamic OS adaptation. The next section
describes reflection in more detail.
2.4 Reflection Mechanisms
A conventional program runs through a predefined deterministic execution
path. Any behavioural change in the path requires the code to be changed,
recompiled and executed again. The ability by which an application program
can check itself at runtime is called ‘self-awareness’ or ‘introspection’ [105].
Using introspection, an application can query its status, check data structures,
42
Figure 2.3: Tower of Reflection (Reproduced from [81,105])
etc. at runtime. The mechanism by which an application becomes ‘self-aware’
and changes itself accordingly either to change its behaviour or to improve its
performance is called Reflection [112].
In order to achieve reflection, an application needs to be aware of many
aspects of its design and implementation, e.g. its data structures, language
constructs/semantics, runtime support system (or virtual machine). The pro-
cess by which this information is made available to an application is called
reification [105].
Reflective systems are generally made up of a ‘base-level’ component and
one or more ‘meta-level’ components or entities operating one above the other
(see figure 2.3). The base-level represents the application program code, with
the meta-level being a model of the base-level that analyses the reified in-
formation. One meta-level component can have further meta-levels above it
resulting in a reflection tower (see figure 2.3).
Generally, the meta-levels are causally connected with each other such
43
that a change made by one component is reflected everywhere. Using causal
connection, it is possible for a meta-level to change the behaviour of an ap-
plication without the knowledge of its base-level component. The meta-level
achieves this by intercepting and changing the behaviour of certain function
calls to/from the base-level. The change could be in the form of changing the
value of a data structure or changing the base-level code itself.
For example, consider a check-pointing approach to fault-tolerance. This
functionality can be brought into a system by introducing a suitable meta-
level entity [105]. The calls to all write operations on the check-pointed data
objects are intercepted by the meta-level which then performs the actual check-
pointing of the data-object: storing a copy of the data elsewhere. Once this is
done, the write operation continues as expected in the base-level. The result
as far as the application is concerned is a write operation to the data-object –
it is unaware of the check-pointing. Note that using reflection at runtime, it is
possible to dynamically change – the objects that are check-pointed, the fault-
tolerance mechanism used, etc. – all without the knowledge of the application’s
base-level component.
Depending on the information reified and the level of change, reflection can
be classified into two main types:
• Structural Reflection,
• Behavioural Reflection.
Structural Reflection: is the ability of a programming language to provide
reification of the program structure including any abstract data structures.
For example, a meta-level entity in structural reflection can query all com-
ponents/objects of a class, it can add/delete objects or even change their
44
data-type [81]. Structural reflection was first introduced in logic programming
for languages such as Smalltalk-80 [50] and LISP [81].
Behavioural Reflection: is the ability of a programming language to
provide reification of the language semantics and implementation along
with the data and implementation of the runtime system [81]. Behavioural
reflection is difficult to achieve. The meta-level has complete control over the
base-level to bring about any change to the aspects such as the way functions
are called, the value of data that is being written to or read from, etc. The
next sub-section describes the use and support for reflection in programming
languages.
2.4.1 Reflective Programming Languages
The reflection mechanism originated in programming languages such as
Smalltalk-80 [50], CLOS, LISP etc. with many modern programming lan-
guages being extended to support reflection [81]. For example: extensions
(in the form of library packages) to modern programming languages such as
Ada [106], Java [36] and C++ [106] have been developed to provide support
for reflection [105]. Java, for instance, has a reflection API that provides facil-
ities to introspect data structures (e.g. a class) used in a program. However,
Java’s ability to alter program behaviour is very limited, i.e. it only allows
to get/set a field, invoke a method through the API or just instantiate a new
class [36].
To overcome these limitations and to provide more reflective support Open-
Java [106] was introduced. The OpenJava compiler is typically a macro trans-
lation parser which translates OpenJava source code into a regular Java source
45
code that exhibits reflection [106]. Thereafter, it uses the facilities provided
by the Java Virtual Machine (JVM). OpenJava is considered to be a result of
the lessons learnt from OpenC++ [106].
OpenC++ [106] uses a low-level parse tree approach instead of using
OpenJava’s strong typed object interface to the syntactic structure of the
source [106]. It supports compile time structural reflection while behavioural
reflection is supported through meta-classes written by meta-level program-
mers.
OpenJIT [84] is a reflective Java just-in-time (JIT) compiler. The OpenJIT
compiler allows class-specific customisations. Mostly written in Java with a
few small Java Native Interface (JNI) stubs for JVM introspection, OpenJIT
also has a few C-level runtime routines. The OpenJIT compiler checks and
modifies itself during execution of a Java application, thereby adapting to the
runtime application-specific requirements [84]. Since most of it is written in
Java, it imposes performance overheads due to the extra level of interpretation
involving the JVM.
Another interesting development for reflective support in Java was intro-
duced in the form of a class library called Javassist [36]. Javassist supports
load-time structural reflection in Java. Javassist takes a simple approach. It
provides the following sets of classes: ones to read compiled Java byte-code,
another to create a new byte-code; ones to add/change the methods or the
name of a class in the compiled byte-code and finally, ones to load the com-
piled byte-code into the JVM for execution. Once loaded into the JVM, a
class cannot be changed thereafter.
To illustrate the potential of this approach, a simple example is presented
(as quoted in [36]). Consider a class Calendar that implements an interface
46
Writable provided by a third party as shown below.
class Calendar implements Writable {public void write(PrintStream s) { ... }
}
The class Calendar implements method write declared in the interface
Writable. Suppose that the third party changes the class name Writable to
Printable and the method name from write to print. This would necessitate
changing the Calendar class code as follows:
class Calendar implements Printable {public void write(PrintStream s) { ... }public void print() { write(System.out); }
}
In the real-world scenario a change like this might mean changing huge
amounts of code which can be impractical. If Java supported structural reflec-
tion then it was possible to change the interface name to Printable and also
make similar changes to the method write. The Javassist class library allows
to do this at the time of class loading.
Reification and reflection in Javassist is done by creating an object of Ct-
Class (Compile-time Class) which can read byte-code from a compiled Java
class file. The CtClass object is provided with methods such as toBytecode(),
addMethod(), addField(), setBody(), etc. to generate new byte-code, add a
new method, add a new field to a class and change the code of an existing
method respectively [36].
Behavioural reflection in Javassist is implemented using software hooks.
Intermediate hooks are inserted into the methods in a reflective class. When
47
that particular method is called, the call is intercepted by the hook and is
then handled by a meta-level class which might change the behaviour if re-
quired [36]. Similar to Javassist, another system called linguistic reflection
was developed. Although, it allowed dynamic creation of a new class, it did
not allow changes to an existing class definition [36].
OpenAda [106] provides compile time structural reflection to the standard
Ada 95 programming language. OpenAda makes use of the pragma Metaclass
specification construct in the application source code – files with extension .oa
– that specifies to the compiler which type or object or method to translate
for reflection [106]. A simple example (as quoted in [106]) would be:
pragma Metaclass(Verbose.Object);
with OpenAda.MOP;
...
procedure Verbose is
type Object is new OpenAda.MOP.Class with private;
...
procedure Translate Procedure Body
( This : in out Object;
Input : in out OpenAda.Syntax.Procedure Body.Node;
Control : in out OpenAda.Syntax.Visitation Controls );
...
private
type Object is new OpenAda.MOP.Class with null record;
end Verbose;
This example depicts an overriding of procedure Translate Procedure Body
inherited from type Class. OpenAda provides the programmer with several
packages such as OpenAda.Syntax, OpenAda.MOP (Meta-Object Protocol),
etc. to support reflection [106]. It makes use of Dynamic Link Libraries
48
(DLL) in Microsoft Windows OS and Shared Object Libraries (*.so files) in
UNIX variants to support dynamic loading of the meta-classes. To achieve
behavioural reflection OpenAda provides a simple set of packages that allow
changing the behaviour of a method at runtime. Methods provided by the
package are Reflect and Reify which help in interception and introspection
when required [106].
The OpenAda compiler translates the OpenAda source code into standard
Ada 95 compatible source code which the user can compile and execute using
any standard Ada compiler. However, this may introduce certain limitations to
the features that depend on the underlying Ada compiler. The next subsection
describes some existing reflective middlewares that allow dynamic changes to
their services.
2.4.2 Reflective Middlewares
A middleware is a software layer that fits between the applications and an
OS – mediating interactions between them. Middlewares provide a standard
interface to the applications by hiding complex details of the underlying OS in-
terface. The complexity includes features such as remote method invocation,
network communication protocols and cryptography [70]. Middlewares are
generally deployed in distributed computing environment and network com-
munication systems. Existing middleware technologies include CORBA [127],
Java-based J2EE [4] and the .Net framework [5].
Reflective middlewares [70] use a reflection mechanism to adapt their ser-
vices in order to accommodate changing requirements of applications. Such
middlewares use reification and meta-level components to bring about fine-
grained changes to their services. They also provide programmers with an
49
interface to explicitly control a change to a specific service.
Most reflective middlewares provide support for interception. Interceptors
are used to support added functionality such as fault tolerance, cryptography,
runtime monitoring or collecting system statistics, etc. Some of the reflective
middlewares proposed in the past include DynamicTAO [71], Open ORB [24],
OpenCORBA [75] and mChaRM [34].
DynamicTAO [71], an extension of the C++ TAO Object Request Bro-
ker (ORB) [71] allows runtime reconfiguration of the ORB’s internal engine
and the respective applications using ORB. It uses ComponentConfigurators
to represent dependency relationship between the different ORBs, the ORB
components and the application components. On receiving a request to re-
place a component in the system, the middleware checks its dependencies
with the other components using the attached ComponentConfigurator. Dy-
namicTAO allows runtime loading and unloading of modules by exporting a
meta-interface.
The Open ORB [24] middleware was independently developed at the same
time as the DynamicTAO. Its main aim was to support applications with
dynamic requirements. The Open ORB platform can be configured to include
appropriate components using a component model which allows hierarchical
composition and distribution.
In Open ORB, the base-level consists of the components implementing the
normal middleware services while the meta-level exports these implementa-
tions to the programmer to enable inspection and adaptation. A base-level
component can have its own private set of meta-level components which are
50
referred to as the component’s meta-space. Each meta-space is further parti-
tioned into various meta-space models that provide different views of the plat-
form implementation and can be independently reified. Open ORB defines four
meta-space models grouped according to the distinction between structural
and behavioural reflection. The Interfaces and Architecture meta-space mod-
els support structural reflection whereas the Interception meta-space model
supports behavioural reflection. Prototype implementations of the Open ORB
have tested the suitability of the architecture for distributed multimedia ap-
plications. Initial experiments indicated that Open ORB performed the same
as Orbacus [24], a commercial ORB, and around 10% slower than GOPI [24].
OpenCORBA [75] adds reflection support to standard CORBA. It has been
implemented in NeoClasstalk [101], a smalltalk-like [50] reflective language
based on meta-classes. In OpenCORBA, the behaviour of a CORBA service
is changed by replacing the meta-class of a class that provides that service.
Quarterware [110] is another reflective middleware platform that provides
a component framework for the ORB mechanisms. With the use of a reflective
interface, the programmers can plug custom components into the framework.
Quarterware supports multiple middlerware standards such as CORBA, Java
RMI (Remote Method Invocation), and MPI (Message Passing Interface).
The multi-Channel Reification Model (mChaRM) [34] is a reflective mid-
dleware which enables explicit control over multi-channel communication using
a communication-based reification approach. The model allows interception of
calls to the methods in communication channels in order to inspect and adapt
their structure or behaviour.
In general, reflective middlewares make extensive use of interception to in-
tercept calls from or to the applications to bring about the required change.
51
Many reflective middleware techniques have now been adapted into standard
middlewares. For instance, CORBA includes a standard for portable inter-
ceptors and Java includes the Core Reflection API. These architectures are
suitable for distributed computing which require application portability. The
use of such middlewares for efficient application-specific resource management
has not been fully exploited.
However, the middleware layer between the OS resource management and
the applications adds an additional level of indirection in the system. Fur-
thermore, middleware provides a standard interface to the applications and
handles all the complexity pertaining to the low-level OS interface which dif-
fers from one OS to another. With each OS implementing different policies,
middlewares are limited by what features the underlying OS can provide. The
next subsection discusses some of the existing reflective OSs.
2.4.3 Reflective OSs
To accommodate reflection in OSs, a direct analogy to the implementation of
reflection in programming languages can be taken. Essentially, an OS should
provide a mechanism by which reflection is achieved. This additional func-
tionality in the OS may introduce some overhead into the system. However,
this overhead could be justified by the additional flexibility provided to the
application and the resulting performance gain. The overhead should be zero
or minimal if applications do not make use of reflection.
For the systems that wish to provide an efficient fine-grained dynamic
adaptation mechanism, a reflective OS should allow: each OS module to have
its own set of meta-level entities and to share the associated information and
their functionality. Individual functionality allows distinct policies for different
52
applications (e.g. distinct scheduling policies), whilst some shared function-
ality allows shared facilities (e.g. efficient IPC). The following subsections
discuss existing reflective OSs that include – ApertOS [128], Chameleon [27],
2K [33, 69, 72] and Spring [115,116].
ApertOS
ApertOS [28,59,128,129] is one of the first generation object-oriented reflective
OSs and was particularly designed for use in mobile and distributed comput-
ing environments. It implements a reflective object-oriented framework that
provides support for object migration. The framework introduces a unique
concept of separating an object with its meta-object. This was particularly
implemented in ApertOS to aid object migration. Here, an object is consid-
ered to encapsulate: a state, some methods which access its state and a virtual
processor which executes its methods. A meta-object is an object which de-
fines the behaviour of a particular object. For instance, a virtual processor of
an object can be viewed as a meta-object.
In the reflective framework, everything that is shared and protected is
an object. Each object belongs to a particular meta-space consisting of one
or several meta-objects. Figure 2.4 (reproduced from [128]) shows the rela-
tionship between various objects, meta-objects and the respective meta-spaces
they belong to. As an object evolves through its lifetime, its requirements
change. If the meta-space that it belongs to does not support the new require-
ments, then the object can migrate to a different meta-space that provides
the required support. This is particularly useful in the mobile communication
environment where at one instance, an object might be using a local proto-
col for communication, and at the next instance, it might require to use an
53
Figure 2.4: Object/Meta-Object Separation and Meta-Hierarchy [128]
inter-connection protocol.
The aim of ApertOS is not to adapt its resource management policies but is
to provide support for objects in the systems (e.g. the application processes)
to choose the required policies by selecting and migrating to the respective
meta-spaces. Thus, every system module in ApertOS is implemented as a
meta-object and belongs to one or more defined meta-spaces.
In a way, ApertOS itself can be considered as a large object using multiple
meta-spaces that consists of multiple meta-objects. These meta-objects use
other meta-objects forming a meta-hierarchy. For instance, a meta-object
which implements segmentation in virtual memory, uses another meta-object
which implements paging. The paging meta-object would in turn use a meta-
object which implements the physical memory management.
Objects in ApertOS, can migrate to a different meta-space using the
“canSpeak()” method. Each object executing in the system is associated with
54
a context. ApertOS provides a standard means to compose individual execu-
tion environments of each application. The existence of objects, their state,
and object migration is handled by a core module called MetaCore. The Meta-
Core does not belong to any meta-space. It forms the main communication
bridge between the objects and the meta-objects within different meta-spaces.
ApertOS was implemented for the SONY PWS1550 and MC68030 pro-
cessors. The evaluation showed that it spent 40% of its processing time in
saving, finding and restoring context of the system [128]. i.e. the overhead
for reflection in ApertOS was high. Also, ApertOS allowed only a single re-
flective module per meta-level, preventing multiple applications from having
different reflective functionality. Never-the-less ApertOS provided a new way
of dynamically specialising an OS [128].
Bryce et. al. [28] introduced a new pre-emptive hierarchical scheduler by
replacing the existing non-preemptive one in ApertOS to improve its perfor-
mance. It was shown that the application performance improved by up to 5
folds [28].
Chameleon
Chameleon [27] is an object-oriented OS that shares the same philosophical
approach as ApertOS. Based on a µ-kernel architecture, it was mainly designed
for soft real-time multimedia applications. In order to provide better adapt-
ability, Chameleon introduced new concepts such as AbstractCPU, brokers,
and the broker interface hierarchy. Furthermore, techniques such as dynamic
class binding served as a basis for all system modules. Chameleon incorporates
an event-driven model that allows new events to be defined and dynamically
introduced into a running system.
55
Similar to ApertOS, Chameleon has a hierarchical meta-objects structure
wherein the meta-objects actively communicate amongst each other to support
reflection. Due to this, Chameleon also showed similar overheads that were
associated with ApertOS.
2K
2K [33, 69, 72] is a reflective, component-based distributed OS that uses a re-
flective ORB – Dynamic TAO [71] for dynamic customisation. It incorporates
a middleware layer to admit on-the-fly customisation by dynamically loading
new components into the system. The system software includes models of its
own structure, state and behaviour by using reification. This allows the sys-
tem components to access the system state and check if they need to adapt.
The reflective ORB model provides code update mechanisms to allow dynamic
replacement of system and application components [72].
2K adopts a network centric model in which all the entities, users, the
various system components and devices exist in a network. Each entity has a
network-wide identity, profile and dependencies upon other network entities.
When configuring a particular service, the entities constituting that particular
service are assembled together.
The system configures itself automatically and loads a minimum set of com-
ponents required for executing the user applications. Any further components
are downloaded and configured from the network as and when required. The
philosophy is based upon a “what you need is what you get” (WYNIWYG)
model [69, 72].
In order to achieve this, 2K reifies inter-component dependency. The sys-
tem and the application components need to fulfil an explicit representation
56
requirement before they can execute. For example, an Internet browser could
specify that it depends upon components implementing an X-Window system,
a local file service, the TCP/IP protocol, and the Java virtual machine [72].
The main motivation for 2K was to manage variation in the environment
(e.g. fluctuations in network bandwidth, connectivity, protocols, error rate)
and the evolution of software and hardware (e.g. software version updates and
hardware reconfigurations) [72]. Adaptation in 2K is driven by environmen-
tal and system software or hardware changes and not by application-specific
requirements. The 2K OS is essentially an OS with a built-in reflective middle-
ware framework (i.e. Dynamic TAO). The dynamic customisation takes place
in the middleware layer.
Spring
Spring [116–118] is a distributed network OS developed to work in a networked
multi-processor environment. Spring uses certain properties of reflection, but
it cannot be considered as a completely reflective OS. The use of reflection
in Spring is to share information and to represent the system state at any
given time. After prior analysis, information pertaining to an application’s
characteristics (e.g. deadline, period, etc.) is placed in the process control
blocks (PCB) of the corresponding application process.
Spring researchers developed three integrated languages for its support.
First, in order to efficiently specify the reflective information within the ap-
plications, high-level programming languages – Spring-C [88] and Real-Time
Concurrent C (RTCC) [52] – were developed. These languages allowed pro-
grammers to specify reflective information such as period, deadline, etc. Each
application in Spring must be programmed in either Spring-C or RTCC.
57
Second, a system description language to provide implementation details
for detailed and accurate timing analyses was designed and implemented. The
language is called System Description Language (SDL) [89]. SDL is used to
specify details such as the nodes in a network, the memory layout of the
system, the bus characteristics, etc.
Third, a notational language – Fault Tolerant Entities for Real-Time
(FERT) [25] for specifying fault-tolerant requirements on a task-by-task ba-
sis was designed. FERT allows the designer to treat each FERT object as
a fault-tolerant entity with protection boundaries. Initially, the FERT ob-
jects have no timing and redundancy constraints. The designer then specifies
a set of application modules as part of a single entity. These modules rep-
resent the user-level code for redundant operations. Furthermore, a FERT
designer can also specifies one or more adaptive control policies which inter-
act with scheduling and analysis algorithms (both off-line and online analysis
algorithms) providing dynamic guarantees.
Spring does not provide any mechanism for dynamic adaptation of OS
policies. It encapsulates the static application requirements into the process’s
PCB and does not accommodate any dynamic changes to these requirements.
2.5 Summary
This chapter discussed resource constraints in real-time embedded systems
along with existing OS resource management and specialisation techniques.
The CPU and memory were the two main system resources discussed in this
chapter. Efficient management of these resource by an RTOS is the key to
provide better application support. There exist several resource management
policies for managing the CPU and memory. However, each policy has its own
58
failure or inefficient use scenario. Most policies are generic in nature, provide
average case support and do not adapt to application-specific requirements.
The OS specialisation techniques help customise certain parts of an OS such
as resource management policies by adapting them to meet the application-
specific requirements. Most techniques are static in nature, i.e. the system is
statically adapted to the given requirements without considering the dynamic
application requirements.
The reflection mechanism mainly found in programming languages [105]
can be considered as a specialisation technique that can help bring about
dynamic changes to the OS policies. A reflection mechanism can help an OS
dynamically adapt to the application-specific requirements at runtime.
The use of reflective middleware technology provides applications with an
easy-to-use interface also allowing dynamic customisations. Most reflective
middlewares provide support for distributed and pervasive computing. They
do not focus on providing application-specific resource management.
On the other hand, reflective OSs such as ApertOS [128], Chameleon [27],
etc. provide support for reflection within the OS itself allowing the system
to undergo changes at runtime. The OS is divided into several reflective ob-
jects which interact with each other using a meta-object protocol. Objects,
including application processes, are grouped to form meta-spaces. An object
in meta-space A can migrate to meta-space B if the meta-space B implements
a feature required by the object. This feature could be a resource management
policy or the implementation of a specific algorithm.
Existing reflective OSs do not provide explicit support for application-
specific resource management. Furthermore, by allowing dynamic customisa-
tion of all components, they increase the system complexity thereby increase
59
overhead due to reflection.
In order to provide application-specific resource management in systems
with constrained-resources, an OS should adapt or change its policies at run-
time according to the application requirements. Most OS specialisation tech-
niques [37, 48] support customisation of a single resource. Often, resources
are inter-dependent on each other such that a change made to one resource
management policy could affect the other.
The reflection mechanism provides support to bring about dynamic changes
in OS policies. Existing reflective approaches are too complex and focus on
issues other than application-specific resource management. There is a need
for a reflection-based mechanism in an OS that can adapt/change the resource
management policies on an OS according to the runtime application require-
ments.
Chapter 3
Reflection in RTOS for EfficientResource Management
This chapter proposes the generic reflective framework for an RTOS. The
reflection mechanism is modified for use in the context of an RTOS such that
it has little or no overhead in the system. The framework allows fine-grained
changes to the RTOS’s resource management policies by obtaining application-
specific resource requirements from applications and the system modules alike.
This helps build efficient and adaptive resource management modules that
dynamically adapt/change their behaviour according to application-specific
requirements. Also, the implementation and evaluation of a prototype µ-kernel
– DAMROS [95,96], as an instantiation of the framework, is described.
The chapter is organised as follows: the next section discusses existing
properties of the reflection mechanism and presents modifications to it for use
in an RTOS context. This section describes the process of reification, the role
of the kernel, categorisation of information and the in-kernel reflection inter-
face. Section 3.2 presents the generic reflective framework for an RTOS using
the modified reflection mechanism. Section 3.3 describes the implementation
of a prototype µ-kernel – DAMROS [96] along with two example reflective
61
62
system modules: a reflective CPU scheduler and a reflective virtual memory
manager. Finally, section 3.4 presents experimental results of applications
using the reflective framework in DAMROS with the two example reflective
system modules implemented in DAMROS.
3.1 Modifications to Reflection Mechanism
On the one hand, the RTOS needs to identify application resource require-
ments and accordingly adapt its resource management policies. On the other
hand, applications need a mechanism to specify their resource requirements to
the RTOS. A reflection mechanism helps bring about dynamic changes in the
behaviour of the RTOS policies and establishes an information exchange path-
way between applications and the RTOS. The approach taken is not to make
the entire RTOS reflective, rather only the required resource management
modules or applications are reflective. Furthermore, a resource management
module or an application may choose not to be reflective at all.
Most implementations of reflection mechanisms, in both programming lan-
guages and reflective OSs, consider the method of implicit reification [27, 36,
72, 106, 128]. i.e. the mechanism implicitly reifies anything and everything
in the system. This generates enormous amounts of information at runtime
imposing considerable overhead on the reflective subsystem.
In reflective OSs [27,72,128], the mechanism is implemented with an inten-
tion to transparently allow dynamic changes to all components in the system.
By default, every component in the system, either a resource management
module or a device driver, is part of the reflective mechanism in the OS. The
main goal is to provide utmost flexibility for runtime changes to the system
and not to provide efficient application-specific resource management support.
63
Such fully-fledged implementation of reflective has significant benefits in
terms of the flexibility offered, but it is not essential for efficient resource
management. This thesis aims to use minimal properties of reflection in a way
that is applicable only to the participating OS resource management modules
and the applications, such that the mechanism introduces little or no overhead
to the ones that do not participate.
Reflection has heavy dependency on reification of information between the
various base-level and the meta-level components [105, 128]. For instance, in
an RTOS context, the flow of reified information is from application to appli-
cation, application to system modules and between different system modules.
It is important that an RTOS moderates and has control over the flow of this
information. Such control will allow to restrict any non-legitimate use and
also, to change the information if required for the benefit of the application.
This requires certain modifications to the process of reification. The next
subsection describes the modifications needed to the conventional reification
process.
3.1.1 Modifications to the Process of Reification
It is not necessary to reify all the information available in the system. Conse-
quently, the following changes are made to the reification process:
• Rather than passing the reified information directly to the concerned
meta-level components, it is first passed to the kernel. This helps the
kernel to moderate and have control over the reified information.
• Traditionally, a particular meta-level component receives only the in-
formation reified by its base-level component. The change implies that
64
not only can one or more meta-level components receive reified infor-
mation from multiple base-level components, but also they can receive
from non-reflective applications. This is very useful because, resource
information is not just confined to a particular base-level component.
By using multiple sources, it is possible to obtain more relevant and
accurate information.
• Any information reified is stored in the kernel and passed to the meta-
level components only when explicitly requested. This helps reduce the
communication overhead that might have been caused by the transfer of
unnecessary information between the kernel and the meta-level compo-
nents.
Figure 3.1 shows the new process of reification involving the flow of in-
formation through the kernel. The kernel core in the figure represents the
minimal part of a kernel that includes support for reflection. Note that the
applications as well as the base-level components, either of system modules
or the applications, can reify information which is then stored in the kernel.
This information, when requested, is transferred to the respective meta-level
components. The dotted lines show the traditional method of reification where
information is passed directly to the meta-level component. The next subsec-
tion describes the role of the kernel in the modified mechanism.
3.1.2 Role of the Kernel
Figure 3.2(a) shows the information exchange mechanism between the base-
level and the meta-levels of a conventional reflection tower. In the modified
65
Figure 3.1: Reification through the Kernel
reflection tower (see figure 3.2(b)), information reified by any base-level com-
ponent (either application or resource management base-level) is passed to
the RTOS kernel instead of the meta-level. The kernel acts as an information
base for all the meta-levels which then explicitly request for certain category
of information. During this process, the kernel may change the information if
required for the benefit of the entire system. This method allows information
to be shared with not just one meta-level but amongst multiple meta-levels.
Also, many base-levels can share a single meta-level eliminating the need for
redundant meta-levels for each and every base-level in the system.
3.1.3 Component Privileges
In the modified reflection mechanism, it is possible that a meta-level belonging
to one base-level component can affect a change in another base-level compo-
nent. This is accomplished by assigning privileges to both – the base-level
66
Figure 3.2: Modifications to Reflection
67
and meta-level components. Privilege assignment is similar to process privi-
leges found in OSs such as Linux [19] where processes with ‘root’ privilege are
superior to the normal user processes.
Base-level Privileges
During initialisation, the application processes and the base-level components
(of applications and system modules) are assigned either an ‘application’ or a
‘system’ privilege by the kernel. These privileges allow the kernel to distinguish
between the application and system base-level components. The system base-
level components are considered superior to those of the applications, i.e. the
kernel assigns greater importance-level to the information reified by a base-
level having a ‘system’ privilege. The kernel maintains a list of initialised
base-level components in the system.
Each resource in the system has a unique identification number – resour-
ceID. A base-level component is uniquely identified by the resource it repre-
sents (e.g. CPU or memory) and thus, has an associated resourceID. Further-
more, each base-level provides the kernel with a list containing resourceID,
meta-level privilege pairs. This list is used to give ‘read’ or ‘write’ access over
the code/data of that base-level to the requesting meta-level component.
Meta-level Privileges
Each base-level component grants a ‘read’ or ‘write’ privilege to a requesting
meta-level component. During initialisation, a meta-level component requests
the kernel for a privilege over one or more base-level components. If assigned a
‘write’ privilege, the meta-level can affect a change in the base-level component
irrespective of whether it is the meta-level component for that base-level or not.
68
However, the meta-level components with a ‘read’ privilege can only query the
kernel for the information reified by the particular base-level component.
Each meta-level component must represent at least one resource identified
by the resourceID. The kernel associates a list of meta-levels along with their
privileges to each base-level component such that it can easily identify the
destination meta-level component when a particular base-level reifies informa-
tion. Similarly, a list of meta-levels is also associated with each non-reflective
application process. The next subsection describes the process in the kernel
for assigning importance-level to each reified information.
3.1.4 Infolevel for Reified Information
Any newly reified information is validated by the kernel against existing in-
formation and state of the system at that time; accordingly categorised and
assigned an importance-level – infoLevel. Information when stored is cate-
gorised with respect to the resource it belongs to. If an information belongs
to more than one resource, then a different infolevel is assigned according to
the resource. Information with the highest infoLevel is delivered first to the
requesting meta-level component. The next subsection describes the process
of categorising the reified information.
3.1.5 Categorisation of Reified Information
Identifying useful information from non-useful ones is the key to reduce reifica-
tion overhead. At runtime, enormous amounts of information is generated by
the base-levels as well as the application processes. It is not logical to use all
the reified information. Thus, information needs to be categorised according
to the resources and some discarded/ignored.
69
Often information pertaining to one meta-level is also useful to other meta-
levels. Furthermore, an information relevant to one or more meta-levels might
be more important for one meta-level than the other. Thus, it is essential to
assign different importance-levels to each category an information belongs to.
Categories are represented by the resourceIDs of the resources. This is
obtained by looking at the meta-level list associated with the respective reify-
ing component. Thus, any reified information can be categorised according to
the resource(s) (e.g. CPU, memory) represented by the meta-levels. The fol-
lowing subsections categorise the information and its use pertaining to: CPU,
memory and other resources.
Information for the CPU resource
Any information that corresponds to a change in the scheduling order, or the
execution time of the processes in a system falls in this category. The following
information belongs to the CPU (scheduling) category:
• process priority: the scheduler uses this information to select the next
runnable process. This information is vital for the CPU scheduler.
• process deadline: this is another vital piece of information which can be
used by the scheduler to raise or lower the priority of a process if using
priority based scheduling.
• scheduling policy: this information suggests the scheduling policy that
is to be used. All the application processes will be affected as a direct
result of any change brought about due to this information. However, in
such a case the kernel should intervene and disallow the change.
70
There exist several other kinds of information which help in efficient manage-
ment of the CPU. The above list could be used as a general guidance applicable
to any real-time application process but is in no respect complete. Additions
to the list are implementation specific.
Information for Memory resource
Memory resource also includes the possible use of virtual memory techniques
such as paging. Any information that suggests the use of memory either by
executing a piece of code or accessing data belongs to this category.
The following is a list of the most important information belonging to this
category:
• read-access: It is difficult for an RTOS to predict memory access patterns
of an application process. Thus, information that suggests a memory
read access is valuable for efficient memory management. For instance,
in case of a paged system, this information could be used to make only
those pages that are actually being read to be available, while moving
the unused pages to the swap-space.
• write-access: this information is similar to read-access but has additional
impact. For instance, in an RTOS implementing the copy-on-write fea-
ture [19], a memory write-access would trigger a particular copy opera-
tion. Knowing in advance when an application is going to do a memory
write-access helps the RTOS to make more accurate memory allocations.
• allocation: this information can be obtained implicitly or explicitly from
the system. It provides the RTOS and other reflective modules with
information about a process’s memory usage statistics.
71
• deallocation: this works in conjunction with the allocation and helps
keep track of the amount of memory usage by a process at any given
time.
• reservation: this information suggests the RTOS to reserve certain
area/region in memory, such that the pages belonging to this region
are always physically available for access.
There is no limit on the kind or category of information an application process
can provide. The above mentioned categories should be treated only as a
guideline and may differ with different implementations.
Information for Other Resources
Although this thesis mainly focuses on the CPU and memory resources, similar
categorisation can be made for other resources in the system. Furthermore, the
resource management code for other resource may choose to receive informa-
tion reified for the CPU and memory. More accurate decisions could be made
by a particular resource management module by knowing the status of other
resources in the system. For instance, a memory management module may
handle a page-fault at a later time if it knew that the faulting process’s CPU
budget has expired. Thus, switching to the next process rather than handling
the page-fault would prevent the ready process from waiting unnecessarily.
Power – particularly in case of battery operated embedded devices – is
another such resource where information pertaining to other resources is very
useful.
The next subsection describes the flow of reified information through the
kernel.
72
3.1.6 Flow of Reified Information
Figure 3.2(c) shows the stages involved in the flow of information through
the kernel in both directions – base-level to meta-level and vice-versa. Each
meta-level component (of an application or a resource management module) is
assigned a privilege which governs the information it can access. For instance,
an application meta-level entity is not allowed to obtain sensitive information
regarding other application processes, thus, its meta-level component would
not be granted ‘read’ privilege by the scheduler base-level component. The
next subsection describes the information flow from the base-level to the meta-
level.
Base-level to Meta-level
The flow of information from a base-level to a meta-level is divided into two
main phases. In the first phase, information reified by a base-level is passed to
the kernel where it is validated against the privileges of the sending application
process or the base-level component. An importance-level is assigned to the
information depending on the privilege of the sending component and the
resource the information is relevant for.
The kernel then checks for available free memory in the system. If there
is no free memory to store the information, then the information is either dis-
carded (in case of lower importance-level than existing information) or existing
information having lower importance-level is deleted in order to store the new
one.
Once the kernel decides to store the information, it determines the effects
the information could have on the system. For example: consider that an
application process has requested a higher priority. Once the meta-level gets
73
this information, it might request the kernel to grant the application a higher
priority. If this request is granted, then it directly affects other processes
executing in the system. Hence, at the time of reification itself the kernel
checks the system state making suitable changes to the information if required
(e.g. lower the requested priority). This way, care is taken that the information
reified by one base-level cannot adversely affect other processes or system
modules in the system.
In the second phase, the destination meta-level component(s) of the reified
information are determined and accordingly the information is categorised.
Note that information reified by one base-level might be useful to several meta-
level components. In the above example, information about raising a process’s
priority would be useful to the reflective scheduler module as well. Hence,
the kernel categorises the reified information by attaching a list of probable
meta-level(s) to it.
This information is then held in the kernel-space until the concerned meta-
level(s) request it, the application that reified the information has exited the
system or it is overwritten by information with higher infoLevel. The meta-
level component may choose to query the information either periodically (e.g.
every 10 milliseconds) or intermittently (e.g. after every base-level change
it requests). This decision is made by the application or system developer
implementing the meta-level code. The next subsection describes the flow
information from the meta-level to a base-level.
Meta-level to Base-level
The flow of information in the form of requests/commands from a meta-level
to a base-level is moderated by the kernel. Any change that a meta-level wants
74
to bring about, has to be passed as a request to the kernel. Again, the kernel
validates the request against the privileges assigned to the requesting meta-
level. The kernel then analyses the effects the change might have on the entire
system. For example: the effects of changing a process’s priority as described
above will affect the process scheduling order.
Furthermore, depending on the system state at the time of the request, the
kernel may make changes to the request itself or the way it is handled. For
instance: consider that the meta-level component of an application requests
a higher priority in the system. The kernel cannot bring about this change
by changing the application’s base-level component. In this case, the kernel
lets the request be handled by the meta-level of the reflective scheduler. The
scheduler’s meta-level would then manipulate the process priorities in its base-
level scheduler to affect the change.
The kernel is involved in every activity inside the reflection tower and plays
an important role in maintaining system integrity. This additional level of
indirection in the flow of reified information might seem to impose considerable
overhead into the system. However, since reified information remains in the
kernel at any given time and is only passed to the meta-level on explicit request,
most of the communication overhead is avoided. Furthermore, the kernel can
exercise complete control over the reification process and over reflection as a
whole. This also allows the kernel to analyse the existing state of the system
and manipulate information if necessary for the benefit of the entire system.
The key is that the kernel is able to discard certain unwanted information much
earlier in the reflection process avoiding any additional penalties that it might
have incurred. The modified reflection tower forms the basis of the reflective
framework explained later in section 3.2. The next sub-section describes the
75
support for reflection in the form of an in-kernel reflection interface.
3.1.7 In-kernel Reflection Interface
Most OS kernels store or have knowledge of the current system state or the
occurrence of any future events such as the process that executes next, expiry
of a timer, etc. In addition to this, applications have information regarding
their resource requirements and execution behaviour. Provision of mechanism
to exchange or share this valuable information is not enough to achieve efficient
resource management. The kernel also needs to provide additional mechanisms
to the resource management modules so as to query information at will and
bring about change(s) at runtime. The following facilities constitute the in-
kernel reflection interface:
• reification: an ability for the reflective system modules, applications
and the kernel alike to reify information. The process of reification via
the kernel has been explained in detail earlier in section 3.1.1. Along with
the information that is to be reified, the interface should also capture
the type of information (e.g. memory allocation, CPU requirement, etc.).
This helps the kernel categorise and assign an importance-level to it.
• introspection: an interface for the reflective modules to inspect the
reified information. This could be in the form of a simple function call
in the case of a single address space RTOS or the use of a system call.
The calling meta-level component would use this interface to query reified
information.
• interception: an interface or mechanism for the meta-level components
to intercept the base-level. The interception interface would work on a
76
function-level granularity allowing the meta-level components to inter-
cept functions present in a base-level component. A meta-level could
either request the kernel to intercept calls to a particular function found
within a fixed region of code or to intercept all calls to the function in
the entire base-level. The amount of flexibility or any additional func-
tionality provided by this interface is implementation-specific.
• code change: an ability to add, change or install code into the RTOS
or the applications. If possible, this interface should make use of the
existing dynamic loading features of an OS. The meta-level components
should be able to add new code into the base-level such that the new code
would either compliment the existing base-level functionality or com-
pletely replace it. Again, whether the newly added code is dynamically
relocated or is linked during compile time is implementation-specific.
Figure 3.3 shows the use of the in-kernel reflection interface between the
base-level and the meta-level components of a system module or an application.
The kernel is shown to occupy the area between the vertical dotted lines. On
the left-hand side of the kernel are the base-level components including the
applications and system modules and on the right-hand side are the respective
meta-level components. There is no specific order in which the interface is to
be used and the components may either be in an application address space or
the kernel address space depending on the implementation of the OS.
Information reified by the base-level components is stored within the kernel
and sent to the corresponding meta-level component(s) on explicit request.
Similarly, requests from the meta-level to intercept the base-level code or to
install new code into the base-level are passed to the kernel. At any stage, the
77
Applicationcode
System
base−levelcode
reify
Kernel
intercept
install code
meta−levelcode
readmodule
module
install code
intercept
readreify
Application/System
Figure 3.3: In-kernel Reflection Interface
kernel has complete control over the information as well as the changes being
made.
3.1.8 Summary
In summary, it is noted that conventional reflection mechanism provides
promising features but has lesser support in terms of having central control
over the information in an RTOS context. The communication overhead asso-
ciated with traditional reification process is not acceptable for real-time per-
formance. The modified reflection mechanism gives the RTOS kernel complete
control over the reflection mechanism. In particular, for soft real-time systems,
this means the system can maintain its integrity such that no change occurs
without the knowledge of the kernel. The method of reification via the kernel
such that information is transmitted to the meta-level components only on
request reduces the communication overhead.
78
There are no guidelines defined for the use of reflection in an RTOS context.
There is a need for a reflective framework in an RTOS which lays a well-
defined structure with useful guidelines for the development of efficient resource
management modules. Such modules could take advantage of reflection and
provide the required application-specific support. The next section describes
the generic reflective framework for an RTOS.
3.2 Generic Reflective RTOS Framework
The RTOS framework is based on the modified reflection mechanism described
in section 3.1. Most commercially available RTOSs are based on either – a
µ-kernel or a monolithic OS architecture. There is a significant difference be-
tween the two architectures. In a µ-kernel, the system modules – commonly
called servers – run as independent processes similar to the application pro-
cesses either in single or independent address spaces [109]. However, in a
monolithic kernel, the system modules are compiled into a single kernel mod-
ule that operates at specified time intervals (e.g. at scheduling instance, on
a timer interrupt, etc.) [109]. The µ-kernel architecture is modular in nature
and can be more easily specialised. The generic reflective framework is appli-
cable to both architectures and is defined in the form of an open framework
with no restriction on how it is implemented. The system developer is free to
add additional features to the framework so as to adapt it to the needs of a
particular system.
The framework assumes that the underlying OS has the notion of time
such that the framework is aware of the passage of time. The OS may provide
a simple system call such as gettime() which would return the current time in
the system. The time may either be an integer value (i.e. clock ticks) or in a
79
particular time format (i.e. hh:mm:ss).
The key aspect of the framework is the RTOS kernel. It is the centre-
point for all interactions within the system. The kernel should have reflection
support (as described in section 3.1) that allows the resource management
modules to adapt to application-specific requirements at runtime. The next
subsection describes the core elements of the framework.
3.2.1 Core Elements of the Framework
It is not essential to include all properties of reflection in the framework. This
section describes the core elements (particularly with respect to reflection)
that the framework must include in order to provide the required application-
specific resource management. Under the framework, the kernel must include
support for the following features:
• explicit reification: support for explicit reification i.e. the source is
compiled with explicit reification calls to reify only the required informa-
tion at runtime. These reification calls could be either added manually
by the developer or automatically inserted using compiler-assisted tech-
niques. The reification interface adheres to the process explained in
section 3.1.1.
• introspection: an interface in the kernel to allow the meta-level com-
ponents to query reified information at will. A simple function call or a
system call interface can be used for this purpose.
• function interception: this interface allows the meta-level component
to intercept a function. The core interception mechanism should allow:
80
1. to intercept all calls to a function or a given number of calls found
within a specified region of code,
2. to intercept a function such that the control is transferred to the
intercepting function provided by the meta-level before executing
the original function code,
3. to intercept a function such that the control is completely trans-
ferred to the intercepting function and the original function is never
executed. Thus, the intercepting function replaces the original func-
tion’s functionality.
• causal connection or link: an interface for the meta-level component
such that it could form a causal connection with data in the base-level.
This means, a change made by a meta-level component to the causally
connected data will be reflected on the actual data in the base-level. In
case of a single address space RTOS, this could be achieved by means
of providing the meta-level with a C like pointer to the data. In case of
a multi-address space RTOS, the causal link facility should at least be
supported for data belonging to the same address space.
The next subsection describes the optional elements of the framework that
may or may not be implemented.
3.2.2 Optional Elements of the Framework
In addition to the core elements, the framework may choose to extend or im-
plement some additional elements. The following guidelines provide extended
features to the existing core elements and some additional elements that can
be included in the framework:
81
• selective introspection: allowing the meta-level to specify the infor-
mation it is particularly interested in such that the kernel automatically
alerts it once such information is reified. This would also help the kernel
assign better importance-level to the information. The meta-level could
set a timeout for such kind of information whereby on timeout the kernel
would no longer alert the meta-level.
• enhanced interception: in addition to the core interception features
described in the previous subsection, the following could also be imple-
mented:
1. the ability to intercept a function such that the control is trans-
ferred to the intercepting function provided by the meta-level after
executing the original function code,
2. the ability to intercept calls to a function rather than the function
itself. This facility would allow custom behaviour of the function
depending on where it is called from. The interface could inter-
cept all or a said number of calls to a function within defined code
boundaries (i.e. limited by start and end addresses).
• install new code: the ability to install new code into the kernel to
either add an extra functionality or complement the existing one. The
implementation could use techniques such as dynamic loading and re-
location to provide this facility. If the extra functionality is already
pre-compiled into the module then it could be added or removed using
the replacement facility provided by the core interception element.
82
The next sections describe the model for constructing reflective system mod-
ules and reflective applications making use of both core and optional elements
of the framework.
3.2.3 Reflective System Modules
Each reflective system module in the framework is separated into two entities:
a base-level and a meta-level. The base-level component implements a stan-
dard resource management policy. For instance, the base-level of a reflective
CPU scheduler could implement a fixed priority scheduling policy.
The meta-level component would analyse reified information in the kernel
pertaining to its base-level (using the introspection interface) and identify the
need for a change in the base-level. This change could be in the form of a
change to the policy (e.g. EDF scheduling policy instead of FP) or a minor
change in data structure(s) present in the base-level (e.g. change a process’s
priority). Whether a meta-level component is executed as a separate process
or only when information pertaining to it has been reified is implementation
specific. Ideally, when not required the meta-level component should remain
in an idle state (i.e. not consuming CPU time).
Figure 3.4 shows the general structure of a reflective system module. The
base-level component sets up meta-level privileges during initialisation and
reifies information at runtime. For example: the base-level of a reflective CPU
scheduler would reify: the status of the current process, information about the
current scheduling policy, process priorities, etc. The meta-level component
can intercept a base-level function, install new code or establish a causal link
in order to change or adapt the base-level’s behaviour at runtime.
It is possible to share a single meta-level component between two or more
83
reifieddata
ApplicationBase Kernel Core
CodeBase−level
CodeMeta−level
link
readreified
data
causal
install codeor
request for
transferintercepted
call
reify data
Reflective System Module
installcode
interception
Figure 3.4: Structure of a Reflective System Module
reflective system modules. The meta-level component in this case would simply
obtain privileges from the respective base-level components during its initiali-
sation. Sharing meta-level component amongst one or more base-levels would
save memory required by any additional meta-level components. It is suggested
that this facility should only be used when the two base-level components are
closely related to each other. For example: in case of power management, the
base-level components of resources such as memory and CPU could share their
meta-level component with the base-level of the power management module
as well. This will provide more information about the resource for efficient
power management. However, the possibility and impact of a shared meta-
level component has not been explored in this thesis. Note that the figure 3.4
does not show the shared meta-level(s).
84
3.2.4 Reflective Applications
Applications are the primary users of the facilities an RTOS provides. The
better the support they obtain from an RTOS, the better is their performance.
The main motivation of this work is not to allow implementation of reflective
applications but to provide efficient resource management support by using
the information available within the applications and the system as a whole.
However, similar to the reflective system modules, the framework also supports
the implementation of reflective applications.
With the use of privileges, the kernel restricts the way applications use the
reflection interface. With the assigned application privilege, an application
meta-level component cannot perform the following operations:
• access information reified by the resource management modules (e.g. the
system’s process queue) unless given explicit privilege by the respective
base-level component of the resource management module. However,
an application is able to access the process queue containing its child
processes/threads.
• install or replace new code into any of the system module base-level
components when the change affects other applications in the system.
• share a meta-level component amongst multiple application or system
base-levels. This ensures complete isolation within different applications.
Other than the above restrictions, the structure of the reflective applica-
tions is similar to the reflective system modules.
There is however one more difference: irrespective of whether an applica-
tion has a meta-level component or not, it is still allowed to reify information.
85
In this case, the reified information, when stored in the kernel, can be used by
a reflective system module.
Reification plays an important role in extracting valuable information from
the applications. By reifying application-specific resource requirements to the
system, applications in a way control and adapt the RTOS’s reflective resource
management policies. The next subsection describes the meta object protocol
for the reflective modules.
3.2.5 Meta Object Protocol for Reflective Components
The Meta Object Protocol (MOP) provides a basic set of rules/guidelines
that decide how the reflective components (e.g. meta-levels) will operate and
interact with each other in the system. The following are the MOP guidelines
defined for the reflective framework:
• exactly what and how much is reified? : the factor that affects this deci-
sion is the importance-level of the currently reified information and the
availability of memory. The approach is to store as much of the reified
information as possible in the order of importance, discarding informa-
tion with lower infoLevel to minimise memory utilisation. At runtime,
the base-level component would use the reification interface to reify all
the relevant information. Depending on the availability of memory, the
kernel either chooses to store or discard some of the reified information
(more details in section 3.1.4).
• number of allowed meta-levels: ideally, one meta-level component for a
base-level would suffice. The framework supports the use of multiple
meta-level components operating one on top of the other. Practically,
86
there is no limitation in the framework on the number of meta-levels a
reflective system module/application can have at any given time. This
is possible by letting the first order meta-level component to act as a
base-level component to the second order meta-level and so on.
• interaction between different meta-levels: similar to supporting multiple
meta-level components, a meta-level ‘A’ can interact with a meta-level
‘B’ by acting as its base-level component and using reification to pass
information to ‘B’. The meta-level ‘B’ on the other hand can also act as
the base-level of ‘A’ and reify information to it. This phenomenon can
be termed as cyclic reflective tower where the meta-level components
can interact and change each other. The interaction between meta-level
components needs to be explicitly configured by providing the required
privileges to each other during initialisation as described in section 3.1.3.
In order to practically verify the reflective framework in an OS context,
a prototype RTOS has been implemented. The following sections describe
the implementation of a prototype RTOS – DAMROS. DAMROS implements
two reflective system modules: reflective CPU scheduler and reflective virtual
memory manager (paging).
3.3 Prototype Implementation – DAMROS
This section presents the prototype implementation of the reflective framework
in a home-grown micro-kernel RTOS - DAMROS [95, 96]. DAMROS stands
for Dynamically Adaptive Micro-reflective Real-time Operating System. The
generic reflective framework allows applications to express their specific re-
source requirements via the process of reification.
87
Based on a µ-kernel architecture, DAMROS has been implemented as a
single address space RTOS for the Intel x86 CPU architecture [61]. It sup-
ports virtual memory paging and implements a two-level CPU scheduler. The
main goal of implementing DAMROS is to test the reflective framework for
application-specific resource management, in particular, for the CPU and
memory resource. For the prototype implementation, the development of
DAMROS was restricted to implementation of a two-level CPU scheduler,
a paged memory management subsystem, and a few device drivers to provide
an interface to run the experiments.
The base core kernel consists of the reflection interface, a minimalist lower-
level scheduler (whose main objective is to schedule the various system modules
when required including the higher-level application scheduler) and interrupt
handling routines (e.g. timer interrupt). All the system modules (e.g. CPU
scheduler, memory manager, etc.) execute as separate individual system pro-
cesses/threads.
In order to support the framework, DAMROS implements a gettime() func-
tion which can be used to obtain the current time in the system. The value
returned (data type time t)is a 64-bit value of the CPU’s time-stamp counter
(i.e. Intel RDTSC machine instruction [38]). Furthermore, DAMROS imple-
ments timers using which the processes/threads in the system can sleep until
or be woken up at a particular time in future.
Since DAMROS is a single-address space RTOS, applications execute in
the same address space as the OS. There is no distinction between a process
and a thread. Both mean one and the same and can be used interchangeably.
The next subsection describes the implementation of the reflection interface
in the kernel.
88
3.3.1 Reflection Interface in the Kernel
According to the reflective framework, the implementation of the in-kernel
reflection interface allows applications and the system modules to reify infor-
mation from the base-level as well as the applications; to introspect, intercept
and also install new code into the base-level. The interface has been imple-
mented as two separate components: the rManager : for managing reification
and introspection; and the iManager : for managing interception and the in-
stallation of code. Before describing the interfaces offered by each component,
the next subsection describes the different types of information and the sup-
port of reification in DAMROS.
Support for Reification
In DAMROS, each resource is assigned a unique ID (an Identification number)
which is used when reifying information about a particular resource. The
IDs assigned for the CPU and memory resources are 1 and 2 respectively.
DAMROS, implemented in the C language, defines C constants CPU and
MEMORY with values 1 and 2 respectively. Information related to a resource
is further categorised into information type, represented as infoType.
Since DAMROS is a single address space OS, reification of information is
accomplished using a direct function call interface. The base-level component
of a reflective module or an application prepares and reifies a data structure
containing the required information. The C language data structure used for
reification is as follows:
typedef struct {
process_id_t pid;
int resourceID;
int infoType;
89
unsigned data;
void *dataPtr;
int infoLevel;
time_t time;
} reify_t;
In the above data structure, pid represents a unique ID assigned to a pro-
cess/thread in DAMROS. The resourceID represents one of the resources in
the system (i.e. the CPU or MEMORY). The significance of other fields is
described later in the following subsections. For each resource there are a
number of different information types (i.e. infoType) defined.
CPU infoTypes
Each infoType (an integer constant) has an associated infoLevel field which
depicts the importance of the information. A greater infoLevel value gives
greater importance to the information. For the CPU, DAMROS defines the
following infoTypes (the associated infoLevel value is shown in brackets):
• HI PRIORITY (infoLevel = 22): used when a process requires a higher
priority. If granted, the process obtains one higher priority than its
existing priority.
• LO PRIORITY (infoLevel = 21): used when a process requires a lower
priority. If granted, the process obtains one lower priority than its exist-
ing priority.
• CHILD FP (infoLevel = 25): used by a process to request for a Fixed
Priority (FP) scheduling policy to schedule its child threads. For threads
with equal priorities, DAMROS uses a pre-emptive FP scheduler that
90
switches between the threads with equal priority similar to a RR sched-
uler. Thus, all child threads have equal priorities, the existing RR sched-
uler is used until one of the threads changes its priority.
• NO CHILD FP (infoLevel = 24): used by a process to undo the effect
of the above request. DAMROS ignores and discards this information if
there is no FP scheduling policy in use by the process.
• DEADLINE (infoLevel = 22): used by a process/thread to specify a time
deadline when using EDF scheduling policy. In this case, the dataPtr
field of reify t points to the 64-bit time t value. A value of 0 or 1 in the
data field indicates whether the given time is absolute or relative to the
current time in the system.
• CHILD EDF (infoLevel = 25): used by a process to request a Earliest-
Deadline-First (EDF) scheduling policy to schedule its child threads.
• NO CHILD EDF (infoLevel = 24): used by a process to undo the effect
of the above request. DAMROS ignores and discards this information if
there is no EDF scheduling policy in use by the process.
• CHILD FCFS (infoLevel = 20): used by a process to request for a First-
Come-First-Serve (FCFS) scheduling policy to schedule its child threads.
• NO CHILD FCFS (infoLevel = 19): used by a process to undo the ef-
fect of the CHILD FCFS request. DAMROS ignores and discards this
information if there is no FCFS scheduling policy in use by the process.
• CHILD RR (infoLevel = 15): used by a process to request for a round
robin scheduling policy to schedule its child threads.
91
• CHILD SCHED (infoLevel = 30): used by a process to request for a
user-defined (UD) scheduling policy to schedule its child threads. In
this case, there are two possibility: (1) an application process might
implement its own UD scheduler code or (2) a process might want to
use a UD scheduler implemented elsewhere in the system – either in the
scheduler’s meta-level or a different process in the system. In case of (1),
the address location of the UD scheduler is placed in the dataPtr field of
the reify t structure with the data field set to 0. In case of (2), the UD
scheduler can only be used if the component implementing it allows it in
which case, the dataPtr field points to a string that uniquely identifies
the US scheduler function implemented elsewhere with the data field set
to 1. Case (2) is described in more detail later in section 3.3.3.
• NO CHILD SCHED (infoLevel = 29): used by a process to undo the
effect of the CHILD SCHED request. In other words, disables the UD
scheduling policy for the child threads of the calling process. DAMROS
ignores this information if no UD scheduler is active for the child threads.
• PROCESS QUEUE (infoLevel = 35): this infoType is reified by the base-
level scheduler. It is used by an UD scheduler to obtain the process queue
from the base-level scheduler using the requestInfo() interface (described
in section 3.3.2). The process queue, thus returned, consists of only the
child threads of the application implementing the UD scheduler.
DAMROS implements a RR scheduling policy at the base-level with a time
quantum of 5ms. For the above infoTypes beginning with ‘NO ..’ (e.g.
NO CHILD FP), the previously installed scheduler is replaced by the default
RR scheduler. Note that the infoType NO CHILD RR would have no effect
92
and hence, it does not exist. The infoLevel values are used by the rManager to
prioritise information. The assignment of infoLevels is such that information
pertaining to UD scheduling gets higher priority than the rest.
Memory infoTypes
For memory, particularly with respect to paging, DAMROS defines the follow-
ing infoTypes :
• MEM READ (infoLevel = 25): used by a process to specify that a par-
ticular region of memory is being read by the application. The dataPtr
field of reify t structure holds the starting location of the access and the
data field holds the size (in bytes) of the read operation from the starting
location.
• MEM WRITE (infoLevel = 24): used by a process to specify that a
particular region of memory is being written to by the application. The
range of the memory region that is written to is specified similar to the
MEM READ.
• MEM ALLOC (infoLevel = 20): mainly used by the base-level code of
the memory manager but can also be used by an application. It informs
the meta-level component of the memory manager about a memory al-
location. The range of the memory region allocated is specified similar
to the MEM READ.
• MEM FREE (infoLevel = 19): similar to MEM ALLOC, but specifies
that a memory region is freed in memory.
93
• KEEP ALIVE (infoLevel = 22): used by a process to request a certain
region of virtual memory to be always resident in physical memory.
• ALLOW DEATH (infoLevel = 21): used by a process to voluntarily
suggest a certain virtual memory region to be removed from physical
memory (i.e. moved to swap space).
• LRU (infoLevel = 15): used by a process to request an LRU paging
policy in the system.
• LFU (infoLevel = 27): used by a process to request an LFU (Least
Frequently Used) paging policy in the system.
• NO LFU (infoLevel = 26): used by a process to undo the effects of an
LFU request. DAMROS ignores and discards this information if the
process is not using an LFU policy.
• MRU (infoLevel = 29): used by a process to request an MRU (Most
Recently Used) paging policy in the system.
• NO MRU (infoLevel = 28): used by a process to undo the effects of an
MRU request. DAMROS ignores and discards this information if the
process is not using an MRU policy.
• UD POLICY (infoLevel = 31): used by a process to request that
a UD paging policy be used to manage its memory pages. Like
CHILD SCHED, there are similar two possibilities and the correspond-
ing field in reify t are set as in case of CHILD SCHED.
• NO UD POLICY (infoLevel = 30): used by a process to undo the ef-
fects of a UD POLICY request. DAMROS ignores and discards this
94
information if the process is not using an UD policy.
• PAGE TABLE (infoLevel = 35): this infoType is reified by the base-
level of the memory manager to reify the page table belonging to an
application in the system. The information is used by a UD paging
policy to obtain the page table belonging to its application using the
requestInfo() interface (described next). The page table, thus returned,
consists of only the pages allocated to the application implementing the
UD policy.
Each infoType has a unique integer value. When an information is reified,
the reify t data structure containing the right values is passed to the kernel
via a system call. In case of DAMROS, a system call would be a direct
function call. This data structure is stored in the kernel and passed to the
respective meta-level component(s) on request. The next subsection describes
the interfaces provided by the rManager component.
3.3.2 The rManager
The rManager component provides several interfaces for the applications and
the system modules to take advantage of the reification mechanism in the
kernel.
95
The implementation of each interface in terms of the C language specifica-
tion is described as follows:
Interface reify():
SYNOPSIS:
int reify(int infoType, ...);
DESCRIPTION:
Any component in the system either reflective or non-reflective including the
applications, system modules and the kernel itself can use this interface to reify
resource related information. For example: an application process can reify
information to request a higher priority using the following call (C language
representation):
reify(HI_PRIORITY);
where HI PRIORITY is the infoType.
In DAMROS, the reify interface is a C function that accepts a variable
number of arguments (i.e. the C language va args) such that the component
that reifies information is able to provide additional information as required.
For instance, to reify a request to install an UD scheduler, the application
process would be required to provide the location of the UD scheduler’s code.
Thus, the usage of reify in this context is as follows:
reify(CHILD_SCHED, &UD_scheduler);
where CHILD SCHED is the infoType and ‘&UD scheduler’ represents the
address location of the scheduler code. The function reify() uses the specified
infoType retrieving any additional information from the function arguments
and prepares the reify t data structure.
96
int reify(int infoType, ...)
{
va_list args;
reify_t *info;
va_start(args, infoType);
info = (reify_t *) malloc(sizeof(reify_t));
info->pid = getpid();
info->infoType = infoType;
info->resourceID = getresourceID(infoType);
info->infolevel = getinfolevel(infoType);
info->time = gettime();
switch(infoType)
{
case ...
...
...
case CHILD\_SCHED:
/* store the address location of UD scheduler */
info->dataPtr = (void *)va_arg(args, unsigned);
break;
...
...
}
va_end(args);
/* pass the information to the rManager */
return rManager_save(info);
}
Figure 3.5: Code Snippet of Reify Interface
97
Figure 3.5 shows the code snippet of the reify interface for handling the
infoType – CHILD SCHED. Here, getpid() and getresourceID() are functions
that return the process ID and the resource ID associated with the passed in-
foType. In the above example, since the infoType CHILD SCHED corresponds
to the CPU scheduler, the resource ID returned by getresourceID() would be
1 or the defined constant – CPU. Also, functions getinfolevel() and gettime()
return the infoLevel for a given infoType and the current system time.
The newly prepared data structure (reify t) is passed to the rMan-
ager save() where it is either stored or discarded. In the rManager, each reified
information is tagged with an infoLevel field to signify its importance-level.
Figure 3.6 lists the code snippet of rManager save().
Depending on the availability of memory and the infoLevel, the rManager
either stores or discards the information. Each resource in DAMROS is as-
signed 8 KB of memory for the storage of reified information. The function
memFree() returns the memory available for the corresponding resource. In
the case where there is no memory available, the rManager accommodates the
new information (if it has higher infoLevel) by discarding an already existing
information with a lower infoLevel value.
The function getLeastInfoLevel() returns the information which has the
least infoLevel value amongst the already reified information for that resource.
If the return information has greater infoLevel than that of the newly reified
information, then the new information is discarded. Each stored information
is assigned a unique ID which is returned by the saveInfo() function.
98
int rManager_save(reify_t *info)
{
reaify_t *curInfo;
int id;
...
/* check if memory available */
if (memFree(info->resourceID) < sizeof(reify_t)) {
/* get stored info with least infoLevel
for the concerned resource (e.g. memory) */
curInfo = getLeastInfoLevel(info->resourceID);
/* if infoLevel of reified info is lower */
if(curInfo->infoLevel > info->infoLevel) {
/* discard the newly reified information */
discardInfo(info);
return -1;
}
/* discard info with a lower or equal infoLevel */
discardInfo(curInfo);
}
/* save the newly reified information */
id = saveInfo(info);
return id;
}
Figure 3.6: rManager: Saving Reified Information
99
Interface requestInfo():
SYNOPSIS:
int requestInfo( int resourceID
, int infoType
, time_t after
, time_t before
, reify_t* info);
DESCRIPTION:
This interface is used by the meta-level component to obtain reified informa-
tion stored in the rManager. During system initialisation, each meta-level is
assigned a privilege list of <resourceID, access privilege> pairs. The access
privilege in terms of read access or write access assigns the meta-level a read-
/write privilege for a particular resource represented by its resourceID. As per
the framework, a base-level provides the kernel with this list during its initiali-
sation. The kernel assigns this list to its meta-level component. The rManager
checks for the privilege of the requesting component prior to providing it with
the requested information.
In the implementation of function requestInfo(), it is not necessary for the
requesting component to specify all the function parameters. The parameter
resourceID must be specified, whereas the rest are optional. When called with
only the resourceID, the function copies the resource’s reified information with
highest infoLevel into the info parameter.
By specifying the infoType parameter, the function provides information
with the given infoType which was most recently reified. If any requested
information does not exist, then the function returns ‘-1’ with the info param-
eter pointing to a Null value. Using the parameters after and before either
alternatively or together, a meta-level can correspondingly query information
100
that was reified after a given time, before a given time or between a given time
period.
Furthermore, a meta-level can request for a causal link to the data present
in the base-level using the requestInfo() function. In this case, instead of
copying information into the info parameter, a unique ID representing the
data structure in the base-level is returned. This unique ID is then used by
the function linkData() to establish the causal link. If the causal link is not
granted, then the function returns ‘-2’.
Interface linkData():
SYNOPSIS:
void* linkData(int ID);
DESCRIPTION:
This interface allows a meta-level to form a causal link with the data present
in the base-level. Such a link helps the meta-level to directly inspect/anal-
yse/modify data in the base-level without incurring any extra overhead in
the system. A call to linkData() must always be preceded by a requestInfo()
call. The call to requestInfo() lets the rManager know what data the calling
component wants to link to.
The rManager authenticates this request against the privileges, prepares
for a causal link and provides the calling component with a unique identifier
(ID) that resembles the request. This ID is then used by linkData() call to
form the actual causal link.
An example use of a causal link is as follows: the meta-level code of a
reflective CPU scheduler could establish a causal link with the scheduler’s
process queue. Any change made by the meta-level to the order of the processes
101
in this process queue would affect the scheduling of the processes without the
knowledge of the base-level. Note that, in order to form a causal link, the base-
level code must reify the data and assign a ‘write’ privilege to the meta-level
that would request for the causal link. DAMROS follows the same guidelines
described in section 3.1.3 to assign privileges to the meta-level components.
For process synchronisation, the access to the data is protected by a mu-
tex [19]. The base-level code must provide a mutex along with the data during
reification. i.e. the dataPtr field in the reify t structure holds the pointer to
the data being reified and the data field holds the mutex. The mutex inherits
the meta-level privilege list of <resourceID, access privilege> pairs from the
base-level such that only the meta-level with a ‘write’ privilege is able to lock
the mutex and use the data.
Interface unlinkData():
SYNOPSIS:
void unlinkData(int ID);
DESCRIPTION:
This interface is used by the meta-level code to close/invalidate a previously
established causal link. This function uses the given ID to locate the reified
data and the associated mutex. The meta-level entry is disabled in the mutex’s
privilege list such the calling meta-level can no more lock the mutex to use the
data. This entry is reset by using linkData() again.
Note that, the applications, system modules and the kernel are compiled
together to run in a single address space such that the above interfaces are
equally visible to the applications as they are to the system modules. The
reflection interface in DAMROS does not impose any additional execution
102
time penalties caused by using the IPC (Inter-Process Communication) mech-
anisms. Race conditions are avoided by the use of mutexes where necessary.
The next subsection describes the implementation of the iManager component.
3.3.3 The iManager
The iManager component provides interception and code installation inter-
faces to the system modules as well as the applications. The implementation
of each interface in the terms of the C language specification is as follows:
Interface allowIntercept():
SYNOPSIS:
int allowIntercept( void *function
, char *func_name
, int resourceID);
DESCRIPTION:
This interface is used by base-level code during its initialisation to allow the
interception of a particular function (represented by parameter function) by
the meta-level component representing the given resourceID. If the given re-
sourceID is ZERO, then any thread in the system is allowed to intercept the
function. The iManager verifies and assigns a ‘write’ privilege to correspond-
ing entry in its meta-level privilege list if required. This interface assigns a
unique ID to represent the function such that a meta-level or a thread can use
either this ID or the given string in the func name parameter it to intercept
the function in future. This is used for case (2) of both CHILD SCHED and
UD POLICY infoTypes of the CPU and memory resource.
103
Interface interceptAllowed():
SYNOPSIS:
int interceptAllowed( char *func_name
, int func_ID
, void *new_func);
DESCRIPTION:
This interface is used by a meta-level or a thread to intercept a function which
is present in a different base-level or implemented by a different process in
the system. The component implementing this function must allow it to be
intercepted using the allowIntercept() interface. Only one of the parameters
func name and func ID is provided to identify the function to be intercepted.
The parameter new func specifies the location of the new function. Once
intercepted, the control is transferred to the new func. Further details of the
implementation of interception is described in detail in the next interceptCall()
interface section.
Interface interceptCall():
SYNOPSIS:
int interceptCall( void *orig_func
, void *new_func
, void *start
, void *end
, int ncalls
, int nparams);
DESCRIPTION:
This interface is used by the meta-level to intercept a given function (parameter
orig func) in the base-level. Before a function is intercepted, the iManager
104
checks if the requesting meta-level has a ‘write’ privilege (granted by using
allowIntercept() call). In DAMROS, following are the two different methods
in which a meta-level can intercept a function in the base-level:
Method #1:
In this method, after interception, the control is transferred to the function
at location represented by the new func parameter. The original function is
never executed unless explicitly called from within the meta-level function. A
typical use of this type of interception is:
id = interceptCall( &base_level_function
, &meta_level_function
, NULL
, NULL
, 0
, 0);
The above call intercepts the function base level function() and transfers
control to the function meta level function() when rest of the parameters are
set to the values shown above.
The operation of interception can be illustrated with an example. Con-
sider that the functions base level function() and meta level function(), used
in the interceptCall(), are located at addresses 0x08048517 and 0x08068000
respectively. For the Intel x86 architecture [61], the machine code at the start
of function base level function() is generally represented as follows:
0x08048517: 55 => PUSH EBP (1 byte)
0x08048518: 89 E5 => MOV EBP,ESP (2 bytes)
0x0804851A: 83 EC 38 => SUB ESP, 0X38 (3 bytes)
0x0804851D: XX XX XX
...
In the above code, 55 is the opcode for instruction PUSH EBP stored at
address 0x08048517 and the total space taken by this instruction is 1 byte.
105
In order to intercept this function, a JMP instruction (opcode = E9) is added
to the start of the function such that control is transferred to the function
meta level function(). The format of the JMP instruction is:
JMP <signed 32-bit displacement>
i.e. E9 <signed displacement> in machine code.
Using this displacement, the control jumps to an effective address calcu-
lated as follows:
Effective Address = (Next Instruction Address) +
(<signed displacement>)
The effective address in the example needs to point to the function
meta level function(), i.e. address location 0x08068000. By adding the
JMP instruction which is 5 bytes long at address 0x08048517, the next in-
struction address is 0x0804851B. The displacement for JMP instruction is
0x08068000− 0x0804851B = 0x0001FAE5. Thus, the above code is changed
as follows:
0x08048517: E9 E5 FA 01 00 => JMP 0x08068000 (5 bytes)
0x0804851C: 90 => NOP (1 byte)
0x0804851D: XX XX XX
...
The instructions with an opcode value 90 (representing a no operation instruc-
tion) are accordingly inserted to preserve the previous instruction order. The
iManager creates a function stub on-the-fly by generating the following code:
0x08050000: 55 => PUSH EBP (1 byte)
0x08050001: 89 E5 => MOV EBP,ESP (2 bytes)
0x08050003: 83 EC 38 => SUB ESP, 0X38 (3 bytes)
0x08050006: E9 12 85 FF FF => JMP 0x0804851D (5 bytes)
0x0805000B: C3 => RET (1 byte)
106
Here, the instructions from address 0x08050000 until address 0x08050006 are
from the original function base level function() that were replaced. Note that,
there is a JMP instruction added at address 0x08050006. This jumps to the lo-
cation 0x0804851D into the function base level function(), i.e. the code which
has been preserved after interception. The location to this newly generated
code is associated with a unique ID for this particular interception operation.
The meta-level can obtain this location by using a helper function void
*getOriginalCall(int ID). Thus, a meta-level can explicitly execute the inter-
cepted function, base level function() in the example, as follows:
void meta_level_function(int param1, int param2)
{
void *original_function;
...
original_function = getOriginalCall(interceptID);
original_function(param1, param2);
}
In the above code snippet, interceptID is a variable known to the meta-level
that holds the unique interception ID returned by interceptCall() interface.
Note that, in DAMROS, the function meta level function() has the flexibility
to call the intercepted function at any time – initial stage, in the middle or
towards the end of its execution. However, if it does not call the base-level
function, then the meta level function() essentially replaces the functionality
of the base-level function.
Method #2:
In this method, rather than intercepting a function as a whole, calls to a
particular function from within a fixed region of code can be intercepted.
107
A typical call would be as follows:
id = interceptCall( &base_level_function
, &meta_level_function
, &some_function
, NULL
, 2
, 0);
The above call intercepts calls to function base level function() residing
within the function some function(). The code starting at location represented
by some function() is scanned for calls to function base level function() until
a return assembly instruction (e.g. RET in Intel x86 assembly [38]) is found
when the parameter end is set to NULL. Otherwise the process carries on until
the location pointed to by the parameter end is reached. If the parameter
‘ncalls’ has a value greater than ZERO, then only the first ‘ncalls’ number of
calls are intercepted (i.e. 2 in the above call).
The two functions can have a different number of arguments. The intercep-
tor function (i.e. meta level function()) must always have equal or fewer num-
ber of parameters than the intercepted function (i.e. base level function()).
The parameter nParams of interceptCall() specifies the number of parame-
ters required by the interceptor function. The iManager rejects the request
if this number is greater than the number of parameters for the intercepted
function. This number for the intercepted function is determined by decoding
the machine code instructions before a call to the intercepted function. The
parameters are pushed onto the stack using PUSH instructions before calling
the function. If the given nParams is less than the parameters for intercepted
function, then the PUSH instructions for the later parameters are replaced by
NOP instructions. If the nParams is 0, then no changes are made.
108
In Intel x86 architecture, the assembly code for a function call instruction
is represented as follows:
CALL <signed 32-bit displacement>
i.e. E8 <signed displacement> in machine code.
Here, E8 is the opcode for a CALL instruction and the displacement is used
to calculate the effective address of the function [38, 61] as described before.
For the same example as above, consider that the parameter start is point-
ing to location 0x106C00. The iManager needs to replace calls to function
base level function() with calls to function meta level function(). It scans the
machine code starting from location 0x00106C00 until it detects a CALL in-
struction represented as follows:
...
0x00106C7A: E8 99 18 F4 07 => CALL 0x08048517 (5 bytes)
0x00106C7E: XX XX
...
Here, the instruction at location 0x00106C7A is a CALL instruction (opcode =
E8). The next instruction starts at address 0x00106C7E. Thus, the effective
address of the called function is 0x00106C7E + 0x07F41899 = 0x08048517
which is the location of function base level function().
In order to transfer control to function meta level function(), the displace-
ment needs to be changed.
Displacement = 0x08068000 - 0x00106C7E
= 0x07F61382
The code is changed such that the CALL instruction points to function
meta level function() instead of base level function(). The instruction stream
after the change would look as follows:
109
...
0x106C7A: E8 82 13 F6 07 => CALL 0x08068000 (5 bytes)
0x106C7E: XX XX
...
This process of replacement continues until the required number of calls have
been intercepted. The interceptCall() interface returns this unique ID to the
calling meta-level component which can be used in the future to refer to the
particular interception operation.
Interface uninterceptCall():
SYNOPSIS:
int uninterceptCall( int ID
, boolean keepAlive);
DESCRIPTION:
This interface is used by the meta-level code to undo the effects of a intercept-
Call(). The iManager restores the machine code to its original state. If the
parameter keepAlive is set to TRUE, then the iManager retains the saved in-
terception information. This lets a meta-level to re-intercept a function much
faster in the future by using the same ID eliminating the time required scan
the underlying machine code.
Interface reinterceptCall():
SYNOPSIS:
int reinterceptCall(int ID);
DESCRIPTION:
This interface is used by a meta-level to re-intercept a previously un-
intercepted function, the information of which was retained using the keepAlive
110
parameter of uninterceptCall(). On receiving this request, the iManager checks
if the calling meta-level is the owner of the ID and immediately changes the
underlying machine code.
Interface installCode():
SYNOPSIS:
int installCode( int resourceID
, void *function
, char *codename);
DESCRIPTION:
This interface is used by a meta-level component to install new code into the
system or application (e.g. a user-defined scheduling policy). This interface
specifies the iManager about the existence of a function that could act as a
replacement for the base-level of the given resourceID. The function location is
specified by the parameter function and is uniquely identified by the string in
parameter codename. The iManager maintains a list containing the function
location, its codename and the resourceID it corresponds to.
Normally, a base-level component or an application thread could reify and
use a function it implements as an alternative resource management module.
For instance, a base-level can reify to use a function UD scheduler() as the
scheduler for its child threads as described in reification section before. The
installCode() interface allows a base-level to use a function that it does not
implement and which is present in a meta-level. Thus, once a meta-level
installs the code using this function, a base-level or an application could request
to use it via the uniquely identified name (parameter codename). This interface
is similar to allowIntercept() but it allows the use of a function implemented
in the mate-level rather than the base-level.
111
Interface uninstallCode():
SYNOPSIS:
int uninstallCode( int resourceID
, char *codename);
DESCRIPTION:
This interface is used by the meta-level to un-install the code previously in-
stalled using installCode(). The iManager deletes the entry represented by
the given resourceID and codename from the list of installed functions.
In summary, the rManager and the iManager components provide differ-
ent interfaces to the applications as well as the system modules to exchange
information and bring about runtime changes to the resource management
modules. The following section describes the design and implementation of
a reflective CPU scheduler that makes use of these interfaces. DAMROS im-
plements two reflective system modules: reflective CPU scheduler (described
next) and reflective virtual memory manager (described in section 3.3.5).
3.3.4 Reflective CPU Scheduler (VRHS)
The design of the reflective CPU scheduler uses the framework provided by
DAMROS. This section describes a Virtual Reflective Hierarchical Scheduler
(VRHS) model [94] in which threads of a common parent are grouped together
and scheduled by a custom application-specific scheduling policy.
Generally, applications are either developed without the knowledge of the
RTOS they would execute upon or are developed with a specific RTOS in
mind. Due to this fact, there might be several assumptions made about an
RTOS early on in the development cycle. One such assumption is about the
scheduling policy implemented in an RTOS. The use of a specific scheduling
112
policy controls the timing behaviour of the real-time application threads. An
application developed for one RTOS would behave differently when executed
on a different RTOS implementing a different scheduling policy.
The motivation to use a hierarchical application-specific scheduling model
is to avoid this behavioural impact on applications and to allow the use of
application-specific scheduling policies to schedule the threads. This preserves
the application’s timing behaviour no matter how it was developed. The parent
application thread can install a UD scheduling policy to schedule its child
threads.
The VRHS model implemented in DAMROS is designed as a two-level
scheduler (see figure 3.7). The lower-level scheduler, called the ‘System sched-
uler ’, implements a fixed priority scheduling policy. The system scheduler is
built into the DAMROS kernel. The higher-level scheduler, called the ‘Appli-
cation scheduler ’, executes as an independent system process/thread which is
scheduled by the system scheduler. The next subsection describes the system
scheduler in more detail.
S
SchedulingVirtual
Hierarchy
SystemScheduler
ApplicationSchedulerA
Figure 3.7: Two-level Scheduler in DAMROS
113
User−defined
causallink to
reify()install
code
installCode()
requestInfo()
interceptCall()
linkData()
callinterceptedcontrol on
transfer
change behaviourafter interception
base−level
meta−level
static default
OptimizedPriority
rScheduler
scheduling policy
Reflective Scheduler
processqueue
Base Kernel Core (also holds application reified data)
Figure 3.8: Structure of Reflective CPU Scheduler Module
System Scheduler
The lower-level system scheduler mainly schedules two important system mod-
ules – the application scheduler and the meta-level component of the appli-
cation scheduler, called the ‘rScheduler ’. The System scheduler uses a FP
scheduling policy with ‘rScheduler’ having a higher priority than the applica-
tion scheduler so that the meta-level always make changes prior to the execu-
tion of the base-level (application scheduler).
Furthermore, the system scheduler executes the rScheduler thread only
when relevant information, pertaining to the application scheduler is reified.
All application threads are scheduled by the application scheduler.
While the system scheduler remains unchanged in the entire life-time of the
system, the application scheduler undergoes many changes ( brought about by
the rScheduler). Figure 3.8 shows the model of the reflective scheduler mod-
ule. The rScheduler has access to multiple schedulers and it can replace the
114
base-level scheduler at runtime using interceptCall() interface. The next sub-
section explains the operation of the application scheduler and the rScheduler
component in more detail.
Application Scheduler
DAMROS supports hierarchical scheduling mechanism with the base-level
code, application scheduler, implementing a round robin (RR) scheduling pol-
icy with a 5ms time quantum. The traditional hierarchical scheduling ap-
proaches use a tree-based structure in the scheme where each intermediate
node represents a scheduler. The leaf nodes represent the application threads
that need to be executed. A node at a higher-level schedules a scheduler at a
lower-level, eventually scheduling the application threads [39, 99].
Using reflection, in DAMROS, it is possible to have a virtual hierarchy
of schedulers whilst still maintaining a simple two-level scheduler structure.
DAMROS currently implements the following schedulers: FP, RR, EDF and
FCFS. These schedulers are used by the rScheduler to replace the existing
application scheduler if required at runtime. DAMROS does not support dy-
namic loading. All schedulers have to be preloaded.
The rScheduler (meta-level component of the application scheduler) com-
ponent is designed to run as an independent system thread having a higher
priority than the base-level application scheduler thread. At the time when the
application scheduler needs to schedule the newly added scheduler, the rSched-
uler simply replaces the application scheduler with the new scheduler code.
This action makes the system scheduler transparently schedule the new sched-
uler instead of the application scheduler. Note that, neither the application
scheduler nor the system scheduler know about this change. The rScheduler
115
then reverts back to the original scheduler when the new scheduler is no longer
required to run.
When information pertaining to the CPU is reified, the rManager activates
the rScheduler thread which obtains the reified information and makes appro-
priate changes to the base-level application scheduler if required. The change
may require either the manipulation of base-level data structures or replac-
ing the application scheduler itself (e.g. using linkData() and interceptCall()).
The operation of rScheduler is explained in more detail later. The next section
describes the implementation of the universal run queue in DAMROS.
Universal Run Queue
To facilitate the operation of the rScheduler, DAMROS implements a Universal
Run Queue(URQ) that contains all the runnable threads maintained in the
order of their execution. All threads in the system are linked together to form
a family tree hierarchy making it easier to trace parent and child relationships.
Furthermore, each thread is associated with a scheduler and threads belonging
to a common parent having the same scheduler are grouped together forming
a process queue for that scheduler. The application scheduler has access to
one such process queue containing only the threads that it is responsible for.
The system scheduler keeps track of the process queue currently in use by the
application scheduler.
The representation in the URQ can be illustrated with an example. Sup-
pose that there are three applications executing in the system. Threads T1 to
T4 belong to application A1, threads T5 to T7 belong to application A2 and
threads T8 to T12 belong to application A3. It is known that threads T1 to T4
are best scheduled using FP scheduling policy, while threads T5 to T7 require
116
EDF scheduling policy and threads T8 to T12 require the FCFS scheduling pol-
icy. Thus, in all there are 12 threads executing in the system. For simplicity,
it is assumed that none of the threads use shared resources or block on I/O.
Figure 3.9: URQ: Representation of Threads
Figure 3.9 shows the representation of the URQ for the above example.
The CPU bandwidth is equally divided amongst different applications in the
system. If the application uses a different scheduler for its threads then the al-
located CPU bandwidth is distributed to the threads depending on the schedul-
ing policy.
117
As per the example, the application scheduler should schedule the lower-
level schedulers – either FP, EDF or FCFS – which in turn schedule the cor-
responding application threads.
Distribution of CPU bandwidth
The VRHS model distributes the CPU bandwidth amongst each scheduler used
by the application threads. Using a timer in DAMROS, the rScheduler thread
get activated when the CPU budget allotted to a scheduler is exhausted. The
CPU bandwidth in terms of CPU time for the above example is distributed
as follows: there are 3 different applications in the system. If scheduled using
RR scheduler, each application thread is executed for 5 ms in a RR order.
The CPU bandwidth is to be divided equally amongst all three applications.
Thus, an application with the least number of threads is considered. In this
case, it is application A2. The total CPU bandwidth to be allocated to each
scheduler is calculated as the product of the time quantum of RR scheduler
and the least number of threads in an application. i.e. each scheduler in the
second level is allocated 5ms× 3 = 15ms.
Depending on the scheduling policy, the allocated CPU bandwidth might
be used by all, a few or only one of the threads attached to that particular
scheduler. Note that, the execution of the scheduler code at any level is not
accounted for distributing the CPU bandwidth.
Shared Resources
When using different schedulers, the framework restricts the use of shared
resources to the threads belonging to a single applications. i.e. Threads in
different applications using different scheduling policies cannot share resources
118
amongst each other.
The threads of the same application scheduler by a single scheduler can be
share resources. Each shared resource is associated with a mutex for thread
synchronisation. In this case, each mutex has an associated list consisting of
the threads that are allowed to lock it. In DAMROS, when a thread blocks
on a shared resource, the thread that currently holds the mutex is executed
on high-priority until it releases the mutex. This is done by the rScheduler.
When a thread A tries to lock an already locked mutex, it is blocked and the
rScheduler thread is activated by setting the appropriate flag. If the mutex
has currently locked by thread B, then irrespective of the scheduler being
used by the application, the rScheduler manipulates the process queue of the
scheduler such that thread B uses the allocated CPU bandwidth until it release
the mutex. On releasing the lock, thread A which was blocked before gets
activated and the rScheduler is activated to reset all the changes it made to
the process queue.
Operation of VRHS Model
Figure 3.10 shows the virtual structure of the schedulers in the VRHS model
that schedule the applications threads for the previous example. The System
scheduler invokes only one lower-level scheduler (the application scheduler) at
any given time. The rScheduler changes the base-level application scheduler
code such the scheduler that is next required to schedule the threads is directly
invoked by the system scheduler avoiding an additional level of indirection
through the application scheduler (root node of the hierarchy).
The rManager activates the rScheduler thread when information concern-
ing the CPU is reified. In figure 3.10, threads T8 to T12 are shown ready to
119
FP System SchedulerKernel
rSchedulerthread
intercept
UD
threadsReady
T8 T9 T10 T11 T12
EDF
FP
T1 T2 T3 T4 T5 T6 T7
FCFS
FP policy EDF policy FCFS policy
ApplicationScheduler
Figure 3.10: Operation of the VRHS Model
be scheduled for the first time in the system. At this point, the rScheduler
thread replaces the application scheduler with the FCFS scheduler using the
interceptCall() interface and sets timer that expires after a time equal to the
CPU budget of FCFS scheduler. The system scheduler schedules the appli-
cation scheduler which now implements FCFS scheduling policy. Note that
neither the system scheduler nor the application scheduler are aware of this
change.
The rScheduler gets activated for two reasons: one is when informa-
tion is reified (activated by the rManager) and the other is when the ap-
plication scheduler needs to be changed (expiry of timer). Any new sched-
uler implementation is added to the system using either installCode() or
reify(CHILD SCHED, &UD scheduler) call. The VRHS model makes it simple
120
to add schedulers to the hierarchy.
Using this model, a scheduler hierarchy of any depth can be virtually cre-
ated. In the previous example, suppose that thread T9 spawns child threads
T13 to T16 and introduces an UD (application specific) scheduling policy to
schedule its child threads. In this case, the CPU budget is recalculated for the
threads of the FCFS scheduler alone such that the CPU budget of T9 is shared
by its child threads as well. Appropriate changes are made to the URQ. Just
before scheduling the threads T13 to T16, the rScheduler changes the FCFS
policy at the application scheduler to the new UD scheduler. Now there ex-
ists a three-level virtual hierarchy of schedulers in the model. However, the
rScheduler always maintains a two-level scheduler hierarchy at any time.
Minimising Context Switches
Both rScheduler and the base-level application scheduler threads should use
as little time as possible. The implementation of the rScheduler thread is such
that it is not dependent on any external parameters (e.g. shared resources) and
can execute to completion in a single shot (without interruption). Thus, rather
than context switching to it, the system scheduler does a simple procedure
call to directly jump to the thread’s code eliminating any context switching
overhead associated with the execution of the rScheduler thread.
All the in-built schedulers in DAMROS are similarly designed to be ex-
ecuted by a procedural call instead of context switching to them. Thus, if
all the scheduling policies in use by the application threads are the in-built
ones, then the VRHS model incurs no context switch overhead due to the
various schedulers in the hierarchy. However, when a UD scheduler is used,
the rScheduler sets a contextSwitch flag in DAMROS, such that the system
121
scheduler context switches to the application scheduler instead of the usual
procedural call.
Operation of the rScheduler
Figure 3.11 lists the pseudo code of the rScheduler thread. The reified infor-
mation pertaining to the CPU is processed, adding or removing schedulers to
the hierarchy or manipulating the base-level data structures. Later, the URQ
is checked to see if the next scheduler to be active is different from the current
scheduler. This is true when the rScheduler is activated by an expired timer,
in which case, the application scheduler is replaced using interceptCall() inter-
face. The pseudo-code in the figure is self-explanatory. The next subsection
discusses some of the issues related to VRHS or the hierarchical scheduling
schemes in general.
Issues with Scheduler Behaviour
Due to the hierarchical scheduling structure, various different kinds of sched-
ulers are active in the hierarchy. It can be difficult to determine the state or
behaviour of the system at any given point in time or to accurately distribute
CPU bandwidth amongst all co-existing schedulers and the threads in the sys-
tem. This issue is often attributed to the scheduler composition problem [99]
which arises from the incorrect choice of schedulers that coexist in the system.
Also, in the hierarchical model the distribution of CPU bandwidth amongst
the application threads varies and is dependent on the type of schedulers com-
posed in the hierarchy. Furthermore, this distribution is in direct proportion
to the scheduler type used at each level. Though the composition of certain
scheduling policies is flawed [99], the VRHS model still allows the existence
122
void rScheduler(void)
{
reify_t info;
void *next_scheduler;
/* read all reified information and perform any operations
if required */
while ( requestInfo(CPU, 0, 0, 0, &info) != -1 ) {
...
if(isSchedulerAdd(info.type)){
/* if reified info requires adding a new scheduler then
group the corresponding threads forming a new process
queue for the new scheduler and setup the URQ
*/
}
else if(isSchedulerRemove(info.type)){
/* if reified info requires removing a scheduler then
regroup the corresponding threads adding them to the
process queue of the scheduler one level up the
hierarchy and accordingly setup the URQ.
*/
}
else if(isDataChange(info.type)){
/* make sure the requesting thread is not a parent
application thread, then change thread priority
or deadline if FP or EDF policy is used.
Otherwise ignore.
*/
}
...
} // while loop
/* Analyse the hierarchy and determine
the next level scheduler. */
next_scheduler = next_level_scheduler(_current_apps_scheduler);
/* change scheduler only if required */
if ( next_scheduler != _current_apps_scheduler ) {
/* Using DAMROS reflection API intercept the
current scheduler and change it to next scheduler.
NOTE: this call changes the program code
*/
interceptCall(_current_apps_scheduler, next_scheduler, NULL, NULL, 0, 0);
_current_apps_scheduler = next_scheduler;
} // end if (interception)
/* relinquish CPU and let
application scheduler execute */
} // end rScheduler
Figure 3.11: Pseudo-code of rScheduler Thread
123
of such schedulers without incurring high penalty to other application threads
not scheduled using those particular schedulers. The problem can be addressed
by the use of a budgeting system as employed in [12, 30, 102]. In this case, all
threads share certain amount of CPU bandwidth that is negotiated using a
contract with the main scheduler in the system. The scheduler then prioritises
the threads for scheduling and sets timers to expire when the budget of the
currently executing thread is exhausted.
The VRHS model is presented as a simple example showing the potential
of the framework and the amount of flexibility it offers. The next subsection
describes the sample implementation of an application-specific UD scheduler.
Application-Specific UD Scheduler
This subsection provides guidelines for the development of application-specific
UD schedulers that can be accommodated into the VRHS model. Figure 3.12
shows a flow chart representing a typical application-specific scheduler.
The shaded blocks shown are application-specific and need to be imple-
mented by the application developer, while the rest of the blocks are provided
by VRHS model. Block 1 contains the code that requests the process queue. If
granted, the queue consists of only the threads belonging to the application of
the requesting scheduler. The UD scheduler requests the process queue each
time it is invoked. Block 2 checks if the request for process queue has been
granted. If not granted, then the Failure recovery routine implemented by the
application developer is executed (block 4). This code could either do a retry
using timeout (shown as dotted line in figure) or could simply shutdown the
application.
If the request is granted, a causal link with the requested process queue
124
Start
2
YES
NO
Context Switch to the selected thread (End).
using the QueueMake Scheduling decision
1
Link to the Process Queue
Request for Application Specific Process Queue.
Failurerecovery
3
4
5
6
Request Granted ?
Figure 3.12: Application-specific UD Scheduler Blocks
is established in block 3. The code in block 5 (provided by the application
developer) uses the process queue to determine the next thread to be scheduled
using an UD scheduling policy. Block 6 context switches to the selected thread.
This structure if used by the UD schedulers ensures uniformity across all
UD schedulers in the system and helps with system integrity. By hiding away
the complexity, the reflective framework lets the application developer concen-
trate on the UD scheduling policy alone.
Figure 3.13 lists the pseudo code of an UD scheduler implementing the
FCFS scheduling policy. This UD schedules schedules all the ready threads
in a FCFS order. If the first thread that entered the system is not ready, the
scheduler executes the next ready thread. In the pseudo code, requestInfo() call
125
void fcfs_scheduler(void)
{
process_queue_t *p;
thread_t *next_thread, *current_thread;
int id = 0, retry = 1;
try_again:
/* request for the process Queue */
id = requestInfo(CPU, PROCESS_QUEUE, 0, 0, NULL);
/* request granted if id greater than 0 */
if ( id > 0 )
{
/* causal link to the requested data */
p = (process_queue_t *)linkData(id);
/* FCFS Scheduling policy */
/* get first thread in queue */
for(next_thread = p;
next_thread != NULL;
next_thread = next_thread->next_in_q)
{
/* check if the thread is ready to RUN */
if(next_thread == THREAD_READY){
break;
}
}
/* activate rScheduler to change the application
scheduler if no thread is ready to run */
if(next_thread == NULL){
activate_rscheduler();
}
/* context switch to the selected thread */
context_switch_to(next_thread);
}
/* retry 5 times */
else {
retry++;
/* kill the application after 5
retry attempts */
if( retry == 5 ){
kill_application();
}
goto try_again;
}
} // end of FCFS scheduler
Figure 3.13: User-Defined FCFS Scheduler
126
requests the rManager to establish a causal link to the application’s process
queue. The rManager validates this request, checks if the calling component
is the scheduler for the application process using the URQ and returns an ID
if successful. The scheduler uses a ‘linkData(id)’ call to form a causal link
with the process queue. The remaining code carries out the FCFS scheduling
policy by switching a ready thread in the FCFS order. Note that most of the
complexity of dispatching, timing/event management, etc. is hidden from the
UD scheduler making it easier to implement and maintain.
In order to use this scheduler for its child threads, the parent application
thread would use the reify(CHILD SCHED, &fcfs scheduler) call. The next
subsection describe the design and implementation of the reflective memory
management system in DAMROS.
3.3.5 Reflective Memory Management System
(RMMS)
Memory in real-time embedded systems is an important resource and needs
to be managed efficiently. This section presents the reflective memory man-
agement system (RMMS) implemented in DAMROS. Memory requirements of
complex embedded applications (e.g. multimedia applications) vary dynami-
cally at run-time. DAMROS implements a paged virtual memory management
scheme. The size of each memory page is set to 4 KB.
Using paging, the code/data pages present in subsequent virtual memory
pages need not be physically contiguous in memory with 0% external memory
fragmentation. The use of auxiliary memory as swap space allows applications
with larger memory requirements to use more memory than is actually (phys-
ically) available. However, the page-swap operations associated with paging
127
are considered to incur significant performance penalties in the system which
is why it has not been widely deployed in embedded systems. This thesis con-
tends that by proper use of information about to the memory access patterns
of applications, it is possible to efficiently manage memory and reduce the
associated paging overhead.
In a paged system, the main operation of a memory manager (MM) is to
allocate memory to a requesting system module/application. In DAMROS,
memory is allocated in pages, i.e. the size of memory allocation is in multiples
of a page. When allocating a new page, if there is no free memory page
available for allocation then the memory manager selects an already allocated
page (called the victim page) for allocation by moving its contents to the swap
space. This process is called page swapping, in which the contents of a memory
page are moved to the swap space or vice-versa.
If a process tries to access a swapped page, then the system generates a
page-fault transferring the control to the RTOS’s page-fault handler routine.
The page-fault handler copies the contents of the swapped page back into
memory. This operation may also cause another memory page to be swapped
if there are no free memory pages.
Since the auxiliary memory device is slower than the main memory, each
page swap operation costs a certain number of CPU cycles or CPU time. Thus,
for better system performance, it is important that the Memory Management
(MM) module minimises the total number of page-faults (i.e. page swap op-
erations). Ideally, the MM module should select a victim page such that, once
swapped, the page is not be accessed in the immediate future. This ideal case
is nearly impossible to implement. This is because, it is difficult if not im-
possible to predict, at runtime, the memory access patterns of all application
128
threads running in the system. Furthermore, it cannot be ascertained in ad-
vance whether a particular victim page would be required in the immediate
future.
There are several page replacement policies such as Least Recently Used
(LRU) [120], Most Recently Used (MRU) [120], etc. and various optimisations
to these policies which an MM can implement. However, none of the policies
satisfy the ideal requirement and mostly provide an average case support.
Using reflection mechanism, it is possible to obtain reified information
about the memory access patterns of the applications. Such information can
be in the form of accesses to a particular memory location; or to suggest a UD
policy. The MM module can adapt the underlying paging policy accordingly
to provide better support. The RMMS module makes use of the reflection
framework in DAMROS [96] to either dynamically adapt the page replace-
ment policy or change it to use a different policy depending on application
requirements.
The RMMS model is shown in figure 3.14 (see upper-right corner). It con-
sists of a base-level MM component implementing a standard page replacement
policy and a meta-level component. The meta-level uses requestInfo() inter-
face to obtain information about memory accesses that is either reified by the
applications or the base-level MM module.
The base-level maintains a global page list containing each memory page
along with the time it was last accessed, the frequency of access and a page
flag representing whether the page is new/old or has been marked as a victim
for swapping. Furthermore, pages belonging to different applications are also
grouped together to form individual page tables for each application. A meta-
level can simply make changes to the ordering of the pages in this global page
129
SchedulerApplication
RMMS
meta−level code
Memory managerbase−level
reify()affectchange
Applicationcode
Application process
currently executing
DAMROS
reify()
Figure 3.14: Structure of the RMMS Model
list to adapt the existing paging policy. The base-level would not be aware of
such a change and would continue its normal operation.
The left-hand-side of figure 3.14 shows an application process reifying infor-
mation. Information reified could be any of the following: access to a memory
region, a memory region that is no more required, allocation of memory and
request for a change in policy.
If information about a future memory access reaches the RMMS meta-level
component, it analyses the global page list to check whether the corresponding
memory pages are marked as victim pages (for reclamation) by the base-level.
If true, then it changes the page flag accordingly to avoid the unnecessary page
swap operation that could have resulted in multiple page-faults.
In DAMROS, the RMMS base-level implements a clock-based least re-
cently used (LRU) page replacement policy similar to the Linux OS [54] which
130
henceforth, would be referred to as LRU policy. DAMROS maintains statisti-
cal information that reflects the current memory usage of application threads
executing in the system. The meta-level component is scheduled by the sys-
tem scheduler as a high priority kernel thread (one priority lower than the
rScheduler). It is activated by the rManager when information pertaining to
memory is reified. Other than manipulating the global page list, the meta-
level intercepts the base-level to replace it with application-specific UD page
replacement policies as and when required.
The meta-level of RMMS module in DAMROS is called rVMM. Following
are the rules that govern the functioning of rVMM :
• an application can request a change in the paging policy. This change,
if granted by the rVMM, would be applicable only to the memory pages
of the concerned application. i.e., an application can only make changes
to the pages it uses and not the pages used by any other thread.
• an application can request for a particular region of memory to be
locked/freed so that it is either not swapped or swapped out by the
MM code. This is similar to the mlock() mechanism found in the Linux
OS [19].
Figure 3.15 shows the structure of the RMMS module which is similar to
the reflective scheduler module shown in figure 3.8. DAMROS has built-in
implementations of MRU and LFU page replacement policies along with the
LRU policy at the base-level.
The applications would typically reify information about their memory
access patterns to rVMM as follows:
reify(MEM_READ, &my_var, 256);
131
rVMM
VMM policystatic default
MRU Optimized User−defined
causallink topage−tables
reify()install
code
installCode()
requestInfo()
interceptCall()
linkData()
callinterceptedcontrol on
transfer
change behaviourafter interception
base−level
meta−level
Reflective Virtual Memory Manager
Base Kernel Core (also holds application reified data)
Figure 3.15: Reflective Memory Management System (RMMS)
The above reification call suggests that 256 bytes of data starting at loca-
tion pointed to by my var is being read by the application. The reify interface
prepares the reify t data structure such that dataPtr field contain the starting
memory location (i.e. address of my var) and the data field contain the size
of the access (i.e. 256 in this case). The rVMM is activated by the rManager
when such information is reified. When executed, it adjusts the page flags
accordingly.
It is difficult for the base-level MM to keep track of which memory pages
were recently accessed. There is no hardware support that suggests the time
132
of access for a particular memory page. Thus, the MM policy approximates
the page accesses and applies the respective policy to select victim pages.
By updating the access information in the global page list, the rVMM helps
the base-level make more accurate decisions. Furthermore, if an application
requires a different paging policy to be used to manage its pages, the rVMM
can use any of the built-in policies or a UD paging policy to replace the base-
level policy. The next subsection describes the implementation of a typical
application-specific UD paging policy.
Application-specific Paging Policy
Figure 3.16 shows the operation of RMMS when using a UD policy for an
application. Application X is using an UD paging policy. During initialisation,
the application reifies a request to change the paging policy to the UD policy
that it implements. The rManager activates the rVMM thread which makes
the appropriate changes in the base-level MM. On the next page-fault, caused
by application X, the rVMM thread gets activated. It checks if the page-fault
is to be handled by the UD policy. If true, control is transferred to the UD
policy code rather than the base-level LRU code. The UD policy has access
to a page list consisting of all the pages allocated to the faulting application.
The UD policy determines the next victim page from the page-list and
returns control back to the rVMM module. The selected page is moved to
the swap space. While handling a page-fault, the execution time of UD policy
is accounted against the scheduler’s CPU budget and the system interrupts
remain enabled. Once this budget expires, the rScheduler changes to a different
scheduler (if in use). It is not possible for a malicious UD policy to use the
CPU forever.
133
Page−fault HandlerSystem
Page A
Page B
Page C
Page X
Pages used byApplication X
Hardware MMU
page−fault
targetpage
Up−call
Application−specific
Application ProgramCode
pageselectRMMS module
DAMROS Kernel
Application X
UD Policy
Figure 3.16: Operation of the RMMS model
To maintain uniformity amongst different application-specific UD policies,
the applications must adhere to the following guidelines. A typical application-
specific UD paging policy consists of an initialisation phase and a decision
phase. In the initialisation phase, the application initialises all the data struc-
tures (i.e. page table). In the decision phase, the UD code selects a victim
page to be moved to the swap space. The following subsection describes this
with an example implementation of UD paging policy.
Example User-defined paging policy
Figure 3.17 shows the C style pseudo code of a UD paging policy implemented
in the application. This code is executed when a page-fault occurs on a memory
page belonging to this application. It uses the requestInfo() interface to request
the page table of the application. The policy retries the page table request for
5 times, and if not granted it kills the application. However, if the request is
134
granted, it links to the page table. The UD policy has access to information
about each page used by the application such as whether a page is currently
in memory or swap space or whether a page is marked as a victim page. This
is done by associating a flag field with each page in the page table. Setting
a particular bit in this flag accordingly reflects the status of the respective
page. Any change made by the UD policy to the page table affects how the
base-level policy reclaims pages from this particular page-table.
The function setVictim() operates on a given page’s flag marking it as a
victim page. The rVMM follows the following fairness policy: when the UD
policy requests for a page table, the rManager activates the rVMM thread
which keeps a record of the number of victim pages already marked in the
page table. When the UD policy returns control to it, the number of victim
pages in the page table must still be equal or greater than the previous number.
If this is not the case then the rVMM kills the application and reclaims all its
memory pages. Thus, it is not possible for a UD policy to keep all its pages in
memory. Due to this check, the operation of one UD policy does not adversely
affect other applications in the system.
Furthermore, it is possible for the RMMS module to benefit from the infor-
mation available to the CPU scheduler as well. The next subsection describes
how such information could be used to benefit each resource management
module.
Use of RMMS in Scheduling
Using reflection, it is possible for the meta-levels of both the RMMS module
and the reflective CPU scheduler to interact with each other and exchange
information. The CPU scheduler can act as a base-level to RMMS and reify
135
void UD_paging(void)
{
page_table_t *p;
thread_t *next_thread, *current_thread;
int id = 0, retry = 1;
try_again:
/* request for the page table */
id = requestInfo(MEMORY, PAGE_TABLE, 0, 0, NULL);
/* request granted if id greater than 0 */
if ( id > 0 ) {
/* causal link to the requested data */
p = (page_table_t *)linkData(id);
/* UD paging policy */
...
/* from the page table, select a victim page */
...
/* set the selected page as victim */
setVictim(selected_page);
}
/* retry 5 times */
else {
retry++;
/* kill the application after 5
retry attempts */
if( retry == 5 ){
kill_application();
}
goto try_again;
}
} // end of UD paging policy
Figure 3.17: Application-specific UD Paging Policy
136
information to it. For instance, a reflective CPU scheduler can reify applica-
tion timing information to the RMMS module (e.g. remaining CPU budget,
deadline, etc.). The cost of swapping pages in and out of memory is high in
terms of both time and power as compared to a context switch if the pages
required by next application thread/process are already present in memory.
The RMMS module could use such information to determine if the cost of
swapping a page is higher than simply reducing the remaining budget of the
application thread/process and requesting the scheduler to perform a context
switch to another process.
Similarly, the scheduler can acquire information from the RMMS module
regarding an application’s memory usage to check if its pages are in memory
before context switching to it. Note that DAMROS is a single address space
RTOS supporting reflective hierarchical scheduling. The context of this discus-
sion is scheduling of multiple threads/processes of a single application in order
to efficiently use the CPU budget allotted to the scheduler. The interaction of
one scheduler with the RMMS module does not affect any other scheduler or
application threads/processes in the system.
The implementation of the reflective CPU scheduler (VRHS model) and the
reflective virtual memory (RMMS module) allow experiments to be performed
on DAMROS to evaluate the generic reflective framework. This is discussed
in the next section.
3.4 Evaluation
Compiling DAMROS using gcc (version 3.2.2) for the Intel x86 architec-
ture [61], produces an image of size 83 KB including the framework, two reflec-
tive system modules, device drivers and the test applications. The hardware
137
used to conduct the experiments included an embedded single-board computer
with a Cyrix MediaGX (233 MHz) processor and 64 MB of SDRAM. DAMROS
was configured to use only the first 4 MB of RAM. On an actual embedded
system, a flash memory or other similar auxiliary memory device would be
used for the swap space. For simplicity, DAMROS uses the upper 48MB of
RAM (i.e. 16MB onwards) as the swap space. Note that, the application
timings in this case would be much faster than in the actual system. However,
all experiments use the same setup to guarantee uniformity in the recorded
timings.
The objective of this evaluation is to verify the operation of the reflective
framework, to show the amount of flexibility offered for application develop-
ment and to support application-specific resource requirements at runtime.
More detailed experiments have been carried out in the following chapters
that compare against other approaches.
3.4.1 Timing Analysis
Table 3.1 lists the maximum time taken by each interface in the reflection
framework along with the page-fault handler routine in DAMROS. The time
was measured using the time-stamp counter of the processor. The timing
measurements depend on the hardware being used (i.e. the CPU clock speed).
Nevertheless, the figures are indicative of the relative performance of the in-
terface in DAMROS.
3.4.2 Changing Application Behaviour
This section demonstrates the ability of the framework to allow applications
to change their behaviour at runtime. Two application threads T1 and T2
138
RTOS Function Max. time, t in µs
reify() 0 ≤ t ≤ 1requestInfo() 1 ≤ t ≤ 2linkData() 0 ≤ t ≤ 1unlinkData() 0 ≤ t ≤ 1interceptCall() 1 ≤ t ≤ 2uninterceptCall() 0 ≤ t ≤ 1reinterceptCall() 0 ≤ t ≤ 1allowIntercept() 0 ≤ t ≤ 1interceptAllowed() 1 ≤ t ≤ 2installCode() 4 ≤ t ≤ 5uninstallCode() 1 ≤ t ≤ 2page-fault handler 0 ≤ t ≤ 4
Table 3.1: Measured Execution Times of DAMROS Interfaces
were implemented. Thread T1 calls a function read packet() in an infinite
loop. The function read packet() implements a particular algorithm or pro-
tocol to read data from a memory buffer. This function is critical to the
functioning of the application. Before entering into the loop, T1 uses the al-
lowIntercept(&read packet, ”read packet”, 0) call to allow the interception of
function read packet() by any component or thread in the system.
Assume that, in future, the function read packet() needs to be changed due
to a bug or a change in requirements. Also, suppose that the new change is to
be patched at runtime without having to stop, recompile and rerun the thread.
This is particular the case when the application has been deployed on board
a satellite or a Mars exploration vehicle for instance.
Thus, in order to fix this issue without having to stop, recompile, and re-
run thread T1, an independent thread T2 containing the new implementation
of function – read packet() is developed off-line. This thread when executed
in the system intercepts the function read packet() in thread T1 using the
139
interceptAllowed(”read packet”, 0, &new function) call. The control is now
transferred to the new implementation in thread T2 replacing the original
functionality.
From the data collected over 1000 samples, it was found that on an average
it took 4 µs for thread, T2 to affect the change in thread T1. The time has
been measured from the moment interceptAllowed() was called in thread T2
until it returned.
Similar to the above, an application can affect a change to various other
attributes such as priority, the paging policy, the scheduling policy, etc. It was
found that the measured maximum execution time to affect any such change
took no more than 30 µs in DAMROS (Note: a combination of installCode(),
reify() and other interface calls can take more time due to privilege checks
performed). The next subsection presents a detailed evaluation of VRHS –
the reflective CPU scheduler.
3.4.3 Evaluation of VRHS
The evaluation is divided into two parts: one using preliminary tests and the
other using detailed experiments.
Preliminary Tests
[Case 1] Consider an application with threads T1 and T2 consisting of a loop
with a maximum of 10,000 iterations. Table 3.2 shows the results obtained
after executing T1 and T2 simultaneously under normal conditions (i.e. with no
reflective scheduler). In table 3.2, Start indicates the time (in seconds) an ap-
plication thread started executing in the system; End indicates the time when
140
the application thread finished execution; Lifespan (i.e. End – Start), indi-
cates the time the application thread spent in the system (this does not mean
that the application process was executing throughout its lifespan). Here, the
time measured is relative to the time the RTOS was initialised. For instance,
in table 3.2, T1 starts 0.134s after the RTOS was initialised.
It is evident that both applications take almost the same time to complete
with T2 finishing 0.065s later than T1. This is because, T1 started execution
ahead of T2. The results were obtained as an average of several samples with
DAMROS implementing a RR scheduling policy. Only two application threads
were running in the system at any given time.
Process Start End Lifespan
T1 0.134 1.284 1.150T2 0.139 1.354 1.215
Table 3.2: No Reflection, Basic RR Scheduler
[Case 2] In this case, the reflective scheduler is initialised and the appli-
cation threads: T1 and T2 are rerun. This time however, an FP schedul-
ing policy is used and thread T1 requests for a higher priority using the
reify(HI PRIORITY) call. The rScheduler module obtains this information
from the rManager. It replaces the base-level scheduler with an FP scheduler
and assigns a higher priority to thread T1.
Process Start End Lifespan
T1 0.136 0.827 0.691T2 0.827 1.607 0.780
Table 3.3: Reflection with One High Priority Application
141
On an average, it took no more than 30 µs for the rScheduler to affect the
required change once thread T1 had reified the request. From table 3.3, it is
observed that the lifespan of thread T1 is drastically reduced from 1.150s to
only 0.691s and that of thread T2 is also reduced from 1.215s to 0.780s. This
is because thread T2 is the only thread executing in the system once thread T1
has finished executing. However, note that the end time of thread T2 is 1.607s
as compared to 1.354s in the previous case (see table 3.2). It is delayed by
0.253s because thread T1 is executed on a higher priority.
[Case 3] This case reverses the above scenario, in that, thread T2 requests
for a higher priority while thread T1 executes as normal. Initially, RR scheduler
is used since both threads have equal priority. Thus, thread T1 starts execution
and on the next context switch thread T2 starts executing. It then reifies the
high priority request. The rScheduler module makes similar changes as above
so that thread T2 gets a higher priority this time.
Process Start End Lifespan
T1 0.122 1.590 1.468T2 0.127 0.912 0.785
Table 3.4: Reflection with One High Priority Application
From table 3.4, it is observed that the lifespan of thread T2 is reduced
from 1.354s to 0.785s. The reason for thread T2 to not have similar figures
as thread T1 in table 3.3 is because, T1 had already executed for at least one
time quantum before T2 started executing. Thread T1 spent 0.785s waiting for
execution thereby incurring a total delay of 0.306s as compared to its normal
execution as per table 3.2.
142
[Case 4] In this case, thread T1 is run as normal whereas thread T2 is mod-
ified to request for a higher priority only during the execution of a particular
section of its code (i.e. during a critical section). This is a typical behaviour
in priority-based real-time systems that use priority ceiling protocol to control
access to a shared resource or a critical section [29]. To imitate this behaviour,
the code for thread T2 was modified such that it requests for a higher priority
just before it enters the original loop. Then after executing half way through
the loop, it requests for a lower priority (see pseudo-code in figure 3.18).
The rScheduler module gives T2 a higher priority and later switches back
to default RR policy when thread T2 requests a lower priority. Note that, the
rScheduler replaces the fixed priority scheduling policy with the default RR
policy only when all threads have equal priorities (which is true in this case).
Thus, the order of execution of both threads in the system is as follows:
thread T1 starts executing first. Later, when thread T2 is scheduled, it is
executed on a higher priority until it is half way through the loop. At this
point, both threads T1 and T2 execute at the same priority and are scheduled
by the default RR policy until they finish execution. Table 3.5 shows that
thread T2 finishes execution in 1.187s whereas in the normal case it would
have taken 1.215s (as per table 3.2). Thread T2 finishes its execution 0.028s
faster, while thread T1 takes an additional 0.318s to complete. The delay in
the execution time of T1 can be attributed to its waiting time when thread T2
was executing on a high priority.
[Case 5] In order to evaluate scheduling policies for multiple threads be-
longing to different applications, two separate applications – A1 and A2 were
developed. Each application spawns three independent child threads. All child
threads perform similar operation – printing their corresponding thread IDs
143
void thread_T2(void)
{
int i = 0;
reify(HI_PRIORITY);
while( i < 10000)
{
if(i < 5000 ){
/* execute code in critical section */
...
}
...
/* lower the priority */
if (i == 5000) {
reify(LO_PRIORITY);
}
...
i = i + 1;
}
}
Figure 3.18: Pseudo-code for Thread T2
144
over 500 iterations of a loop.
Process Start End Lifespan
T1 0.122 1.590 1.468T2 0.127 1.314 1.187
Table 3.5: Reflection with One High Priority and Other Varying Priority Ap-plication
Assume that the child threads C11, C12 and C13, belonging to the parent
thread A1, are to be scheduled using the default RR scheduling policy and the
child threads: C21, C22 and C23, belonging to the parent thread A2, are to be
scheduled using a FCFS scheduling policy. Also assume that threads C11 to
C13 entered the system before threads C21 to C23.
Under the normal circumstances, all threads would be scheduled using the
default RR policy. In this case, the scheduling order of the threads would be:
A1, A2, C11, C12, C13, C21, C22 and C23.
Using the framework, the application thread A2 requests for a FCFS
scheduling policy to be used for scheduling its child threads by using
reify(CHILD FCFS) call. The rScheduler module identifies this application-
specific scheduling requirement and installs a FCFS scheduler for threads –
C21 to C23. The CPU bandwidth is equally divided amongst both applica-
tion’s child threads. Thus, each child thread set gets 15 ms (5ms× 3) of CPU
time after A1 and A2 are scheduled by the RR scheduler.
After execution, the resulting scheduling order was observed to be: A1,
A2, C11, C12, C13, C21 (for 15ms). Again, A1, A2, C11, C12, C13, and C21
until C21 finishes execution. Then, thread C22 starts executing in the similar
manner and so on until all threads finish execution.
145
Summary
In summary, the above test cases showed the amount of flexibility offered
by the framework in DAMROS. It is evident that the application are able
to adapt or bring about changes in the CPU scheduling policy. Compared
to the overhead (a few micro-seconds) of reifying information in the system,
the gain in application performance is significant. As observed, even a slight
change in priority can make a big difference in the execution times of an
application. More detailed experiments involving the use of different schedulers
in the VRHS model are described in the next subsection.
Detailed Experiments
It is difficult to simulate a dynamically changing reflective hierarchical schedul-
ing model such as the VRHS. It is also not possible to perform the evaluation
on the basis of a discrete event simulation model, a deterministic model or a
queueing model [90,109]. The VRHS model is composed of various traditional
scheduling policies (e.g. FCFS, FP, EDF, etc.) along with the application-
specific UD policies (if used). All such policies co-exist in a single system
and are used to schedule different groups of application threads. Note that,
the evaluation of VRHS does not provide any performance metrics for the
schedulers themselves. The evaluation is based on the following criteria:
• Flexibility offered,
• Scalability of the System,
• Performance of the System,
• Overheads incurred.
146
The following are the definitions of the terms used in the experiments:
• Execution time: the time a thread spends in the system executing on
the CPU is its Execution time; denoted as ǫ(t) where t is the executing
thread.
• Wait time: the time spent by a thread waiting from the time it entered
the system until it first executes on the CPU is its Wait time; denoted
as ω(t) where t is the thread. In the literature, this is also known as the
Response time.
• Turn around time: the time taken by a thread from the time it entered
the system to the time it leaves the system is its Turn around time;
denoted as TTRnd(t) where t is the thread.
• Start-time: The time at which a thread first starts its execution is its
Start-time; denoted as St where t is the thread.
• End-time: The time at which a thread finishes its execution and leaves
the system is its End-time; denoted as Et where t is the thread.
The following subsections present the results of three different experiments
conducted to test the VRHS model in DAMROS. Experiment #1 uses only
two schedulers in the hierarchy with only one application affecting a change
in the system; experiment #2 uses three different schedulers in the hierarchy
with two applications affecting a change; and finally, experiment #3 emulates
an MPEG decoder application that uses an application-specific UD schedul-
ing policy. Towards the end, a detailed discussion of performance and the
overheads incurred by VRHS is presented.
147
Experiment #1
Two applications were implemented and executed as threads – A1 and A2 –
each having equal execution time. The application thread A1 spawned 4 child
threads – T1 to T4, which have the same execution time as the parent thread
A1. Equal execution time is ensured by using a common code across all the
threads. The application threads were executed in DAMROS for the following
test cases:
[Test Case #1] : Using the RR policy at the root node, applications A1
(including the child threads) and A2 were executed with no changes made to
the VRHS scheduling model. In total, there were 6 threads executing in the
system: 2 application threads – A1 and A2, and 4 child threads of A1 – T1
to T4. The measured average wait time ( ω(t) ) and the average turn around
time (TTRnd(t)) of each thread were 0.438 ms and 16.188 ms respectively.
This case presented the default case behaviour without using any features of
the VRHS model.
[Test Case #2] : In this test case, the application thread A1 changes the
scheduling policy of its child threads to the FCFS scheduling policy using
the reify(CHILD FCFS) call. On execution, the avg. ω(t) and the avg.
TTRnd(t) for the threads were 4.896 ms and 12.146 ms respectively. Note the
increase in the value of ω(t). This is because, only one child thread executes
to completion while its siblings wait in the ready queue as per the FCFS policy.
[Test Case #3] : In this test case, the scheduling policy was changed from
FCFS to FP for the child threads T1 – T4 such that they all have equal
priority but higher than the parent application threads. For this, instead of
148
using reify(CHILD FCFS) the application thread A1 uses a reify(CHILD FP)
call. On execution, the avg. ω(t) and the avg. TTRnd(t) were measured to
be 0.438 ms and 14.688 ms respectively. Note that the avg. ω(t) is equal to
that measured for the default case. This is because all the child threads had
equal priority and were executed in the order similar to the RR scheduling
policy. This shows that the reflection interface operates non intrusively with
little or no overhead. The avg. TTRnd(t) improved in this case as compared to
the default case. This is because, the threads had higher priority than their
parent application thread A1.
[Test Case #4] : In the previous test, since all the child threads had equal
priority, the thread that executed first was the first one to leave the system.
In this test case, the code was modified such that thread T1 has a higher
priority than the rest of the threads. For this, a reify(HI PRIORITY) call
was used in thread T1. This results in thread T1 being executed to completion
before any of its siblings start execution. In this case, the measured avg. ω(t)
and the avg. TTRnd(t) were 1.813 ms and 13.438 ms respectively.
[Test Case #5] : In order to verify the correctness of the VRHS model,
thread T3 was modified to have the same priority as thread T1. Now, the
threads T1 and T3 have equal but a higher priority than threads T2 and T4.
The measured avg. ω(t) and the avg. TTRnd(t) in this case were 3.167 ms and
12.542 ms respectively. There is an increase in ω(t) since threads T2 and T4
wait until threads T1 and T3 to finish execution.
[Test Case #6] : In this test case, the code was modified such that thread
149
T3 has the highest priority followed by the threads T2 and T4, while the
thread T1 has the lowest priority of all. The measured avg. ω(t) and the
avg. TTRnd(t) in this case were 2.417 ms. and 12.583 ms respectively. Here,
ω(t) was observed to be less as compared to the previous test case. This is
because, only the lowest priority thread T1 had a lengthy wait time.
[Test Case #7] : In this test case, consider that thread T1 had a critical
section code which is to be executed at the highest priority. Thread T1 was
executed with the lowest priority in the previous test case. Using the reflection
interface it is possible to request a higher priority for thread T1 only while it
executes the critical section code. This was achieved by using the same code
mentioned in the preliminary tests before.
In this test case, the measured avg. ω(t) and the avg. TTRnd(t) were
2.396 ms and 10.458 ms respectively. There was a significant speed-up in
terms of TTRnd(t) (10.458 ms). This is because, the lowest priority thread
T1 executed some parts of its code (critical section) at highest priority that
resulted in all the threads finishing early. This behaviour is also evident by
observing the lower value of ω(t) (2.396 ms).
Summary
In summary, it is observed that the changes brought in by the application
thread A1 affect only its child threads and not A2. i.e., the VRHS model
only produces local effects and does not spread the overheads, if any, to other
threads in the system. Each scheduler in the virtual hierarchy operates in
complete isolation and does not know about the existence of other schedulers.
150
Figure 3.19: Results of Experiment #1
The test cases show that the applications can make use of the available reflec-
tion interface to satisfy their application-specific requirements. The figure 3.19
shows the bar graph representation of the avg. ω(t) and the avg. TTRnd(t)
values for all test cases.
Experiment #2
The VRHS model was tested using two separate schedulers for scheduling
different sets of application threads. The test application threads A1 and A2
spawn 4 child threads each and introduce a different scheduling policy for each
set of child threads. The applications code was similar to experiment #1. In
this experiment, 10 threads were executed: 2 parent application threads – A1
and A2, and 4 child threads of each application T1 – T4 and T5 – T8.
[Test Case #1] : In this test case, all application threads were executed
151
using the default RR scheduling policy. This case is a representative of the
default case with no change brought in by the VRHS model. The measured
avg. ω(t) and the avg. TTRnd(t) were 0.688 ms and 25.338 ms respectively.
[Test Case #2] : Threads A1 and A2 were both modified to use FCFS
scheduling policy to schedule the corresponding child threads. On execution,
the measured avg. ω(t) and the avg. TTRnd(t) were 8.6 ms and 17.863 ms
respectively.
[Test Case #3] : The application thread A2 was modified to use an FP
scheduling policy while thread A1 used the FCFS policy. On execution,
the measured avg. ω(t) and the avg. TTRnd(t) were 5.15 ms and 19.75 ms
respectively.
[Test Case #4] : In this test case, the scheduling policies were swapped.
i.e., FCFS policy was used for the child threads of A2 and an FP policy
for those of thread A1. On execution, the measured avg. ω(t) and the avg.
TTRnd(t) were 5.1 ms and 19.725 ms respectively.
[Test Case #5] : In this test case, using the same schedulers as above,
thread T2 is made to have the highest priority and thread T4 the lowest among
other threads. On execution, the measured avg. ω(t) and the avg. TTRnd(t)
were 5.588 ms and 17.925 ms respectively.
[Test Case #6] : This test case used an FP policy to schedule both child
thread of both applications. On execution, the measured avg. ω(t) and the
152
avg. TTRnd(t) were 0.75 ms and 22.7 ms respectively.
[Test Case #7] : In this test case, thread T2 is assigned the highest priority
and thread T3 the lowest. Similarly, thread T6 is assigned the highest priority
and thread T8 the lowest. On execution, the measured avg. ω(t) and the avg.
TTRnd(t) were 1.936 ms and 18.288 ms respectively.
Figure 3.20: Results of Experiment #2
Summary
Figure 3.20 shows the bar graph representation of the avg. ω(t) and the avg.
TTRnd(t) for all the above test cases. It is evident from the better avg. TTRnd(t),
that an application can improve its performance if it gets application-specific
resource (CPU in this case) management support from the RTOS. Clearly,
the reflection framework provides this application-specific support. The next
subsection describes an experiment involving a multi-threaded MPEG decoder
153
application whose performance is shown to improve with the use of a UD
scheduler.
Experiment #3 (Application-specific)
An application emulating the behaviour of a multi-threaded MPEG [51] de-
coder was implemented. The parent application thread buffers the in-coming
MPEG video stream and activates one of the decoder threads which decodes
an MPEG frame of a particular type (i.e. either an I, a P or a B frame [51]).
The implementation made the following assumptions:
• a constant bandwidth for the in-coming MPEG video stream either from
a network resource or a local storage disk,
• a constant decoding time per frame for each of the decoder threads,
• an MPEG video stream consisting of the following frame pattern (3
different scenes):
(IPBBPBB)(IPBBPBBPBB) (IPBBPBBPBB)
The test video stream has 27 different frames constituting 3 different scenes.
The application was tested for concurrent decoding of these 3 scenes. The
parent application thread invoked 27 decoder threads to decode each frame
in the stream. However, due to the inherent frame dependencies, it was not
possible to decode each frame independently. i.e. not all the decoder thread
could be ready for execution at any given time. Furthermore, threads that
decoded the frames belonging to a particular scene could execute in a known
execution order. i.e. decoding one frame at a time. With 3 different scenes in
154
0
100
200
300
400
500
0 5 10 15 20 25 30 35
Tim
e E
laps
ed (
in m
s)
Random Arrival of Threads
Arrival TimeEnd Time
Figure 3.21: Using RR Scheduler
the video stream, there could only be a maximum of 3 active decoder threads
decoding a frame independently.
To determine the effects of using a UD policy on other applications exe-
cuting in the system, the system executed another application App, consisting
of 4 child threads. Now, there are two different applications, consisting of
multiple thread, executing in the system.
During normal execution of both application threads using the default
RR scheduling policy, it was observed that the MPEG decoder application
showed poor performance. Figure 3.21 plots the start-time and end-time of
all application threads including application App’s threads. Threads – 3 to 7
in the figure belong to application App , thread 32 is the parent application
thread of the MPEG decoder and the rest are the decoder threads.
155
0
100
200
300
400
500
0 5 10 15 20 25 30 35
Tim
e E
laps
ed (
in m
s)
Random Arrival of Threads
Arrival TimeEnd Time
Figure 3.22: Using UD Scheduler
Next, the MPEG application was modified to introduce an application-
specific UD scheduler into the VRHS model using reify(CHILD SCHED,
&UD scheduler) call. The UD scheduler made use of the information about
the MPEG data arrival time and kept track of the scene a frame belonged
to. This information allowed the UD scheduler to schedule the corresponding
threads using a priority-based scheduling policy.
Both applications were executed again with the MPEG application using
an application-specific UD scheduler. The results (see figure 3.22) show con-
siderable improvement in performance of the MPEG application. The decoder
threads completed execution much earlier than in the previous case.
Summary
In summary, this experiment showed that applications whose requirements are
156
0
100
200
300
400
500
0 5 10 15 20 25 30 35
Tim
e E
laps
ed (
in m
s)
Random Arrival of Threads
DefaultApplication Specific
Figure 3.23: RR Vs UD Scheduler
not satisfied by the existing policies in an RTOS can make use of the framework
to introduce an application-specific UD policy. Comparing the end-times of the
threads scheduled using the default RR policy against the application-specific
UD policy, it can be seen that the decoder threads were scheduled at the right
time using the application-specific UD scheduler (see figure 3.23). Further, the
MPEG decoder application showed much better performance using its own UD
scheduler.
Note that, the start-time of application App’s threads is shown to have
a delay of 4 ms on an average. This is caused by the overhead added by
the reflection interface in DAMROS for handling the reification process. How-
ever, by using another scheduler (e.g. FP, or an application-specific policy) for
157
application App, it is possible to improve its performance. The following sub-
sections discuss the performance and related overheads of VRHS as compared
to other hierarchical scheduling schemes.
Performance and Overheads of VRHS
Unlike the traditional hierarchical approaches where the performance of the
system deteriorates with the addition of extra schedulers in the hierarchy,
the performance of VRHS is not affected by the presence of extra schedulers.
This is because, the system scheduler does not context switch between several
schedulers in the hierarchy. It only context switches to an UD scheduler, else a
direct procedure call interface is used avoiding much of the context switching
overhead. For experiment #2, which used two schedulers, if traditional ap-
proaches (e.g. MaRTE OS API [103], SFQ [55] method, etc.) were to be used,
then they would have incurred context switching overhead to switch between
the schedulers in the hierarchy.
Generally, in the traditional approaches, the number of context switches
increases with each additional level of scheduling hierarchy. An exception
to this is the HLS [100] implementation which, like VRHS, does not context
switch between schedulers. However, HLS adds significant overhead to the
actual context switch time itself (a 11.7 µs context switch time as compared
to the 7.10 µs in Windows 2000 kernel without HLS on a 500MHz Pentium III
machine [99]). Such an overhead affects all application threads being scheduled
in the system. Furthermore, the HLS model incurs an overhead of 0.96 µs for
each additional level in the scheduling hierarchy [99].
The overheads incurred by most hierarchical scheduling models are because
of the context switches between the various schedulers in the hierarchy. Lower
158
the number of schedulers in the hierarchy lower is the context switching over-
head. To get an idea about the scale of this overhead, consider Linux [19] and
NetBSD [124] for instance. The context switch time of Linux on a 500MHz
G3 processor was 89 µs [124], and that of the SA model developed to run on
a NetBSD system on a similar system was found to be 225 µs [124]. If this
time is multiplied by each scheduler in the hierarchy, the overhead can have a
massive impact on overall system performance.
In VRHS, only the required scheduler remains active at any given time
eliminating the need for several context switches. However, there is a one-
time overhead of the rScheduler thread that executes before an application
scheduler. The maximum observed execution time of the rScheduler thread
is 10 µs. Hence, the VRHS model has a one-time overhead of nearly 10 µs
irrespective of the number of schedulers in the hierarchy. The only real context
switch that occurs is initiated by the application scheduler to switch to the
next ready thread (when using built-in schedulers).
Also, the framework incurs an additional overhead in terms of memory.
Memory is required to store and retrieve reified information. For all of the
above test cases, it was found that the maximum memory used for storage of
reified information was around 200 bytes at any given time. This is negligible
compared to the flexibility offered.
Given the amount of flexibility provided by the VRHS model and the ease
of using the reflection framework, the approach can be considered to be bet-
ter than the traditional ones. The next section describes the experiments
performed to evaluate the reflective memory management system (RMMS).
159
3.4.4 Evaluation of RMMS
In order to simulate constrained memory resource system in a controlled ex-
periment, the total number of free memory pages available in the system was
reduced to 64. i.e., the available free memory in DAMROS was now only
262 KB. Although the experiments use a small amount of memory, the ap-
proach is nevertheless scalable to support real systems. Reducing the memory
size in this case, allows to easily analyse and perform experiments.
Also, most OSs maintain a certain number of memory pages in the system
that are always free. The total memory of 262 KB available to the applications
in the following experiments excludes the maintained free pages. DAMROS
can allocate all 64 memory pages to the applications. Similar to the evaluation
of the VRHS model, the evaluation of RMMS is also divided into two parts:
one for the preliminary tests and the other for detailed experiments.
Preliminary Tests
Test applications A1 and A2 were developed. When executed, at first, the
application thread A1 allocates all the available free memory to itself. Later,
it reads the allocated memory (one byte at a time) in a loop. This action
emulates a periodic sequential page access in the system. The application
thread A2 also behaves in a similar manner but allocates only 10 memory
pages to itself. Both threads require 1 memory page each for the code and
data segment.
Thread A1 is initiated before thread A2. It occupies all the available free
memory pages such that when thread A2 executes, there is no free memory.
In order to allocate 10 memory pages to thread A2, the RMMS module should
select 10 victim pages to be swapped. In total, there were 12 page swap
160
60
80
100
120
140
0 100 200 300 400 500 600 700 800 900 1000
No.
of P
age
Fau
lts
Trial runs
Page Fault comparison
Figure 3.24: Static Vs Reflective LRU
operations: 2 page swaps to initialise thread A2’s code and data segments
(each using one page) and 10 page swaps to allocate A2’s memory pages. At
this point, it is important that the RMMS module choose the right pages to
be swapped out.
In this scenario, without using reflection, each of the traditional LRU and
MRU page replacement policies were tested. This helps to compare against
the use of reflection later. The number of page faults generated in the total
lifespan of both applications were recorded in each of the following cases:
[Test Case #1] : This test case used the traditional LRU policy without
reflection. On an average there were 109 page-faults generated in the system
(see upper dark lines in figure 3.24).
Next, the reflective interface was used in the RMMS to optimise memory
utilisation and to reduce the number of page faults. The rVMM module was
161
initialised and the base-level reified page-tables to rVMM. Also, application
thread A1 was modified to reify information suggesting that it would use the
first 10 pages allocated to it using reify(MEM READ, &memory data, (10 *
PAGE SIZE)) call. Thread A2 also reified similar information. The RMMS
base-level used the LRU paging policy while the rVMM manipulated the page
flags to reflect the reified information. On an average, 95 page-faults were
observed in this case (see lower dark lines in the figure 3.24).
[Test Case #2] : This test case executed the same application threads under
normal conditions (without using reflection) but using an MRU policy instead
of LRU. On an average, 2404 page-faults were generated over the 1000 test
runs of the applications.
Using the reflection framework, the same way as above, with the RMMS
base-level using an MRU paging policy only 145 (avg.) page-faults were gen-
erated.
The number of page-faults generated in the system were significantly re-
duced by using the framework. When applications reify memory usage in-
formation, the rVMM forms a causal connection with the page-table using
linkData() call. It then changes/marks the page flags such that the pages that
would be used by the applications would remain in memory (i.e. they would
not be swapped out by the base-level).
The worst case time to handle a page-fault was observed to be nearly
4µs. Thus, using reflection in the case of the LRU policy, a reduction of 14
page-fault saves 56µs of valuable CPU time. Similarly, for the MRU policy a
massive saving of 9.04ms of CPU time is achieved. The reify() interface can be
accounted to incur an overhead of nearly 1µs with an additional 3µs overhead
added by the rVMM component to bring about the change. Thus, removing
162
the 4µs overhead off the above figures, a total saving of 52µs for LRU and that
of 9.036ms for MRU policy is quite significant. The next subsection describes
a more detailed experiment using several different test cases.
Detailed Experiments
For more accurate and controlled measurements, the number of free pages
available to the applications was further reduced to 32. i.e., only 124 KB of
memory was made available to the applications. Note that, in this case, the
free memory excludes the memory pages used by the application code/data
segments. To maintain uniformity in the values, a fixed test application set
was used across all the experiments.
Test Application Set
The test application set constitutes of applications A1 and A2. Both applica-
tions require 20 memory pages each. Thread A1 randomly accesses its memory
pages in an infinite loop whereas thread A2 (also in an infinite loop) sequen-
tially accesses its pages. The total pages available in the system is only 32, but
the total pages required by both applications is 40 pages. The execution of
both applications would generate an increasingly large number of page-faults in
the system. Once the application threads start executing, the number of page-
faults generated in the system are recorded for every 1,000 context switches. A
total of 100 such readings are recorded before killing the application threads.
The following experiments use different paging policies along with reflection.
Experiment #1
This experiment used a global page replacement strategy with no knowledge of
the individual application’s memory usage. A victim page was selected on the
163
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0 20 40 60 80 100
No.
of P
age
Fau
lts
Context Switches
CASE #1
LRUMRULFU
Figure 3.25: Experiment #1: Page-faults
basis of its global usage statistics collected over a period of time. Figure 3.25
shows the corresponding page-fault graph after executing the test application
set with LRU, MRU and LFU page replacement policies respectively.
The LRU paging policy showed relatively poor performance. This is be-
cause, LRU is a recency-base policy and does not track the frequency of usage.
Application thread A2 has a loop-based sequential access in which after ac-
cessing the 20th page, it again starts accessing pages from the 1st. Towards
the end of A2’s access loop, the LRU policy would have marked the 1st page
as the least recently used page. If a page-fault occurred while the thread was
accessing the last page in the sequence, then the LRU policy would reclaim
the 1st page. This would further result in a series of page-faults as the thread
proceeds to re-iterate its memory access through the loop.
The total number of page-faults generated when using LRU, MRU and
164
LFU policies were observed to be 83,750, 69,199 and 66,669 respectively. If
application thread A2 was allowed to specify its memory usage pattern, then
by reification it may be possible to reduce the number of page-faults.
Experiment #2
In this experiment, the page replacement strategy was modified such that a
selected victim page did not belong to the thread that caused the page-fault.
For instance, in the test application set, if thread A1 caused a page fault, then
the selected victim page would belong to thread A2. In this case, the number
of page-faults generated corresponding to each paging policy (i.e. LRU, MRU
and LFU) was observed to be 133,330, 133,334 and 133,327 respectively. All
the three paging policies showed poor performance compared to the previous
experiment. Clearly, the global page replacement strategy generated much less
number of page faults than the one used in this case.
Experiment #3
In this experiment, the page replacement strategy was modified to select a
victim page which belonged to the application thread that caused the page-
fault. The total page-faults, thus generated, for this strategy corresponding to
each paging policy (i.e. LRU, MRU and LFU) were – 66,666, 66,665 and 66,636
respectively. This strategy generated relatively less number of page-faults as
compared to both of the above experiments. Perhaps if an application-specific
UD paging policy is used, there could be a further reduction in the number of
page-faults.
165
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0 20 40 60 80 100
No.
of P
age
Fau
lts
Context Switches
Using RMMS
Reflective LRUReflective MRUReflective LFU
App. Specific
Figure 3.26: Page-faults for RMMS
Experiment #4 (UD paging policy)
The RMMS module allows applications to introduce an application-specific
UD paging policy into the system. In order to show the significance of re-
flection in the RMMS model, an UD policy was used for thread A2. Also,
Thread A2 was modified to reify its memory access pattern at runtime using
reify(MEM READ,&variable, PAGE SIZE) call before accessing a page.
The UD policy keeps track of the recent memory usage pattern of thread
A2. Furthermore, it is customised to support the thread’s sequential loop-
based access. Thus, on a page-fault, the UD policy selects a page (from A2’s
page-table) that would not be used immediately in the future. For instance, if
a page-fault occurs when A2 is accessing the 20th page, the UD policy selects
the 19th page instead of the 1st page.
Along with the use of an UD policy, more tests were performed using
166
the rVMM module to use the reified information and use the existing paging
policies (LRU, MRU and LFU) for A2 alone. The application thread A1 with
a random memory access used LFU page replacement policy. The graph in
figure 3.26 shows the page-fault graph for each paging policy using reflection
framework along with the UD policy. The LFU policy was used globally in
the RMMS base-level which handled the page-faults generated by thread A1.
Amongst the four different test cases, the first uses an LRU policy for A2,
the second an MRU, third an LFU and finally the fourth uses an UD policy.
The total page-faults generated in each case were – 65,266, 56,766 , 56,066 and
45,766 respectively.
As compared to the previous experiments, there is a significant reduction
in the number of page-faults generated. The UD policy showed best results
amongst all other reflective policies. The use of reflection with the traditional
policies also showed significant reduction in page-faults.
In the above experiments, the RMMS module incurred a memory over-
head of nearly 2 KB. This memory was used to store information reified by
the applications and system components. However, this overhead is negligible
in comparison to the amount of flexibility offered and the significant reduction
in page-faults. Comparing the number of page-faults generated using the tra-
ditional paging policies (LRU, MRU and LFU) against the application-specific
UD policy, the UD policy shows 31%–65% reduction in the number of page-
faults. Thus, by allowing applications to introduce custom paging policies into
the system, the RMMS provide the required application-specific support.
167
3.5 Summary
In summary, this chapter presented the generic reflective framework for an
RTOS. The traditional reification process in reflection was modified such that
information reified is stored in the RTOS kernel and later passed to the meta-
level component on explicit request. The design and implementation of DAM-
ROS, a reflective RTOS, implementing the reflective framework was also de-
scribed.
The implementation of two reflective resource management modules in
DAMROS: a reflective CPU scheduler (VRHS model) and a reflective memory
management system (RMMS) were described. Several experiments to evaluate
each reflective resource management module were presented. The experimen-
tal results showed improvement in application performance with minimal or
negligible overheads in terms of time and memory. Both VRHS and RMMS
modules were shown to be flexible enough to accommodate application-specific
resource requirements pertaining to the CPU and memory.
It is evident that reification in the reflection framework plays an important
role in adapting the system policies according to the application requirements.
The next chapter presents a case study for virtual memory to investigate dif-
ferent methods of adding reification calls into application source code. Instead
of relying on the applications to explicitly reify information, the chapter in-
troduces a method of automatically inserting reification calls into application
source code which would specify the RTOS about an application’s memory
usage patterns at runtime.
Chapter 4
Support for Reification: a CaseStudy
In order to satisfy their resource requirements, applications need to reify the
resource usage information to the RTOS. This helps the RTOS to adapt its
resource management policies accordingly and satisfy the application require-
ments. The generic reflective framework presented in the previous chapter
requires applications to explicitly reify information. There are various meth-
ods to support reification. One such method is the insertion of reification calls
into application source code at compile-time. Other methods could involve
the analysis of source code to identify resource usage patterns and insertion of
reification calls at certain points either manually or automatically.
This chapter uses virtual memory management (paging) as a case study
to show the significance of reification and the methods to support it. Paging
allows to run applications with greater memory requirements (i.e. memory size
greater than physically available). The case study will make use of reification
and accordingly adapt the OS’s paging module.
This case study considers applications with greater memory requirements
and that exhibit a loop-based memory access. A mechanism that exploits such
169
170
memory access to automatically add reification calls and to dynamically adapt
the paging policy is described.
The chapter is organised as follows. A brief introduction to the paging
model used in this chapter is presented in section 4.1. Two simple reification
calls to specify memory usage are discussed in section 4.2. This is followed by
the description of three methods of inserting reification calls: manual method
(section 4.4) – used by the application programmer to manually insert calls;
automatic method (section 4.5) – to automatically identify data locality and
insert appropriate calls; hybrid method (section 4.6) – involves a mixture of
both manual and automatic methods. The design of an OS paging mechanism
called CASP is presented in section 4.7 which uses the reified information to
optimise the virtual memory subsystem. Further, to simulate the behaviour
of CASP along with reification within the applications, the implementation of
an on-the-fly virtual memory simulator (PROTON) is described in section 4.9.
Finally, simulation results involving benchmark applications with the CASP
mechanism are presented in section 4.11.
4.1 Paging Model
This section describes the paging model used in this chapter. The model is
based on the following assumptions:
• there exists hardware support (e.g. MMU) to trap page-faults and trans-
fer control to the OS’s page-fault handler,
• there exists fixed amount of physical memory, M which can be divided
into exactly n number of unique equal sized pages,
• the system does not support multiple sized pages,
171
• the OS implements a demand paging system [19],
• memory is allocated in multiples of a page (i.e. one or more) and that it
is virtually contiguous while physically this may or may not be the case,
• there exists a secondary auxiliary storage device (e.g. a hard disk) that
acts as a swap space [19].
Definition 4.1a A page-fault in a demand paged system occurs when the
page requested by a process is not present in physical memory. Each applica-
tion process uses a data structure called a page table that maps the process’s
virtual pages to the physical ones. The hardware checks this page table and
maps the memory requests accordingly. In cases where there is no entry in the
page table to map the requested virtual page, the hardware reports a page-fault
to the OS.
In a demand paged system, the actual memory pages are allocated to
application processes only when accessed for the first time. Figure 4.1 is a
diagrammatic representation of the paging model. The hardware traps page-
faults to the OS page-fault handler routine. This page-fault handler routine
analyses the information provided by hardware and transfers control to the
page replacement code if needed. The page replacement code is responsible
for all the paging activity in the system such as reclaiming unused pages from
memory, bringing back the evicted pages from the swap-space, etc.. Depending
on the type of a page-fault the page-fault handler does the following to handle
it:
• a page from memory is moved to swap-space. This is called page-out or
swap-in operation,
172
Figure 4.1: OS Paging Model
• a page from swap-space is moved back into memory. This is called page-
in or swap-out operation,
• in case of demand paging, a new page is allocated to the process accessing
a virtual page for the first time.
Definition 4.1b A page-fault is said to be a minor page-fault when the
page requested has not been allocated and there exists a free page in memory
that can be allocated to the requesting process. This is true for demand ZERO
pages [90]. If the cost to allocate a page in memory is Calloc, then the cost to
handle a minor page-fault,
Cminor ≈ Calloc
173
Definition 4.1c A page-fault is said to be a major page-fault in two
different scenarios: one is when no free page exists in memory for allocation,
in which case an existing memory page needs to be paged-out to make space
for allocation; and the other is when a previously allocated page could not
be found because it was previously paged-out by the page replacement code,
in which case it needs to be paged-in. Sometimes a page-in operation might
cost an additional page-out operation if there is no free page in memory to
page-in. Thus, a major page-fault can either cause only one page operation
(page-in) or it might cause an additional operation (page-out). Assuming the
cost of page-in and page-out is nearly equal, say Cpage, then the cost to handle
a major page-fault is given as,
Cmajor ≈
{
(Cpage + Calloc) : 1 page op.
(2 × Cpage + Calloc) : 2 page ops.
Disk read or write operations have always been expensive as compared to
memory read or write. Hence, the cost to allocate a new page in memory,
i.e. Calloc is much lower compared to the cost to page-in/page-out, i.e. Cpage.
Thus, it can be inferred that:
Cmajor ≫ Cminor (4.1)
Algorithm 1 describes the operation of a page-fault handler routine. On
a page-fault a page-fault handler checks if a page has been allocated. If not,
it allocates a new page. Also, it checks if a page has been paged-out by the
page replacement code. In this case, it allocates a new page in memory and
page-in the old page from the swap space. Note that, in both the cases, the
newly allocated page needs to be added to the internal page-list(s) of the OS.
The OS maintains one or more page-lists to keep track of all the pages in
174
memory. Finally, the page-fault handler maps the page into the page table of
the requesting process.
Procedure Page-fault Handler (fault_address)Begin
if page allocated (fault address) is FALSE then
newpage := allocate page ()add page to list (newpage)
else if paged out (fault address) is TRUE then
newpage := allocate pagepage in (newpage, faultaddress)add page to list (newpage)
set mapping (fault_address, newpage)
End Procedure
Algorithm 1: Page-fault Handler Routine
Let ψ = {P1, P2, ..., Pn} be the set of all ‘n’ (allocated + free) pages in
memory. Let φ = {P1, P2, ..., Pf} ∀ Pi ∈ ψ be the set of ‘f ’ free pages
and ω = {P1, P2, ..., Pm} ∀ Pi ∈ ψ be the set of ‘m’ allocated pages. Now,
ψ = φ ∪ ω.
The function allocate page() in algorithm 1 may cause a page-out operation
if no free pages are available for allocation. i.e. a page, Pi ∀ Pi ∈ φ is selected
if f 6= 0 else it requests the page replacement code to page-out a page Pi ∈ ω.
Depending on the replacement policy used, one or more pages may be
identified as a candidate for eviction by the page replacement code. In case of
LRU, each page has an associated reference field ‘Ref ’ which is marked or in-
cremented whenever a process accesses that page. The page replacement code
for LRU selects a page Pi ∀ Pi ∈ ω which was least recently used (determined
by the value for Ref and position of the page in the LRU stack) [109].
The time spent by an OS in paging during the execution of a process affects
the process’s turn around time (Tτ ). Tτ for a process can thus be divided into
175
user-time and system-time. User-time (Uτ ) is the time for which a process
actually executes its code and system-time (Sτ ) is the time for which the OS
executes code either on behalf of the process (e.g. system calls) or is involved
in paging. Thus, Tτ = Uτ + Sτ
The paging activity of an OS depends on the page replacement policy being
used and the system load at the time of process execution. Other system tasks
like execution of system calls, etc. can be thought of having a constant time
as compared to paging. Thus, Sτ = Pτ +Oτ where Pτ is the time taken by an
OS in paging activity and Oτ is considered to be the constant time taken for
other OS activities. In an OS with global page replacement policy, page-faults
caused by one process in the system affect the turn around time of another.
Thus, the time taken for the paging activity can be summarised as:
Pτ ∝ (ρ · Cminor + η · Cmajor)
where ρ = no. of minor page-faults and
η = no. of major page-faults.
The turn around time, Tτ of a process can thus be summarised as:
Tτ ∝ {Uτ + (ρ · Cminor + η · Cmajor) +Oτ} (4.2)
From eqn. (4.1) and (4.2), it is clear that a greater value of ‘η’ will affect Tτ
more than it will for a greater value of ‘ρ’. i.e., Tτ can be improved if the no.
of major page-fault, i.e. ‘η’ is reduced. Clearly, an efficient mechanism that
reduces the number of page-faults (mainly ‘η’) can substantially reduce the
system time thereby improving the application’s execution time.
176
4.2 Reification Calls for Paging
This section describes reification calls that will be inserted into application
source code. The reification calls for paging should identify the memory ac-
cesses in the source and reify this information to the reflection framework. The
intention here is to ensure that the memory being accessed is always present in
the memory during its access. An OS paging mechanism could use such reified
information to lock and release memory pages allocated to the corresponding
virtual memory addresses being reified. A detailed description of such an OS
mechanism called CASP [97] is given in section 4.10 later.
For simplicity, two simple names have been chosen to represent the reifica-
tion calls: keep() – suggests that a memory region will be accessed in the near
future and discard() – suggests that a memory region will not be accessed in
the near future. Essentially, both calls are wrappers around the original reify()
call. It is not necessary to define separate reification calls for each and every
type of information that needs to be reified. In this case, keep() and discard()
help in better understanding. Also, wrapper functions provide a better level
of abstraction improving code readability without adding any code penalties.
The following subsections explains the two calls in more detail.
4.2.1 keep(< address >, < size >)
This call captures memory access information of the application under consid-
eration. It indicates to the meta-level of the paging module that a particular
virtual memory region will soon be accessed by the application. keep() can be
defined as a C constant as follows:
#define keep(address, size) reify(KEEP_ALIVE, address, size)
177
This call returns a unique identifier that is associated with the memory
region. The unique identifier can be later used to by the ‘discard()’ reification
call to simply refer to the previously reified memory region.
4.2.2 discard(< id >)
The discard() reification call indicates that a virtual memory region will not
be accessed in the near future. i.e. the paging module may move it to the
swap space. Note that the fact that this memory region can still be accessed
anytime but not immediately in the future means that the paging module
cannot completely get rid of the region.
Similar to keep(), discard() is defined as follows:
#define discard(id) reify(ALLOW_DEATH, id)
Note that, discard does not specify a memory address or size. It uses
a unique identifier returned by one of the keep() calls to refer to the corre-
sponding memory region. This restricts the application programmer to always
precede a discard() call with a keep() call. The next section explains how these
two reification calls can be effectively inserted into the application source code
to benefit from the reflection framework.
4.3 Inserting Reification Calls
Insertion of reification calls into the application source code will provide valu-
able runtime information to the RTOS about an application’s memory access
patterns. Three methods of insertion are described: manual method of in-
sertion – calls are inserted by the application programmer himself; automatic
method of insertion – calls are automatically inserted into the application
178
source code using a software tool (requiring no intervention from the appli-
cation programmer); hybrid method of insertion – calls are inserted using a
mixture of manual and automatic methods. The methods make use of large
memory region access within loops to identify data locality and insert reifica-
tion calls around them.
The following sections describe the three methods using a sample appli-
cation – ‘scan’. This is a micro-benchmark application that allocates itself
100 MB of virtual memory and loops 5 times – in each loop iteration reading
all allocated memory (a byte at a time) in a sequential order.
The C source code representation of ‘scan’ is shown in figure 4.2. Notice
that, the inner loop sequentially accesses large amounts of memory (i.e. reads
the array memptr of size 100 MB). Assuming that the available physical mem-
ory is only 64 MB, ‘scan’ stresses the virtual memory subsystem generating
a worst-case scenario for traditional page replacement policies. Such applica-
tions that use more memory than is physically available are generally termed
as out-of-core applications [26].
void scan(){
int index, loops = 5, size = 100 * MB;char *memptr;memptr = (char *) malloc(size);while(loops){
for(index=0; index < (size); index++){
temp = memptr[index];}loops--;
}}
Figure 4.2: Benchmark Application – ‘scan’
179
4.4 Manual Insertion Method
The application programmer who has sufficient knowledge of the application’s
data size and its usage in the source code is able to accurately insert reification
calls in the source at the time of programming. Manual insertion of calls into
the application source can be time consuming and error prone if the person
inserting them is not the application developer. It is best to add reification
calls during application development. Otherwise the application programmer
needs to:
1. know the application source language,
2. understand the application behaviour,
3. determine data access points or in other words the data hot-spots.
Although the above criteria seem daunting to the application programmer,
the end result nevertheless can be very satisfactory. Particularly in the case
of out-of-core applications executed upon limited resource portable embedded
systems where efficient utilisation of resources has a significant impact on the
overall performance of the system.
For the sample application – ‘scan’, the programmer would make limited
modifications by splitting the inner loop into several loops (4 in this case) and
adding reification calls around them (see code in figure 4.3). The programmer
can test the performance of the modified source and accordingly vary the
location of the reification calls or the number of split loops to achieve better
performance.
The call to keep(memptr + index, size/4) suggests that the memory region
of size ‘size/4 ’ starting at the virtual address ‘memptr + index’ will be accessed
immediately in future.
180
void scan(){
int index, loops = 5, size = 100 * MB;char *memptr;int id;memptr = (char *) malloc(size);while(loops){
id = keep(memptr, size/4);for(index=0; index < (size/4); index++){
temp = memptr[index];}discard(id);
id = keep(memptr + index, size/4);for(; index < (size/2); index++){
temp = memptr[index];}discard(id);
id = keep(memptr + index, size/4);for(; index < (3*(size/4)); index++){
temp = memptr[index];}discard(id);
id = keep(memptr + index, size/4);for(; index < (size); index++){
temp = memptr[index];}discard(id);
loops--;}
}
Figure 4.3: Manual Insertion for ‘scan’
181
4.5 Automatic Insertion Method
In order to assist the RTOS to adapt its policy from a memory management
point of view, the RTOS needs to know about the application’s memory re-
quirements and its access patterns during execution. Reification calls need to
be added into the application at points where the application process accesses
or allocates memory to itself.
Previous work in this area considered several compiler based techniques for
inserting custom memory management hints [26, 82]. The compiler directed
memory management [82] analyses application code at compile time for loops
consisting of accesses to data arrays, inserting primitives such as LOCK, UN-
LOCK and ALLOCATE into the code to control the allocation of memory for
the corresponding arrays at run-time.
Brown et. al. [26] proposed a similar approach using compiler-inserted
pre-fetch and release hints to manage physical memory. Brown’s approach
used a run-time software layer to queue hints in application space and then
send them across to the underlying OS. These techniques assume that the
underlying OS supports allocation of memory on demand and also provides
an efficient lock/release mechanism. The scope of using reification process
is much wider and is not restricted to only virtual memory management, this
chapter uses virtual memory as a case study to describe the usage and support
for reification in the reflective framework.
The process of automatically detecting regions having large memory ac-
cesses in the application source code can be particularly hard and restrictive
without much information (i.e. the control flow graph or a pre-execution trace
182
of the application) about the application. The method described in this sec-
tion uses only the application source code and no other information to insert
reification calls.
The automatic insertion method improvises on loop-based sequential mem-
ory accesses. It parses the application source, detects loop with large data
access, splits the loop into multiple smaller loops still maintaining the original
application behaviour and then inserts reification calls around these loops to
provide information to the RTOS. The process is similar to manual insertion
but it done automatically. The next subsection explains the automatic method
for applications written in the C language.
4.5.1 Automatic Insertion for C Language
The C language which is widely used for embedded application development
was chosen for analysis. Note that in C, an algorithm can be expressed in many
different ways (e.g. the use of pointers instead of arrays, or the use of ‘for’
loops instead of ‘while’ loops, etc.). To counter this, the CIL (C Intermediate
Language) tool set [53], which can transform such C source code into a uniform
C source representation, has been used. For instance, CIL transforms all
loop constructs (for, do-while, etc.) into while loops, all data accesses and
declarations are represented in a way such that there is no difference between
pointer reference and array reference (i.e. ‘a[i]’ will be transformed into ‘(*(a
+ i))’ ).
A tool – ‘cloop’ has been developed to parse the CIL transformed C source
code, detect loops with large amounts of memory accesses and insert the reifi-
cation calls if required [97].
Figure 4.4 shows the process involved in automatic insertion of reification
183
Figure 4.4: Steps Involved in Automatic Insertion
calls. With respect to paging, reification calls will be inserted to specify the
application’s memory access patterns to the RTOS at runtime. The tool –
cloop is specialised to detect loops with large amounts of data accesses and
insert reification calls.
In the sample application ‘scan’, reification calls are inserted to suggest
immediate and non-immediate accesses to certain memory regions accessed
within loops. Note that, an OS mechanism (in a system with only 64 MB)
would not be able to lock the memory pages if the information reified by ‘scan’
suggests that a memory region of size 100 MB will be accessed. In other words,
reification calls needs to be inserted more intuitively than just specifying the
memory access. This involves taking into account the amount of available
physical memory and reducing the amount of locked memory accesses by an
184
application at any given time. Since this case study improvises on loops, it is
logical to split the loops such that each split loop accesses smaller portions of
the memory region.
The following relation is used to determine the minimum size of memory
region to be locked and/or the number of split loops. A minimum watermark
for the amount of physical memory always to be free is set to ‘y%’ of the total
amount of memory (Mtotal) in the system. If Dsize represents the size of the
memory region being accessed (determined by the data size of the variable as
well as the loop bound) and Mfree represents the total free memory, then the
minimum amount of memory region that the OS mechanism needs to lock, i.e.
Dlock is given by:
Dlock = min(
⌊x% ×Dsize⌋,Mfree − y% ×Mtotal
)
(4.3)
where x is the minimum percentage of the memory region to be accessed and
y is the percentage of total memory that needs to be free.
For the sample application, assume that the minimum watermark, y is set
to be 10% and that the minimum amount of data to be accessed, x is set to
be 25%. Thus, the eqn. 4.3 now becomes:
Dlock = min(
⌊0.25 ×Dsize⌋,Mfree − 0.1 ×Mtotal
)
(4.4)
The mechanism would then lock either 25% of the data size or 90% of the
available free memory – whichever is less. Thus, the number of split loops is
given by: ⌊Dsize/Dlock⌋.
The cloop tool uses a two stage process:
• for a loop with a known loop bound, loop splitting techniques [16] are
used to split the loop into separate individual loops. This depends on
185
Figure 4.5: Pass-1 of the cloop Tool
186
Figure 4.6: Pass-2 of the cloop Tool
187
the loop bound and the size of the data being accessed. Appropriate
keep() and discard() reification calls are then inserted in between the
split loops.
• for a loop with an unknown loop bound, a separate function (called
checkpoint function) containing the split loops with the reification calls
is created while a conditional statement is inserted before the original
loop. The loop bound is checked at run-time, invoking the new function
containing the reification calls if the loop bound and data size is large
enough to benefit from the reification calls. This is determined by looking
at the free memory available at that time and the size of the memory
being accessed. In order to benefit from the use of reification calls, an
application should access data of size which is either greater than the
physical memory size or at least greater than the available free memory
in case of unknown loop bounds.
Figure 4.5 shows the flowchart of the pass-1 phase of the cloop tool. In this
phase, cloop parses the CIL transformed source code and builds a meaningful
internal representation to help find loops with large memory access. The
flowchart of pass-2 phase is shown in figure 4.6. In this phase, cloop identifies
the target loops that need to be split with the reification calls inserted. Using
the relation 4.3 above, a loop is either split (in case of known loop bounds) or
conditional statements added (in case of unknown loop bounds).
For the sample application, it is assumed that the system has 64 MB of
physical memory and at the time of execution the available free memory is
58 MB (this is an ideal value when using a freshly booted Linux system). The
code in figure 4.7 shows the CIL transformation of the original ‘scan’ source
188
void scan(){
int index, loops = 5, size = 100 * MB;char *memptr;memptr = (char *) malloc(size);while(loops != 0){
index = 0;while(index != (size -1))){
temp = *(memptr + index);index = index + 1;
}loops = loops - 1;
}}
Figure 4.7: CIL Transformation of – ‘scan’
code. Note that, the inner ‘for’ loop has been converted to a ‘while’ loop so
that all loops in the source are uniform. Also, the term ‘*(memptr + index)’
accesses indexth element of the data array memptr (i.e. memptr[index]).
This source is parsed to detect loops and determine data access points (Pass
1 of cloop). For ‘scan’, the inner loop is selected as the prime candidate for
splitting. Using the equation 4.4, Dlock = min(
⌊0.25 × 100⌋ = 25, 58 − 0.1 ×
64 = 51.6)
, i.e. the minimum size of data region to be locked is 25 MB (i.e.
⌊size/4⌋). Thus, the inner loop needs to be split into 4 (⌊100/25⌋) similar
loops. Since the loop bound for the inner loop is known, it is split into 4
smaller loops and the reification calls inserted. The transformed source code
of ‘scan’ is shown in figure 4.8.
Note that, the variable ‘index’ is not initialised at the start of every loop.
This allows the continuation of data access similar to the original loop (before
splitting). Each time in the split loop, only the required amount of memory
(⌊size/4⌋) is used as determined above.
In this particular example, both manual and automatic methods produce
almost similar transformations. However, if ‘scan’ dynamically varied the loop
189
void scan(){
int index, loops = 5, size = 100 * MB;char *memptr;int id;memptr = (char *) malloc(size);while(loops != 0){
index = 0;id = keep(memptr, size/4);while(index != (size/4 -1))){
temp = *(memptr + index);index = index + 1;
}discard(id);
id = keep(memptr + index, size/4);while(index != (size/2 -1))){
temp = *(memptr + index);index = index + 1;
}discard(id);
id = keep(memptr + index, size/4);while(index != (3*size/4 -1))){
temp = *(memptr + index);index = index + 1;
}discard(id);
id = keep(memptr + index, size/4);while(index != (size -1))){
temp = *(memptr + index);index = index + 1;
}discard(id);
loops = loops - 1;}
}
Figure 4.8: Automatic Method for ‘scan’
190
bounds using parametric values, then the automatic method would generate
a different transformation. A conditional statement would be inserted before
the inner loop to execute a new function containing the split loops if the loop
bound of the inner loop exceeded the limit determined by equation 4.4.
4.5.2 Comparison of Manual and Automatic Insertion
Generally, the manual insertion process is considered to be more efficient than
the automatic method. This is because, the application programmer is able
to accurately insert reification calls even at non-loop based data access points.
For instance, consider that a large amount of memory is being accessed in
parts across several function routines. Since the automatic insertion method
only observes memory accesses at certain fixed locations in the source (i.e.
loops in this case study), it fails to recognise this scattered memory access.
The application programmer, on the other hand, would know that the memory
access is scattered across several function routines and thus, can add reification
calls encompassing these function routines.
For the application ‘scan’, both manual and automatic insertion methods
produce the same results. This is because, the application code iterates over
a fixed sized memory array within a loop with a known loop bound. By
changing the code such that the loop bound and the array size are passed to
‘scan’ via function parameters, manually inserted reification calls will not yield
the same results for all array sizes and loop bounds. If these values depend
on certain runtime events, then the programmer is unable to accurately insert
the reification calls.
After executing the two versions of the application scan in Linux, it was
found that the one using automatic insertion method finished execution nearly
191
125 seconds earlier than the one using manual insertion method. Nevertheless,
this is not true for all applications. For example, the MPEG decoder appli-
cation which had data accesses scattered across function boundaries showed
better execution time using the manual insertion method. A more detailed
analysis of both insertion methods for different applications is provided in the
next chapter.
Generally, manual insertion is a slow or time consuming process and its
accuracy depends on the skills of the programmer in detecting memory accesses
and accordingly inserting reification calls. To conclude, both manual and
automatic methods have their associated pros and cons. This leads to the
hybrid method of insertion (described next). It uses the best of both methods
in a way that would yield better results.
4.6 Hybrid Insertion Method
Although the automatic method produces acceptable results, there are known
failure scenarios. It is not possible to automatically add accurate reifica-
tion calls in all applications, particularly where memory is being accessed
across function boundaries. Consider for example the MPEG decoder appli-
cation [107]. In this application, large amount of MPEG data is read into
memory and later decoded by several decoding functions depending on the
frame type (e.g. I frame, B frame or a P frame) [51]. Each decoding func-
tion is responsible to decode a particular kind of frame, consuming part of the
MPEG data in the process. In an MPEG stream consisting of several different
frames, data is consumed across various function boundaries. The automatic
method of insertion fails to identify this kind of scattered data access.
On the other hand, the manual method of insertion only produces best
192
results if the application programmer is aware of such data access and adds
appropriate reification calls within function boundaries. Complete reliance
on the application programmer can have adverse effects as well. Due to the
application complexity, it is possible that the programmer fails to insert certain
key reification calls that could make a huge difference. Thus, there exist trade-
offs between both methods prompting for a combined hybrid approach.
In the hybrid approach, instead of manually analysing the entire applica-
tion source code, the programmer initially uses the automatic tool (cloop) to
analyse the application source and insert reification calls. The cloop tool can
be configured to output important information about the application source.
For instance, it can show the allocation/de-allocation of memory, location of
looped memory accesses, etc. This provides the programmer with sufficient
information early on about the application – pointing him/her at specific loca-
tions in the source code which could then be manually analysed for insertion.
Thus, if the application source code is huge and complex, then using the hy-
brid method, the programmer needs to analyse only a certain percentage of
the source code. By mixing the two methods, the programmer can at least
speed-up the insertion process for looped memory accesses. The next sec-
tion describes CASP, an OS paging mechanism, that makes use of keep() and
discard() calls.
4.7 Design of CASP Mechanism
This section presents the design of a Co-operative Application-Specific Paging
(CASP) [97] mechanism in an OS. CASP makes use of the information pro-
vided by the reification calls inserted using one of the above methods. Note
that, in the previous chapter, DAMROS was a single address space OS. But
193
Figure 4.9: Design of CASP Mechanism
194
the design of CASP supports multiple address spaces as well. The model of
the CASP mechanism is as shown in figure 4.9. Reification calls corresponding
to memory access/usage in the form of keep() and discard() calls are placed in
the application source code. The operation of CASP is divided into an appli-
cation level component called CASPapp and an in-kernel OS component called
CASPos. CASPos acts as a meta-level component of the OS paging module.
Both components are described in the following subsections.
4.7.1 CASPapp Component
The CASPapp component consists of a runtime library attached to the appli-
cation code. The library uses an OS system call interface pass information (in
the form of reify() calls) to the CASPos component. A keep() call suggests a
memory region to be locked for use and a discard() call suggests to unlock a
previously used memory region. For example, an application process will use
keep(< Address >, < Size >) to suggest CASP that it will access the mem-
ory pages mapped for virtual addresses ranging between Address and (Address
+ Size). The CASPapp component passes this information using reify() calls.
The reified information is picked up by the CASPos component which uses re-
flection to lock the pages in memory along with techniques such as pre-paging
and page-isolation (described later). A call to discard(ID) would suggest the
CASPos component to unlock the locked pages.
4.7.2 CASPos Component
CASPos component is activated by the rManager when an application pro-
cess uses keep() or discard() reification calls. After receiving the information,
i.e. the memory Address and Size, the process’s address space is checked to
195
see if the memory pages have already been allocated or if any pages in the
given memory region need to be paged-in. Accordingly, pages are allocated or
paged-in from the swap-space and mapped into the process’s page table. This
operation is called pre-paging.
Algorithm 1 listed the pseudo code of the page-fault handler routine in an
OS. Note the use of add page to list() function. This function is used to add
a newly allocated page into the OS maintained page lists. When a page at a
particular virtual address is not found, the page-fault handler routine either
allocates a new page or pre-pages from the swap space.
The CASPos component has a similar operation. In order to lock pages
in memory, the CASPos component uses a technique called page isolation
(explained in the next subsection) such that the locked pages are not placed
in the OS maintained page lists. The only difference between the page-fault
handler routine and the CASPos component is that the page-fault handler
routine adds the allocated/pre-paged pages into the OS maintained page lists
(via add page to list() as in algorithm 1) whereas CASPos does not.
One of the advantages of the generic reflective framework is the ability
to use existing code. The CASPos component makes use of the interception
mechanism to re-use the existing code for the page-fault handler routine [95,
96] (see section 4.7.4). This mechanism allows the CASPos component to
intercept calls to add page to list() in the page-fault handler routine such that
the control is transferred to a page-isolation() routine instead. In a sense,
the CASPos component uses the information provided by the reification calls
to adjust the working set memory image of the application by pre-fetching,
locking/releasing and pre-swapping pages in and out of memory. This helps to
always keep the memory pages in physical memory every time the application
196
accesses them. The next subsection describes the page-isolation technique used
by the CASPos component for efficient non-intrusive page locking operation.
4.7.3 Page-isolation Technique
The page-isolation technique has a relatively simple operation. Algorithm 2
lists the pseudo code of the page-isolation routine. It is assumed that the OS
maintains two-page lists (similar to Linux [19]): an active page list – consisting
of all pages that are in use, and an inactive page list – consisting of the
remaining pages. The page-isolation routine determines the list a page belongs
to and removes the page from that particular list. The page, thus removed, is
completely isolated from the OS maintained page-lists. This process is termed
as page-isolation.
In order to keep track of the isolated pages, the CASPos component main-
tains a separate page list specific to each application that use the reification
calls. This page list is stored in the OS address space and is not accessible to
the applications. Also, it incurs no extra memory overhead since it occupies
the same amount of memory if the pages were stored in the original OS page
lists.
The isolated page list is later emptied by adding the pages back into the
respective OS page lists when: the corresponding application terminates, when
it uses a discard() call or when there is no free memory available for other
application processes.
Care has been taken by adding checks in the CASPos component such that
a single application is not allowed to lock all available memory to itself. Even
if a programmer greedily adds keep() reification calls in order to lock more
memory, the CASPos only locks the first Nfree/Nprocess pages; where Nfree =
197
the number of available free pages and Nprocess = the number of the different
application processes running in the system. Each keep() call is time-stamped
so that when there is no more free memory available in the system the CASPos
component recovers pages starting from the oldest isolated page-list.
Procedure Page-Isolation (page)Begin
if page in active list (page) is TRUE then
remove from active list ()
else
remove from inactive list ()
add page to isolated list (page)
End Procedure
Algorithm 2: Page-isolation Routine
CASP operates non-intrusively with the existing page replacement code
and thus, has relatively no side-effects. Since the isolated pages do not exist in
the OS page lists, they are never considered as candidates for reclamation by
the OS’s page replacement code. This could, in fact, speed up the reclamation
process since the code has less number of candidates for page reclamation.
CASP achieves page locking without the knowledge of the original page re-
placement code making it a generic approach that can easily operate on top of
any existing page replacement policy. The next subsection describes the use
of the reflection framework in CASP.
4.7.4 Use of the Reflection Framework
CASP uses the interception mechanism built into the generic reflective frame-
work to re-use of the existing page replacement code of the OS. The mechanism
198
allows to intercept unwanted function calls in a particular function routine and
transfer control to another routine instead. This promotes code reuse-ability
and also eliminates code redundancy.
During OS initialisation, CASPos component sets itself as the meta-level
component of the resource represented by the resourceID MEMORY. Thus,
whenever the reflection framework receives information reified for MEMORY
resource, the rManager activates the CASPos component. Also, during ini-
tialisation, CASPos intercepts calls to add page to list() in the page-fault han-
dler routine once and immediately unintercepts them by setting keepAlive to
TRUE.
When the CASPos component starts pre-paging, it uses reinterceptCall()
to intercept calls to add page to list() before calling the page-fault handler rou-
tine. This results in the control being transferred to the page-isolation routine
when the page-fault handler routine calls the add page to list(). By skipping
the execution of function add page to list(), the pre-paged pages are avoided
from being added into the OS page lists. At the same time, by executing the
page-isolation() function instead, these pages are added into the isolated page-
list of the corresponding application. The next section discusses the evaluation
strategy for CASP.
4.8 Evaluation Strategy
DAMROS is a single address space OS and CASP is designed to support mul-
tiple address spaces. It would be interesting to see the applicability of the
framework and CASP to an OS supporting multiple address spaces. Rather
than implementing a new OS or changing DAMROS, it would be ideal to
199
implement just the core elements of the framework along with CASP mecha-
nism in a commodity multi-address space OS. However, before implementing
the framework and CASP in a commodity OS, evaluation using simulation is
considered in this chapter.
Existing virtual memory simulators do not allow customisation in order to
add CASP capabilities into them and also, are generally slow (i.e. in terms of
simulation time). The following sections present existing techniques involving
virtual memory simulations followed by the description of PROTON [93], a
home-grown customisable on-the-fly virtual memory simulator. Later, in the
evaluation section, PROTON is used to simulate CASP along with applications
that use the reification calls.
4.9 Virtual Memory Simulation
This section surveys the existing virtual memory (VM) simulation techniques.
The VM simulation techniques can be classified into two main categories:
Trace-driven simulation and On-the-fly simulation.
4.9.1 Trace-driven Simulation
In the trace driven approach, a complete memory reference trace of the given
workload executing upon the real hardware is obtained. This trace is gener-
ally recorded as a disk file or transmitted via a communication medium (e.g.
Ethernet). The memory reference trace, so obtained, is processed by a simple
VM simulator implementing the required paging policy (e.g. LRU). Several
solutions, such as Laplace [66] and kVMTrace [67], exist to obtain memory
reference traces of the system workload.
ATOM [114] is a static code annotation based trace collection tool which
200
analyses a single application. ATUM [10] on the other hand, uses microcode
to efficiently capture address traces. Since the microcode operates beneath the
OS layer, the captured trace consists of memory accesses of all the software
components running on the hardware.
Trace-driven approaches are known to generate huge reference traces even
for only a few seconds of workload execution. Due to this problem, several
errors such as trace discontinuities, time dilation and memory dilation are
introduced into the system [122]. Such errors are collectively called trace
distortions [122].
Kaplan et. al. [68] proposed two algorithms - Safely Allowed Drop (SAD)
and Optimal LRU Reduction (OLR) - for reducing the trace size by several
factors. Using this approach, the simulation error in terms of number of page-
faults for CLOCK and SEGQ (segmented queue) replacement policies was
under 3%. However, since the trace reduction algorithms discard information
which is not required by an LRU policy, this approach only applies to LRU-
based policies. The next subsection discusses the existing on-the-fly simulation
techniques.
4.9.2 On-the-fly Simulation
The On-the-fly VM simulation techniques simulate the memory references
alongside the execution of an application process. The approach involves an-
notating the application code to call a simulator function for each memory
access instruction. This function simulates the memory reference in the simu-
lator. Although, this method eliminates the need for recording and handling
huge memory reference traces, it adds considerable runtime overhead.
MemSpy [83] is one such simulator which annotates application’s assembly
201
code. Typically, MemSpy exhibited a slow-down factor in the range of about
20 to 60 for the simulation of direct-mapped data cache of size 128 KB [122].
Fast-Cache [74] is an on-the-fly simulator based on an abstraction called
‘active memory’. It results in a slow-down factor in the range of about 2 to 7
for the simulation of direct-mapped data caches between sizes 16 KB to 1 MB.
Most of the existing on-the-fly simulators support single application pro-
cesses only. It is neither possible to determine the overall system performance
nor to predict the effects of a particular application on an existing system
workload. PROTON on the other hand can simulate multiple applications
making it possible to determine the VM performance of the entire system.
Eggers et. al. [43] present techniques involving efficient placement of in-
strumentation trace points or annotations into the application assembly code.
The approach is particularly focused on shared-memory multiprocessor ar-
chitectures. PROTON focuses on reducing code annotations in a high-level
language source rather than the assembly code. This gives it more flexibility
and makes the process portable to any platform (particularly if the application
source is written in the C language).
The next section presents the design and implementation of PROTON that
is used to stimulate the reification calls as well as the CASP mechanism.
4.10 PROTON Virtual Memory Simulator
PROTON [93] simulator has been specifically designed for flexibility, easy cus-
tomisability and to support the simulation of multiple applications at once.
Another objective of PROTON is to improve upon the VM simulation time.
Figure 4.10 shows the model of PROTON. The implementation uses the
POSIX [98] library for handling multiple threads. Similar to other on-the-fly
202
VM simulators, PROTON annotates the application source. The annotations
call a PROTON function at the point of memory reference in the application.
Adding annotations in the high-level language helps to better analyse the data
access pattern and optimise the placement of annotations. Following subsec-
tions explain the optimised code annotation technique and the operation of
PROTON in more detail.
4.10.1 PROTON Annotations
The application source code written in a high-level language (C in this case)
is annotated in three phases. In phase #1, PROTON parses the high-level
language application source and builds an internal representation of dynamic
data flow of the application. This is similar to the one generated by the cloop
tool. Phase #2 detects memory accesses and inserts suitable annotations using
the optimal placement technique (explained in the next subsection). In phase
#3, all the memory allocation/deallocation functions (i.e. malloc(), free(), etc.
) of the underlying C library are intercepted such that PROTON can trap
dynamic memory allocations and simulate them.
Optimal Placement of Annotations
This subsection uses the application ‘scan’ (figure 4.2) as an example for plac-
ing annotations. The traditional annotation methods involve annotating the
assembly code of an application. These methods utilise little or no information
regarding the looped sequential access; neither can such methods detect the
size of the dynamic memory allocation (e.g. size of memptr in scan).
If PROTON were to use similar techniques, then the resulting annotated
203
Figure 4.10: PROTON Design Model
application source would look like the listing in the figure 4.11. The annota-
tion ‘sim mem access(&memptr[index], 1, READ)’ indicates to the simula-
tor that the application is reading 1 byte from the memory location pointed
to by memptr[index]. Note that, this example only shows annotations for the
dynamic memory allocations. The variable temp is stored on the stack and its
access has not been shown to be annotated. It is evident that the annotation
for the inner loop can be easily inserted before the start of the loop such that
a single annotation is enough to represent the entire memory access of the
variable memptr in the loop.
Such optimised placement of annotations is only possible by analysing the
application source written in a high-level language. Phase #2 of PROTON
detects this kind of memory reference and inserts a single combined annotation
outside the loop. In case of ‘scan’, by adding a single annotation outside the
inner loop, the annotation is called only 5 times instead of (5 × 104, 857, 600)
times.
204
Each annotation is a function call which would require the storage and re-
trieval of the processor flags and registers on the stack. Thus, rigorous use of
such annotations would incur a substantial runtime overhead potentially slow-
ing down the simulation process. By optimal placement of the annotations,
PROTON minimises the number of annotations (i.e. function calls), thereby,
reducing the associated overhead (see code in figure 4.12).
void scan(){
int index, loops = 5, size = 100 * MB;char *memptr;memptr = (char *) malloc(size);while(loops){
for(index=0; index < (size -1); index++){
sim_mem_access(&memptr[index], 1, READ);temp = memptr[index];
}loops--;
}}
Figure 4.11: ‘scan’ with Traditional Annotation
Nevertheless, such placement technique may introduce an element of error
in the simulation. For example, in case of ‘scan’, the placement strategy is
accurate when memptr is the only variable being accessed in the loop. How-
ever, if there are other variables that are being accessed in the loop, then these
intermittent accesses would affect the state of the virtual memory subsystem
resulting in a different set of paging operations. For instance, consider the
following code statement:
memptr[index] = strptr[index] + memptr[index];
If this statement is added to the inner loop of ‘scan’, then the memory
locations strptr[index] and memptr[index] are first read from memory, the sum
205
calculated and the result written back again to the location memptr[index]. In
this case, it is not appropriate to insert three annotations outside the loop: two
indicating read access to variables memptr and strptr using the READ option
and one annotation indicating a write access to memptr using the WRITE
option to sim mem access(). For all such cases, PROTON uses the traditional
approach of adding annotations. i.e. add annotation for each memory access
inside the loop.
void scan(){
int index, loops = 5, size = 100 * MB;char *memptr;memptr = (char *) malloc(size);while(loops){
sim_mem_access(memptr, size, READ);for(index=0; index < (size -1); index++){
temp = memptr[index];}loops--;
}}
Figure 4.12: ‘scan’ with PROTON Annotations
A preliminary analysis of MiBench [56] applications suggested that it is
common to find loop based single variable accesses similar to ‘scan’. For in-
stance, in the bubble sort algorithm all accesses in the loop are confined to
the array list being sorted. However, there are some exceptions such as the
FFT (Fast Fourier Transformation) application whose access pattern depends
on the data at runtime.
PROTON is able to detect dead code in loops. For instance, consider a
loop containing dead code as shown in the following code fragment:
...
condition = FALSE;
206
for(i=0; i< SIZE; i++)
{
if(condition) {
temp = memptr[i];
}
else {
;
}
}
...
In the above code, the read access to data memptr would never be executed.
During phase #1, PROTON builds an internal symbol table with the knowl-
edge of statically initialised variables. For the above code fragment, in phase
#2, when analysing the ‘for’ loop, PROTON evaluates the condition of the
‘if ’ statement which in this case results in FALSE. Hence, no annotations are
added for access to the variable memptr. However, PROTON can only analyse
static conditions (i.e. it cannot analyse for instance, if(func call(condition))).
For such cases, adds annotation inside the dead code. Since the annotation is
within the dead code, it is never executed, thus, maintaining the exact runtime
behaviour.
4.10.2 Simulation of Multiple Applications
In order to simulate multiple applications, the annotations from all applica-
tions should be recorded by a common PROTON simulator code base. The
use of IPC mechanisms [90] such as pipes or shared memory can help address
this issue.
However, using IPC would add additional overhead into the simulation pro-
cess making it much slower. PROTON takes a different approach. During the
application source analysis in phase #1, a minor source modification enables
207
the entire application workload, consisting of all applications, to run as a sin-
gle multi-threaded application which is linked with the PROTON simulator.
Each application is then executed as an independent thread of the resulting
application process. This way, the annotations added into each applications
provide a direct function call interface without incurring extra communication
overhead.
Although PROTON supports multi-threaded applications, it does not
guarantee their scheduling behaviour. Since PROTON is built on top of the
POSIX pthreads [98] library, a multi-threaded application can opt to use the
underlying POSIX scheduling framework if available.
If a single application is multi-threaded, then during simulation of multiple
applications, the individual threads of that particular multi-threaded applica-
tion would be executed along with other application threads, all sharing a
single address space. One limitation of this approach is that, there is no pro-
tection between the threads belonging to different applications. Thus, a faulty
or erroneous application within the simulated workload can affect the simula-
tion process by interfering with other application threads. It is assumed that
each application has been tested to be bug free. Tools such as Valgrind [108]
can be used to detect memory errors in the application code.
The simulator has been implemented as a shared library which is linked
to the application at runtime. The following parameters are used to configure
PROTON is virtual memory system:
• TOTAL MEMORY: This parameter sets the amount of physical memory
to simulate.
• PAGE SIZE: This parameter sets the size of a memory page. PROTON
208
supports different page sizes allowing it to simulate virtual memory for
different architectures.
• DISK READ TIME: This parameter is used to set the time required to
read a single page from the disk. PROTON simulates a disk read/write
by executing a delay sequence for the given time.
• DISK WRITE TIME: This parameter is used to set the time required
to write a single page to the disk.
• TOTAL APPS: This parameter specifies the number of application
threads that would need to be simulated.
• POLICY: This parameter specifies the paging policy to be used for sim-
ulation. PROTON implements the following paging policies [20, 40, 90,
120]:
1. LRU policy,
2. CLOCK policy,
3. MRU policy,
4. FIFO (First-In-First-Out) policy,
5. AGEING policy,
6. LFU policy,
7. MFU (Most Frequently Used) policy,
8. RANDOM policy,
9. USER - specifies that a user-defined policy is in use.
209
Note that PROTON implements the CASP mechanism described in sec-
tion 4.7 which operates on top of any of the above page replacement
policies.
• MK TRACE: This parameter sets/un-sets the recording of memory ref-
erence trace to a disk file.
• GUI: This parameter graphically displays the execution of applications
and draws dynamic graphs of the paging activity of the system. Use of
GUI (Graphical User Interface) slows the simulation process and is not
recommended for workloads with lengthy execution times.
A configuration file is used to setup the above simulator parameters. A set
of annotated test applications are compiled and linked together to form one
executable, which when executed starts the PROTON initialisation function.
This function reads the configuration file and initialises the virtual memory
system, PROTON spawns the application threads and begins simulation. The
simulation is either set to be continuous until all application threads finish
execution or set for a fixed amount time. PROTON then generates the virtual
memory statistics of the application(s). The PROTON virtual memory sys-
tem assumes the following: cache is disabled, the CPU direct access to main
memory, and the swap space has unlimited storage space.
4.10.3 Implementing UD Paging Policies
The value ‘USER’ for configuration parameter ‘POLICY’ specifies the use of
a user-defined page replacement policy. PROTON provides user-hooks which
when set are executed at certain paging events during simulation. A new
210
UD policy needs to use these user-hooks to trap the required paging events.
Following user-hooks have been provided in PROTON:
• pgrep init() and pgrep exit(): These user-hooks are called by PROTON
to initialise/de-initialise the page replacement policy code.
• pgrep replace(): This user-hook is called when there is no free mem-
ory available for allocation. The function is expected to return a page
number starting from 0 to MAX PAGE NUMBER (globally accessi-
ble value) that will be swapped/replaced by PROTON. An UD policy
must implement this hook failing which the PROTON simulator will not
be initialised.
• pgrep alloc() and pgrep free(): These user-hooks are called after a new
page has been allocated/deallocated to an application. The implemen-
tation of these hooks is optional.
• pgrep access(): This user-hook is called whenever a page is accessed by
an application. This helps the UD paging policy to maintain page usage
statistics.
The next section describes simulation experiments using PROTON to de-
termine the effects of reification calls and the CASP mechanism for different
kinds of benchmark applications.
4.11 Simulation Experiments using PROTON
4.11.1 Benchmarks
Experiments were performed using four different kinds of applications, one is
the sample application - ‘scan’, two were taken from the embedded benchmark
211
suite – MiBench [56] and one from Brown’s work [26]. The data-set size of
these applications was increased such that they required more physical memory
than was available. PROTON was configured to simulate 64 MB of memory
with an unlimited swap-space. Table 4.1 summarises the characteristics of
these applications. Each chosen application exhibits a different memory access
pattern.
Name Description Input Data Set Memoryrequired
BSORT Bubble Sort 2559 records of 100 MB40 KB each
FFT Fast Fourier four 32-bit float 64 MBtransform arrays of
size 222
MATVEC Matrix vector three 32-bit 101 MBmultiplication integer arraysapplication
SCAN Example continuous 100 MBbenchmark byte arrayapplication
Table 4.1: Description of Benchmark Applications
BSORT – a bubble sort algorithm that sorts a list of records (approx.
40 KB in size) in an ascending order. The access pattern is mainly sequential
with a few random accesses due to record swap operations.
FFT – uses 4 floating point number arrays to perform the fast fourier
transform. The data access pattern in the main loop of FFT depends on the
value of the data itself, resulting in a random data access pattern.
MATVEC – a matrix vector multiplication application uses three data
arrays of different sizes to perform intensive matrix multiplication operations
in loops with known loop bounds. Therefore, it was easy to insert reification
212
calls by both automatic and manual methods. MATVEC was used to test
CASP for locking multiple data sets.
Finally, SCAN – the previously described micro-benchmark application. In
order to evaluate the automatic insertion method under dynamic conditions,
SCAN was modified to use parametric values for the variable ‘loops’ and ‘size’
(see original code in figure 4.2). SCAN generates a worst-case scenario for the
page replacement policies which helps determine the performance of CASP in
the worst case.
Original (O) using PROTON (P) using CASP (C)Application Time Faults Time Faults Time Faults
BSORT 20,289 1,248,444 9,115 1,216,539 7,152 1,128,046FFT 7,832 809 3,616 799 2,855 654MATVEC 3,215 164,774 2,664 164,761 1,041 103,502SCAN 1,773 112,385 1,581 112,385 599 80,681
Table 4.2: Single Application Benchmark Results for LRU
The results obtained from PROTON simulation have been validated
against the results of the original simulation method for the same applica-
tions. In the original method, applications were annotated for each and every
memory access and simulated for the respective paging policies. It is observed
that PROTON simulation has minimal simulation error (in terms of the dif-
ference in page faults generation) as compared to the original simulation. This
is further discussed towards the end of this section. The results obtained by
the original simulation method are be represented as O for each application
workload.
213
4.11.2 Single Application
Initially, to test the performance of CASP with the use of reification calls
in complete isolation, a single application scenario is considered . Also, the
performance of PROTON simulator with the described optimisation was com-
pared against the original simulation results (O). Three versions of all the
benchmark applications were produced: (1) PROTON version – the annota-
tions were added into the application source code using the placement tech-
nique, (2) CASP version – similar to (1) but including the reification calls (i.e.
keep() and discard()) and (3) Original version – the annotations were added at
each memory reference in the application source code. The benchmarks were
simulated for the following page replacement policies: LRU, MRU, LFU and
MFU. Each version of the benchmark application was simulated in a single
application scenario for the above page replacement policies.
Table 4.2 lists the simulation time (in seconds) and the number of page-
faults for each of the benchmark applications when simulating an LRU paging
policy. The values under column (O) represent the original simulation, those
under column (P) represent the normal simulation using PROTON and those
under column (C) represent PROTON simulation with the applications using
the CASP mechanism. Since the original simulation annotates each and every
memory reference, the simulation time is considerably larger than the PRO-
TON simulation. The timing of (O) are is used as a guidance for the maximum
value for the simulation time while comparing (P) and (C) values.
Applications using CASP generated significantly less number of page faults
(see table 4.2) and exhibited lower simulation times. On average, for an LRU
policy, applications using CASP generated 19% less number of page faults.
MATVEC using CASP performed the best amongst all other applications.
214
This is because the data access pattern of MATVEC is known and also, it
consists of loops with known bounds. The access pattern of SCAN is also
deterministic and hence, is benefited from CASP. The non-deterministic nature
of the data to be sorted and the intensive swap operations between random
memory locations caused BSORT to perform the worst. FFT also has nearly
random access pattern. The reification calls in FFT try to lock part of the
memory being accessed sequentially.
Simulator Performance
For BSORT, PROTON has shown to reduce the simulation time by almost
57% (avg.) across all policies (see figure 4.13). Application BSORT accesses
memory rigorously. PROTON inserts common annotations outside the loops
saving processing time which results in better simulation time. However, due
to this optimised placement PROTON incurs a 2% simulation error in terms
of number of page faults.
Similarly, for FFT, MATVEC and SCAN, PROTON reduces simulation
time by 48%, 28% and 17% respectively (see figures 4.14, 4.15 and 4.16). The
analysis of application source written in a high-level language helps PROTON
identify data locality in the code and optimise the placement of code annota-
tions.
On average, PROTON reduces the simulation time between 17% to 57%
with the simulation error ranging from 0% to 2%. Considering the improve-
ment in the simulation time and the flexibility offered by PROTON, a simu-
lation error of 2% is acceptable.
215
0
0.2
0.4
0.6
0.8
1
1.2
MFULFUMRULRU
Nor
mal
ised
exe
cutio
n tim
es
BSORT execution times
OriginalProtonCASP
Figure 4.13: BSORT Simulation
0
0.2
0.4
0.6
0.8
1
1.2
MFULFUMRULRU
Nor
mal
ised
exe
cutio
n tim
es
SCAN execution times
OriginalProtonCASP
Figure 4.14: FFT Simulation
0
0.2
0.4
0.6
0.8
1
1.2
MFULFUMRULRU
Nor
mal
ised
exe
cutio
n tim
e in
sec
onds
.
MATVEC execution times
OriginalProtonCASP
Figure 4.15: MATVEC Simulation
0
0.2
0.4
0.6
0.8
1
1.2
MFULFUMRULRU
Nor
mal
ised
exe
cutio
n tim
e in
sec
onds
.SCAN execution times
OriginalProtonCASP
Figure 4.16: SCAN Simulation
4.11.3 Multiple Applications
Another advantage of PROTON is its ability to simulate multiple applica-
tions workload. Conventional VM simulators do not support simulation of
multiple applications. Such simulation would not only help the system devel-
opers to determine the performance of the entire workload, but also allow to
determine the effects of a particular application on a given workload. Further-
more, simulation using different page replacement policies in PROTON helps
the developer identify the best suitable page replacement policy for a given
workload.
216
Tables 4.3, 4.4 and 4.5 list the simulation results for an LRU policy in two
and three application scenarios. In the tables, the column ‘App-set’ shows
the combination of applications that were simulated together. To denote this,
the first letter of each application is used. For instance, S-M suggests that
applications SCAN and MATVEC were simulated together. Similarly, S-B-F
suggests that applications SCAN, BSORT and FFT were simulated together.
using PROTON (P) using CASP (C)App-set Time Faults Time Faults
S-M 7,286 284,159 5,947 123,608S-B 18,492 1,392,173 12,058 1,062,146S-F 8,254 114,637 7,947 88,037
Table 4.3: Two Applications Scenario for LRU (1)
using PROTON (P) using CASP (C)App-set Time Faults Time FaultsM-F 10,145 172,732 9,047 143,233M-B 20,348 1,392,324 17,094 1,102,068F-B 21,026 1,218,372 22,542 1,234,016
Table 4.4: Two Applications Scenario for LRU (2)
It is evident from the tables that, on average, the simulation time of the ap-
plications almost doubles when compared to the a single application scenario.
However, the number of page-faults varies according to the applications. This
is because, the CPU resource is being shared amongst several applications
using a scheduling algorithm whereas the performance of paging is largely
dependant on the page replacement policy being used and an application’s
memory access patterns. If the page replacement policy in the OS is non-
changeable, then the type of applications and their respective memory access
217
patterns dictate the paging performance of the system. Thus, it is important
to be able to determine the effects of certain applications on a given workload.
It was observed that applications with known or deterministic memory
access patterns showed better performance using CASP. The key is to use ac-
curate reification calls in the source code. Since it is difficult to predict the
memory access pattern of applications such as BSORT and FFT, reification
calls were inserted to intermittently lock and release certain memory regions.
Notice that, although the applications using CASP show better performance
than the ones not using CASP, the performance improvement is not signifi-
cant. Looking at the results obtained for the two applications scenario it can
be concluded that CASP depends on accurate reification calls and provides
better support to applications with known memory access patterns. When
the applications – FFT and BSORT are simulated together, there a more page
faults generated increasing the simulation times. Since both the applications
have nearly random access pattern, the insertion of reification calls has no
significant benefits.
using PROTON (P) using CASP (C)App-set Time Faults Time Faults
S-B-F 22,394 1,334,428 19,065 1,186,685S-B-M 25,235 1,502,785 19,376 1,163,424S-F-M 12,947 282,728 10,031 153,139B-F-M 13,283 1,394,482 12,997 1,288,604
Table 4.5: Three Applications Scenario for LRU
Similar results were obtained for the three applications scenario in which
the applications performed best when SCAN and MATVEC were executed
together in a group (see table 4.5). From the above simulations, it can be noted
218
that, even if applications using CASP do not show significant performance
improvement, CASP does not impose huge penalties onto the system. Also,
CASP provides best support for applications with deterministic memory access
patterns (e.g. sequential access).
4.11.4 Slow-down Factor
Conventional on-the-fly simulation techniques have shown to add a slow down
factor ranging from 20 to 60 [122]. In comparison to (O), PROTON has
shown to reduce the simulation time by 17%–57%. Thus, it can be considered
that PROTON-based simulation has nearly 17%–57% faster than using the
traditional method of full annotations. Furthermore, previously reported sim-
ulators simulated small amounts of memory ranging between 128KB to 1MB.
The use of larger memory may further affect their performance. For the above
simulations PROTON was configured to use 64MB of memory.
In summary, the evaluation in this chapter has accomplished two main
objectives. Firstly, it showed that the CASP mechanism helps reduce the
number of page faults by almost 19% when using accurate reification calls.
Secondly, the PROTON simulator is a better alternative for on-the-fly virtual
memory simulation which supports simulation of multiple applications. The
simulator improves simulation time by reducing the required number of code
annotations.
4.12 Summary
This chapter used virtual memory (paging) as a case study to show the sig-
nificance of reification in the reflective framework. The chapter considered
various methods of inserting reification calls into the application source code
219
written in the C language. The design of CASP – an OS paging mechanism to
utilise the information provided by the reification calls was presented. CASP
efficiently locked/released pages in memory such that the pages that an appli-
cation would access in the immediate future were always present in memory.
The mechanism operates non-intrusively on top of any existing page replace-
ment policies in the OS.
Furthermore, the design and implementation of PROTON, an on-the-fly
virtual memory simulator was described. Simulation experiments using PRO-
TON showed improvement in performance of the benchmark applications that
used CASP via the reification calls. It was also shown that PROTON sim-
ulator performed better than conventional on-the-fly simulators. In the next
chapter, the implementation of CASP in a commodity OS – Linux (2.6.16
kernel) is described along with the experimental evaluation.
Chapter 5
Implementation of CASP in aCommodity OS (Linux)
In chapter 3, the generic reflective RTOS framework was presented and eval-
uated with a prototype µ-kernel implementation – DAMROS. Chapter 4 pre-
sented a case-study of using reification calls in conjunction with the virtual
memory resource evaluated using a virtual memory simulator – PROTON [93].
An OS mechanism – CASP [97] was proposed in order to provide application-
specific memory management support. The CASP mechanism allows applica-
tions to lock and release memory pages dynamically at runtime via the use of
reification calls. This process helps reduce the number of page faults mainly
caused due to incorrect page eviction by the underlying page replacement pol-
icy. It is essential to evaluate the CASP mechanism in a real-world commodity
OS. This chapter presents the implementation and evaluation of the framework
and CASP in Linux (2.6.16 kernel).
Linux is an open source operating system widely used in embedded systems
such as mobile phones, PDAs, media players, etc. The widespread use and the
availability of the kernel source code played a key role in choosing Linux for
implementation of the reflection framework and CASP. The framework and
221
222
CASP has been implemented in two flavours of Linux: one using an LRU-
based paging policy and the other using CART [17] paging policy.
The chapter is organised as follows. The next section provides an overview
of the Linux 2.6.16 kernel that implements an LRU-based paging policy. The
section also describes the implementation of CART [17] paging policy in Linux.
Section 5.2 describes the implementation of the reflection framework and
CASP in both flavours of Linux. Finally, section 5.3 presents the evaluation
of CASP using standard benchmark applications in both single and multiple
application scenarios.
5.1 Overview of Linux 2.6.16 Kernel
This section provides a brief overview of the memory management subsystem
in the vanilla Linux 2.6.16 kernel. Memory in Linux is divided into three
different zones: ZONE DMA – the lower end memory zone (addressable by 16-
bit devices) mainly used for I/O or DMA (Direct Memory Access) operations;
ZONE NORMAL – the normal zone above ZONE DMA memory used by the
applications; and ZONE HIGHMEM – the high-end memory mainly used by
the kernel [19, 54]. The memory pages belonging to each zone are stored in
two zone-wise lists – active list and inactive list. The active list consists of
most recently accessed pages and all newly allocated pages.
Unlike in theory [62], Linux does not reclaim pages upon a page-fault. A
special kernel daemon thread ‘kswapd()’ reclaims pages when invoked depend-
ing on set watermarks. The kswapd() thread tries to maintain a fixed number
of free pages that are available in a zone determined by the value of the zone
watermark. This thread moves pages present in the active list that have not
been recently accessed into the inactive list. While in the inactive list, the
223
pages are again marked as accessed by the kernel so that the kswapd() moves
them back into active list. When the ratio of the number of pages in the inac-
tive list and the active list reaches a certain watermark, the kswapd() thread
starts reclaiming unreferenced pages from the inactive list.
The vanilla Linux 2.6.16 kernel implements an LRU-based page replace-
ment policy which can be closely compared with LRU-2Q [54]. In practice it
has been shown that the performance of this replacement policy is close to
LRU [17]. For simplicity, in all further discussions the vanilla Linux kernel
implementing this page replacement policy will be referred to as Linux-LRU.
Note that, Linux-LRU makes page replacement decisions solely on the basis of
recency without using any frequency features. i.e. under heavy system load,
it is possible for the page replacement policy to replace the most frequently
used page. For more information about the Linux kernel and its memory
management subsystem please refer to [54].
The CASP mechanism has been designed to operate in conjunction with
any existing page replacement policy. In order to test CASP in Linux with
another page replacement policy a patch consisting of a CART-based [17]
policy implementation in Linux was obtained from Peter Zijlstra 1 [1]. The
next subsection describes in brief the implementation of CART policy in Linux.
5.1.1 CART Implementation in Linux
The term Linux-CART will be used to refer to the implementation of CART
in Linux. The Linux-CART implementation uses four different page lists: T1,
T2, B1 and B2 for each memory zone.
The pages in T1 are considered to have a short-term utility while the pages
1Downloaded from URL: http://programming.kicks-ass.net/kernel-patches/cart/
224
in T2 have a long-term utility. The CART page replacement policy reclaims
recently unreferenced pages from T2 first and then reclaims similar pages from
T1. This ensures that frequently accessed pages are not reclaimed by CART.
However, this can affect applications with large sequential loop based access
which generally have pages that can be said to have long-term utility.
The other two page lists: B1 and B2 maintain page-history information
of those pages that were recently reclaimed. A more detailed explanation can
be found in [17]. The paging model of Linux 2.6.16 kernel has been explained
in section 4.1 in chapter 4.
5.2 Implementation in Linux
This section describes the implementation of the reflection framework and
CASP [97] in Linux 2.6.16 kernel. The CASP implementation depends on the
generic reflective framework. The next subsection describes the implementa-
tion of the reflection framework.
5.2.1 Reflection Framework
The core elements of the reflective framework, as described in section 3.2.1,
have been implemented in the Linux kernel. Most of the reflection code has
been ported from DAMROS. The interface to these elements remain the same
as in DAMROS. The Linux kernel has been modified to implement the follow-
ing functions:
• reify()
• requestInfo()
• interceptCall()
225
• uninterceptCall()
• linkData()
Linux is a multi address space OS. Thus, unlike DAMROS, the above functions
can not be directly called by the applications. For this purpose, a system call
has been implemented which passes information between the application space
and the kernel space. This is further discussed in the next subsection.
The implementations of interceptCall() and linkData() support the inter-
ception of a function and the formation of a causal link to data belonging to
a common address space. i.e. an application is able to intercept functions
implemented in its own address space. However, since Linux is a monolithic
OS, kernel code including all system modules resides in a single kernel address
space. Thus, a meta-level of a system module can intercept all functions or
causally link to any data in the kernel address space.
The implementation of the interception mechanism makes similar changes
to the underlying machine code as explained in section 3.3.3. Linux uses cer-
tain hardware features to set read-only, read/write or execution permissions on
memory pages. For instance, in Linux, a memory page containing the applica-
tion code has read-only and execution permissions set such that no process in
the system can change its contents. The implementation of interceptCall() in
Linux temporarily changes set read/write permissions to such memory pages,
then changes the machine code and resets the original permissions.
Similar to the implementation in DAMROS, information reified in Linux
is stored in the kernel and passed on to the requesting meta-level modules.
Also, the implementation of the framework is specific to the Intel x86 archi-
tecture [61].
226
5.2.2 CASP Mechanism
CASP has been implemented in two different flavours of Linux: one using the
original LRU-based policy and the other using the CART page replacement
policy. CASP consists of two components: CASPapp operates in the appli-
cation space and CASPos operates in the kernel space. The reification calls
keep() and discard(), defined in chapter 4, for virtual memory have been im-
plemented as an application library – CASPapp. A new system call has been
implemented in Linux to facilitate the communication between CASPapp and
CASPos components.
In both flavours of Linux, handle mm fault() is the main page-fault han-
dler routine. The functions – pagevec add() and pgrep add() are used for
adding a page to the page list in Linux-LRU and Linux-CART respectively.
During the process of OS initialisation, the interception code of CASPos
scans the handle mm fault() routine recording the locations of calls to the
pagevec add() routine. Later while pre-paging, CASP requests the interception
of all the recorded calls. This action makes the interception mechanism replace
the underlying machine code that calls the routine – pagevec add() with the
code to call the routine – page isolate() instead. Typically, there is only one
call to pagevec add() in the handle mm fault() routine. Since the location
of this call is marked at the beginning of OS initialisation, the actual cost of
interception during pre-paging is only a few micro-seconds. This is negligible
compared to the cost of the paging subsystem in general.
After CASP finishes pre-paging, it un-intercepts the calls back to the
pagevec add() routine (i.e. resets the machine code to its original state). The
implementation of the framework is common for both flavours of Linux.
Finally, the page-isolation mechanism depends on the page replacement
227
policy implemented in the Linux kernel. This is because, each replacement
policy maintains different page lists from which the locked pages need to be iso-
lated. The following subsections describe the implementation of page-isolation
routine in each Linux flavour.
5.2.3 Page-isolation in Linux-LRU
Similar to the algorithm 2 in chapter 4, the page-isolation routine in Linux-
LRU removes a page from either active list or inactive list depending on where
it resides during isolation and then adds this page to the corresponding appli-
cation’s isolated page list. When isolated pages are discarded, CASPos inserts
these pages into the inactive list making them the most likely candidates for
reclamation.
5.2.4 Page-isolation in Linux-CART
The Linux-CART implementation maintains 4 page lists – T1, T2, B1 and B2.
Since the pages in B1 and B2 do not reside in the physical memory, the lists of
interest here are T1 and T2, only. Thus, the page-isolation routine in Linux-
CART removes a page from either T1 or T2 depending on where it resides
during isolation and then adds this page to the corresponding application’s
isolated page list. When isolated pages are discarded, CASPos inserts these
pages into T2 making them the most likely candidates for reclamation.
The implementation of CASP makes use of the information generated by
the reification calls inserted in the applications and works non-intrusively with-
out affecting the normal operation of the existing page replacement code.
The efficient implementation of the interception mechanisms ensures code re-
usability without incurring any high penalty both in terms of space and time.
228
The next section describes the experimental evaluation of CASP in Linux.
5.3 Experimental Evaluation
5.3.1 Hardware Platform
All the experiments were performed on an embedded Cyrix MediaGX 233 MHz
processor-based system with 64 MB SDRAM memory and 128 MB Linux swap
partition on a 7200 RPM IDE disk drive. For each benchmark application,
three versions were produced: (1) manually inserted reification calls, (2) au-
tomatically inserted reification calls and (3) manually inserted Linux mlock()
primitives. Version (3) is same as (1) except that CASP’s keep() and discard()
are replaced by Linux’s mlock() and munlock() primitives. The workload (sin-
gle or multiple benchmark applications) was executed on a freshly booted test
platform running the corresponding Linux flavour, maintaining the same en-
vironment to obtain accurate measurements.
5.3.2 Benchmark Applications
In order to test the performance of CASP, different kinds of benchmark appli-
cations were selected. The selection of benchmarks was based on the following
criteria:
• Memory usage: the application must be out-of-core – i.e. applications
using more memory than was physically available.
• Access pattern: the applications selected should have different types of
memory access patterns (e.g. sequential, random, etc.).
• Embedded: the application or part of the application code must be ap-
plicable to embedded systems, including real-time systems.
229
• Linux: the application should compile and execute on the implemented
Linux flavours.
Several kinds of applications were surveyed and finally five applications
were chosen for this evaluation. Benchmark applications used were: three from
MiBench [56] (embedded applications benchmark suite); one from Brown et.
al. [26]; and the application ‘scan’ (described in chapter 4). The data-set
size of these applications was increased such that they required more physical
memory than is available. No other modification was done to the application
source. Table 5.1 summarises application characteristics. Each benchmark
application has a different memory access pattern:
MAD – MPEG decoder application which sequentially decodes data into
a fixed size buffer. The buffer is used by several functions, each consuming
part of the data. It is not possible to analyse the data locality of MAD using
automatic insertion method. Manual reification calls were inserted to lock and
release part of the buffer when the data was consumed; and around function
calls rather than loops.
FFT – uses 4 floating point number arrays to perform the fast fourier
transform. The data access pattern in the main loop of FFT depends on the
value of the data itself, resulting in a random access pattern.
FFT-I – inverse FFT with similar code to FFT. It has mostly non-
sequential data access but includes small sections with sequential data access.
FFT-I is used to determine the effects of CASP on applications with small
sections of sequential access.
MATVEC – a matrix vector multiplication that uses three data arrays
of different sizes accessed within loops with known loop bounds. The known
loop bounds enable insertion of reification calls by both automatic and manual
230
methods. MATVEC was used to test CASP for locking multiple data sets.
SCAN – see section 4.3 in chapter 4. In order to evaluate the automatic
insertion method under dynamic conditions, SCAN was modified to use dy-
namic values for the variable ‘loops’ and ‘size’ (see alg. 4.2). SCAN generates
a worst-case scenario for the page replacement policies which helps determine
the performance of CASP in the worst case.
Tests involved the execution of the benchmark applications a single appli-
cation and multiple applications scenarios. Note that, the experiments show
results for the manual and automatic insertion methods only. This is because,
the hybrid method is essentially a combination of both manual and automatic
methods. i.e. For a given application, the results for the hybrid method would
be the best of the manual and automatic method.
Name Description Input Data Set Memoryrequired
MAD MPEG layer I, II 128 kbps 96.25 93 MB& III decoder min. MP3 data
FFT Fast Fourier four 32-bit float 64 MBtransform arrays of size 222
FFT-I Inverse Fast four 32-bit float 64 MBFourier transform arrays of size 222
MATVEC Matrix vector mult. three 32-bit 101 MBapplication integer arrays
SCAN Example benchmark continuous 100 MBapplication byte array
Table 5.1: Benchmark Applications
5.3.3 Single Application Scenario
This subsection presents the experimental results of executing each bench-
mark application on both flavours of Linux in a single application scenario.
231
Tables 5.2, 5.3, 5.4 and 5.5 list the execution time (in seconds), the number
of minor and major page faults, and the resident memory set size (RSS) for
each benchmark application (in pages). The following subsections describe the
paging performance in terms of the number of major/minor page-faults and
the memory usage of each application – one at a time.
Original (O) using mlock (L)Application Time Minor Major RSS Time Minor Major RSS
MAD 1,899 27,944 1,685 12,429 1,795 22,506 1,071 12,116FFT 343 21,828 833 6,849 1,323 22,060 915 11,644FFT-I 403 22,538 940 6,582 1,347 22,652 1,193 11,019MATVEC 2,256 143,056 101,174 13,057 2,707 178,820 126,467 14,235SCAN 638 78,721 31,315 12,778 860 100,939 32,847 13,431
Table 5.2: Single Application Performance in Linux-LRU (1)
using CASP manual (M) using CASP automatic (A)Application Time Minor Major RSS Time Minor Major RSS
MAD 1,740 22,087 793 8,124 1,786 22,511 1,682 11,204FFT 342 21,099 776 6,758 349 22,234 970 6,743FFT-I 342 21,126 773 6,395 352 23,431 1,178 6,939MATVEC 1,860 175,327 81,311 13,063 2,139 158,330 95,896 13,132SCAN 544 94,281 21,975 12,876 417 106,319 15,652 13,084
Table 5.3: Single Application Performance in Linux-LRU (2)
MAD
In MAD, data is consumed across several different functions. Hence, it is
not easy to insert reification calls using the automatic method. The cloop
tool inserts the calls around the memory buffer used to store MPEG data.
Figures 5.1(a) and 5.1(b) plot the occurrence of minor and major page faults
for three versions of MAD executed upon Linux-LRU: one is the original (O),
one using mlock() (L) and one using CASP with manual reification calls (M).
232
Original (O) using mlock (L)Application Time Minor Major RSS Time Minor Major RSS
MAD 1,872 33,605 2,351 10,386 1,794 23,669 1,013 11,010FFT 352 22,542 865 6,614 511 24,068 10,845 8,678FFT-I 413 24,724 952 6,619 576 22,792 12,567 9,242MATVEC 1,419 165,618 75,686 12,834 1,281 152,678 68,977 12,959SCAN 340 103,731 17,231 12,644 426 111,699 22,452 12,959
Table 5.4: Single Application Performance in Linux-CART (1)
using CASP manual (M) using CASP automatic (A)Application Time Minor Major RSS Time Minor Major RSS
MAD 1,787 21,342 788 5,651 1,800 28,205 1,879 12,311FFT 336 18,857 540 6,596 347 20,578 700 6,774FFT-I 338 21,067 436 6,431 351 22,266 579 6,651MATVEC 1,279 111,586 61,826 12,628 1,330 107,649 68,471 12,684SCAN 368 107,527 14,278 12,939 373 93,568 21,814 12,523
Table 5.5: Single Application Performance in Linux-CART (2)
The x-axis plots the time elapsed during the execution of the application
and the y-axis plots the number of page-faults for corresponding time on the x-
axis. The lines in the graphs terminate when the application finishes execution
and exits the system. A shorter line indicates that an application takes less
time to finish its execution.
Note that, the graphs shown provide data for CASP using manually in-
serted reification calls only. The automatic reification calls also produced
similar information. MAD using CASP (M) generated fewer minor and major
page faults as compared to (O) and (L). The steps shown in the graph depict
the lock and release of partial regions in the MPEG data buffer. The curve
for CASP is uniform and quite predictive in nature.
Similar results can be seen in the graphs shown in figures 5.1(c) and 5.1(d)
that plot the minor and major page faults for Linux-CART. On average, for
233
0
5000
10000
15000
20000
25000
30000
0 200 400 600 800 1000 1200 1400 1600 1800 2000
No.
of
page
-fau
lts
Time in seconds
(a) Minor Page-faults for MAD (Linux-LRU)
O L
M
0
200
400
600
800
1000
1200
1400
1600
1800
0 200 400 600 800 1000 1200 1400 1600 1800 2000
No.
of
page
-fau
lts
Time in seconds
(b) Major Page-faults for MAD (Linux-LRU)
O L
M
0
5000
10000
15000
20000
25000
30000
35000
0 200 400 600 800 1000 1200 1400 1600 1800 2000
No.
of
page
-fau
lts
Time in seconds
(c) Minor Page-faults for MAD (Linux-CART)
O L
M
0
500
1000
1500
2000
2500
0 200 400 600 800 1000 1200 1400 1600 1800 2000
No.
of
page
-fau
lts
Time in seconds
(d) Major Page-faults for MAD (Linux-CART)
O L
M
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 200 400 600 800 1000 1200 1400 1600 1800 2000
No.
of
mem
ory
page
s
Execution time in seconds
(e) Resident Memory Set Size for MAD (Linux-LRU)
VM-size O L
M avg. O avg. L
avg. M 0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 200 400 600 800 1000 1200 1400 1600 1800 2000
No.
of
mem
ory
page
s
Execution time in seconds
(f) Resident Memory Set Size for MAD (Linux-CART)
VM-size O L
M avg. O avg. L
avg. M
Figure 5.1: MAD: Results on Linux-LRU and Linux-CART
234
both flavours of Linux, MAD-CASP generated 29% and 6% less major page
faults as compared to MAD and MAD-MLOCK, respectively.
The graphs in figures 5.1(e) and 5.1(f) plot the RSS of MAD at any given
time during its execution on both Linux-LRU and Linux-CART respectively.
MAD-CASP (M) uses fewer resident memory pages as compared to the other
variants. This is because, the reification calls inserted into MAD helps CASP
to lock and release only the required memory pages at runtime. Thus, the
working set size of MAD-CASP is reduced to only the CASP locked pages in
memory. The consistent steps shown in the graphs depict the lock and release
operations of CASP. On average MAD-CASP uses nearly 4,000 fewer memory
pages than the other variants.
FFT
Application FFT mainly has a random memory access pattern. Manual reifica-
tion calls were added in and around the main application loop which partially
lock and release data arrays at runtime. Since the access pattern of FFT de-
pends on the value of the data obtained within the loop, the reification calls
only attempt to lock 25 % of the data arrays starting from the data value.
The graphs in figures 5.2(a) and 5.2(b) plot the occurrence of minor and ma-
jor page faults for FFT in Linux-LRU. The performance of FFT-MLOCK is
the worst in that it generates more page faults than the original. However,
CASP performed only slightly better than the original. This is mainly because
of the non-predictive nature of FFT’s memory access patterns.
Similar graphs have been plotted for Linux-CART in figures 5.2(c)
and 5.2(d). The CASP mechanism in Linux-CART performed much better
235
0
5000
10000
15000
20000
25000
0 200 400 600 800 1000 1200 1400
No.
of
Page
-fau
lts
Time in seconds
(a) Minor Page-faults for FFT (Linux-LRU)
O L
M 0
100
200
300
400
500
600
700
800
900
1000
0 200 400 600 800 1000 1200 1400
No.
of
Page
-fau
lts
Time in seconds
(b) Major Page-faults for FFT (Linux-LRU)
O L
M
0
5000
10000
15000
20000
25000
0 50 100 150 200 250 300 350
No.
of
Page
-fau
lts
Time in seconds
(c) Minor Page-faults for FFT (Linux-CART)
O M
0
100
200
300
400
500
600
700
800
900
0 50 100 150 200 250 300 350
No.
of
Page
-fau
lts
Time in seconds
(d) Major Page-faults for FFT (Linux-CART)
O M
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 200 400 600 800 1000 1200 1400
No.
of
mem
ory
page
s
Execution time in seconds
(e) Resident Memory Set Size for FFT (Linux-LRU)
VM-size O L
M avg. O avg. L
avg. M 0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 50 100 150 200 250 300 350
No.
of
Page
s in
mem
ory
Time in seconds
(f) Resident Memory Size for FFT (Linux-CART)
VM-size O
avg. O M
avg. M
Figure 5.2: FFT: Results on Linux-LRU and Linux-CART
236
for FFT in comparison to Linux-LRU. This is because, the CART paging pol-
icy maintains frequency information about memory page accesses along with
the recency information. Thus, CASP benefits from the underlying page re-
placement policy in this case to help improve the performance of an application
with random access pattern. The performance of FFT-MLOCK is too poor
in Linux-CART to consistently plot with other values in the graph and hence,
has been omitted.
The resident memory size of FFT and FFT-CASP are almost similar in
both Linux-LRU and Linux-CART. Due to the random access patterns of FFT,
many times CASP was noted to release and lock similar pages in memory. This
resulted in FFT-CASP having similar resident memory size to the original FFT
(see the graphs in figures 5.2(e) and 5.2(f)).
FFT-I
Application FFT-I is very similar to FFT differing in the final section of the
code which calculates the inverse operation. In this final section, FFT-I has
a sequential memory access pattern. In FFT-I, apart from adding reification
calls similar to FFT, reification calls were also added to take advantage of the
sequential access pattern in the final section.
The graphs in figures 5.3(a) and 5.3(b) plot minor and major page faults
of FFT-I in Linux-LRU. In comparison to the results of the FFT application,
it is evident that CASP generated better results for FFT-I. This is because,
CASP improvises on FFT-I’s sequential access pattern in the final section of
its code.
237
0
5000
10000
15000
20000
25000
0 200 400 600 800 1000 1200 1400
No.
of
Page
-fau
lts
Time in seconds
(a) Minor Page-faults for FFT-I (Linux-LRU)
O L
M 0
200
400
600
800
1000
1200
1400
0 200 400 600 800 1000 1200 1400
No.
of
Page
-fau
lts
Time in seconds
(b) Major Page-faults for FFT-I (Linux-LRU)
O L
M
0
5000
10000
15000
20000
25000
0 50 100 150 200 250 300 350 400 450
No.
of
Page
-fau
lts
Time in seconds
(c) Minor Page-faults for FFT-I (Linux-CART)
O M
0
100
200
300
400
500
600
700
800
900
1000
0 50 100 150 200 250 300 350 400 450
No.
of
Page
-fau
lts
Time in seconds
(d) Major Page-faults for FFT-I (Linux-CART)
O M
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 200 400 600 800 1000 1200 1400
No.
of
mem
ory
page
s
Execution time in seconds
(e) Resident Memory Set Size for FFT-I (Linux-LRU)
VM-size O L
M avg. O avg. L
avg. M 0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 50 100 150 200 250 300 350 400 450
No.
of
Page
s in
mem
ory
Time in seconds
(f) Resident Memory Set Size for FFT-I (Linux-CART)
VM-size O
avg. O M
avg. M
Figure 5.3: FFT-I: Results on Linux-LRU and Linux-CART
238
Similar results have been obtained for Linux-CART as shown in fig-
ures 5.3(c) and 5.3(d). Using CASP, FFT-I performed much better in Linux-
CART in comparison with FFT. From above two applications, FFT and FFT-
I, it is evident that accurate reification calls regarding the memory access
patterns provide much better results using CASP. Thus, although CASP im-
proved application performance for both random and sequential memory access
pattern, the real benefit of the mechanism can be gained by using as accurate
reification calls as possible.
Again, the resident memory size of FFT-I using CASP was reduced in both
Linux-LRU and Linux-CART (see the graphs in figures 5.3(e) and 5.3(f)).
MATVEC
Application MATVEC uses three different data sets to perform complex mul-
tiplication operations involving matrices making it CPU intensive. Due to
known loop bounds, it was easier to insert reification calls using both manual
and automatic methods. Figures 5.4(a) and 5.4(b) show the occurrence of
minor and major page faults of MATVEC in Linux-LRU. MATVEC-CASP
generated more minor page faults than the original MATVEC. However, the
major page faults generated by MATVEC-CASP were fewer in comparison.
In Linux, reclaimed pages are stored in a region called the swap-cache
which is present in physical memory. These pages remain in the swap-cache
until they are actually moved to the swap space by a kernel thread [54]. Due
to the intensive memory and CPU operations in MATVEC, memory pages are
more frequently referenced. Instead of immediately swapping out the released
pages to swap space, CASP stores them in the swap-cache. Such pages that
exist in the swap-cache, when re-referenced immediately in the future cause
239
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 500 1000 1500 2000 2500
No.
of
Page
-fau
lts
Time in seconds
(a) Minor Page-faults for MATVEC (Linux-LRU)
O M
0
20000
40000
60000
80000
100000
120000
0 500 1000 1500 2000 2500
No.
of
Page
-fau
lts
Time in seconds
(b) Major Page-faults for MATVEC (Linux-LRU)
O M
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 200 400 600 800 1000 1200 1400 1600
No.
of
Page
-fau
lts
Time in seconds
(c) Minor Page-faults for MATVEC (Linux-CART)
O L
M
0
10000
20000
30000
40000
50000
60000
70000
80000
0 200 400 600 800 1000 1200 1400 1600
No.
of
Page
-fau
lts
Time in seconds
(d) Major Page-faults for MATVEC (Linux-CART)
O L
M
0
5000
10000
15000
20000
25000
30000
0 500 1000 1500 2000 2500
No.
of
mem
ory
page
s
Execution time in seconds
(e) Resident Memory Set Size for MATVEC (Linux-LRU)
VM-size O M
avg. O avg. M
0
5000
10000
15000
20000
25000
30000
0 200 400 600 800 1000 1200 1400 1600
No.
of
mem
ory
page
s
Execution time in seconds
(f) Resident Memory Set Size for MATVEC (Linux-CART)
VM-size O L L
avg. O avg. L
avg. M
Figure 5.4: MATVEC: Results on Linux-LRU and Linux-CART
240
only a minor page fault. Hence, there was an increase in the number of minor
page faults for MATVEC-CASP.
Similar results can be seen for MATVEC in Linux-CART (see figures 5.4(c)
and 5.4(d)). However, in Linux-CART, the execution time of MATVEC-CASP
was slightly larger and it generated more page faults compared to MATVEC-
MLOCK.
Due to the intensive memory and CPU operations in MATVEC, the RSS
for MATVEC-CASP in Linux-LRU and Linux-CART remains close to the
original application. The huge variation in the curves show the intensity of
memory operations in MATVEC (see figures 5.4(e) and 5.4(f)).
SCAN
Application SCAN generates a worst-case scenario for most traditional page
replacement policies by stressing the virtual memory subsystem to its limits.
Similar to MATVEC-CASP, in Linux-LRU, SCAN-CASP also generated more
minor page faults and fewer major page faults as compared to the original. The
substantial reduction in the number of major page faults resulted in better
performance of SCAN-CASP (see figures 5.5(a) and 5.5(b)).
Similar results were seen in Linux-CART as well. However, SCAN-CASP
had a slightly larger execution time than the original. This is because, the in-
tensive memory operations of SCAN continuously generated a large number of
page faults stressing the lock and release operations of CASP. Thus, although
SCAN-CASP generated fewer major page faults, it incurred a minor overhead
which resulted in a larger execution time (see figures 5.5(c) and 5.5(d)).
The graphs in figures 5.5(e) and 5.5(f) show the intensity at which SCAN
accesses memory pages in both Linux-LRU and Linux-CART respectively.
241
0
20000
40000
60000
80000
100000
120000
0 100 200 300 400 500 600 700 800 900
No.
of
Page
-fau
lts
Time in seconds
(a) Minor Page-faults for SCAN (Linux-LRU)
O L
M
0
5000
10000
15000
20000
25000
30000
35000
0 100 200 300 400 500 600 700 800 900
No.
of
Page
-fau
lts
Time in seconds
(b) Major Page-faults for SCAN (Linux-LRU)
O L
M
0
20000
40000
60000
80000
100000
120000
0 50 100 150 200 250 300 350 400 450
No.
of
Page
-fau
lts
Execution time in seconds
(c) Minor Page-faults for SCAN (Linux-CART)
O L
M
0
5000
10000
15000
20000
25000
0 50 100 150 200 250 300 350 400 450
No.
of
Page
-fau
lts
Execution time in seconds
(d) Major Page-faults for SCAN (Linux-CART)
O L
M
10000
11000
12000
13000
14000
15000
16000
0 100 200 300 400 500 600 700 800 900
No.
of
mem
ory
page
s
Execution time in seconds
(e) Resident Memory Set Size for SCAN (Linux-LRU)
VM-size(25,933) O L
M avg. O avg. L
avg. M
10000
11000
12000
13000
14000
15000
16000
0 50 100 150 200 250 300 350 400 450
No.
of
Page
s in
mem
ory
Time in seconds
(f) Resident Memory Set Size for SCAN (Linux-CART)
VM-size(25,933) O
avg. O L
avg. L M
avg. M
Figure 5.5: SCAN: Results on Linux-LRU and Linux-CART
242
CASP strives to bring the resident memory size down to a respectable level
thereby incurring a minor overhead.
Overall Performance
The number of page-faults (both minor and major) and the execution times
of the benchmark applications were measured in a single application scenario.
Since the cost to handle a major page-fault is greater compared to that of a
minor page-fault, reducing the number of major page-faults will reduce the
paging overhead and improve the execution time of the application. Irrespec-
tive of the execution times, reduction in paging overhead is beneficial to the
overall system. This is particularly true in the case of an embedded system
with limited memory resource.
The graphs in figures 5.6(a), 5.7(a) and figures 5.6(b), 5.7(b) show the num-
ber of page-faults and the execution times of the corresponding benchmark ap-
plications executed individually in both Linux-LRU and Linux-CART. Shown
for each application are four bars: the original application (O), the application
using Linux’s mlock() primitives (L), the application using CASP with manual
insertion (M) and the application using CASP with automatic insertion (A).
A bar is further divided into two parts – for figures 5.6(a) and 5.7(a): the
top part shows the number of minor page-faults; the bottom part shows the
number of major page-faults and for figures 5.6(b) and 5.7(b): the top part
shows the user-time; the bottom part shows the system-time.
FFT has a nearly random data access pattern – difficult for the automatic
method to identify data locality. Manual reification calls were inserted at
small regions of sequential access. mlock() imposed a large overhead due to
the random access pattern – it thrashes. Thus, although (L) generates only a
243
0
0.2
0.4
0.6
0.8
1
1.2
1.4
O L M AMATVEC
O L M ASCAN
O L M AFFT-I
O L M AFFT
O L M AMAD
Nor
mal
ised
Pag
e-fa
ults
(a) Benchmark Total Page-faults
Minor faultsMajor faults
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
O L M AMATVEC
O L M ASCAN
O L M AFFT-I
O L M AFFT
O L M AMAD
Nor
mal
ised
Exe
cutio
n tim
e
(b) Benchmark Execution Times
User TimeSystem Time
Figure 5.6: Summary of Results for Linux-LRU
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
O L M AMATVEC
O L M ASCAN
O L M AFFT-I
O L M AFFT
O L M AMAD
Nor
mal
ised
Pag
e-fa
ults
(a) Benchmark Total Page-faults
Minor faultsMajor faults
0
0.2
0.4
0.6
0.8
1
1.2
1.4
O L M AMATVEC
O L M ASCAN
O L M AFFT-I
O L M AFFT
O L M AMAD
Nor
mal
ised
Exe
cutio
n tim
e
(b) Benchmark Execution Times
User TimeSystem Time
Figure 5.7: Summary of Results for Linux-CART
0
0.2
0.4
0.6
0.8
1
1.2
1.4
O L M AEXEC. TIMES
O L M APAGE-FAULTS
Nor
mal
ised
pag
e-fa
ults
and
exe
c. ti
mes
(a) Two Applications (Linux-LRU)
Page-faultsExec. Times
0
0.2
0.4
0.6
0.8
1
1.2
1.4
O MEXEC. TIMES
O MPAGE-FAULTS
Nor
mal
ised
pag
e-fa
ults
and
exe
c. ti
mes
(b) All Applications (Linux-LRU)
Page-faultsExec. Times
Figure 5.8: Results for Multiple Applications
244
few more page-faults, the execution time is much larger.
FFT-I, being similar to FFT, has sequential data access in parts – showed
better results using CASP with manually inserted reification calls. Note that,
although minor page-faults increased for the automatic method (A), the num-
ber of major page-faults were still tolerable. This shows that CASP could also
be used for applications with random data access patterns.
For MATVEC, the aggressive loop based multiplication operations on sev-
eral data arrays lead to a greater number of page-faults. Known loop bounds
made it easy to insert reification calls by both methods. Results for (M) and
(A) indicate better performance than (L) and (O) for both major page-faults
and the execution times. Manual insertion method reduced MATVEC’s exe-
cution time by nearly 18% and generated 20% fewer major page-faults.
Using CASP, SCAN generated fewer major page-faults (for both (M) and
(A)) than (O) and (L). Automatic insertion method performed better than
the manual method since it added run-time conditional loop bound checking
code for the appropriate use of reification calls. Pages reclaimed from the
inactive list were stored in the swap-cache before being written out to the
swap space. Since both SCAN and MATVEC aggressively walked through the
data, stressing the page replacement code, pages that have been only recently
reclaimed are recalled into the active list. If such pages still resided in the swap
cache, then it caused only a minor page-fault to fetch them. CASP releases
isolated pages after discard() is used making such pages remain in the swap-
cache when recalled. Hence, an increase in the number of minor page-faults is
observed for (M) and (A).
245
Across all individually executed benchmark applications, on average, man-
ually inserted reification calls generated 22% fewer major page-faults improv-
ing the execution time by 13%; automatically inserted reification calls gen-
erated 15.13% fewer major page-faults, improving execution time by 9% in
Linux-LRU. Clearly, the manual method out performed the automatic method,
but the latter yields better results in comparison with mlock() primitives.
5.3.4 Multiple Applications Scenario
Two or more applications were executed simultaneously with one of them using
CASP. Due to the similarity in the implementation of CASP in both flavours
of Linux, the experiments involving multiple applications workload have been
executed on Linux-LRU only. The ‘Workload’ (see table 5.6): (1) TWO-SO
– two original versions of the SCAN application process; (2) TWO-SL – one
original SCAN process and another SCAN using mlock(); (3) TWO-SM –
one original SCAN process and another SCAN with manual reification calls;
(4) TWO-SA – one original SCAN process and another SCAN using automatic
reification calls; (5) ALL-O – executing original versions of all benchmark
applications; (6) ALL-1M – all original applications with SCAN using manual
reification calls.
Results obtainedWorkload Time Minor Major RSS
TWO-SO 2,240 223,925 33,953 (6,149 + 6,310)TWO-SL 1,839 219,703 35,174 (7,341 + 5,532)TWO-SM 1,612 218,514 28,728 (8,600 + 4,885)TWO-SA 1,778 233,767 30,394 (7,489 + 5,292)ALL-O 15,341 606,334 133,020 22,514ALL-1M 13,454 621,451 109,003 20,019
Table 5.6: Results for Multiple Applications
246
The RSS for (1) to (4) is divided into two parts with the first part repre-
senting the RSS of the original process. Figures 5.8(a) and 5.8(b) show major
page-faults and execution times for the execution of two and all applications.
The benchmark for two applications shows four bars: O, L, M, A – each rep-
resenting both original processes (O), 1 mlock() process (L), 1 CASP process
with manual reification calls (M) and 1 CASP process with automatic reifi-
cation calls (A) respectively. The results for page faults are represented in
striped bars and those for execution times in dark bars. The benchmark for
all applications shows two bars: O and M – each representing all original pro-
cesses (O) and 1 SCAN process using CASP with manual reification calls (M)
respectively.
Note that, since the results are normalised for both page-faults and exe-
cution times (exec. times), the figures plot both values against y-axis of the
same graph for both two and all applications. The bars for page-faults and
exec. times cannot be compared with each other.
The experiments enabled to determine the effects of a single CASP process
on the entire system workload. Using mlock() in more than one applications
resulted in out-of-memory errors. This is because, mlock() has not been de-
signed for dynamic locking. The experiments were, thus, limited to the use of
locking (either mlock() or CASP) in only one application in the workload.
Results show that, CASP using manually inserted calls, in two applications
scenario generated 15% fewer major page-faults and reduced execution times
by 28%; and in all applications scenario generated 18% fewer major page-
faults and reduced execution times by 12%. When the system was out of
memory, CASP automatically released isolated pages into the OS page list.
As explained in chapter 4, multiple applications using CASP are less likely to
247
be affecting the performance of other application processes by locking memory
pages.
5.3.5 Memory Usage
By isolating pages from the global page lists, CASP adjusts to the current
process’s working-set and reduces its resident memory set size (RSS). Ta-
bles 5.2 and 5.3 list the average RSS values for individual benchmark execu-
tions in Linux-LRU. For MAD with manually inserted reification calls, CASP
reduces its RSS by 35%. Under stressed conditions, applications using CASP
have shown to use slightly more RSS than (O) (see results for MATVEC and
SCAN). However, in comparison to (L), the average RSS of (M) is almost 24%
less. Note that, a lower value of RSS for applications using CASP causes less
interruption to other processes in the system. In fact, it helps other processes
to use more memory. For instance, in table 5.6, the results of two applica-
tions show that the application using CASP uses fewer resident memory pages
while the original application uses more. This indirectly resulted in fewer ma-
jor page-faults for the original application as well. Furthermore, is also reduced
its execution time.
5.3.6 Space Overhead
Insertion of reification calls increases the application code size. Table 5.7 lists
the compiled image size of each benchmark application. Manually inserted
reification calls added an overhead ranging from 0.4% to 21% over the original
application code whilst automatic insertion added 0.8% to 23% overhead. The
overhead is nearly equivalent to one extra memory page (assuming a page size
of 4KB) which is negligible compared to the reduction in RSS and page-faults.
248
Benchmark Original(O) mlock(L) Manual(M) Automatic(A)
MAD 414,477 414,617 416,117 419,415FFT 511,261 520,778 513,544 515,215FFT-I 511,325 521,390 513,864 515,664MATVEC 9,038 10,592 10,338 10,802SCAN 8,912 9,729 10,751 10,997
Table 5.7: Benchmark Code Size (bytes)
Further reduction in code size can be achieved by optimisation of the
CASPapp library. The CASPos component mostly re-uses existing memory
management code of the Linux kernel. Additional code was added to imple-
ment the framework, a new system-call and the page-isolation routine. The
final kernel image size (‘bzImage’ ) for both Linux-LRU and Linux-CART in-
creased only by 0.6%. CASP implementation in Linux has not been fully
optimised and includes code for reverse mapping [54] required for mlock().
Removing any unwanted code can further reduce the kernel size.
Linux version Original with CASP
Linux-LRU 2,046,074 2,059,147Linux-CART 2,047,190 2,060,479
Table 5.8: Linux Kernel Image Sizes (in bytes)
Summarising, CASP reduced major page-faults (22% for single, 18% for all
applications) and the RSS. It was observed that the manual insertion method
was more accurate and performed better than the automatic method. When
using paging to support out-of-core embedded applications, CASP helps to
reduce the inherent paging overheads and improves application execution times
(13% for single, 12% for all applications). Furthermore, CASP mechanism was
shown to perform better than the existing mlock() primitives found in Linux.
249
5.4 Summary
This chapter presented the implementation and evaluation of the reflection
framework and CASP mechanism in the Linux 2.6.16 kernel. CASP allowed
adaptation of Linux’s virtual memory management subsystem according to
application-specific memory requirements. CASP operates non-intrusively on
top existing page replacement policies and uses reification calls inserted into
application source to efficiently lock/release memory pages at runtime. CASP
has been compared against the existing Linux system call mlock(). Evaluation
showed that applications using CASP generated fewer page-faults, required
fewer resident memory pages, and improved their overall execution times.
Furthermore, applications using manually inserted reification calls performed
better than those with automatic insertion.
Chapter 6
Conclusion
This chapter concludes the research work presented in this thesis. The over-
all thesis contribution is presented in section 6.1 along with some identified
applications and limitations of the work in section 6.2. The future research
initiatives and directions are discussed section 6.3. Finally, section 6.4 presents
the concluding remarks.
6.1 Thesis Contribution
Chapter 1 presented the central hypothesis of this thesis:
“Conventional CPU scheduling and memory management policies
in RTOS provide generic support that do not, in general, allow
application-specific resource control. This thesis contends that
application-specific control of processor scheduling and memory
management will provide better application support thereby im-
proving application performance. This thesis proposes a generic
reflective framework in the RTOS to efficiently capture application-
specific requirements and bring about fine-grained changes in the
resource management policies. The use of explicit reification in
251
252
application source code to specify the resource requirements will
provide better application support and improve performance”
The main objective of this research work was to prove the above hypothesis
by: showing that the generic resource management policies in the existing OSs
do not provide application-specific support; proposing and implementing a re-
flective OS framework that captures application-specific CPU and memory
requirements and accordingly adapts its policies; and proving that the pro-
posed framework helps to provide application-specific resource management
support to the applications and improves their performance.
In this respect, chapters 1 and 2 provided numerous examples of increas-
ing application resource requirements which are supported by average-case
resource management policies in the OS. It was made clear that applications
needed more control over the OS resource management and be able to adapt
the OS policies according to the application-specific requirements. Also, the
experiments carried out in chapters 3, 4 and 5 showed that, under normal
conditions, applications using the generic resource management policies of the
OS often showed average performance.
Chapter 2 reviewed the existing reflection mechanisms in programming lan-
guages, middlewares and OSs. It emphasised the use of reflection mechanisms
to bring about runtime changes in the behaviour of the system. Chapter 3 put
forth modifications to the reflection mechanism particularly to the process of
reification and proposed a generic reflective OS framework. Later in this chap-
ter, the implementation of the proposed reflective framework in DAMROS, a
prototype RTOS, was described. DAMROS was implemented as a single ad-
dress space OS. It implemented a reflective CPU scheduler (VRHS model)
and a reflective virtual memory manager (RMMS). Both used the framework
253
to adapt/change their policies at runtime. Several experiments were carried
out to show the ability of the framework to bring about runtime changes in
these resource management policies. The experiments proved that by pro-
viding application-specific support, it is possible to improve the application
performance.
Chapter 4 presented methods that could be used to provide support for ex-
plicit reification in the framework. The chapter used virtual memory paging as
a case study and described three methods of inserting reification calls into the
application source code: manual, automatic and hybrid. It presented CASP,
an OS mechanism that used the reification calls to adapt the paging policy of
an OS. Later in this chapter, the implementation of PROTON, an on-the-fly
virtual memory simulator, was described. Simulation experiments involving
standard benchmark applications along with the reification calls and CASP
showed a significant improvement in paging performance (i.e. by reducing the
total number of page-faults generated).
To verify the scalability of the framework and the CASP mechanism, chap-
ter 5 presented the implementation in Linux. Two different flavours of Linux
2.6.16 kernel implementing the core elements of the reflective framework along
with the CASP mechanism operating on top of LRU and CART [17] page
replacement policies were described. Experiments described in chapter 5 were
executed in a multi address space OS (Linux). They involved benchmark ap-
plications using reification calls inserted into them using the manual and au-
tomatic methods. The results showed significant performance improvement in
application using the framework. Under normal conditions, applications that
used LRU and CART policies showed average paging results. In particular,
this chapter proved that it is possible to improve the application performance
254
and virtual memory management by using the framework to adapt the under-
lying OS paging policy. The next subsection discusses the applications and
some identified limitations of this work.
6.2 Applications and Limitations
Many different kinds of applications can make use of the reflection frame-
work in an RTOS. In industry, the development process of a product is dis-
tributed, in that the applications are developed independent of an OS and vice
versa. This process may lead to integration problems during system deploy-
ment where the applications and the OS are integrated into the target system.
Using the work presented in this thesis, the applications adapt the OS policies
according to their specific requirements thereby gaining better support.
The application developers need not worry about the target OS or the
resource management policies it implements. If the target OS implements
the reflective framework, the developers can introduce application-specific UD
policies into the target OS thereby eliminating any integration issues related
to resource conflicts. This provides a level of confidence to the developers early
on regarding the final behaviour of the application.
The evidence presented in the thesis, particularly for virtual memory man-
agement, showed better support for out-of-core applications having sequential
memory access patterns. The experiments presented in chapter 3, 4 and 5 also
used applications with non-sequential memory access patterns. Although the
performance of such applications did improve, it was not significant.
One of the limitation of the CASP mechanism described in chapter 4 is
that it does not consider shared memory pages. This could be addressed by
future work.
255
6.3 Future Directions
There are several directions in which this work can be carried forward. In par-
ticular, this work needs further evaluation with respect to the use of resources
other than CPU and memory. Efficient power management for instance, can
be of particular interest for researchers to use this work. Another interest-
ing topic could be to study the effects of process execution times due to the
variation in the paging activity.
This thesis presented the reflective framework in the form of an open
framework not restricted by an API. An implementation of the framework
could choose a suitable API depending on the specific requirements. How-
ever, a POSIX-style [98] API could help establish a common interface to the
framework making the applications portable across all implementations of the
framework. The implementation section of DAMROS already presents some
standard interfaces to the framework, but with only the CPU and memory
resources in mind. Future work could extend/standardise these interfaces to
accommodate other system resources.
The reflective framework supports sharing of meta-level components
amongst several different base-level components. Future work could explore
the possibility and impact of one or more shared meta-level components in the
reflective framework.
There are several reification calls an application could use to provide in-
formation to the OS. This thesis described calls pertaining to the CPU and
memory resource. Future work could also identify certain key reification calls
associated with other resources in the system.
Many non-embedded systems such as desktop computers are introducing
256
applications with real-time requirements. Also, some existing desktop applica-
tions such as graphical (picture or video) editors use large amounts of memory
often leading the system to thrash. This work could be further explored in the
context of applications for the non-embedded world.
The CASP mechanism can be extended to support applications using
shared memory pages. The reflection framework in the RTOS and the CASP
mechanism have been evaluated in the context of a single processor. The
model can be further extended to support multi-core CPUs, distributed or
SMP systems.
6.4 Concluding Remarks
This thesis emphasised the importance of resources in resource-constrained
embedded systems and highlighted the need for an OS to adapt its policies to
support application-specific resource requirements. The thesis mainly focused
on the CPU and virtual memory resource in the context of soft real-time
embedded systems.
Existing RTOSs provide average-case resource management support and do
not take into account the dynamic changes in an application’s resource require-
ments. As a first initiative, the proposed reflective framework implemented in
an RTOS allowed the CPU and virtual memory management policies to be
adapted or changed at runtime depending on application requirements. The
framework helped applications to adapt/change the OS policies according to
their needs. The applicability of the approach was testing by implementing
the framework in both a single as well as a multiple address space OS. In each
case, applications using the framework were able to adapt OS policies which
significantly improved their performance.
257
Paging, associated with the page swap overheads, is generally regarded as
not suitable for soft real-time embedded systems. This thesis showed that,
by using application-specific paging mechanism, it is possible to reduce the
associated overheads. This work made a successful attempt to show that
paging may be a viable approach.
As the complexity of systems increases with more and more applications
being deployed onto single platforms, there is an ever increasing need for an
RTOS to manage its resources efficiently. It is not possible for a single resource
management policy to satisfy the dynamic demands of all applications running
in a system. This thesis has taken one step towards providing application-
specific resource manage support in an RTOS particularly for the CPU and
memory resources. However, much needs to be accomplished yet.
Bibliography
[1] The Linux Kernel Mailing List, (linux-kernel@vger.kernel.org), Online:http://www.lkml.org.
[2] Infineon Technologies. XC167CI 16-bit Single-Chip Microcontroller datasheets, October 2002.
[3] CompuLab. CM-X255 (ARMCORE-GX) Embedded Computer Modulereference guide, April 2005.
[4] Sun Microsystems. JavaTM 2 Platform Enterprise Edition, v 1.4 APISpecification, http://java.sun.com/j2ee/1.4/docs/api/ 2003.
[5] Microsoft Corporation. .NET Framework 3.5, 2007,http://msdn2.microsoft.com/en-us/library/w0x726c2(vs.90).aspx.
[6] ARM710T Datasheet, ARM DDI 0086B, ARM Ltd., UK, July, 1998,Online: http://www.arm.com/documentation/ARMProcessor Cores/.
[7] Virtualization Products, VMWare, Inc., Palo Alto, CA, USA, Online:http://www.vmware.com/.
[8] Samsung D840 Specifications, Samsung Electronics Co. Ltd., 2007,http://uk.samsungmobile.com/mobile/SGH-D840/spec.
[9] Accetta, M., Baron, R., Bolosky, W., Golub, D., Rashid,
R., Tevanian, A., and Young, M. Mach: A New Kernel Founda-tion for UNIX Development. Tech. rep., Computer Science Department,Carnegie Mellon University, August 1986.
[10] Agarwal, A., Sites, R. L., and Horowitz, M. Atum: a newtechnique for capturing address traces using microcode. In ISCA ’86:Proceedings of the 13th annual international symposium on Computer
259
260
architecture (Los Alamitos, CA, USA, 1986), IEEE Computer SocietyPress, pp. 119–127.
[11] Albert M. K. Cheng. Real-Time Systems: Scheduling, Analysis andVerification. Wiley-Interscience, August 2002.
[12] Aldea, M., Bernat, G., Broster, I., Burns, A., Dobrin, R.,
Drake, J. M., Fohler, G., Gai, P., Harbour, M. G., Guidi, G.,
Gutirrez, J., Lennvall, T., Lipari, G., Martnez, J., Medina,
J., Palencia, J., and Trimarchi., M. FSF: A Real-Time SchedulingArchitecture Framework. In Proceedings of the 12th IEEE Real-Time andEmbedded Technology and Applications Symposium, RTAS 2006 (SanJose, CA, USA, April 2006).
[13] Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., Burnett,
N. C., Denehy, T. E., Engle, T. J., Gunawi, H. S., Nugent,
J. A., and Popovici, F. I. Transforming Policies into Mechanismswith Infokernel. In Proceedings of the 19th ACM symposium on Op-erating systems principles (New York, NY, USA, 2003), ACM Press,pp. 90–105.
[14] Audsley, N., Gao, R., Patil, A., and Usher, P. Efficient OSResource Management for Distributed Embedded Real-Time Systems. InProceedings of Workshop on Operating Systems Platforms for EmbeddedReal-Time applications (Dresden, Germany, July 2006).
[15] Austin, T., Blaauw, D., Mahlke, S., Mudge, T., Chakrabarti,
C., and Wolf, W. Mobile Supercomputers. IEEE Computer 35, 5(May 2004), 81–83.
[16] Bacon, D. F., Graham, S. L., and Sharp, O. J. Compiler trans-formations for high-performance computing. ACM Computer Survey 26,4 (1994), 345–420.
[17] Bansal, S., and Modha, D. S. CAR: Clock with Adaptive Replace-ment. In FAST ’04: Proceedings of the 3rd USENIX Conference on Fileand Storage Technologies (Berkeley, CA, USA, 2004), USENIX Associ-ation, pp. 187–200.
[18] Barve, R. D., Grove, E. F., and Vitter, J. S. Application-Controlled Paging for a Shared Cache. SIAM Journal on Computing 29,4 (2000), 1290–1303.
261
[19] Beck, M., Bohme, H., Dziadzka, M., Kunitz, U., Magnus, R.,
and Verworner, D. Linux Kernel Internals, second ed. Addison–Wesley, 1998.
[20] Belady, L. A. A Study of Replacement Algorithms for Virtual-StorageComputer. IBM Systems Journal 5, 2 (1966), 78–101.
[21] Bernat, G., Burns, A., and Llamosi, A. Weakly hard real-timesystems. IEEE Transactions on Computers 50, 4 (2001), 308–321.
[22] Bershad, B. N., Chambers, C., Eggers, S., Maeda, C., Mc-
Namee, D., Pardyak, P., Savage, S., and Sirer, E. G. SPIN: anExtensible Microkernel for Application-specific Operating System Ser-vices. SIGOPS Oper. Syst. Rev. 29, 1 (1995), 74–77.
[23] Bilgic, A. M., and Hemmert, J. W. The Algorithmic Driving Force,Infineon Technologies, 2006.
[24] Blair, G. S., Coulson, G., Andersen, A., Blair, L., Clarke,
M., Costa, F., Duran-Limon, H., Fitzpatrick, T., Johnston,
L., Moreira, R., Parlavantzas, N., and Saikoski, K. ReflectiveMiddleware: The Design and Implementation of Open ORB 2. IEEEDistributed Systems Online (see http://www.computer.org/dsonline), 6(September 2001).
[25] Bondavalli, A., Stankovic, J., and Strigini, L. Adaptable FaultTolerance for Real-Time Systems. In Proceedings of the 3rd InternationalWorkshop on Responsive Computer Systems (September 1993).
[26] Brown, A. D., and Mowry, T. C. Taming the Memory Hogs: UsingCompiler-Inserted Releases to Manage Physical Memory Intelligently. InProceedings of the Fourth Operating Systems Design and ImplementationConference (OSDI) (Oct 2000), p. 72.
[27] Bryce, R. W. Chameleon, a dynamically extensible and configurableobject-oriented operating system. PhD thesis, Victoria, B.C., Canada,Canada, 2003. Adviser-G. C. Shoja.
[28] Bryce, R. W., Murata, K., Shoja, G. C., and Manning, E. G.
”porting and enhancements of a real-time object-oriented operating sys-tem”. In Proceedings of the PacRim ’95 conference (May 1995), IEEE.
262
[29] Burns, A., and Wellings, A. Real-Time Systems and ProgrammingLanguages, Second ed. Addison–Wesley, 1997.
[30] Campos, J. L., Gutierrez, J. J., and Harbour, M. G. Inter-changeable Scheduling Policies in Real-Time Middleware for Distribu-tion. In Proceedings of the 11th International Conference on ReliableSoftware Technologies, Ada-Europe (Porto, Portugal, 2006), pp. 227–240.
[31] Candea, G. M., and Jones, M. B. Vassal: Loadable Scheduler Sup-port for Multi-Policy Scheduling. In Proceedings of the Second USENIXWindows NT Symposium (August 1998), pp. 157–166.
[32] Carr, S., McKinley, K. S., and Tseng, C.-W. Compiler Opti-mizations for Improving Data Locality. In ASPLOS-VI: Proceedings ofthe sixth International Conference on Architectural Support for Program-ming Languages and Operating Systems (New York, NY, USA, 1994),ACM Press, pp. 252–262.
[33] Carvalho, D., Kon, F., Ballesteros, F., Roman, M., Camp-
bell, R., and Mickunas, D. Management of Execution Environ-ments in 2K. In Proceedings of the Seventh International Conferenceon Parallel and Distributed Systems (ICPADS’2000) (July 2000), IEEEComputer Society, pp. 479–485.
[34] Cazzola, W., and Ancona, M. ”mcharm: A reflective middlewarefor communications-based reflection”. Tech. Rep. ”DISI-TR-00-09”, Uni-versita defli Studi di Milano, Milan, Italy, May 2000.
[35] Cheriton, D. R., and Duda, K. J. A Caching Model of OperatingSystem Kernel Functionality. In Proceedings of the 1st Symposium onOperating Systems Design and Implementation (November 1994), ACMPress, pp. 179–194.
[36] Chiba, S. Load-Time Structural Reflection in Java. Lecture Notes inComputer Science 1850 (2000), 313.
[37] Cox, M., and Ellsworth, D. Application-Controlled Demand Pag-ing for Out-of-Core Visualization. In VIS ’97: Proceedings of the 8thConference on Visualization (Los Alamitos, CA, USA, 1997), IEEEComputer Society Press, pp. 235–ff.
263
[38] Crawford, J. H., and Gelsinger, P. P. Programming the 80386.SYBEX, 1987.
[39] de Lara, E., Wallach, D. S., and Zwaenepoel, W. HATS: Hier-archical Adaptive Transmission Scheduling for Multi-Application Adap-tation. In Proceedings of the 2002 Multimedia Computing and Network-ing Conference (MMCN’02) (San Jose, CA, January 2002).
[40] Denning, P. J. The Working Set Model for Program Behavior. InSOSP ’67: Proceedings of the first ACM symposium on Operating SystemPrinciples (New York, USA, 1967), ACM Press, pp. 15.1–15.12.
[41] Denys, G., Piessens, F., and Matthijs, F. A Survey of Cus-tomizability in Operating Systems Research. ACM Computing Surveys(CSUR) 34 (December 2002).
[42] Doller, E. Flash Memory Trends and Tech-nologies, Intel Developer FORUM, MEMS001,http://download.intel.com/idf/us/docs/PS MEMS001.pdf 2006.
[43] Eggers, S. J., Keppel, D. R., Koldinger, E. J., and Levy,
H. M. Techniques for efficient inline tracing on a shared-memory mul-tiprocessor. In SIGMETRICS ’90: Proceedings of the 1990 ACM SIG-METRICS conference on Measurement and modeling of computer sys-tems (New York, NY, USA, 1990), ACM Press, pp. 37–47.
[44] Elizabeth J. O’Neil and Patrick E. O’Neil and Gerhard
Weikum. The LRU-K page replacement algorithm for database diskbuffering. In Proceedings of the ACM SIGMOD International Confer-ence on Management of Data (1993), pp. 297–306.
[45] Endo, Y., Gwertzman, J., Seltzer, M., Small, C., Smith,
K. A., and Tang, D. VINO: The 1994 Fall Harvest. Tech. Rep.TR-34-94, Center for Research in Computing Technology, Harvard Uni-versity, December 1994.
[46] Engler, D. R., Gupta, S. K., and Kaashoek, M. F. AVM:Application-level Virtual Memory. In Proceedings of the Fifth Workshopon Hot Topics in Operating Systems (HotOS-V) (May 1995), p. 72.
[47] Engler, D. R., Kaashoek, M. F., and O’Toole, Jr., J. Exok-ernel: an Operating System Architecture for Application-level Resource
264
Management. In Proceedings of the fifteenth ACM Symposium on Oper-ating Systems Principles (1995), ACM Press, pp. 251–266.
[48] Feizabadi, S. A Formally Verified Application-Level Framework forReal-Time Scheduling on POSIX Real-Time Operating Systems. IEEETransaction on Software Engineering 30, 9 (2004), 613–629. StudentMember-Peng Li and Senior Member-Binoy Ravindran and StudentMember-Syed Suhaib.
[49] Fiat, A., and Rosen, Z. Experimental studies of access graph basedheuristics: beating the LRU standard? In SODA ’97: Proceedingsof the eighth Annual ACM-SIAM Symposium On Discrete Algorithms(Philadelphia, PA, USA, 1997), Society for Industrial and Applied Math-ematics, pp. 63–72.
[50] Foote, B., and Johnson, R. E. Reflective facilities in Smalltalk-80.ACM SIGPLAN Notices. 24, 10 (1989), 327–335.
[51] Gall, D. L. MPEG: A Video Compression Standard for MultimediaApplications. Communications of the ACM 34, 4 (1991), 46–58.
[52] Gehani, N., and Ramamritham, K. Real-Time Concurrent C: ALanguage for Programming Dynamic Real-Time Systems. Real-TimeSystems 3, 4 (December 1991).
[53] George C. Necula and Scott McPeak and Shree Prakash
Rahul and Westley Weimer. CIL: Intermediate Language andTools for Analysis and Transformation of C Programs. In ComputationalComplexity (2002), pp. 213–228.
[54] Gorman, M. Understanding the Linux Virtual Memory Manager. Pren-tice Hall, April 2004.
[55] Goyal, P., Guo, X., and Vin, H. M. A Hierarchical CPU Schedulerfor Multimedia Operating Systems. In Proceedings of the Second Sym-posium on Operating Systems Design and Implementation (Seattle, WA,October 1996), USENIX Association, pp. 107–121.
[56] Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M.,
Mudge, T., and Brown, R. B. MiBench: A free, commerciallyrepresentative embedded benchmark suite. In Proceedings of IEEE 4thAnnual Workshop on Workload Characterization (December 2001).
265
[57] Hand, S. M. Self-Paging in the Nemesis Operating System. In the ThirdSymposium on Operating Systems Design and Implementation (New Or-leans, Louisiana, USA, February 1999), pp. 73–86.
[58] Harty, K., and Cheriton, D. R. Application-Controlled PhysicalMemory using External Page-Cache Management. In Proceedings of the5th International Conference on Architectural Support for ProgrammingLanguages and Operating System (ASPLOS) (New York, USA, 1992),vol. 27, ACM Press, pp. 187–197.
[59] ichiro Itoh, J., Lea, R., and Yokote, Y. Using meta-objects tosupport optimisation in the apertos operating system. In COOTS’95:Proceedings of the USENIX Conference on Object-Oriented Technolo-gies on USENIX Conference on Object-Oriented Technologies (COOTS)(Berkeley, CA, USA, 1995), USENIX Association.
[60] Infineon Technologies AG, 81726 Mnchen, Germany. TriCore- 32-bit Unified Processor Core Embedded Applications Binary Interface(EABI), February 2007.
[61] Intel Corporation. Intel x86 Processor Family - Developer’s Manu-als Vol. I, II and III., December 1998.
[62] James L. Peterson and Abraham Silberschatz. Operating Sys-tem Concepts. Addison–Wesley, 1988.
[63] Jiang, S., Chen, F., and Zhang, X. CLOCK-Pro: an Effective Im-provement of the CLOCK Replacement. In Proceedings of 2005 USENIXAnnual Technical Conference (USENIX’05) (Berkeley, CA, USA, April2005), USENIX Association.
[64] Jiang, S., and Zhang, X. LIRS: an Efficient Low Inter-reference Re-cency Set Replacement Policy to improve Buffer Cache Performance. InSIGMETRICS ’02: Proceedings of the 2002 ACM SIGMETRICS Inter-national Conference on Measurement and Modeling of Computer Sys-tems (New York, NY, USA, 2002), ACM Press, pp. 31–42.
[65] Johnson, T., and Shasha, D. 2Q: a low overhead High Perfor-mance Buffer Management Replacement Algorithm. In Proceedings ofthe Twentieth International Conference on Very Large Databases (San-tiago, Chile, 1994), pp. 439–450.
266
[66] Kaplan, S. F. Collecting whole-system reference traces of multipro-grammed and multithreaded workloads. In WOSP ’04: Proceedings ofthe 4th international workshop on Software and performance (New York,NY, USA, 2004), ACM Press, pp. 228–237.
[67] Kaplan, S. F. Complete or fast reference trace collection for simulatingmultiprogrammed workloads: choose one. In SIGMETRICS ’04/Perfor-mance ’04: Proceedings of the joint international conference on Measure-ment and modeling of computer systems (New York, NY, USA, 2004),ACM Press, pp. 420–421.
[68] Kaplan, S. F., Smaragdakis, Y., and Wilson, P. R. Trace reduc-tion for virtual memory simulations. In SIGMETRICS ’99: Proceedingsof the 1999 ACM SIGMETRICS international conference on Measure-ment and modeling of computer systems (New York, NY, USA, 1999),ACM Press, pp. 47–58.
[69] Kon, F., Campbell, R. H., Mickunas, M. D., Nahrstedt, K.,
and Ballesteros, F. J. 2K: A Distributed Operating System forDynamic Heterogeneous Environments. In Proceedings of the 9th IEEEInternational Symposium on High Performance Distributed Computing(HPDC’9) (Pittsburgh, August 2000), pp. 201–208.
[70] Kon, F., Costa, F., Blair, G., and Campbell, R. H. The casefor reflective middleware. Communications ACM 45, 6 (2002), 33–38.
[71] Kon, F., Roman, M., Liu, P., Mao, J., Yamane, T., Clau-
dio Magalh a., and Campbell, R. H. Monitoring, security, anddynamic configuration with the dynamicTAO reflective ORB. In Middle-ware ’00: IFIP/ACM International Conference on Distributed systemsplatforms (Secaucus, NJ, USA, 2000), Springer-Verlag New York, Inc.,pp. 121–143.
[72] Kon, F., Singhai, A., Campbell, R. H., Carvalho, D., Moore,
R., and Ballesteros, F. J. 2K: A Reflective, Component-BasedOperating System for Rapidly Changing Environments. In ECOOP’98Workshop on Reflective Object-Oriented Programming and Systems(Brussels, Belgium, July 1998).
267
[73] Krueger, K., Loftesness, D., Vahdat, A., and Anderson, T.
Tools for the Development of Application-Specific Virtual Memory Man-agement. In Proceedings of the OOPSLA ’93 Conference on Object-oriented Programming Systems, Languages and Applications (1993),pp. 48–64.
[74] Lebeck, A. R., and Wood, D. A. Active memory: a new abstractionfor memory system simulation. ACM Trans. Model. Comput. Simul. 7,1 (1997), 42–77.
[75] Ledoux, T. OpenCORBA: A reflective open broker. In Proceedings ofthe Reflection’99 (July 1999), Springer-Verlag, pp. 197–214.
[76] Lee, D., Choi, J., Kim, J.-H., Noh, S. H., Min, S. L., Cho, Y.,
and Kim, C. S. LRFU (Least Recently/Frequently Used) ReplacementPolicy: A Spectrum of Block Replacement Policies. Tech. Rep. SNU-CE-AN-96-004, Seoul National University, March 1996.
[77] Lee, D., Choi, J., Kim, J. H., Noh, S. H., Min, S. L., Cho, Y.,
and Kim, C. S. LRFU: A Spectrum of Policies that Subsumes theLeast Recently Used and Least Frequently Used Policies. IEEE Trans.Computers 50, 12 (2001), 1352–1361.
[78] Liedtke, J. L4 Reference Manual (486, Pentium, Pentium Pro). Tech.rep., GMD-German National Research Center for Information Technol-ogy, September 1996.
[79] Liu, C. L., and Layland, J. W. Scheduling Algorithms for Multi-programming in a Hard-Real-Time Environment. Journal of the ACM20, 1 (January 1973), 46–61.
[80] Lund, K., and Goebel, V. Adaptive Disk Scheduling in a Multi-media DBMS. In MULTIMEDIA ’03: Proceedings of the 11th ACMInternational Conference on Multimedia (2003), ACM Press, pp. 65–74.
[81] Malenfant, J., Jaques, M., and Demers, F.-N. A Tutorial onBehavioral Reflection and its Implementation. In Proceedings of the Re-flection 96 Conference, Gregor Kiczales, editor, pp. 1-20, San Francisco,California, USA (April 1996).
[82] Malkawi, M., and Patel, J. Compiler Directed Memory Manage-ment Policy for Numerical Programs. In SOSP ’85: Proceedings of the
268
tenth ACM Symposium on Operating Systems Principles (New York,NY, USA, 1985), ACM Press, pp. 97–106.
[83] Martonosi, M., Gupta, A., and Anderson, T. Memspy: an-alyzing memory system bottlenecks in programs. In SIGMETRICS’92/PERFORMANCE ’92: Proceedings of the 1992 ACM SIGMET-RICS joint international conference on Measurement and modeling ofcomputer systems (New York, NY, USA, 1992), ACM Press, pp. 1–12.
[84] Matsuoka, S., Ogawa, H., Shimura, K., Kimura, Y., Hotta,
K., and Takagi, H. OpenJIT - A Reflective Java JIT Compiler. InProceedings of OOPSLA ’98 Workshop on Reflective Programming inC++ and Java (November 1998), pp. 16–20.
[85] McNamee, D., and Armstrong, K. Extending the Mach ExternalPager Interface To Accommodate User-Level Page Replacement Poli-cies. In Proceedings of the USENIX Association Mach Workshop (1990),pp. 17–29.
[86] Megiddo, N., and Modha, D. S. ARC: A Self-Tuning, Low OverheadReplacement Cache. In FAST ’03: Proceedings of the 2nd USENIXConference on File and Storage Technologies (Berkeley, CA, USA, 2003),USENIX Association, pp. 115–130.
[87] Nieh, J., and Lam, M. S. The Design, Implementation and Evalua-tion of SMART: A Scheduler for Multimedia Applications. In SOSP’97:Proceedings of the 16th ACM Symposium on Operating Systems Princi-ples (October 1997), ACM Press, pp. 184–197.
[88] Niehaus, D. Program Representation and Translation for PredictableReal-Time Systems. In Proceedings of the IEEE Real-Time SystemsSymposium (December 1991), pp. 43–52.
[89] Niehaus, D., Stankovic, J., and Ramamritham, K. The SpringSystem Description Language. Tech. Rep. UMASS TR-93-08, Universityof Massachusetts Amherst, 1993.
[90] Nutt, G. Operating Systems, third ed. Addison Wesley, 2004.
[91] O’Neil, E. J., O’Neil, P. E., and Weikum, G. An optimalityproof of the LRU-K page replacement algorithm. Journal of ACM 46, 1(1999), 92–112.
269
[92] Patel, K., Smith, B. C., and Rowe, L. A. Performance of a Soft-ware MPEG Video Decoder. In MULTIMEDIA ’93: Proceedings of thefirst ACM International Conference on Multimedia (1993), ACM Press,pp. 75–82.
[93] Patil, A. PROTON: a customisable on-the-fly Virtual Memory Simu-lator. Tech. Rep. YCS-2007-420, University of York, York, UK, 2007.
[94] Patil, A. VRHS: an Application Specific Reflective Hierarchical Sched-uler. Tech. Rep. YCS-2007-419, University of York, York, UK, 2007.
[95] Patil, A., and Audsley, N. An Application Adaptive GenericModule-based Reflective Framework for Real-time Operating Systems.In Proceedings of the 25th IEEE Work in Progress session of Real-timeSystems Symposium (Lisbon, Portugal, December 2004).
[96] Patil, A., and Audsley, N. Implementing Application-SpecificRTOS Policies using Reflection. In Proceedings of the 11th IEEE Real-time and Embedded Technology and Applications Symposium (San Fran-cisco, 2005), pp. 438–447.
[97] Patil, A., and Audsley, N. Efficient Page lock/release mechanismin OS for out-of-core Embedded Applications. In Proceedings of the13th IEEE Real-time and Embedded Computing Systems and Applica-tions Symposium (Daegu, Korea, August 2007), pp. 81–88.
[98] POSIX.1. IEEE Standard for Information Technology - Portable Oper-ating System Interface (POSIX) - Part 1: System Application ProgramInterface (API) [C Language]. Tech. rep., IEEE Std 1003.1-1988, 1988.
[99] Regehr, J., and Stankovic, J. A. HLS: A Framework for Compos-ing Soft Real-Time Schedulers. In Proceedings of the 22nd IEEE Real-time Systems Symposium (RTSS’01) (London, UK, December 2001),Computer Society, IEEE, pp. 3–14.
[100] Regehr, J. D. Using Hierarchical Scheduling to Support Soft Real-Time Applications in General-Purpose Operating Systems. PhD thesis,University of Virginia, May 2001.
[101] Rivard, F. A New Smalltalk Kernel Allowing Both Explicit and Im-plicit Metalclass Programming. In Proceedings of OOPSLA’96, Work-shop : Extending the Smalltalk Language (October 1996).
270
[102] Rivas, M. A., and Harbour, M. G. Application-Defined Schedulingin Ada. ACM Ada-Letters XXII, 4 (December 2002), 77–84.
[103] Rivas, M. A., and Harbour, M. G. POSIX-CompatibleApplication-Defined Scheduling in MaRTE OS. In Proceedings of the14th Euromicro Conference on Real-Time Systems (June 2002), IEEEComputer Society, pp. 67–75.
[104] Rivas, M. A., and Harbour, M. G. Proposal of Application-DefinedScheduling Interface, Proposal submitted for consideration by the Real-time POSIX Working Group, July, 2002.URL:http://marte.unican.es/appsched-proposal.pdf.
[105] Rogers, P. Software Fault Tolerance, Reflection and the Ada Program-ming Language. PhD thesis, University of York, UK, October 2003.
[106] Rogers, P., and Wellings, A. J. OpenAda: A Metaobject Protocolfor Ada 95.
[107] Rowe, L. A., Patel, K. D., Smith, B. C., and Liu, K. MPEGVideo in Software: Representation, Transmission and Playback. InProceedings of Symposium on Electronic Imaging Science & Technology(February 1994).
[108] Seward, J., and Nethercote, N. Using valgrind to detect undefinedvalue errors with bit-precision. In Proceedings of the USENIX’05 AnnualTechnical Conference (April 2005).
[109] Silberschatz, A., Galvin, P. B., and Gagne, G. Operating Sys-tem Concepts, sixth ed. John Wiley & Sons, Inc., 2002.
[110] Singhai, A. Quarterware: a middleware toolkit of software risc compo-nents. PhD thesis, Champaign, IL, USA, 1999. Adviser-Roy H. Camp-bell.
[111] Smaragdakis, Y., Kaplan, S., and Wilson, P. EELRU: Simpleand Effective Adaptive Page Replacement. In SIGMETRICS ’99: Pro-ceedings of the 1999 ACM SIGMETRICS International Conference onMeasurement and Modeling of Computer Systems (New York, NY, USA,1999), ACM Press, pp. 122–133.
271
[112] Smith, B. C. Reflection and Semantics in a Procedural Language. PhDthesis, Massachusetts Institute of Technology, January 1982.
[113] Spencer, B., Wilson, L., and Doering, R. The SemiconductorTechnology Roadmap. Tech. rep., Future Fab International, December2005.
[114] Srivastava, A., and Eustace, A. Atom: a system for buildingcustomized program analysis tools. In PLDI ’94: Proceedings of theACM SIGPLAN 1994 conference on Programming language design andimplementation (New York, NY, USA, 1994), ACM Press, pp. 196–205.
[115] Stankovic, J. A. Reflective Real-Time Systems. Tech. Rep. 93-56,Univeristy of Massachusetts, 1993.
[116] Stankovic, J. A., and Ramamritham, K. The Spring Kernel: aNew Paradigm for Real-Time Operating Systems. SIGOPS Oper. Syst.Rev. 23, 3 (1989), 54–71.
[117] Stankovic, J. A., and Ramamritham, K. The Spring Kernel: aNew Paradigm for Real-Time Operating Systems. SIGOPS OperatingSystems Rev. 23, 3 (1989), 54–71.
[118] Stankovic, J. A., and Ramamritham, K. A Reflective Architecturefor Real-Time Operating Systems. Prentice-Hall, Inc., 1995.
[119] Stonebraker, M. Operating System support for Database Manage-ment. Communications of ACM 24, 7 (1981), 412–418.
[120] Tanenbaum, A. S., and Woodhull, A. S. Operating Systems: De-sign and Implementation, second ed. Prentice Hall, 1997.
[121] Turley, J. Operating Systems on the Rise, Embedded Systems Design,http://www.embedded.com/columns/surveys/187203732? requestid=605762006.
[122] Uhlig, R. A., and Mudge, T. N. Trace-driven memory simulation:a survey. ACM Comput. Surv. 29, 2 (1997), 128–170.
[123] Venkatachalam, V., and Franz, M. Power Reduction Techniquesfor Microprocessor Systems. ACM Computing Surveys 37, 3 (2005),195–237.
272
[124] Williams, N. J. An Implementation of Scheduler Activations on theNetBSD Operating System. In Proceedings of the FREENIX Track:USENIX Annual Technical Conference (June 2002).
[125] Winwood, S., and Heiser, G. Flexible Scheduling Mechanisms inL4. Tech. rep., University of New South Wales, Australia, November2000.
[126] Wolf, W. Computers as Components: Principles of Embedded Com-puting System Design. Morgan Kaufmann, July 2005.
[127] Yang, Z., and Duddy, K. CORBA: a platform for distributed objectcomputing. SIGOPS Operating Systems Review 30, 2 (1996), 4–31.
[128] Yokote, Y. The Apertos Reflective Operating System: The Conceptand Its Implementation. In Conference Proceedings on Object-OrientedProgramming Systems, Languages, and Applications (1992), ACM Press,pp. 414–434.
[129] Yokote, Y., and Tokoro, M. The new structure of an operatingsystem: the apertos approach. In Proceedings of the 5th workshop onACM SIGOPS European workshop (New York, NY, USA, 1992), ACM.
[130] Zhu, M.-Y., Luo, L., and Xiong, G.-Z. The Minimal Model ofOperating Systems. ACM SIGOPS Operating Systems Review 35 (July2001).
[131] Zuberi, K. M., Pillai, P., and Shin, K. G. EMERALDS: a Small-Memory Real-Time Microkernel. In Proceedings of the 17th ACM Sympo-sium on Operating Systems Principles (1999), ACM Press, pp. 277–299.