A Survey of how Virtual Machine and Intelligent Runtime...

TR-CS-07-01

A Survey of how Virtual Machine andIntelligent Runtime Environments

can Support Cluster Computing

Samuel Chang, Peter Strazdins

February 2007

Joint Computer Science Technical Report Series

Department of Computer ScienceFaculty of Engineering and Information Technology

Computer Sciences LaboratoryResearch School of Information Sciences and Engineering

This technical report series is published jointly by the Department ofComputer Science, Faculty of Engineering and Information Technology,and the Computer Sciences Laboratory, Research School of InformationSciences and Engineering, The Australian National University.

Please direct correspondence regarding this series to:

Technical ReportsDepartment of Computer ScienceFaculty of Engineering and Information TechnologyThe Australian National UniversityCanberra ACT 0200Australia

or send email to:

Technical-DOT-Reports-AT-cs-DOT-anu.edu.au

A list of technical reports, including some abstracts and copies of some fullreports may be found at:

http://cs.anu.edu.au/techreports/

Recent reports in this series:

TR-CS-06-02 Peter Christen. A comparison of personal name matching:Techniques and practical issues. September 2006.

TR-CS-06-01 Stephen M Blackburn, Robin Garner, Chris Hoffmann,Asjad M Khan, Kathryn S McKinley, Rotem Bentzur, AmerDiwan, Daniel Feinberg, Daniel Frampton, Samuel Z Guyer,Martin Hirzel, Antony Hosking, Maria Jump, Han Lee,J Eliot B Moss, Aashish Phansalkar, Darko Stefanovic,Thomas VanDrunen, Daniel von Dincklage, and BenWiedermann. The DaCapo benchmarks: Java benchmarkingdevelopment and analysis (Extended Version). September2006.

TR-CS-05-01 Peter Strazdins. CycleCounter: an efficient and accurateUltraSPARC III CPU simulation module. May 2005.

TR-CS-04-04 C. W. Johnson and Ian Barnes. Redesigning the intermediatecourse in software design. November 2004.

TR-CS-04-03 Alonso Marquez. Efficient implementation of design patternsin Java programs. February 2004.

TR-CS-04-02 Bill Clarke. Solemn: Solaris emulation mode for Sparc Sulima.February 2004.

A Survey of How Virtual Machine and Intelligent Runtime Environments CanSupport Cluster Computing

Samuel ChangDepartment of Computer ScienceAustralian National University

[email protected]

Peter StrazdinsDepartment of Computer ScienceAustralian National University

[email protected]

Abstract

The use of cluster computers in high performance computing environments has recently been on the rise and atpresent, it presents a fertile source of unsolved research problems. With the Australian National University’s(ANU) High Performance Computing (HPC) research group planning to undertake two projects in the field ofcluster computing very shortly, the aim of this survey is to lay the groundwork for both projects by providingcomprehensive, up to date information on current research related to virtual machines and intelligent runtimeenvironments for use in the context of a large cluster or grid system. This survey covers a broad range ofinformation sources, namely the proceedings of top conferences, journals, and technical reports published bynotable research groups and researchers. We summarize key papers and make listings of research groups which areundertaking projects that we feel are worth taking note of. Subsequently, the information gathered is synthesized toidentify unsolved problems and potential areas of future work. Notable findings presented in this survey are thatparavirtualization of operating systems is no longer a necessary requirement for use with the Xen hypervisor asprocessor vendors Intel [10] and AMD [1] are building virtualization support into their processors, migration ofentire operating systems now incurs as little as 100ms down time by iteratively pre copying pages [15]. We havealso come to the conclusion that an important first step in the successful development of an intelligent runtimeenvironment is to develop a system which can accurately profile the hardware that a cluster system is made up of aswell as applications that would be run on it.

1. Introduction

Supercomputers provide a significant increase incomputing power that allows an otherwise unsolvablecomputationally intensive problem, such as weatherforecasting or the simulation of airplanes in windtunnels, to be solved in a reasonable amount of time.Cluster1 computers, which are systems made up ofcommodity off the shelf (COTS) components, areincreasingly becoming the tool of choice when it comesto building a supercomputer.

Currently, more than 70% of the top 500supercomputers are cluster systems [84]. This trenddoes not come as a surprise as cluster computers

1 This survey solely concentrates on high performanceclusters as opposed to availability or throughput clusters.

provide a substantial increase in computing power at arelatively inexpensive cost compared to othersupercomputing architectures. Furthermore, by usingCOTS components to build a cluster computer, shouldhardware in the system fail, parts are easily andcheaply replaced without the need for expensivecustom made and proprietary components.

The purpose of this survey is to lay the groundwork fortwo projects related to cluster computing that would beundertaken in the very near future by the ANU’s HPCresearch group. These projects are briefly summarizedbelow:

Project number one [83], “Virtualization Support forHeterogeneous Clusters”, explores how the Xen virtualmachine monitor can be used in a heterogeneouscluster computing environment. This project includesevaluating and reducing the performance overhead of

2

Xen, the configuration and selection of an operatingsystem (OS) tuned to the needs of an application overthe cluster and migrating an application over thecluster by migrating the virtual machines running onit.

The second project [83], “Next Generation GridEnabled Cluster Computers: Performance Optimizationfor e-Science” explores the development of intelligentruntime environments for large cluster and/or gridsystems that are capable of selecting where best toexecute a user’s application code based on cost/benefitconsiderations. This project would build on the first byutilizing live migration technologies to allow livemigration of running application processes betweennodes in the cluster and as a means of tailoring theoperating system to particular applications.

This paragraph gives a brief summary of how weconducted this survey. Firstly, we looked through theproceedings of top conferences2 in architecture,operating systems, networks, cluster computing andgrid computing before moving on to look at literaturefound in the major IEEE and ACM journals related tothe above mentioned areas. Next, we browsed throughtechnical reports published by relevant researchgroups, both in industry and academia. Finally, wewent through the publication lists of notable academicsso to pick up any potentially interesting papers thatmay have slipped through the cracks. The reader isdirected to Appendix A, which lists all our sources ofinformation.

The paper is organized as follows: Section 2 containsinformation on virtual machines, with Subsection 2.1covering virtual machine technology and Subsection2.2 covering the migration of virtual machines. Next,Section 3 introduces the concept of intelligent runtimeenvironments and the components of which itcomprises of. These components, which are schedulingand load balancing and fault tolerance mechanisms,are covered in Subsection 3.1 and 3.2 respectively. TheOpen Source Cluster Application Resources (OSCAR)suite used in cluster management is covered Subsection3.3. Section 4 contains unsolved problems andsuggestions for potential areas of future work. Someliterature which does not fall into any of the abovecategories that the reader is highly recommended toread can be found in Section 5. Finally, Section 6concludes the paper. Section 7 serves a twofold purpose

2 The reader is advised that some of the papers mentioned inthis survey require membership with the ACM DigitalLibrary and/or IEEE Explore in order to view.

as our bibliography as well as a collection of relevantliterature, grouped by topic. Appendix A lists all thesources from which we gathered our information andAppendix B contains a list of research groups andnotable projects which we feel are worth keeping trackof.

2. Virtual Machines

A virtual machine is an environment whereby a guestoperating system is presented with a set of virtualizedhardware devices and is managed by a virtual machinemonitor (VMM), or hypervisor. The hypervisor iscommonly run on a host operating system or, in somecases, directly on bare hardware. Multiple guestoperating systems can be run simultaneously on asingle physical machine, in which case the hypervisorprovides multiplexing capabilities of the underlyinghardware.

2.1 Virtual Machine Technology

Since 2005, virtual machines are again becoming a hottopic in academia and industry. Venture capital firmsare competing to fund startup companies toutingvirtual machine based technologies while majorhardware manufacturers Intel, AMD, SunMicrosystems and IBM are developing virtualizationstrategies that target markets with revenues in billionsand growing. In research labs and universitiesworldwide, researchers are developing approachesbased on virtual machines to solve mobility, securityand manageability problems [12].

Why this sudden resurgence in virtual machinetechnology? Some of the many benefits of using virtualmachines are: simplicity of service management,application scheduling flexibility, ease of testing anddebugging complex systems, hardware multiplexing,enhanced system management capabilities, the abilityto run conflicting processes in isolation from eachother, heightened system security, and live migration,dynamic content distribution and secure trustedcomputing.

The main benefits of virtual machine use in the contextof HPC cluster systems are:

• The ability to multiplex hardware, wherebygranting the user the power to select, change and

3

run multiple operating systems catered for specificapplications on a single node.

• The ability to present a similar, virtualized,generic hardware interface to guest operatingsystems regardless of the underlying hardware.

• Intelligent runtime environments that are used tomanage the cluster are granted the flexibility tomigrate entire operating systems by migrating thevirtual machine in which the OS is encapsulatedfrom node to node across the cluster computer.The topic of virtual machine migration isthoroughly covered in Subsection 2.2.

With the ability to multiplex hardware, it is possible tohighly customize the operating system and runtimeenvironment for each application. For example,operating systems containing different kernelconfigurations, system software parameters, loadedmodules as well as memory and disk space can bebooted and run based on the specific needs of anapplication. Multiplexing of hardware also presentssome interesting research scenarios. It allows thepossibility of running two operating systems withdiffering responsibilities on a node, say, a compute OSand a file serving OS, and the ability to toggle theprocessing power supplied to each operating system,based on its workload.

Most cluster systems are heterogeneous systems builtfrom COTS components. The heterogeneity arises fromthe fact that only as funds progressively becomeavailable, are nodes added to the cluster system. Thispresents a problem as a wide range of hardware andoperating systems can be found in the cluster system atany point in time. Management of, let alone using,such a system becomes a serious challenge. However,when virtual machines are employed, the VMM canpresent to each operating system a generic set ofvirtualized hardware devices, allowing the appearanceof a homogeneous system despite heterogeneousunderlying hardware.

Also, with new versions of software and packagesbeing frequently released, there is an increasing burdenof updating software and at the same time possiblyneeding to reconfigure the cluster system for its use.With virtual machine technology however, there is thepotential to possibly update software by redistributingVM images with the updated software and seamlesslymerging it together with the current virtual machineimage.

2.1.1 Summary of Key Papers (Virtual MachineTechnology)

A must read paper on virtual machines is entitled “Xenand the Art of Virtualization” [1] by P. Barham, B.Dragovic et al. This paper describes the fundamentaldesign and architecture of the Xen virtual machinemonitor, a highly popular VMM used in virtualmachine research today. Understanding how Xen wasdesigned and works is fundamental to its utilizationand building on top of it. A point to note is that theXen hypervisor has undergone several revisions sincethe publication of the abovementioned paper and whilesignificant improvements have been made to it, it hasalso, not surprisingly, become more complex! Recently,CPU manufacturers Intel and AMD have addedvirtualization support to their CPUs named VT-x [10]and AMD-V [1] respectively. This additional supportallows an unmodified OS to run on Xen, thus makingparavirtualization is no longer a necessaryrequirement.

Worth a quick glance through is the paper “Xen andthe Art of Repeated Research” [6] by B. Clark, T.Deshane et al. The authors of this paper conductedmultiple experiments with the Xen VMM and reportedtheir findings, which, to a large extent, support thosefound in [1]. Some key findings presented in the paperare that Xen can support up to about sixteen servers onan average server class machine as well as that Xenperforms surprisingly well when compared to hardwaremade specifically for virtualization.

The paper “Survey of Virtual Machine Research” [8]by R. Goldberg, though dated, provides core conceptsstill used in virtual machine research today. Twosections of this paper that are highly worth reading areboth the “Principles” and “Practice” sections. Bothsections provide good introductory material with thePrinciples section describing the fundamentals of aVMM and the Practice section describing scenarios inwhich a virtual monitor can be used.

A more recent paper by authors M. Rosenblum and T.Garfinkel entitled “Virtual Machine Monitors: CurrentTechnology and Future Trends” [12], explains thereason for the revival of virtualization, describes thecurrent scene of virtual machine technology as well aslists the challenges that would be faced during theimplementation of a VMM today. Three highlyapplicable areas that it touches on are the challengesassociated with CPU virtualization, memoryvirtualization and I/O virtualization.

4

In the paper “When Virtual Is Better than Real”, [5]authors P. Chen and B. Nobel make a case as to whyvirtual machines are worth while investments in anycomputing environment. The paper provides three casestudies: secure logging, intrusion prevention anddetection and environment migration. The section ofmost relevance to our research “Environmentmigration” as it provides a good overview of thechallenges faced in migration of virtual machines aswell as the possible methods of overcoming them.Some of these challenges include the development of amechanism to preserve the synchronous state of anoperating system during migration, and the support ofmigration between physical machines which do nothave identical hardware. Migration between physicalmachines with different hardware poses a problem asdespite a hypervisor providing a virtualized set ofphysical devices to the guest operating system, guestoperating systems occasionally directly access physicalhardware components for improved performance.

As a node in the cluster may contain more than oneCPU or be a multi-core processor, there is a need toefficiently utilize that resource. That idea is explored inE. Bugnion, S. Devine et al.’s paper “Disco: RunningCommodity Operating Systems on ScalableMultiprocessors” [4], which examines the problem ofextending modern operating systems to run efficientlyon large scale shared memory multiprocessors withouta large implementation effort. This is achieved byinserting an additional layer of software between thehardware and the operating system, with the additionallayer acting like a virtual machine monitor in thatmultiple copies of “commodity” operating systems canbe run on a single scalable computer. The monitor alsoallows these commodity systems to efficientlycooperate and share resources with each other. Theresulting system contains most of the features ofcustom scalable operating systems developedspecifically for scalable multiprocessor systems at afraction of the cost.

S. King, G. Dunlap et al.’s paper “Operating SystemSupport for Virtual Machines” [11] provides a goodintroduction to the two different types of VMMs in usetoday, namely a Type I VMM and a Type II VMM.The paper defines a Type I VMM to be one which runsdirectly on bare hardware with guest virtual machinesrunning on top of it, while a Type II VMM requires anadditional host operating system layer between the barehardware and the VMM. The paper then goes on todescribe three optimizations to improve theperformance of a Type II VMM, which yielded a

performance increase of about 14 to 35 percent. Thisfinding is of interest since Xen is a Type II VMM.

The paper by W. Huang, J. Liu et al. entitled “A Casefor High Performance Computing with VirtualMachines” [9] provides a strong argument for the useof VM technology in cluster computing systems. Itintroduces and backs up the potential benefits,including management, system security, performanceisolation, check pointing, restarting and live migration.The paper also demonstrates that despite runningapplications in a virtualized environment, HPCapplications can achieve almost the same performanceas those running in a native non-virtualizedenvironment. Two techniques used to obtain such goodperformance is VMM bypass I/O and Scalable VMimage management.

2.2 Virtual Machine Migration

There are many reasons for the migration of a virtualmachine. For example, from the point of view of asystem administrator, the ability to migrate an entirevirtual machine across hardware simplifies the issue ofserver maintenance. The administrator could migratean operating system containing a web server runningon a primary machine to a secondary and take theprimary machine offline for servicing purposes.

However, in the context of high performancecomputing, the main benefit that virtual machinemigration offers is the ability to migrate entireoperating systems containing running applications todifferent nodes of a cluster computer. Some benefits ofmigrating an entire operating system are increasedfault tolerance, load balancing and better performancedue to reduced latency. In the case of fault tolerance, avirtual machine running on a node which has failedcan be migrated to an available node. Load balancingattempts to distribute the workload evenly throughoutthe cluster. Lastly, by migrating a virtual machinecontaining running applications closer to their datasources, latency is reduced, which in turn results inbetter performance.

A logical question to ask is why entire operatingsystems should be migrated as opposed to migrating anindividual process or application. The justification ofvirtual machine migration is presented in the followingsection.

5

2.2.1 Migrating Processes

While it would be ideal to be able to map a process to anode, or subset of nodes, in a cluster for the entireexecution duration of the process, this unfortunately isnot always optimal nor is it even possible. Forexample, the lack of optimality arises when a subset ofnodes, more suited to the particular process than theones the process is currently executing on, becomesavailable. There would be inefficient use of resourcesunless execution of the process is migrated to the moreoptimal subset of nodes. Also, in a very straightforwardscenario, should a node fail, processes would no longerbe able to execute on it. A webpage that lists somemotivations for process migration can be found at [17].

The paper “The Design and Implementation of Zap: ASystem for Migrating Computing Environments” [21]lists five key reasons for supporting process migration.These are

• Fault resilience by migrating processes off offaulty hosts

• Data access locality by migrating processes closerto the data

• Better system response time by migratingprocesses closer to users

• Dynamic load balancing by migrating processes toless loaded hosts

• Improved service availability and administrationby migrating processes before host maintenance sothat applications can continue to run with minimaldowntime.

Despite these advantages, the migration of individualprocesses is not widely used today due to significantcomplications. For example, the necessity of similaroperating systems and underlying hardware poses achallenge as a heterogeneous cluster could be runningdifferent hardware and different operating systems.Even if the conditions for the OS and hardware weremet, there is still the problem of residues of the processremaining on the original machine, with complicatedabstraction layers being required for such migrations towork. Also, should the original machine, from whichthe processes were originally migrated from fail, thereis a high chance that the processes would cease tofunction.

2.2.2 Migrating Virtual Machines

With virtual machine technology, an entire guestoperating system is cleanly encapsulated inside avirtual machine. A narrow interface to the virtualizedhardware provided by the hypervisor allows migrationof the entire guest operating system to be cleanlyexecuted. An operating system running in a virtualmachine can be migrated to execute on an arbitrarynode regardless of the underlying hardware. Themigration of an entire OS and all of its applications asone unit avoids many of the difficulties faced byprocess-level migration approaches [15]. Anyapplication formerly running in the guest operatingsystem can resume execution once migration iscomplete.

2.2.3 Summary of Key Papers (Virtual MachineMigration)

A key sister paper to [3] is called “Live Migration ofVirtual Machines” [15], by C. Clark, K. Fraser et al. Inthis paper, the authors describe several mechanismsand strategies for the live migration of virtualmachines across clusters, using the Xen VMM. Anadded bonus is that astonishingly good results wereobtained, with all migrations taking under 100ms. Thisis achieved by using a pre-copy approach in whichpages of memory are iteratively copied from the sourcemachine to the destination host, all without everstopping the execution of the virtual machine beingmigrated. Towards the end, the virtual machine getspaused for a very brief period to copy any remainingpages to the destination, before resuming executionthere. However, there are two limitations to thisapproach. Firstly, wide area network redirection acrosssubnets is not supported, resulting in a situation similarto that requiring the use of Mobile IP. Secondly, theauthors assume that the cluster is running on a networkfile system. Should there be local disks which need tobe migrated, the authors state that total migrationtimes may be intolerably extended. We are reassuredthat our project is headed in the right direction with theauthors stating that, “A key challenge is to developcluster control software which can make informeddecisions as to the placement and movement of virtualmachines”.

In the paper “Internet Suspend/Resume” [19], authorsM. Kozuch and M. Satyanarayanan propose anapproach to allowing an operating system to beaccessed from any location using a thin client device.The benefit of such a system is that a user would beable to use his individually customized operating

6

system from any location supporting such a system.This is achieved through the unprecedented use oflayering virtual machine technology on a distributedfile system. One useful idea presented in the paper iscalled “Proactive State Transfer” which attempts topredict the location of where the system would be nextaccessed. By predicting a system’s access patterns, filescould then be copied across the network before actualmigration occurs, resulting in lesser time spent duringthe actual migration of the virtual machine.

C. Sapuntzakis, R. Chandra et al.’s paper entitled“Optimizing the Migration of Virtual Computers” [22]introduces the concept of a capsule, which is the stateof a running computer’s disks, memory, CPU registersand I/O devices. Capsules are then used for themigration of virtual machines across differentmachines. There is similarity to [15] in terms of thegoal of the paper and some of the techniques used.However, the main drawback of this paper is that theentire VM would have to be suspended while migrationtakes place, resulting in downtime and loss ofproductivity. Useful migration optimization techniquessuch as ballooning, demand paging of capsule disks,hash based compression, hash cache design, findingsimilar blocks and HCP protocol can be learnt from thepaper.

In the paper “Self-migration of Operating Systems”[18], authors J. Hansen and E. Jul suggest a differentapproach to migrating virtual machines. Instead oftraditional host driven migration, the authors proposeself-migration of virtual machines and claim there areadditional benefits when virtual machines are migratedin this manner. For example, there would be lessoverhead incurred from communication between theVMM and the virtual machine as well as increasedsecurity benefits. The network and CPU cost ofperforming the migration is attributed to the guest OS,rather than to the host environment. Portability isanother benefit of self migration. Since migrationhappens without hypervisor involvement, this approachis less dependent on the semantics of the hypervisorand can be ported across different hypervisors andmicrokernels.

“The Design and Implementation of Zap: A System forMigrating Computing Environments” [21] by S.Osman, D. Subhraveti et al. describes a novel systemfor transparent migration of applications. Zapintroduces the novel concept of a pod, which aregroups of processes that are provided a consistent,virtualized view of the underlying system. A consistentview is achieved by placing a thin virtualization layer

on top of the operating system. While Zap’s emphasisis on the migration of processes, there are conceptspresented that should be taken into consideration whendesigning a virtual machine migration system. Theseconcepts, found in Section 4.2 to 4.5 deal with memoryvirtualization and migration, file system virtualizationand migration, device virtualization and migration andnetwork virtualization and migration.

It is worth to bring to the attention of the reader avirtualization feature called Solaris Containers, foundin the Sun Solaris 10 operating system designed by SunMicrosystems [26]. Solaris containers are described as“a breakthrough approach to virtualization andsoftware partitioning by allowing many privateexecution environments be created within a singleinstance of the Solaris OS. Each environment has itsown identity, separate from the underlying hardware.This causes it to behave as if it is running on its ownsystem, making consolidation simple, safe, and secure[23].

3. Intelligent Runtime Environments

Optimal mapping of processes to nodes, or processors,in a cluster is crucial to obtaining the most from thecluster system. On one hand there is the goal ofproviding maximal computing power to a runningprocess or application while on the other is the goal ofservicing as many jobs as possible. These goals must beachieved even in the event of node failures.Compounding the problem is that subsets of nodes in acomputational grid may be owned by differentorganizations that impose different restrictions andpolicies on the usage of resources. There is clearly aneed for an intelligent runtime environment3 to handlethese issues.

3.1 Scheduling and Load Balancing

Many factors can affect the performance of a clusterfrom the point of view of a process, including theprocessing power of the node(s) on which it iscurrently executing as well as the locality of executionof the process with respect to its required data.Generally, the higher the processing power of a node,

3 We define an intelligent runtime environment to be asystem that consists of scheduling, load balancing, faulttolerance and other mechanisms.

7

the faster a process executes while the closer a processis to its data sources, the lower the latency forcommunication, resulting in better performance. Also,depending on whether the application iscomputationally intensive or data intensive, it may beprudent to design a cluster with nodes that attempts tocater to each type of application.

Scheduling problems are worsened by the fact that thecluster system is usually heterogeneous, comprising ofnodes with different processor speeds and resources,for example network connection speeds. Clustersystems could also possibly contain nodes that have avarying number of processors. While attempting toproduce a schedule statically is hard, imagine trying toschedule workloads dynamically while multiple jobsare executing on a cluster!

3.1.1 Summary of Key Papers (Scheduling andLoad Balancing)

The paper “Parallel Job Scheduling on MulticlusterComputing Systems” [30] by J. Abawajy, S.Dandamudi et al. claims that the key to making clustercomputing work well is by using middlewaretechnologies that can manage the policies, protocols,networks and job scheduling across the interconnectedset of computing resources. The authors develop anonline dynamic scheduling policy for heterogeneouscluster systems that supposedly performs better thanspace-sharing and time-sharing policies.

Authors H. Li, D. Groep et al.’s paper entitled“Predicting job start times on clusters” [41] asserts thatthe job start time predictions are useful to guideresource selections and to balancing the workloaddistribution in computational grids. The paperintroduces a system for predicting job start times onclusters based on statistical analysis of historic jobtraces and simulation of site schedulers.

The benefits that would be gained from splitting a jobacross many compute nodes either in a large cluster orgrid would depend on the communication patternsbetween processes, workload on the communicationlinks and the maximum bandwidth of the links. Thepaper “A Study on Job Co-Allocation in Multiple HPCClusters” [45] by J. Qin and M. Bauer aims tounderstand the impact of communications on multi-processor jobs in order to develop scheduling strategiesand job allocation algorithms for multi cluster gridswhich can accommodate communication factors.

“Efficient Resource description and high qualityselection for virtual grids” [40] by Y. Kee, D.Logothetis et al. describe a novel resource selection andbinding component that utilizes a resource descriptionlanguage and returns a set of selected and boundresources, otherwise known as a virtual grid. Theauthors claim that the resource description languageintroduced in the paper effectively captures applicationlevel resource abstractions.

A dynamic resource management system for clustersystems called Cluster on Demand is explained in thepaper “Dynamic Virtual Clusters in a Grid SiteManager” [35] by J. Chase, D. Irwin et al. Going a stepfurther than just running a VMM on each individualnode, the Cluster on Demand system virtualizes anentire physical cluster, partitioning it into multiplevirtual clusters. It allocates virtual clusters from acommon server pool, with each virtual cluster being anindependent entity from the next and having its ownworking environment. Targeted for use in a largecomputational grid environment, Cluster on Demandlayers on top of existing grid middleware and acts as asite manager that has dynamic policy-based allocationabilities. A very interesting question that arises afterreading this paper is: what are the advantages anddisadvantages of running a VMM on each individualnode as opposed to aggregating the entire clustersystem and running a VMM on top of it.

Selecting the hardware to build a cluster system fromthe vast range of hardware options available whileunder budget constraints is not an easy task. In thepaper “Robust Processor Allocation for IndependentTasks When Dollar Cost for Processors is aConstraint” [49], authors P. Sugavanam, H. Siegel etal. evaluate a set of six heuristics with respect toidentifying a set of machines whose total cost fallswithin an organization’s budget, while at the sametime meeting the computational needs of the tasks thatwould be run on it. The parameter which the heuristicsaim to optimize is the makespan, or time taken tocomplete a set of tasks. The heuristics not only evaluatethe system under ideal conditions but also considersituations when performance degrades due tounpredictable circumstances. Of the six heuristicsevaluated, the GENITOR heuristic performs the bestclosely followed by the Partition/Merge GreedyIterative Maximization heuristic. As the authors claimthat the general problem of optimally mapping tasks tomachines in a HPC environment has been shown to beNP-complete, developing heuristics for scheduling and

8

load balancing in intelligent runtime environmentsshould also be considered.

"Xenoservers: Accountable Execution of UntrustedPrograms" [47] by D. Reed, I. Pratt et al., introducesthe concept of establishing a global publicinfrastructure that provides data execution serviceswhile charging the user for the usage of that service.This system was designed because the authors felt thatmany networked applications could benefit fromexecuting closer to the data or services with which theyinteract, and by doing so, avoid long communicationlatencies and overhead incurred by transferring dataover congested or expensive network links. For ourpurposes, Section 2.2 of the paper, "Accounting andCharging for Resources", provides good ideas fordesigning an accounting mechanism used to charge auser for the computing services that is providedprovided. Some accounting mechanisms includecounting the CPU cycles spent executing anapplication, counting the number of context switchescaused by an application and keeping track of theamount of data transmitted over network devices. Itmay be worth incorporating such a system into anintelligent runtime environment that manages a HPCcluster which may be used commercially.

3.2 Fault Tolerance

Despite scheduling processes to run on a cluster in themost optimal fashion, a failure of nodes orcommunication links can cause disruptions, resultingin the need for entire execution schedules to berecalculated. This can result in a significant waste oftime and computing resources. Fault tolerancemechanisms aim to reduce computational wastage anddowntime by attempting to deal with such failures inthe most optimal fashion possible.

3.2.1 Summary of Key Papers (Fault Tolerance)

“SPHINX: A Fault-Tolerant System for Scheduling inDynamic Grid Environments” [55], authors J. In, P.Avery et al. propose a framework that effectivelyschedules work across a large number of distributedclusters that are owned by different organizations in afault-tolerant way. This is achieved in spite of thehighly dynamic nature of the grid and complex policyissues. The novelty of the system lies in the use ofeffective monitoring of resources and job execution

tracking mechanisms to make scheduling and faulttolerant decisions.

The paper “A Resource Manager for Optimal ResourceSelection and Fault Tolerance Service in Grids” [57]by H. Lee, S. Chin et al. describes three mechanismsfor enabling optimal performance on grid systems.Firstly, they introduce a resource manager that findsthe optimal set of resources. Secondly they develop afault detector that detects the occurrence of resourcefailures. Finally a fault manager guarantees that thesubmitted jobs complete and improves the performanceof job execution due to job migration even if failureshappen.

Authors G. Lanfermann, G. Allen et al.’s paper“Nomadic Migration: Fault Tolerance in a DisruptiveGrid Environment” presents a grid migrationframework which provides an application with theability to seek out and exploit remote computingresources by migrating tasks from site to site,dynamically adapting the application to a changingenvironment. A “Grid Object”, which is a collection ofkey features of a general grid resource is used to allowscalable information transfer, while fault tolerance isachieved through the use of a peer to peer approachwhile

3.3 Open Source Cluster ApplicationResources (OSCAR)

The “Open Source Cluster Application Resources(OSCAR)” project is a suite of cluster managementtools and best practices. It allows a user to effortlesslyset up and administer a Beowulf cluster. The creatorsof OSCAR describe it as a snapshot of the best knownmethods for building, programming, and using HPCclusters [60].

A potential benefit of using OSCAR is that ofsimplified system administration and management ofheterogeneous cluster systems. E. Focht’s paper“Heterogeneous Clusters with OSCAR: Infrastructureand Administration” [58] discusses designconsiderations for heterogeneous clusters, sketches theinstallation procedure and briefly describesadministration issues with complex heterogeneousconfigurations. Although not directly related to virtualmachines or intelligent runtime environments, ease ofcluster management would be also highly beneficial toany project involving a cluster system.

9

Due to the frequency of new versions of operatingsystems being released, repeatedly setting up testsystems for use with OSCAR software becomestedious. The paper “OSCAR Testing with Xen” [61] byG. Vallee and S. Scott describes the use of the Xenhypervisor to help ease the testing of new OSCARpackages over heterogeneous systems, with varyinghardware and operating systems. This paper would beworth a read to learn about the strategies that theauthors used and evaluate the possibility of theadaptation of similar strategies for use with ourprojects.

4. Future Areas of Research

Thus far, we have conducted a literature surveycovering the main areas related to our researchprojects. In this section we analyze the gatheredinformation and list some problems which we feel havenot been adequately covered in the current researchliterature and are worth tackling in future.

4.1 Virtual Device Performance

Despite significant advances made with the Xen VMMand the ability to migrate virtual machines, oneweakness still remaining in the VMM is that of poorperformance of virtualized devices. For example,network I/O over a virtualized device is much slowerthan I/O over the real network device. Authors A.Menon, A. Cox et al.’s paper “Optimizing NetworkVirtualization in Xen” [27] claim that performancedegrades by a factor of 2 to 3 times for receiveworkloads while transmit workloads degrade by up to afactor of five times. Fortunately, they go on to presentmethods to optimize virtual network interfaces byoptimizing the implementation of data transfer pathbetween guest and driver domains and modifying Xento provide support for guest operating systems toeffectively utilize advanced virtual memory featuressuch as superpages and global page mappings.

A paper similar to the one mentioned above is“Virtualizing I/O Devices on VMware Workstation’sHosted Virtual Machine Monitor” [28] by J. Sugerman,G. Venkitachalam et al. Although VMwareWorkstation instead of the Xen VMM was used, theauthors also reported similar findings of poorerperformance when using virtualized devices due toadditional virtualization overheads. Some suggestions

made by the authors to reduce virtualization are:modifying the guest OS, optimization of the guestdriver protocol, modification of the host OS and bypassof the host OS.

While ANU’s HPC research group’s main focus is noton OS research, it would be worthwhile to invest sometime investigating additional optimizations which canbe made to reduce the cost of using a virtualizednetwork device with the Xen hypervisor as theefficiency of virtualized network devices significantlyaffects the overall performance of a cluster computer.

4.2 Communication Pathways

It is common for nodes in a cluster system to havemultiple network interfaces. By aggregating theinterfaces, an increase in network performance isachieved. A paper that deals with this topic is“Performance Enhancement of SMP Clusters withMultiple Network Interfaces using Virtualization” [79]by P. Strazdins, R. Alexander et al. A key findingmade is that with one communication stream perprocess pair, configuring streams to run independentlyacross interfaces generally yielded significantly betterperformance than did channel bonding. However,better gains for channel bonding were observed as thenumber of streams and communication intensityincreased. The best performance gains came fromconfigurations using Xen virtualized hosts with VMM-bypass for network I/O.

Currently, even with the latest version of the XenVMM (3.0.3), there is still no way to allow bypass ofthe Xen VMM for intranode communication. Intranodebypass would be highly valuable for use in nodescontaining multiple CPUs that wish to communicatewith each other. Current intranode communicationbetween Xen guests on the same node is currently anorder of magnitude slower than the native sharedmemory transport as shown in [79].

4.2 Network Connectivity after Migration

Virtual machines usually share a bridged connectionwith the host machine, resulting in a one to onemapping of an IP address to MAC address. When avirtual machine gets migrated to a new host computer,the IP address to MAC address mapping is no longer

10

valid, resulting in packets destined for the virtualmachine no longer being able to reach it.

The first approach that comes to mind would be that ofMobile IP [81]. However, Mobile IP is not a very viablesolution as the host node from which a virtual machinehad just been migrated away from would still have tobe kept online to continually forward packets to thenew host node at which the virtual machine nowresides. Should the old host node be taken offline, thenetwork connection is broken. Furthermore, in theevent that the virtual machine is repeatedly migrated,packets would have to be forwarded through multiplenodes, resulting in a high amount of wastedbandwidth.

An alternative approach suggested by authors by A.Snoeren and H. Balakrishnan in their paper “An End-to-End Approach to Host Mobility” [82] would be toallocate each virtual machine a token that it can use toidentify itself against a central DNS server. In thisway, end to end connectivity is maintained even aftermigration has occurred. It would be worth exploringthe feasibility of such a solution and researching otherpossible mechanisms of maintaining networkconnectivity.

4.3 Memory Management

With the possibility of multiple OS runningsimultaneously on a single node, efficient memorymanagement, and in particular, reclamation of unusedpages becomes an important aspect of a VMM. Thereare two benefits of keeping the memory used by avirtual machine to a minimum. The first and mostobvious benefit is that it provides other virtualmachines the option of using the free memory ifrequired. The second benefit is that the virtual machinemigration process is sped up due to a smaller workingset needing to be migrated

“Memory Resource Management in VMware ESXServer” [29] by C. Waldspurger discusses techniquesfor the management of memory in virtual machines.The three memory management techniques that areintroduced in the paper are: a ballooning techniquethat reclaims memory from a VM by implicitly causingthe guest OS to invoke its own memory managementalgorithms, an idle memory tax which solves theproblem related to share-based management of share-space resources and a content based page sharing and

hot I/O page remapping technique that reduces I/Ocopying overheads in large memory systems.

4.4 File Systems

Consideration must be given to the type of file systemthat the cluster system uses and the tradeoffs thatwould occur. For example, a distributed file systemwould possibly ease the burden of migrating entirevirtual hard disk drive (HDD) files across networksduring migration operations, with only the runtimeenvironment of an operating system needing to bemigrated. An alternative design may be for each nodein the cluster to have its own local storage device,which would better the support the possibility of eachnode doubling as a computer and server node.

A classic paper that the reader is recommended to readis “The Google File System” [80] by S. Ghemawat, H.Gobioff et al. The file system that is presented is ascalable distributed file system for large distributeddata-intensive applications that provides fault tolerancewhile running on inexpensive commodity hardware.Furthermore, this file system has been shown to bepractical and workable as it is widely deployed withinGoogle as the storage platform for the generating andprocessing of data.

4.5 Profiling of Hardware and Software

Being able to accurately profile both the hardware ofthe cluster as well as the applications that would be runon it is crucial to being able to fully utilize anintelligent runtime environment as the more accuratelyhardware capabilities and software requirements areunderstood, the more accurate the mappings ofapplications to nodes would be. This section discussesin detail some papers related to the profiling ofhardware and applicatios.

“Computational Workload Prediction for Grid OrientedIndustrial Applications: The Case of 3D-ImageRendering” [66] by A. Litke, K. Tserpes et al.,describes a module for predicting computationalworkloads of jobs assigned for execution on grids. Themodule aims to identify the complexity of a given joband predict the computational power that it wouldrequire from the grid infrastructure. The module isdesigned with the use of a trained artificial neuralnetwork. Accurately predicting the computational

11

workload of a given application is a significant aspectof accurately profiling software.

The characteristic of the workload on the ASCI BluePacific is described in the paper “The Characteristicsof Workload on ASCI Blue-Pacific at LawrenceLivermore National Laboratory” [70] by A. Yoo andM. Jette. This paper presents some interesting findings,for example, the fact that there is little correlationbetween a job’s execution time and resource demands.Another interesting point found in the paper is thatmost jobs have very small node and memoryrequirements. By jobs having small memoryrequirements, multiprogramming of parallel jobs ishighly favored.

In the paper “CPU Load Predictions on theComputational Grid” by Y. Zhang, W. Sun et al, aprediction strategy that forecasts the future CPU loadof individual nodes in a cluster based on the previousworkload history is described. The CPU workloadforecast is calculated independently on each node andwhen passed on to a scheduler, allows the scheduler tocalculate the expected future workload on a particularsubset of nodes, thus allowing the scheduler to makeintelligent and optimized decisions about how toschedule a given workload.

The performance of a cluster system is largelydependent on how the data on which it operates on ispartitioned among its nodes. Thus, it is necessary tostrive to develop an optimal partitioning scheme sincea scheme that is suboptimal would result inunnecessary communication cost overhead between itsnodes being incurred. The paper “Communication CostAnalysis for Parallel Networks” [62] by P. Crandalland M. Quinn mathematically characterizes thecommunication costs for five partitioning methods,which are: scatter, contiguous point, contiguous row,interleaved and block decomposition. These variouspartitioning methods are analyzed under varyingproblem sizes and cluster configurations. Costequations that express the expected performance basedon the various options are then established, enablingthe reader to be able to make better informed datadecomposition selections.

Authors L. Hu and I. Gorton’s paper “PerformanceEvaluation for Parallel Systems: A Survey” [64]describes and reviews the various performanceevaluation techniques used in the evaluation of theperformance of a parallel software system. Thesetechniques include measurement, simulation andanalytical modeling. Other issues such as the selection

of metrics most suited to a particular setup and how toconstruct a good model are also covered in the paper.For issues dealing with workload characterization, thereader may wish to jump straight to section 4.2 of thepaper which deals with workload characterization anddescribes the various steps and techniques associatedwith it.

Benchmarking and performance evaluation of highperformance computing systems is an ongoing processthat not only consumes significant staff time but alsogenerates a large amount of data that needs to be storedand analyzed. To help deal with this situation, authorsS. Simon, M. Courson et al. describe an automatedsystem that creates an extensive performance profile ofa cluster system in their paper entitled “AutomatedTechniques for Performance Evaluation of ParallelSystems” [67]. Their system is composed of three maincomponents, namely, a data collection and storagemodule, a data analysis module and a runtimeexecution module.

The paper “Performance Evaluation of ParallelSystems” [63] by Authors P. Cremonesi, E. Rosti et al.provides a thorough analysis of current performanceevaluation methodologies that have been developed foruse in the evaluation of a cluster system. The paperconcentrates on a few performance case studies ofindividual cluster system components such asprocessor, memory, interconnection network, I/O andthe operating system. The multiple case studies whichthe paper contains may be a useful aid in the design ofa profiling system.

In the paper by H. Chen, W. Chen et al. entitled“MPIPP: An Automatic profile-guided parallel processplacement toolset for SMP clusters and multiclusters”the authors propose a fully automatic scheme foroptimized process placement in clusters systems. Theauthors use a profile guided approach to automaticallyfind an optimized mapping that reduces the cost ofpoint to point communications for arbitrary messagepassing applications. The implemented toolset is calledthe MPI Process Placement toolset, which consists ofthree components, which are: A tool to get thecommunication profile of a MPI application, a tool toget the network topology of target clusters and analgorithm to find optimized mapping. Of particularinterest would be the communication profilingcomponent as our characterization system also plans toincorporate such a component.

The characterization system that our research groupplans to design could possibly employ techniques such

12

as hardware performance counters, instrumentation ofthe MPI library and to some extent, scalabilityprediction models. Further details of our project can befound at [69].

4.6 Usage of Economic Theory

Economics has been defined to be the study of howagents make choices in the face of constraints. One ofthe uses of economics is to understand how economieswork and what the relations are between the maineconomic players and institutions. The branch ofeconomics most relevant to the development of anintelligent runtime environment is Microeconomics,which is the study of how individuals, households andfirms make decisions to allocate limited resources [76].

Bearing the definition of economics in mind, it is nottoo far fetched an idea to spend time investigatingeconomic theory and its applicability and suitability todesigning algorithms for intelligently managing a largescale cluster or grid system. Such a system has alimited amount of resources while being subject tohaving to allocate its resources to competing parties.

“A Framework for Measuring SupercomputerProductivity” M. Snir and D. Bader [68] describesutilizing the economic concepts of productivity andUtility Theory to measure the productivity of a HPCsystem. The authors describe how these concepts areable to capture essential aspects of a HPC system, suchas the importance of time-to-solution and the trade offbetween programming time and execution time.

Additional economics related papers that the interestedreader is encouraged to look at are: “Selfish gridcomputing: game-theoretic modeling and NASperformance results” [72], “An economic-basedresource management framework in the grid context”[73], “Employing economics to achieve fairness inusage policing of cooperatively shared computingresources” [74] and “An Economic Approach toadaptive resource management” [75].

A computer science research group that attempts to useeconomic theory for computer science research isHarvard University’s Harvard EconCS Group [89].

5. Highly Recommended Resources

In this section, we cover some papers that, while notdirectly related to the issues of virtual machines andintelligent runtime environments, are highly relevantto the overall picture of high performance computing,parallel systems, distributed systems and clustercomputing.

“Issues and Directions in Scalable Parallel Computing”[68] by Marc Snir, the author mentions an importantpoint which the reader should bear in mind. Theauthor states that the success in scalable parallelcomputing depends on our ability to developprogramming interfaces that can be conveniently usedon systems ranging from a few interconnectedworkstations to massively parallel computers. Heclaims that while scalable machines can be built, thequestion is whether we would be able to program themin a similar fashion. In short, Snir asserts that the mainchallenge of developing parallel computing systems isin the development of scalable software systems.

The authors also recommend the paper “An Overviewof Heterogeneous High Performance and GridComputing” [77] by J. Donarra and A. Lastovetsky.Though slightly dated, this paper still gives the readera good introduction to hardware platforms, typical usesprogramming issues, programming systems and theapplications related to heterogeneous parallel anddistributed computing, as well as the state of researchin those areas.

6. Conclusion

Through the process of conducting this survey, wefound that there has been, to date, a reasonable amountof work published with regards to virtual machines andintelligent runtime environments.

Virtual machine technology appears to be reasonablywell defined, and the current trend of virtual machineresearch seems to be headed in the direction of theaddition of new features to and the optimization ofcurrent VMM technology. As virtual machines arebecoming recognized as important components oftoday’s computing environment, hardwaremanufactures are including virtualization support intheir processors. With regard to virtual machinetechnology for use in HPC environments, there are stillareas which can be worked on. One example would bemodifying the Xen VMM to allow bypass of thehypervisor for intranode communication. Another areathat is lacking is to enable network connectivity to be

13

intact while migration occurs across subnets of acluster environment.

On the other hand, the area of intelligent runtimeenvironments, and in particular, its schedulingcomponent, is still very open ended and a fertile sourceof research problems. While there has been a lot ofwork published to date, there is still no definitive orwidely adopted set of procedures to schedulingworkloads. One possible explanation for this is that dueto the large amount of heterogeneity in the systems thatcan be built, one single scheduling algorithm orheuristic would not be able to cater to all possiblescenarios. Similarly, profiling of hardware andsoftware largely remains an unsolved problem and isan area that, while very challenging and difficult, oncesuccessfully tackled, would be a significantbreakthrough in the area of HPC research.

With the ANU’s availability of funding, equipment,research history and expertise in this area, we seem tobe in a good position to develop world class solutionsto relevant research problems.

7. References and Key Papers

This section has a twofold purpose. Firstly, it serves asthe bibliography section for this survey. Secondly, itdoubles as a catalog of additional relevant papers thatthe authors feel would be useful to the reader. Thepapers listed are grouped according to topic to alloweasy reference for the reader.

7.1 Virtual Machine Technology

[1] AMD. AMD Virtualization.http://www.amd.com/us-en/Processors/ProductInformation/0,,30_118_8826_14287,00.html. Accessed Jan 2007.

[2] A. Awadallah and M. Rosenblum. The vMatrix: ANetwork of Virtual Machine Monitors forDynamic Content Distribution. In Proceedings ofthe 10th IEEE Workshop on Future trends ofDistributed Computer Systems, Suzhou, China,2004

[3] P. Barham, B. Dragovic et al. Xen and the Art ofVirtualization. In Proceedings of the 19th ACM

Symposium on Operating Systems Principles, Oct.2003, NY, USA

[4] E. Bugnion, S. Devine et al. Disco: RunningCommodity Operating Systems on ScalableMultiprocessors. ACM Transactions on ComputerSystems, 15(4):412-447, November 1997

[5] P. Chen and B. Nobel. When virtual is better thanreal. In Proceedings of the 8th IEEE Workshop onHot Topics on Operating Systems, May 2001,Germany

[6] B. Clark, T. Deshane et al. Xen and the Art ofRepeated Research. FREENIX 2004, Nov 2004,Boston, USA.

[7] T. Garfinkel, B. Pfaff et al. Terra: A VirtualMachine-Based Platform for Trusted Computing.In Proceedings of the nineteenth ACM symposiumon Operating systems principles, October 2003,Bolton Landing, NY, USA.

[8] R. Goldberg. Survey of Virtual Machine Research.IEEE Computer, pages 34 - 45, June 1974

[9] W. Huang, J. Liu et al. A Case for HighPerformance Computing with Virtual Machines,In Proceedings of the 20th ICS, 2006, Queensland,Australia

[10]Intel Corporation. Intel Virtualization Technology:Hardware Support for Efficient ProcessorVirtualization.http://www.intel.com/technology/itj/2006/v10i3/1-hardware/5-architecture.htm. Accessed Jan 2007.

[11]S. King, G. Dunlap et al. Operating SystemSupport for Virtual Machines. In Proceedings ofthe 2003 Annual USENIX Technical Conference,June 2003, Texas, USA

[12]M. Rosenblum and T. Garfinkel. Virtual MachineMonitors: Current Technology and Future Trends.IEEE Computer, 35(5):39-47, 2005

[13]J. Reumann, A. Mehra et al. Virtual Services: ANew Abstraction for Server Consolidation. InProceedings of USENIX Annual TechnicalConference, General Track 2000, San Diego, USA

[14]Rich Uhlig, Gil Neiger, Dion Rodgers, Amy L.Santoni, Fernando C.M. Martins, Andrew V.Anderson, Steven M. Bennett, Alain Kagi, Felix

http://www.amd.com/us-

http://www.intel.com/technology/itj/2006/v10i3/1-

14

H. Leung, Larry Smith, "Intel VirtualizationTechnology," Computer, vol. 38, no. 5, pp. 48-56, May, 2005.

7.2 Virtual Machine Migration

[15]C. Clark, K. Fraser et al. Live Migration of VirtualMachines. Networked Systems Design andImplementation, 2005

[16]F. Douglis and J. Ousterhout. Transparent ProcessMigration: Design Alternatives and the SpriteImplementation. Software Practice andExperience, 21(7), July 1991

[17]J. Elson. “Motivations for Process Migration”.http://www.circlemud.org/~jelson/writings/process-migration/node3.html. Accessed Jan 2007

[18]J. Hansen and E. Jul. Self-migration of OperatingSystems. In Proceedings of the 11th ACM SIGOPSEuropean Workshop (EW2004), pages 126-130,2004

[19]M. Kozuch and M. Satyanarayanan. InternetSuspend/Resume. In Proceedings of the 4th IEEEWorkshop on Mobile Computing Systems andApplications, Calicoon, NY, June 2002

[20]Y. Li, Z. Lan, "A novel workload migrationscheme for heterogeneous distributed computing,"ccgrid, pp. 1055-1062, Fifth IEEE InternationalSymposium on Cluster Computing and the Grid(CCGrid'05) - Volume 2, 2005.

[21]S. Osman, D. Subhraveti et al. The Design andImplementation of Zap: A System for MigratingComputing Environments. In The proceedings ofthe 5th OSDI 2002, Boston, USA.

[22]C. Sapuntzakis, R. Chandra et al. Optimizing theMigration of Virtual Computers. In Proceedingsof the 5th Symposium on Operating Systems Designand Implementation (OSDI 2002), ACM OperatingSystems Review, Winter 2002 Special Issue, pages377-390, Boston, MA, USA, Dec. 2002

[23]Sun Microsystems - Solaris 10 Datasheets.http://www.sun.com/software/solaris/ds/utilization.jsp. Accessed Jan 2007.

[24]Sathish S. Vadhiyar, Jack J. Dongarra, "APerformance Oriented Migration Framework ForThe Grid," ccgrid, p. 130, 3rd InternationalSymposium on Cluster Computing and the Grid,2003.

[25]Geoffroy Vallée, Christine Morin, RenaudLottiaux, Jean-Yves Berthou, Ivan Dutka Malen,"Process Migration Based on Gobelins DistributedShared Memory," ccgrid, p. 325, 2nd IEEE/ACMInternational Symposium on Cluster Computingand the Grid (CCGRID'02), 2002.

[26]Wikipedia, the free Encyclopedia - SolarisContainers.http://en.wikipedia.org/wiki/Solaris_Containers.Accessed Jan 2007.

7.3 Additional Resources for VirtualMachines

[27]A. Menon, A. Cox et al. Optimizing NetworkVirtualization in Xen. In Proceedings of theUSENIX Annual Technical Conference, 2006,Boston, USA

[28]J. Sugerman, G. Venkitachalam et al. VirtualizingI/O Devices on VMware Workstation’s HostedVirtual Machine Monitor. In Proceedings of theUSENIX Annual Technical Conference, 2001

[29]C. Waldspurger. Memory Resource Managementin VMware ESX Server. In Proceedings of OSDI,Boston, 2002

7.4 Scheduling and Load Balancing

[30]J. H. Abawajy, S. P. Dandamudi, "Parallel JobScheduling on Multicluster Computing Systems,"cluster, p. 11, IEEE International Conference onCluster Computing (CLUSTER'03), 2003.

[31]K. Amiri, D. Petrou et al. Dynamic FunctionPlacement for Data-intensive Cluster Computing.In Proceedings of the Geieral Track 2000USENIX Annual Technical Conference, 2000, SanDiego, USA.

[32]J. Barbosa, C. Morais et al. Static Scheduling ofdependent parallel tasks on heterogeneous clusters.

http://www.circlemud.org/~jelson/writings/process

http://www.sun.com/software/solaris/ds/utilization.

http://en.wikipedia.org/wiki/Solaris_Containers.

15

In Proceedings of Cluster 2005, 2005, BostonMassachusetts

[33]M. Beltran, J.L. Bosque, "Information policies forload balancing on heterogeneous systems," ccgrid,pp. 970-976, Fifth IEEE International Symposiumon Cluster Computing and the Grid (CCGrid'05) -Volume 2, 2005.

[34]A. Chandra, M. Adler et al. Surplus FairScheduling: A Proportional-Share CPUScheduling Algorithm for SymmetricMultiprocessors. In Proceedings of the 4th

Symposium on Operating Systems Design andImplementation, 2000, San Diego, California.

[35]J. Chase, D. Irwin et al. Dynamic Virtual Clustersin a Grid Site Manager. In Proceedings ofInternational Symposium on High-PerformanceDistributed Computing, 2003, Washington.

[36]H. Chen, W. Chen et al. MPIPP: An automaticprofile-guided parallel process placement toolsetfor SMP clusters and multiclusters. In Proceedingsof the 20th International Conference onSupercomputing, 2006, Cairns, Australia.

[37]M. Dobber, G. Koole, R. van der Mei, "Dynamicload balancing experiments in a grid," ccgrid, pp.1063-1070, Fifth IEEE International Symposiumon Cluster Computing and the Grid (CCGrid'05) -Volume 2, 2005.

[38]Ligang He, Stephen A. Jarvis, Daniel P. Spooner,Graham R. Nudd, "Dynamic Scheduling ofParallel Real-Time Jobs by Modelling SpareCapabilities in Heterogeneous Clusters," cluster, p.2, IEEE International Conference on ClusterComputing (CLUSTER'03), 2003.

[39]Dimitrios Katramatos, Marty Humphrey, Cheol-Min Hwang, Steve J. Chapin, "Developing aCost/Benefit Estimating Service for DynamicResource Sharing in Heterogeneous Clusters:Experience with SNL Clusters," ccgrid, p. 355,1st International Symposium on ClusterComputing and the Grid, 2001.

[40]Yang-Suk Kee, D. Logothetis, R. Huang, H.Casanova, A.A. Chien, "Efficient resourcedescription and high quality selection for virtualgrids," ccgrid, pp. 598-606, Fifth IEEEInternational Symposium on Cluster Computingand the Grid (CCGrid'05) - Volume 1, 2005.

[41]Hui Li, D. Groep, J. Templon, L. Wolters,"Predicting job start times on clusters," ccgrid, pp.301-308, 2004 IEEE International Symposium onCluster Computing and the Grid (CCGrid'04),2004.

[42]Jiadao Li, Ramin Yahyapour, "Learning-BasedNegotiation Strategies for Grid Scheduling,"ccgrid, pp. 576-583, Sixth IEEE InternationalSymposium on Cluster Computing and the Grid(CCGRID'06), 2006.

[43]S. McClure and R. Wheeler. MOSIX: How LinuxClusters Solve Real World Problems. InProceedings of the USENIX Annual TechnicalConference, 2000, San Diego, USA

[44]D. Petrou, J. Milford et al. Implementing LotteryScheduling: Matching the Specializations inTraditional Schedulers. In Proceedings of the1999 USENIX Annual Technical Conference, June1999, Monterey, California.

[45]Jinhui Qin, Michael Bauer, "A Study on Job Co-Allocation in Multiple HPC Clusters," hpcs, p. 3,20th International Symposium on High-Performance Computing in an AdvancedCollaborative Environment (HPCS'06), 2006

[46]Xiao Qin, Hong Jiang, Yifeng Zhu, David R.Swanson, "Towards Load Balancing Support forI/O-Intensive Parallel Jobs in a Cluster ofWorkstations," cluster, p. 100, IEEE InternationalConference on Cluster Computing (CLUSTER'03),2003.

[47]D. Reed, I. Pratt. Xenoservers: AccountableExecution of Untrusted Programs. In Proceedingsof HotOS, 1999

[48]Hermes Senger, Liria Matsumoto Sato, "LoadDistribution for Heterogeneous and Non-DedicatedClusters Based on Dynamic Monitoring andDifferentiated Services," cluster, p. 199, IEEEInternational Conference on Cluster Computing(CLUSTER'03), 2003.

[49]H. Siegel, A. Maciejewski et al. Robust ProcessorAllocation for Independent Tasks When DollarCost for Processors is a Constraint. In Proceedingsof Cluster 2005, 2005, Boston, Massachusetts

16

[50]D. Sullivan, R. Hass et al. Tickets and CurrenciesRevisited: Extensions to Multi-Resource LotteryScheduling. In Proceedings of the 27th Workshopon Hot Topics in Operating Systems, 1999, RioRico, USA.

[51]B. Urgaonkar, P. Shenoy et al. ResourceOverbooking and Application Profiling in SharedHosting Platforms. In Proceedings of the 5th

Operating System Design and Implementation,2002, Boston, Massachusetts.

[52]Ming Wu, Xian-He Sun, "A General Self-Adaptive Task Scheduling System for Non-Dedicated Heterogeneous Computing," cluster, p.354, IEEE International Conference on ClusterComputing (CLUSTER'03), 2003.

[53]Yang Zhang, Anirban Mandal, Henri Casanova,Andrew A. Chien, Yang-Suk Kee, Ken Kennedy,Charles Koelbel, "Scalable Grid ApplicationScheduling via Decoupled Resource Selection andScheduling," ccgrid, pp. 568-575, Sixth IEEEInternational Symposium on Cluster Computingand the Grid (CCGRID'06), 2006.

7.5 Fault Tolerance

[54]M. Mat Deris, M. Rabiei, A. Noraziah, H. M.Suzuri, "High Service Reliability for ClusterServer Systems," cluster, p. 281, IEEEInternational Conference on Cluster Computing(CLUSTER'03), 2003.

[55]Jang-uk In, Paul Avery, Richard Cavanaugh,Laukik Chitnis, Mandar Kulkarni, Sanjay Ranka,"SPHINX: A Fault-Tolerant System forScheduling in Dynamic Grid Environments,"ipdps, p. 12b, 19th IEEE International Paralleland Distributed Processing Symposium(IPDPS'05) - Papers, 2005.

[56]Gerd Lanfermann, Gabrielle Allen, ThomasRadke, Edward Seidel, "Nomadic Migration: FaultTolerance in a Disruptive Grid Environment,"ccgrid, p. 280, 2nd IEEE/ACM InternationalSymposium on Cluster Computing and the Grid(CCGRID'02), 2002.

[57]H. Lee, S. Chin et al., "A resource manager foroptimal resource selection and fault toleranceservice in Grids," ccgrid, pp. 572-579, 2004 IEEE

International Symposium on Cluster Computingand the Grid (CCGrid'04), 2004.

7.6 Open Source Cluster ApplicationResources (OSCAR)

[58]Erich Focht, "Heterogeneous Clusters withOSCAR: Infrastructure and Administration," hpcs,p. 37, 20th International Symposium on High-Performance Computing in an AdvancedCollaborative Environment (HPCS'06), 2006.

[59]Chokchai Leangsuksun, Lixin Shen, Tong Liu,Hertong Song, Stephen L. Scott, "AvailabilityPrediction and Modeling of High AvailabilityOSCAR Cluster," cluster, p. 380, IEEEInternational Conference on Cluster Computing(CLUSTER'03), 2003.

[60]OSCAR - Open Source Cluster ApplicationResources. http://oscar.openclustergroup.org.Accessed Jan 2007.

[61]Geoffroy Vallee, Stephen L. Scott, "OSCARTesting with Xen," hpcs, p. 43, 20th InternationalSymposium on High-Performance Computing inan Advanced Collaborative Environment(HPCS'06), 2006.

7.7 Profiling of Applications

[62]P. Crandall and M. Quinn. “Communication CostAnalysis for Parallel Networks”. Technical Report94-80-04, 1994, Oregon State University, OR,USA.

[63]P. Cremonesi, E. Rosti et al. “PerformanceEvaluation of Parallel Systems”. Elsevier Preprint,Jan 1999, 1999

[64]L. Hu and I. Gorton. “Performance Evaluation forParallel Systems: A Survey”. UNSW-CSE-TR-9709, October 1997

[65]B. Huang, M. Bayer et al. Hpcbench - a Linux-based network benchmark for high performancenetworks. In The 19th Symposium on HighPerformance Computing Systems andApplications, 2005, HPCS 2005

http://oscar.openclustergroup.org.

17

[66]A. Litke, K. Tserpes, T. Varvarigou,"Computational workload prediction for gridoriented industrial applications: the case of 3D-image rendering," ccgrid, pp. 962-969, Fifth IEEEInternational Symposium on Cluster Computingand the Grid (CCGrid'05) - Volume 2, 2005.

[67]S, Simon, M. Courson et al. “AutomatedTechniques for Performance Evaluation of ParallelSystems”. EURESCO99, Spain, June 1999

[68]M. Snir and D. Bader. “A Framework forMeasuring Supercomputer Productivity”.International Journal for High PerformanceComputing Applications 2004, 18: 399-416, 2004.

[69]Peter Strazdins - Characterization of ApplicationPerformance of Cluster Computers.http://cs.anu.edu.au/~Peter.Strazdins/postgrad/PerfEvalMethod-Apps.html. Accessed Feb 2007.

[70]Andy B. Yoo, Morris A. Jette, "TheCharacteristics of Workload on ASCI Blue-Pacificat Lawrence Livermore National Laboratory,"ccgrid, p. 295, 1st International Symposium onCluster Computing and the Grid, 2001.

[71]Yuanyuan Zhang, Wei Sun, Yasushi Inoguchi,"CPU Load Predictions on the Computational Grid*," ccgrid, pp. 321-326, Sixth IEEE InternationalSymposium on Cluster Computing and the Grid(CCGRID'06), 2006.

7.8 Economic Approaches

[72]Yu.-K. Kwok, S. Song, K. Hwang, "Selfish gridcomputing: game-theoretic modeling and NASperformance results," ccgrid, pp. 1143-1150, FifthIEEE International Symposium on ClusterComputing and the Grid (CCGrid'05) - Volume 2,2005.

[73]Chuliang Weng, Minglu Li, Xinda Lu, QianniDeng, "An economic-based resource managementframework in the grid context," ccgrid, pp. 542-549, Fifth IEEE International Symposium onCluster Computing and the Grid (CCGrid'05) -Volume 1, 2005.

[74]P. Xavier, Wentong Cai, Bu-Sung Lee,"Employing economics to achieve fairness inusage policing of cooperatively shared computing

resources," ccgrid, pp. 326-333, Fifth IEEEInternational Symposium on Cluster Computingand the Grid (CCGrid'05) - Volume 1, 2005.

[75]N. Stratford and R. Mortier. An EconomicApproach to Adaptive Resource Management. InProceedings of the 27th Workshop on Hot Topicsin Operating Systems, 1999, Rio Rico, USA.

[76]Wikipedia, the free Encyclopedia - Economics.http://en.wikipedia.org/wiki/Economics. AccessedJan 2007

7.9 Miscellaneous

[77]J. Dongarra and A. Lastovetsky. An Overview ofHeterogeneous High Performance and GridComputing, 2004, Nova Science Publishers Inc.To Appear.

[78]M. Snir. Issues and Directions in Scalable ParallelComputing. In Proceedings of the 12th ACMSymposium on Principles of DistributedComputing (PODC), 1993, NY, USA

[79]P. Strazdins, R. Alexander et al. “PerformanceEnhancement of SMP Clusters with MultipleNetwork Interfaces using Virtualization”. XenHPC06, Sorrento, 2006

[80]S. Ghemawat, H. Gobioff et al. The Google FileSystem. In Proceedings of the 19th ACMSymposium on Operating System Principles, 2003,New York, USA.

[81]C. E. Perkins and A. Myles. Mobile IP. InProceedings of International TelecommunicationsSymposium, pages 415-419, 1997

[82]A. Snoeren and H. Balakrishnan. An End-to-EndApproach to Host Mobility. In Proceedings of the6th annual international conference on Mobilecomputing and networking, pages 155-166. ACMPress, 2000.

[83]The Jabberwocky Project. Research Projects –Project Summaries.http://jabberwocky.anu.edu.au. Accessed Jan2007.

[84]Top 500 Supercomputing Sites.http://www.top500.org/stats. Accessed Feb 2007.

http://cs.anu.edu.au/~Peter.Strazdins/postgrad/Pe

http://en.wikipedia.org/wiki/Economics.

http://jabberwocky.anu.edu.au.

http://www.top500.org/stats.

18

7.10 Notable Research Groups

[85]Carnegie Mellon University Parallel Data Lab.Abacus: Dynamic Function Placement for Data-Intensive Cluster Computing.http://www.pdl.cmu.edu/Abacus/index.html.Accessed Jan 2007.

[86]Carnegie Mellon University Parallel Data Lab.Batchactive Scheduling.http://www.pdl.cmu.edu/batchactive. Accessed Jan2007.

[87]Columbia University Network Computing Lab.Zap: Migration of Legacy and NetworkApplications.http://www.ncl.cs.columbia.edu/research/migrate.Accessed Jan 2007.

[88]Duke University Internet Systems and StorageGroup. Cluster on Demand.http://issg.cs.duke.edu/cod. Accessed Jan 2007.

[89]Harvard University Harvard EconCS Group.http://www.eecs.harvard.edu/econcs/. AccessedJan 2007

[90]Massachusetts Institute of TechnologySupercomputing Technologies Group. AdaptiveScheduling of Parallel Jobs.http://supertech.csail.mit.edu/dynamicProcAlloc.html. Accessed Jan 2007.

[91]University of California – Berkeley Reliable,Adaptive and Distributed Systems Laboratory.Workload characterization and generation.http://radlab.cs.berkeley.edu/wiki/Workload_characterization_and_generation. Accessed Jan 2007.

[92]University of California at San Diego ConcurrentSystems Architecture Group. Virtual Grids.http://www-csag.ucsd.edu/projects/VGrADs.Accessed Jan 2007.

[93]University of California at San Diego ConcurrentSystems Architecture Group. High PerformanceVirtual Machines. http://www-csag.ucsd.edu/projects/hpvm.html. Accessed Jan2007.

[94]University of Maryland-College Park DistributedSystems Software Lab. Active Harmony.http://www.dyninst.org/harmony. Accessed Jan2007.

[95]University of Southern California ComputerNetworks Division. The Scalable ComputingInfrastructure Project.http://gost.isi.edu/projects/scope. Accessed Jan2007.

[96]University of Wisconsin-Madison OperatingSystems Research Group. Condor: HighThroughput Computing.http://www.cs.wisc.edu/condor. Accessed Jan 2007

http://www.pdl.cmu.edu/Abacus/index.html.

http://www.pdl.cmu.edu/batchactive.

http://www.ncl.cs.columbia.edu/research/migrate.

http://issg.cs.duke.edu/cod.

http://www.eecs.harvard.edu/econcs/.

http://supertech.csail.mit.edu/dynamicProcAlloc.h

http://radlab.cs.berkeley.edu/wiki/Workload_char

http://www-csag.ucsd.edu/projects/VGrADs.

http://www-

http://www.dyninst.org/harmony.

http://gost.isi.edu/projects/scope.

http://www.cs.wisc.edu/condor.

Appendix A – Sources of Information

Conferences (in alphabetical order)

• CCGRID 2006, 2005, 2004, 2003, 2002, 2001• Cluster 2004, 2003, 2002, 2001• HotOS 2003, 2001, 1999, 1997, 1995, 1993• HPCA 2005, 2004, 2003, 2002, 2001, 2000• HPCS 2006, 2005, 2004, 2002• ICDCS 2006, 2005, 2004• ICS 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995• Micro 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1997, 1996, 1995• OSDI 2004, 2002, 2000, 1999, 1996, 1994• SOSP 2005, 2003, 2001, 1999, 1997, 1995, 1993• USENIX Technical Conference (including FREENIX Track where applicable) 2004, 2003, 2002, 2001,

2000, 1999, 1996, 1995, Summer 1994, Winter 1994

Journals (in alphabetical Order)

• Communications of the ACM Volume 49 (2006), Volume 48 (2005)• IEEE Computer Volume 39 (2006), Volume 38 (2005)• IEEE Transactions on Computers Volume 56 (2006), Volume 55 (2006)

University Research Groups (in alphabetical Order)

• Brown University• California Institute of Technology• Carnegie Mellon University• Columbia University• Cornell University• Duke University• Harvard University• Massachusetts Institute of Technology• Princeton University• Purdue University-West Lafayette• Rice University• Stanford University• University of California - Berkeley• University of California - Los Angeles• University of California - San Diego• University of Illinois-Urbana-Champaign• University of Maryland - College park• University of Massachusetts-Amherst• University of Michigan - Ann Arbor• University of North Carolina-Chapel Hill• University of Pennsylvania• University of Southern California• University of Texas - Austin• University of Washington

• University of Wisconsin-Madison• Yale University

Appendix B - Notable Research Groups and Projects

University Research Group Notable Project(s) CategoryVirtual Grids VM Migration, Cluster Scheduling

and Intelligent Runtime EnvironmentsUniversity of California-SanDiego

Concurrent Systems Architecture Grouphttp://www-csag.ucsd.edu

High Performance VirtualMachines

Virtualization, Scheduling andIntelligent Runtime Environments

University of Maryland-College Park

Distributed Systems Software Labhttp://www.cs.umd.edu/projects/dssl

Active Harmony Scheduling and Intelligent RuntimeEnvironments

Abacus Scheduling, Intelligent RuntimeEnvironments and Load Balancing

Carnegie Mellon University Parallel Data Labhttp://www.pdl.cmu.edu

Batchactive Scheduling Scheduling and Intelligent RuntimeEnvironments

University of Wisconsin-Madison

UW Operating Systemshttp://www.cs.wisc.edu/areas/os

Condor Scheduling, Intelligent RuntimeEnvironments and High ThroughputComputing

Duke University Internet Systems and Storage Grouphttp://issg.cs.duke.edu

Cluster on Demand VM Migration, Scheduling andIntelligent Runtime Environments

Columbia University Network Computing Labhttp://www.ncl.cs.columbia.edu

Zap: Migration of Legacy andNetwork Applications

VM Migration

University of SouthernCalifornia

Computer Networks Divisionhttp://www.isi.edu/divisions/div7

The Scalable ComputingInfrastructure Project

Scheduling and Intelligent RuntimeEnvironments

University of California-Berkeley

Reliable, Adaptive and Distributed SystemsLaboratoryhttp://radlab.cs.berkeley.edu/wiki/RAD_Lab

Workload characterization andgeneration

Characterization of Applications

Massachusetts Institute ofTechnology

Supercomputing Technologies Grouphttp://supertech.csail.mit.edu

Adaptive Scheduling of ParallelJobs

Scheduling and Intelligent RuntimeEnvironments

http://www-csag.ucsd.edu

http://www.cs.umd.edu/projects/dssl

http://www.pdl.cmu.edu

http://www.cs.wisc.edu/areas/os

http://issg.cs.duke.edu

http://www.ncl.cs.columbia.edu

http://www.isi.edu/divisions/div7

http://radlab.cs.berkeley.edu/wiki/RAD_Lab

http://supertech.csail.mit.edu

A Survey of how Virtual Machine and Intelligent Runtime...

Documents

Transcript of A Survey of how Virtual Machine and Intelligent Runtime...