Maximizing revenue in Grid markets using an...

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2010; 22:1990–2011Published online 1 September 2008 inWiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/cpe.1370

Maximizing revenue in Gridmarkets using an economicallyenhanced resource manager

M. Macıas1,∗,†, O. Rana2, G. Smith3, J. Guitart1,4

and J. Torres1,4

1Barcelona Supercomputing Center, Jordi Girona 29, Barcelona, Spain2Cardiff University, Wales, U.K.3University of Reading, Barcelona, Spain4Technical University of Catalonia, Barcelona, Spain

SUMMARY

Traditional resource management has had as its main objective the optimization of throughput, basedon parameters such as CPU, memory, and network bandwidth. With the appearance of Grid markets,new variables that determine economic expenditure, benefit and opportunity must be taken into account.The Self-organizing ICT Resource Management (SORMA) project aims at allowing resource owners andconsumers to exploit market mechanisms to sell and buy resources across the Grid. SORMA’s motivationis to achieve efficient resource utilization by maximizing revenue for resource providers and minimizingthe cost of resource consumption within a market environment. An overriding factor in Grid markets isthe need to ensure that the desired quality of service levels meet the expectations of market participants.This paper explains the proposed use of an economically enhanced resource manager (EERM) for resourceprovisioning based on economic models. In particular, this paper describes techniques used by the EERMto support revenue maximization across multiple service level agreements and provides an applicationscenario to demonstrate its usefulness and effectiveness. Copyright © 2008 John Wiley & Sons, Ltd.

Received 28 September 2007; Revised 30 May 2008; Accepted 6 June 2008

KEY WORDS: service level agreements; Grid economy; resource management

∗Correspondence to: M. Macıas, Barcelona Supercomputing Center, Jordi Girona 29, Barcelona, Spain.†E-mail: [email protected]

Contract/grant sponsor: Ministry of Science and Technology of SpainContract/grant sponsor: European Union; contract/grant number: TIN2007-60625Contract/grant sponsor: Commission of the European Communities; contract/grant number: 034286

Copyright q 2008 John Wiley & Sons, Ltd.

MAXIMIZING REVENUE IN GRID MARKETS USING AN EERM 1991

1. INTRODUCTION

The Self-organizing ICT‡ Resource Management (SORMA) [1] is a European project developingmethods and tools for efficient market-based allocation of resources. It uses a self-organizingresource management system and market-driven models, which are supported by extensions toexisting Grid computing infrastructure.Unlike many existing Grid environments, tasks submitted to SORMA are matched with available

resources according to the economic preferences of both resource providers and consumers, andthe current market conditions. This means that the classic Grid job scheduler, which is based onperformance rules, is replaced by a set of self-organizing, market-aware agents that negotiate servicelevel agreements (SLAs), to determine resource allocation that best fulfills both performance andbusiness goals. In SORMA, an economically enhanced resource manager (EERM) exists at eachresource provider’s site and acts as a centralized resource allocator to support business goals andresource requirements.The advantages of using Grid economics were introduced previously byKenyon and Cheliotis [2]:Cost: Some applications need a large number of resources only at a given moment. Maintaining

all of these resources can be expensive in terms of space, energy, hardware costs, and maintenancestaff. With economic Grid systems the local system managers can dynamically acquire resources.This means that Grid users pay only for the resources that they have consumed.Efficiency: In many existing Grid environments, clients tend to overestimate their requirements to

avoid running out of resources; pricing can focus clients’ attention on trying to make reservationsthat are not wasteful.Flexibility: Resources are obtained by clients when they need them. This is common to other

non-economics-based Grid systems, but economic approaches enforce resource consumers to obtainthe resources only when they really need them.Scalability: New budget entities and users can be added easily while preserving flexibility and

efficiency.Feedback: The prices and valuation of resource requirements over time can be used to guide

management decisions.Although a number of different economic models may be used to support resource management,

this paper focuses on adaptation mechanisms to support revenue maximization across multipleSLAs. In other words, when an EERM receives task reservations and associated SLAs, the EERMmust allocate, monitor, and enforce resource constraints in order to maximize the number of taskswhose SLAs can be satisfied. However,

• The EERM does not have the ability to decide which of the SLAs must be accepted or rejected.It is used only for consultative purposes. Even if the EERM advises that it cannot fulfill anincoming SLA, economic agents could decide to send it to the EERM because they considerthat it is strategically necessary.

• The EERM uses a predictive model to calculate the impact of a task execution. Any predictionsystem has a margin of error, and the system could accept a task that cannot be fulfilled,resulting in system overload.

‡ICT stands for Information and Communications Technology.

Copyright q 2008 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1990–2011DOI: 10.1002/cpe

1992 M. MACIAS ET AL.

• An adverse situation could reduce the number of available resources, for example, some nodesof an available cluster could crash. The EERM must manage the situation to minimize theeconomic losses as a consequence of having too much jobs for the available resources.

In the cases described above, the service provider would have a reduced number of resources and thesystem would become overloaded. In consequence, some SLAs of accepted jobs might be violatedand the service provider would pay some penalties to the clients whose SLA is not fulfilled. Theapproach adopted in this work aims at minimizing the economic impact of SLA violations, whileat the same time attempting to enable as many tasks as possible to execute to completion.The remainder of the paper is structured as follows: Section 2 presents related work. Section 3

introduces the concepts related to quality of service (QoS) and SLAs—and identifies how theseterms relate to each other. Section 4 defines a resource allocation scenario and describes revenuemaximization and issues associated with managing SLAs. Section 5 describes the EERM’s archi-tecture and highlights the important features required to make use of an EERM within a Gridsystem. Section 6 contains a simple example to demonstrate how the EERM may be used—whichis subsequently tested in Section 7. Finally, Section 8 concludes the paper and describes our futurework.

2. RELATED WORK

A number of industrial vendors are developing techniques for exposing resources to clients andcharging them for usage. The most representative examples are Sun Microsystems and its Net-work.com [3] initiative, and Amazon with the Elastic Computer Cloud (EC2) [4] and S3 services.Both approaches have evolved from the respective companies’ data centers as a way of managingstorage and computational resources, while at the same time allowing third parties to ‘hire’ re-sources during periods when the data centers are under-utilized. The main difference between theseapproaches and the SORMA project is that they do not utilize economic intelligence; the prices atwhich resources are offered are fixed (in the context of Amazon, three different levels of access areoffered for both storage and computational capability). Furthermore, resources within a data centerare often homogeneous and the mechanisms to allow access may be localized to a single vendor.SORMA shifts the focus from individual data centers to an open market environment, where diverseresource providers compete for consumers.Economic resource management and some of its related elements, such as client classification

in Grids by applying price discrimination based on customer characteristics, have been mentionedin other papers, e.g. [5,6]. Chicco et al. [7] describe data mining algorithms and tools for clientclassification in electricity grids, but concentrate on methods for finding groups of customers withsimilar behavior. Poggi et al. [8] propose an architecture for admission control in e-commerceWeb sites, which prioritizes user sessions based on predictions about the user’s intentions to buy aproduct. Making a distinction between different types of users in this way has not been undertakenin the context of Grid computing.The introduction of risk management to the Grid as proposed by Djemame and Kao [9] permits a

more dynamic approach to the usage of SLAs. It involves the modeling of risk that the SLA cannotbe fulfilled. A provider can then offer SLAs with different risk profiles. SLAs with lower risk can



be achieved by introducing different levels of redundancy. This is achieved by introducing buffersof capacity, e.g. reserving more resources than actually needed. Voss [10] proposes precautionarymigrations of tasks for preventing SLA violations, with some similarities to the approach beingpresented in this paper, but emphasizing risk management strategies. She also presents a measure-ment to estimate the effects of migrating tasks to an alternative resource. The last two referencesare within the framework of the European AssessGrid project [11], which aims at developing‘generic, customizable, trustworthy, and interoperable open-source software for bringing to ser-vice providers open-source software for risk assessment, risk management, and decision-support inGrids’.Chunlin and Layuan [12] discuss a QoS-based scheduling approach, which considers utility

and pricing in Grids using the Lagrangian method. Their intention is to model the pricing of Gridresources as a centralized optimization problem. A key objective of the EERMdiscussed in this paperis to improve the monetary benefit; however, there are other efforts in computational mechanismdesign (CMD) such as that outlined by Dash et al. [13,14]. Such an approach focuses on the typesof interaction protocols that would be needed between non-cooperative agents, utilizing differenteconomic strategies. The approach advocated in CMD is the ability to maximize utility (defined indifferent ways depending on the context) across the agents within a system. A centralized view ofthe system is therefore adopted, with each agent acting in a rational manner. The CMD approachis certainly applicable to Grid computing, whereby the resources a provider makes available to themarket depends on what others are also doing. Hence, providers would use market signals fromothers to determine how much to offer on the market.

3. QOS: BACKGROUND AND TERMINOLOGY

QoS has been explored in various contexts [15,16]. Two types of QoS attributes can be distin-guished: those based on the quantitative and those on the qualitative characteristics of the Gridinfrastructure. Qualitative characteristics refer to aspects such as service reliability and user satis-faction. Quantitative characteristics refer to aspects such as network latency, CPU performance, orstorage capacity. For example, the following are quantitative parameters for network QoS: delay(the time it takes a packet to travel from sender to receiver), delay jitter (the variation in the delayof packets taking the same route), throughput (the rate at which packets go through the network),and packet-loss rate (the rate at which packets are dropped, lost, or corrupted). Although qualitativecharacteristics are important, it is difficult to measure these objectively. Systems that are centeredon the use of such measures utilize user feedback [17] to compare and relate them to particularsystem components. Ultimately, each qualitative characteristic should be expressible in terms ofmeasurable, quantitative characteristics. For instance, user satisfaction should somehow map intoparameters such as CPU performance and network latency. However, a key difference betweenthese two is the different viewpoints on qualitative characteristics that would be held by differentusers or applications. Some may view an access time of 2ms (a quantitative characteristic), for in-stance, to constitute a slow service (a qualitative characteristic), whereas others may view a serviceof 4ms to be slow. Hence, qualitative characteristics represent a comparative viewpoint held by auser/application and may be difficult to generalize across different application domains and users.Our focus is primarily on quantitative characteristics.



Similarly, compute QoS can be specified based on how the computational (CPU) resource isbeing used—i.e. as a shared or an exclusive-access resource [18]. When more than one user-levelapplication shares a CPU, the application can specify that it requires a certain percentage of accessto the CPU over a particular time period. In exclusive-access systems, in which usually one user-level application has exclusive-access to one or more CPUs, the application can specify the numberof CPUs as a QoS parameter. In exclusive-access only one application will be allowed to use theCPU for 100% of the time, over a particular time period.Storage QoS is related to access to devices such as primary and secondary disks or other devices

such as tapes. In this context, QoS is characterized by bandwidth and storage capacity. Bandwidthis the rate of data transfer between the storage devices and the application program reading/writingdata. Bandwidth is dependent on the speed of the bus connecting the application to the storageresource, and the number of such buses that can be used concurrently. The number and typesof parallel I/O channels available between the processor and the storage media are significantparameters in specifying storage QoS. Capacity is the amount of storage space that the applicationcan use for writing data.It is necessary for applications to specify their QoS requirements as the characteristics of a single

resource that is necessary to run their application (compute, storage and network), and the periodover which the resource is required. Such a resource may, in practice, involve the aggregation of anumber of different network, compute, and data resources to achieve the desired outcome. Resourcereservation provides one mechanism to satisfy the QoS requirements posed by an application user,and involves giving the application user an assurance that the resource allocation will provide thedesired level of QoS. The reservation process can be immediate or undertaken in advance, andthe duration of the reservation can be definite (for a defined period of time) or indefinite (from aspecified start time and till the completion of the application).

3.1. QoS in Grid computing

The Grid approach can be seen as a global-scale-distributed-computing infrastructure with coordi-nated resource sharing [19,20]. A key Grid problem that many researchers have been investigatingis resource management, specifying how Grid middleware can provide resource coordination forclient applications transparently. One of the most successful middleware projects that provides suchcoordination is the Globus Alliance [21]. Recently, there has been a push to make greater use ofGrid middleware in business applications—as the traditional focus has been towards computationalscience. This change in emphasis has also led to greater emphasis being placed on commercialtechnologies—such as Web Services—and currently service-oriented concepts play a key role inemerging Grid standards.Generally, Grid applications submit their requirements to Grid resource management services

that schedule jobs as resources become available. Each resource provider must support a resourcemanager or scheduler that can receive requests from external applications (i.e. applications thatare being managed by individuals who do not own the resources). However, there are severalapplications that need to obtain results for their tasks within strict deadlines; hence, they cannotwait for resources to become available. For these applications, it is often necessary to reserve Gridresources and services at a particular time (in advance or on-demand). In addition, other featuresare highly desirable, indeed required, if the Grid resource management service is to be able to



handle complex scientific and business applications. We review these requirements in the followingsubsection and then briefly discuss how well current QoS systems meet these requirements.In Grid computing, QoS management aims at providing assurance for accessing resources, while

maintaining the security level between domains. Grid QoS requires a central information service[22] for up-to-date information on resources available for use by others. Such an information servicecan be interrogated by an application user to determine which resources can be used to executean application. As Grid QoS simultaneously deals with a number of resources per service session,SLAs become essential to specify the service level that the client must receive and the providermust supply. Such SLAs must also make use of parameters that are provided by resource ownerswhen they publish the properties of their resources. It is important that each parameter withinthe SLA is capable of being monitored. SLAs often encode requirements that an application userwishes to achieve, and capabilities that a resource owner can provide to others. Such contractsbetween users and providers may be expressed using first-order logic, algebraic operators, or beencoded within a scripting language as a policy. Often there is a tradeoff between the expressivenessoffered by a particular encoding style, and the ease of use, evaluation, and modification of aparticular requirement. Therefore, an SLA contains service level objectives (SLOs), each of which isa constraint on a particular QoS-relatedmetric. The SLA as a whole is therefore attempting to requestparticular QoS requirements from a provider, and it is the provider who must determine whetherthese QoS criteria (or SLOs) can be met given the resource commitments it has already made.Sahai et al. [23] propose an SLA management entity to support QoS in the context of commercial

Grids. They envision the SLA management entity existing within the OGSA architecture, with itsown set of protocols for manageability and assurance; they also describe a language for SLAspecification. Although an interesting approach, this work is still at a very preliminary stage, andits general applicability is still not obvious.

4. MODEL DESCRIPTION

Multiple economic enhancements exist that could be applied to resource management. In this paperwe focus only on those related to revenue maximization across multiple SLAs. However, the aimof our work is to provide a framework that will allow Grid economists to define their own rulesto achieve their particular goals. Therefore, the content of this paper should be considered as aparticular view of how the system behaves.The SLA satisfaction function determines if, for a set of n resources R = {R1, R2, . . . , Rn}, an

SLA S can be fulfilled or will be violated:

SSF(S, R) ={1 (SLA fulfilled)

0 (SLA violated)

The multiple SLA satisfaction function determines if, for a set of n resources R = {R1, R2, . . . ,

Rn}, a set of m SLAs {S1, S2, . . . , Sm} can all be fulfilled or any SLA may be violated:

MSSF({S1, S2, . . . , Sm}, R) =m∏i=1

SSF(Si , R) ={1 (all SLAs fulfilled)

0 (at least one SLA violated)



Consider the following scenario—a set of running tasks each with its own SLA is assigned to aresource. Each time a new task/SLA pair arrives, the EERM must assign a portion of the resourcebundle. There are two possible scenarios:

• There are enough free resources; hence, the multiple SLA satisfaction function is 1. In thiscase, it is trivial to allocate the incoming tasks to a suitable resource. This scenario is notstudied in this paper.

• There are not enough resources (multiple SLA satisfaction function is 0), implying that an in-telligent resource re-allocation mechanism is required for maximizing revenue and minimizingSLA violation penalties. This is the scenario that has been considered in this paper.

4.1. Revenue maximization in resource-limited providers

The price Prci is the amount of monetary units (money) that a client will pay if a provider fulfillsthe SLA Si . The price is specified in the same SLA and usually has a fixed value. On the otherhand, we define penalty Peni as the amount of money that the provider must pay if the SLA Siis violated. The penalty is also specified in the same SLA and can be a function with parametersspecified as in Section 4.2. In addition, the cost of execution Cexi must also be considered for theservice provider.The gain G(Si ) is the economic benefit that the provider obtains with the execution of a task

whose SLA is Si . It is defined as G(Si ) =Prci −Cexi −Peni and it can be positive (provider earnsmoney) or negative (SLA violation with high penalty costs). In a pool of resources R, executing aset S of m SLAs at a particular time t , we define the punctual gain as

�G(t, R) =m∑i=1

G(Si ) =m∑i=1

Prci −m∑i=1

Cexi −m∑i=1

Peni

which is the gain (or loss) obtained if the current tasks all execute and finish on the resources thatwere assigned at instance t .When a new SLA Si arrives and there are not enough resources, system overload will cause the

provider to start violating SLAs. To avoid (or minimize) violation penalties and maximize revenue,we suggest two complementary solutions:

• Dynamic adaptation in terms of resource provisioning. Previous work [24] has demonstratedthat we can increase both the throughput and the number of tasks completed, by dynamicallyadapting the share of available resources between the applications by a function of demand.This is feasible when several applications share a single multi-processor platform (by assign-ing priorities and processors) or in virtualized environments [25], by dynamically assigningresources and priorities for each virtual machine.

• Task reallocation involves finding a new resource assignment R′ for each task i associatedwith the SLA Si . The new gain is defined as

�G ′(t, R) =m∑i=1

G ′(Si ) − M(S, R)

where M(S, R) is the economic cost of migrating the current running tasks S within theresource bundle R. When reallocating tasks, the main challenge for the EERM will be to find



the highest �G ′(t, R), by predicting the new gain for each possible assignment of resources,and trying to minimize the cost of resource reallocation M(S, R).

4.2. SLA violation

Monitoring SLA violation begins once an SLA has been defined. A copy of the SLA must bemaintained by both the client and the provider. It is necessary to distinguish between an ‘agreementdate’ (agreeing on an SLA) and an ‘effective date’ (subsequently providing a service based on theSLOs that have been agreed). A request to invoke a service based on the SLOs (which are the SLAterms identifying QoS attributes requested from a provider), for instance, may be undertaken at atime much later than when the SLOs were agreed. During provision it is necessary to determinewhether the terms agreed in the SLA have been met. In this context, a monitoring infrastructureis used to identify the difference between the agreed upon SLO and the value that was actuallydelivered during provisioning. It is also necessary to define what constitutes a violation. Dependingon the importance of the violated SLO and/or the consequences of the violation, the provider inbreach may avoid dispatch or obtain a diminished monetary sanction from the client.An SLA may be terminated in three situations: (i) when the service defined in the SLA has

been completed; (ii) when the time period over which the SLA has been agreed upon has expired;and (iii) when the provider is no-longer available after an SLA has been agreed (for instance, theprovider’s business has gone into liquidation). In all the three cases, it is necessary for the SLA tobe removed from both the client and the provider. Where an SLA was actually used to provision aservice, it is necessary to determine whether any violations had occurred during provisioning. Asindicated above, penalty clauses are also part of the SLA and need to be agreed upon between theclient and the provider.One of the main issues that the provider and the consumer have to agree upon during the SLA

negotiation is the penalty scheme or the sanctioning policies. As both the service provider andthe client are ultimately businesses (rather than consumers), they are free to decide what kind ofsanctions they will associate with the various types of SLA breaches, in accordance with the weightof the parameter that was not fulfilled. We define the following broad categories of provisioningand violation associated with each category:

• ‘All-or-nothing’ provisioning: provisioning of a service meets all the SLOs—i.e. all of theSLO constraints must be satisfied for the successful delivery of a service.

• ‘Partial’ provisioning: provisioning of a service meets some of the SLOs—i.e. some of theSLO constraints must be satisfied for the successful delivery of a service.

• ‘Weighted partial’ provisioning: provision of a service meets SLOs that have a weightinggreater than a threshold (identified by the client).

Resource monitoring (RM) can be used to detect whether an SLA has been violated. Typicallysuch violations result in a complete failure—making SLA violations an ‘all-or-nothing’ process.In such an event a completely new SLA needs to be negotiated, possibly with another serviceprovider, which requires additional effort on both the client and the service provider. Based on thisall-or-nothing approach, it is necessary for the provider to satisfy all of the SLOs. This equates to aconjunction of SLO terms. An SLA may contain several SLOs, where some SLOs (e.g. at least twoCPUs) may be more important than others (e.g. more than 100 MBytes of hard disk space). During



the SLA negotiation phase, the importance of the different SLOs may be established. Clients (andservice providers) can then react differently according to the importance of the violated SLO. Inthe WS-agreement specification [26], the importance of particular terms is captured through theuse of a ‘business value’.Weighted metrics can also be used to provide a flexible and fair sanction mechanism, in case

an SLA violation occurs. Thus, instead of terminating the SLA altogether it might be possible tore-negotiate, i.e. with the same service provider, the part of the SLA that has been violated. Again,the more important the violated SLO, the more difficult (if not impossible) it will be to re-negotiate(part of) the SLA.

5. ECONOMICALLY ENHANCED RESOURCE MANAGER

The overall aim of the EERM is to isolate SORMA economic layers from the technical ones andorchestrate both economic and technical goals to achieve maximum economic profit and resourceutilization. The main goals of the EERM are

• to combine technical and economic aspects of resource management;• to perform resource price calculations, taking into account current market supply and demand,performance estimations, and business policies;

• to strengthen the economic feasibility of the Grid.

To provide a general solution that supports different scenarios and business policies, the EERMshould provide flexibility in defining user (administrator) configurable rule-based policies, tosupport:Individual rationality: An important requirement for a system is that it is individually rational

on both sides, i.e. both providers and clients have to have a benefit from using the system. Thisis a requirement for the whole system, including features such as client classification or dynamicpricing.Revenue maximization: A key characteristic for SORMA providers is revenue (utility) maxi-

mization. The introduced mechanisms can indeed improve the utility of both the provider and theclient.Incentive compatibility: Strategic behavior of clients and providers can be prevented if a mech-

anism is incentive compatible. Incentive compatibility means that no other strategy results in ahigher utility than reporting the true valuation.Efficiency: There are different types of efficiency. The first one considered here is Pareto effi-

ciency: no participant can improve its utility without reducing the utility of another participant.The second efficiency criterion is allocative efficiency, i.e. the EERM must maximize the sum ofindividual utilities.

5.1. Architecture

The EERM’s architecture is shown in Figure 1. To place the EERM in the context of the SORMAframework, we have also shown the SORMAGrid market middleware (GMM) [27], which providesthe mechanisms to interact with the SORMA market. Once resource usage has been agreed in the



Figure 1. EERM components.

SORMA market, a contract is sent to the EERM over the GMM. The contract provides the EERMwith input for resource allocation, task execution, and SLA enforcement (SLAE) activities. TheEERM is composed of the following components (see Figure 1):Economy agent (EA): The EA receives requests from SORMA market agents over the GMM. For

each request, the EA checks whether the task is technically and economically feasible and calculatesa price for the task based on the category of client (e.g. a preferred customer), resource status, eco-nomic policies, and predictions of future resource availability (provided by the estimator component(EC)). The EA interacts with the upper SORMA economic layers in the SLA negotiation process.EC: The EC calculates the expected impact on the utilization of the Grid and is according to

Kounev et al. [28]. In short, the EC’s task is to avoid performance loss due to resource overload [24].System performance guard (SPG): The SPG monitors resource performance and SLA violations.

If there is a danger that one or more SLAs cannot be fulfilled, the SPG can take the decision ofsuspending, migrating, or canceling tasks to ensure the fulfillment of other, perhaps more important,SLAs with the aim of maximizing overall revenue. Tasks can also be canceled when additionalcapacity is required to fulfill commitments to preferred clients. The policies that dictate when totake action and which types of tasks should be killed, migrated, or suspended are updated via thepolicy manager (PM).



PM: The PM stores and manages policies concerning client classification, task cancellation,or suspension. Policies are formulated using the semantic Web rule language [29]. The PM is animportant part of the EERM in that it allows behavior to be adapted at runtime. With the exception ofthe EC, all other EERM components use the PM to obtain policies that affect their decision-makingprocess.Economic resource manager (ERM): The ERM interacts with local resource managers and is

responsible for ensuring an efficient use of local resources. The ERM is described in further detailin Section 5.3.RM: The RM provides resource information for system and per-process monitoring. Resource

information is used by the EC, SPG, ERM, and SLA components. The RM is explained in furtherdetail in Section 5.4.SLAE: The SLAE is tasked with monitoring SLA fulfillment. The SLAE uses monitoring data

from the EERM and RM.When an SLA violation is detected, the SLAE takes reactive measures suchas SLA re-negotiation or compensation retrieval based on SLA penalty clauses. This component isexplained in further detail in Section 5.5.

5.2. Key features

Task cancellation: This feature is needed to ensure QoS in situations where problems arise, i.e.parts of the Grid fail or the estimations of the utilization were too optimistic. This feature is partof the SPG.QoS: It is introduced with the help of a number of components. First of all the estimator calculates

the expected impact of a task on the utilization. If there is not enough capacity for the task or thetask would lead to capacity problems for other tasks, this information is given to the EA. The EAthen usually rejects the task. However, it can also instruct the SPG to free capacity. As the latterrequires suspending or canceling tasks, it will only be done on a few occasions.The SPG also has another key role in ensuring QoS. When it detects that one or more SLAs

cannot be fulfilled, it suspends or cancels tasks until the remaining SLAs can all be kept. This isdone considering the penalties resulting from cancellation or the suspension of tasks and policies(e.g. regarding client classification).Dynamic pricing: Another enhancement is dynamic pricing based on various factors. Yeo and

Buyya [30] presented an approach for a pricing function depending on a base pricing rate andutilization pricing rate. However, the price can depend not only on current utilization but also onprojected utilization, client classification, projected demand, etc. One such option is to include theimpact a task has on the utilization of the Grid in the price calculation. For example, when anincoming task leads to a utilization above certain thresholds a higher price is charged.The components of the EERM involved in dynamic pricing are the EA, the PM, and the EC. The

functionality required in the EA is to calculate the dynamic prices based on utilization, resourceusage, projected demand, etc. The PM needs to store and manage the pricing policies. The estimatorcomponent needs to deliver the data on which the price calculation in the EA is based, i.e. theestimated performance impact of the task, the expected resource usage of the task, and a projectedutilization for the time frame in which the task is executed.Client classification in EERM: The main differentiation factors in the EERM are a priority on

task acceptance and QoS. Price discrimination is also featured and different policies for pricing can



be introduced. However, in systems that feature components dedicated to trading, it might be moresuitable to move price discrimination into these components in order to coordinate better with othertrading strategies.

5.3. Economic resource manager (ERM)

The ERM is designed to interact with a range of execution platforms (e.g. Condor, Sun Grid Engine,Globus GRAM, or UNIX fork) and achieves this using Tycho [31] connectors that communicateover the network to resource agents (RAs).The RA translates XML messages from the ERM into messages understood by the underlying

platform (e.g. Condor). In addition, RAs provide a consistent interface to the different underlyingresource fabrics. This means that another platform can be adapted to SORMA by implementing anappropriate RA plug-in that performs translations to and from the underlying resource manager’snative protocol. It is intended that access to the existing middleware be constrained by firewallrules, so that all interactions must go through the ERM. As a single point of access, the ERM canprovide additional functionality that the underlying middleware may lack, for example, by providingsupport for advanced reservations.In the current prototype, RAs include a plug-in for launching JSDL [32] jobs using GridSAM [33].

The approach used to implement the ERM is complemented by a similar approach used for RM.

5.4. Resource monitoring

In order to enable SLAE, an understanding of the current and recent state of the underlying resourcesis required. Resource availability and utilization can be sampled periodically in a coarse-grainedmanner in order to provide a high-level understanding of general QoS indicators. At other times itmay be appropriate to target particular and detailed attributes that reflect the given resources’ abilityto fulfill a particular action, e.g. the execution of a task. In addition, notifications received fromresources when a particular threshold has been exceeded can help to identify SLA violations. TheEERM employs the GridRM [34] wide-area distributed monitoring system to gather data requiredfor SLAE.The GridRM design employs gateways for gathering data from a number of different types

of resources that make up the Grid. Resources of interest can include all types of networkeddevices, from a remote sensor or satellite feed through to a computational node or a communi-cations link. The gateway is used internally, to a Grid-enabled site (the local layer), to configure,manage, and monitor internal resources, while providing controlled external access to resourceinformation. The EERM is bound to its local GridRM gateway using the Tycho distributed reg-istry and messaging system (see Figure 2). The EERM queries the gateway for real-time andhistorical resource data, and registers interest to receive different types of events that reflectchanges in resource state (e.g. completion of a submitted task, system load greater than a specifiedthreshold).Resources may already provide legacy agents e.g. SNMP, Ganglia, /proc, Condor. As long as

the gateway is installed with a driver that supports the agent’s native protocol, all resource dataprovided by the native agent can be retrieved. In cases where an existing agent is not installed, aproprietary agent can be used for information gathering. Using a native agent means that existing



Figure 2. The GridRM gateway and relationship to the EERM.

resources can bemonitoredwith little or nomodification. Alternatively, installation of the proprietaryGridRM agent implies some administrative overhead for each resource, but can result in improvedperformance and lower intrusiveness when gathering data.Resource heterogeneity (agent and platform type) is hidden from GridRM clients and hence

the EERM; the Structured Query Language [35] is used to formulate monitoring requests, anda SORMA-specific schema based on the GLUE Schema [36] is used to group data and formatthe results into a consistent form (semantically and in terms of the values returned from differentagents). Currently, the SORMA consortium have identified a number of core attributes that are usedfor monitoring resources, enforcing SLAs, advertising resources on the market, and match-makingpurposes. The core attributes include the following:

• CPU (architecture, number of, speed);• operating System (type, kernel version, shared libraries);• memory (total/free physical/virtual);• disk (total/free, network/local);• per-process execution statistics (start stop times, CPU time, memory footprint, exit status).



The current set of core attributes are a starting point and will evolve over time, as the requirementsfor more complex SLAE are understood.As well as real-time information a need exists to capture historical data so that the SLAE

component can determine the likelihood of an SLA violation, based on past resource provision ata given site. The gateway can be instructed to query particular core attributes at a given frequencyand store the results in its internal database. The consistent view of resource data provided by theGridRM gateway means that the SLAE component is not exposed to resource heterogeneity andhence can focus on performing its core duties of SLA monitoring and enforcement.

5.5. SLA enforcement

The aim of the SLAE component is to take reactive measures, such as SLA re-negotiation orcompensation retrieval, based on SLA penalty clauses. It is important to note that the SLAE is notjust for the providers to meet their commitments, it also has to be monitored to validate that theconsumers have met the SLA. One of the most important aspects to monitor that are relevant toconsumers is the possibility of overuse of the resources than that agreed in the SLAs.The interaction between the EERM and the SLAE component is described in Figure 3. The

process begins when SLAE component receives a contract from SORMA contract management (theelement that creates the contracts once a negotiation is agreed between providers and customers).After this, the SLA is created and sent to the EERM, which watches for its fulfillment. The EERMtakes the economic data from SLA Enforcement and the performance data from RM componentsto detect if an SLA is being violated, and performs a selective violation of SLAs to maximize therevenue.On violation, the SLAE component detects this and generates a notification for the SORMA

economic layers, in order to negotiate a new contract or give clients the possibility of searching foranother provider.

Figure 3. SLA enforcement and EERM components interaction.



Figure 4. Enforcement of SLA fulfillment example scenario.

6. CONCEPT EXAMPLE

To explain the operations of SLA fulfillment using the EERM, we have designed a simple conceptualscenario (see Figure 4): A resource provider wishes to sell the CPU time of four multi-processormachines. There are some free resources, and some running tasks whose prices are specified intheir SLAs. In order to simplify, there are two fixed economic parameters:

• penalty for SLA violation: four currency units per violation, specified in each SLA;• task migration: one currency unit per migration; an indirect cost, calculated by the resourceprovider;

• cost of execution: one currency unit per each utilized CPU.

In the example scenario described in the upper schema of Figure 4, a new task arrives, and itsSLA specifies a requirement for 4 CPUs and a price of 7. The incoming task does not fit in anyresource and, therefore, risks breaking the SLA. In response, the EERM could take three differentactions:

1. Deny resource allocation for the incoming task. This is a non-economic response and meansthat the EERM has fallen back to the same behavior as traditional resource management



systems. Because the SLA has been agreed upon previously, if this response is taken theincoming task SLA will be broken and the provider will have to pay a penalty of 4. Using theformulas proposed in Section 4.1, the provider obtains a punctual gain of

�G(t, R) =m∑i=1

Prci −m∑i=1

Cexi −m∑i=1

Peni = 41 − 18 − 4= 19

2. Perform a selective SLA violation. In the middle schema of Figure 4, the EERM determinesthat the first task in R1 can be terminated due to the low price of that task. As a result theincoming task is now able to fit into R1. The punctual gain for the provider is

�G ′(t, R) = 48 − 19 − 4= 25

3. Reallocate resources. In this particular case, there are 4 free CPUs, but they are scatteredacross the resource bundle. Reallocating tasks to provide a single machine with 4 CPUs maybe cheaper than breaking the SLA. For example, the lower schema of Figure 4 shows taskmigration, which results in a new punctual gain of

�G ′(t, R) =m∑i=1

G ′(Si ) − M(S, R) = 52 − 22 − 2= 28

By applying economic enhancements into resource management, a provider can dramaticallyincrease its revenue (47% in the example) by choosing the correct policy for SLA brokering ortask reallocation. Determining the optimal solution for a given scenario will depend on penalty andreallocation costs as well as on the current resource availability.

7. EVALUATION

After showing a simple example of the theoretic concepts of this paper, this section evaluates asimulated EERM, which uses different policies with several resources and continuous incomingSLAs. In this case, 1000 random SLAs will arrive at a resource pool. The revenue is calculated byusing the next policies: no policy, selective violation, task reallocation, and a combination of thelast two. These policies have been explained in Section 7.1.2.

7.1. Experiment details

This section describes the details of the simulation that was performed for the evaluation of theconcepts introduced in this paper.

7.1.1. SLA creation

SLAs are represented as tuples in the form (t ′, �t,C,Prc,Pen), where t ′ is the arrival time, �t isthe duration of the task, C is the size (in number of CPUs) of the resources to allocate, and Prc,Pen are the price and penalty costs.



The values of the SLA are created in base to a random component �, whose values are distributedin an uniform way in the range [0, 1). The formulas used to define the SLA terms are as follows:Arrival time: It is a series, where t ′0 = 1, and t ′n+1 = t ′n + �(mod�), � being the maximum time

between an SLA and its previous one. In this manner, a uniform task arrival rate is avoided, andmore irregular and realistic task arrival times are obtained.Duration: The chosen distribution for duration is not uniform. This paper considered duration

distribution where short-duration tasks (near to 1 time slot) are more frequent than long-durationtasks (near to the maximum duration 1+ Mt ). It is for that why cos((�/2)�) multiplies Mt insteadof using directly �:

�t =⌊1 + cos

(�

2�)Mt

⌋Size: As with duration function, a non-uniform distribution has been chosen. Tasks that use only

one CPU are more frequent than big tasks (near to the maximum number of CPUs MC ). Owing tothis reason cos((�/2)�) multiplies MC instead of using � directly:

C =⌊1 + cos

(�

2�)MC

⌋Cost of execution: It is the cost of executing a concrete task during a time unit, in a set of C

CPUs whose Unit cost Cu, is the cost of using a single CPU:

Cex=CuC

In the experiments, the value of Cu is the same for all the resources, and its value is not importantbecause it is annulled with the addition of Cex in the Price formula.Price: It adds to the cost of execution Cex the benefit that is wanted, calculated in function of the

number of resources used, the time that the resources are used, and the maximum price per CPUused (MPrcC). The random variable � is added to provide a variable range of prices:

Prc=C�t�MPrcC + Cex

Penalty:

Pen= �MPen

where MPen is the maximum penalty that the resource provider can pay for an SLA violation. Therandom variable � is added to provide a variable range of penalties.

7.1.2. Policies creation

This section describes the policies that will be used in the experiment and their details:

1. No policy: when a task arrives, it is allocated to the first available resource. When there is nosuitable resource for the task, it is rejected and its SLA is violated.

2. Selective violation: the same as the previous policy, but when there are not enough resources,EERM selectively violates the less interesting SLA. The less interesting SLA is calculatedby assigning an interest index based on the SLA properties. In the experiment, we use the



revenue and the penalty to define how interesting is to maintain in execution the task, and thesize and the remaining time as a divisors, as a task with a moderated price but low resourcesutilization is more profitable than a task with a big price that uses lots of resources during along time period. In short, the formula used to define the less interesting SLA index is

ISLA = Prc − Cex + Pen

Ctleft

If the less interesting SLA is the incoming one, this will be rejected and no running tasks willbe cancelled.

3. Task reallocation: when a task cannot be allocated, EERM tries to redistribute resources, inorder to decrease the fragmentation and allow the incoming task to be placed. This experimentuses a Branch and Bound algorithm [37] to reproduce all the possible task migration sequencesand to calculate the one that reports an optimal revenue. In order to prevent a combinationalexplosion, the following limitations have been applied to the tree:

• If the incoming SLA size is smaller than the total amount of all the free spaces in theresource pool, the system will not explore this branch.

• If the price of the incoming SLA is smaller than the cost of the migrations proposed in thebranch, the system will stop exploring this branch. This will also automatically limit thedepth of the search tree.

4. Combination of policies: first, system tries to perform task reallocation and if it is not possible,the less interesting SLA is violated.

7.2. Experiment results

In order to evaluate how the allocation policies behave in several scenarios, the experiment isrepeated in a pool of resources with 8 CPUs per node, ranging from one to 75 machines in the pool.Figure 5 shows the values used for the formulas explained in Section 7.1.1. A fixed migration costis set to 1 currency/monetary unit.Figure 6 shows the revenue comparison between several policies and the no-policy scenario.

It can be seen that when the resource pool begins to be scarce, and becomes overloaded, it isalways better to use any economic policy.In this paper, the quantitative results have a secondary role, as the revenue is expected to increase

with a better reallocation algorithm. Finding an optimal SLA management policy is out of the scopeof this paper. Instead of not having applied the optimal solutions, in a very scarce resources scenario,

Figure 5. Experiment parameters.



Figure 6. Revenue comparison between policy usage (continuous line) and no policy (dotted line). Each graphdescribes (a) selective violation, (b) task migration, and (c) combination.

the gain can increase dramatically. In Figure 7 it can be observed that the task reallocation reportsan increment of 10% in the gain, and selective SLA violation and combined policies provides muchmore benefits than when no policies are used (up to more than 60% in the worst scenarios).In a system such as SORMA this kind of system overload is not usual, as the rate of SLAs, which

the provider cannot fulfill is small, but some special situations should be considered, e.g. a crashin a large portion of the resource pool, where using a good economic policy can make the providerto save an important amount of money.

8. CONCLUSIONS AND FUTURE WORK

The work reported in this paper is motivated by the need to extend traditional resource managementwith economic parameters to support emerging Grid markets. Within such a market, resourceproviders must consider issues relating to current market conditions, QoS, revenue maximization,economic sustainability, and reputation, if they are to operate effectively.In particular, we focus on revenue maximization using SLAs and describe how a strategic ap-

proach to managing SLAs can be used to secure optimal profit in situations where resources arescarce. Using our methods for selective SLA fulfillment and violation, the resource provider candetermine which tasks should be pre-empted in favor of freeing up resources for more lucrative



Figure 7. Percentage of gain when using different policies in scarce resources scenario.

SLAs. For example, it may be more profitable to violate an existing SLA, and pay the associatedpenalty, than to checkpoint and redistribute existing tasks, so that all SLAs can be fulfilled.A prototype EERM is introduced and its architecture described. The EERM is a first attempt at

providing strategic SLAE within a Grid market and forms part of the market mechanisms currentlybeing implemented by the SORMA project. The simulation demonstrated that using economicenhancements in a resource manager can be more cost effective for service providers, especiallywhen such providers try to fulfill the SLAs in scenarios with insufficient resources.Future work will include the identification of policies and parameters suitable for enforcing

revenue maximization given the number of different resource scenarios. The aim is to understandhow to determine an optimal solution (or sub-optimal if the computation cost is too high) across aresource pool when complex policies and multiple economic parameters are at play. Another lineof work will address how EERMs can be used to provide input to the market so that the negotiationprocess between customers and providers results in the generation of more accurate SLAs.

ACKNOWLEDGEMENTS

We would like to thank Mark Baker, SSE, University of Reading, for his comments during the writing of thispaper. This work is supported by the Ministry of Science and Technology of Spain and the European Union(FEDER funds) under contract TIN2007-60625 and Commission of the European Communities under ISTcontract 034286 (SORMA). Thanks are also due to Martijn Warnier and Thomas Quillinan of Vrije University,Amsterdam, The Netherlands—for discussions about types of violations that could arise in SLAs.

REFERENCES

1. Self-organizing ICT Resource Management (SORMA). http://www.sorma-project.eu (online) [25 July 2008].2. Kenyon C, Cheliotis G. Grid resource commercialization: Economic engineering and delivery scenarios. Grid Resource

Management: State of the Art and Future Trends. Kluwer Academic Publisher: Norwell, MA, U.S.A., 2004; 465–478.



3. Network.com. http://www.network.com (online) [25 July 2008].4. Amazon Elastic Computer Cloud. http://aws.amazon.com/ec2 (online) [25 July 2008].5. Pueschel T, Borissov N, Neumann D, Macias M, Guitart J, Torres J. Extended resource management using client

classification and economic enhancements. e-Challenges. Expanding the Knowledge Economy: Issues, Applications,Case Studies. Part 1, ICT for Networked Enterprise, Applications (Information and Communication Technologies andthe Knowledge Economy, vol. 4), Cunningham P, Cunningham M (eds.). IOS Press: The Hague, The Netherlands, 2007;65–72. ISBN 978-1-58603-801-4, ISSN 1574-1230.

6. Pueschel T, Borissov N, Macias M, Neumann D, Guitart J, Torres J. Economically enhanced resource management forinternet service utilities. Eighth International Conference on Web Information Systems Engineering (WISE’07), Nancy,France, December 2007; 335–348.

7. Chicco G, Napoli R, Piglione F. Comparisons among clustering techniques for electricity customer classification. IEEETransactions on Power Systems 2006; 21(2):933–940.

8. Poggi N, Moreno T, Berral JL, Gavalda R, Torres J. Web customer modeling for automated session prioritization onhigh traffic sites. User Modeling Conference, Greece, June 2007; 450–454.

9. Djemame K, Kao O. Risk management in Grid computing. University of Swansea Report Series CSR 7-2006, 2006.10. Voss K. Risk-aware migrations for prepossessing SLAs. International Conference on Networking and Services (ICNS’06),

Santa Clara, U.S.A., July 2006; 68–68.11. AssessGrid. http://www.assessgrid.eu (online) [25 July 2008].12. Chunlin L, Layuan L. An optimization approach for decentralized QoS-based scheduling based on utility and pricing in

Grid computing. Concurrency and Computation: Practice and Experience 2007; 19(1):107–128.13. Dash RK, Jennings NR, Parkes DC. Computational mechanism design: A call to arms. IEEE Intelligent Systems 2003;

18(6):40–47.14. Dash RK, Vytelingum P, Rogers A, David E, Jennings NR. Market-based task allocation mechanisms for limited capacity

suppliers. IEEE Transactions on Systems Man and Cybernetics (Part A) 2007; 37(3):391–405.15. Oguz A, Campbell AT, Kounavis ME, Liao RF. The Mobiware toolkit: Programmable support for adaptive mobile

networking. IEEE Personal Communications Magazine (Special Issue on Adapting to Network and Client Variability)1998; 5:32–43.

16. Bochmann G, Hafid A. Some principles for quality of service management. Technical Report 1, Universite de Montreal,1996.

17. Deora V, Shao J, Gray WA, Fiddian NJ. A quality of service management framework based on user expectations.International Conference on Service Oriented Computing (ICSOC), Trento, Italy, December 2003; 104–114.

18. Foster IT, Fidler M, Roy A, Sander V, Winkler L. End-to-end quality of service for high-end applications. ComputerCommunications 2004; 27(14):1375–1388.

19. von Laszewski G, Wagstrom P. Gestalt of the Grid. Tools and Environments for Parallel and Distributed Computing.(Series on Parallel and Distributed Computing). Wiley: New York, 2004; 149–187.

20. Foster I, Kesselman C, Tuecke S. The anatomy of the Grid: Enabling scalable virtual organizations. International Journalon High Performance Computing Applications 2001; 15(3):200–222.

21. The globus alliance. http://www.globus.org/ (online) [25 July 2008].22. Czajkowski K, Fitzgerald S, Foster I, Kesselman C. Grid information services for distributed resource sharing. Proceedings

of the 10th IEEE International Symposium on High Performance Distributed Computing, Redondo Beach, CA, U.S.A.,2001; 181–194.

23. Sahai A, Graupner S, Machiraju V, Moorsel A. Specifying and monitoring guarantees in commercial Grids through SLA.Proceedings of the 3rd IEEE/ACM CCGrid Conference, Tokyo, Japan, May 2003.

24. Nou R, Julia F, Guitart J, Torres J. Dynamic resource provisioning for self-adaptive heterogeneous workloads in SMPhosting platforms. ICE-B 2007, International Conference on E-Business, Barcelona, Spain, July 2007.

25. Gupta D, Cherkasova L, Gardner R, Vahdat A. Enforcing performance isolation across virtual machines in Xen.Middleware (Lecture Notes in Computer Science, vol. 4290), van Steen M, Henning M (eds.). Springer: Berlin, 2006;342–362.

26. Web Services Agreement specification. http://www.ogf.org/documents/GFD.107.pdf (online) [25 July 2008].27. Joita L, Rana OF, Chacin P, Chao I, Freitag F, Navarro L, Ardaiz O. Application deployment using catallactic Grid

middleware. MGC’05: Proceedings of the 3rd International Workshop on Middleware for Grid Computing. ACM Press:New York, NY, U.S.A., 2005; 1–6.

28. Kounev S, Nou R, Torres J. Using QPN to add QoS to Grid middleware. Technical Report UPC-DAC-RR-CAP-2007-4,Universitat Politecnica de Catalunya, 2007.

29. Horrocks I, Patel-Schneider PF, Boley H, Tabet S, Grosof B, Dean M. SWRL: A semantic Web rule language combiningOWL and RuleML. W3C Member submission 21 May 2004, Technical Report, 2004. http://www.w3.org/Submission/SWRL/ (online) [25 July 2008].

30. Yeo CS, Buyya R. Pricing for utility-driven resource management and allocation in clusters. International Journal ofHigh Performance Computing Applications 2007; 21(4):405–418.



31. Baker M, Grove M. A virtual registry for wide-area messaging. IEEE International Conference on Cluster Computing.IEEE: Singapore, 2006; 1–10.

32. Job Submission Description Language (JSDL) Work Group. http://forge.gridforum.org/projects/jsdl-wg (online) [25 July2008].

33. GridSAM, Grid Job Submission and Monitoring Web Service. http://gridsam.sourceforge.net/ (online) [25 July 2008].34. Baker M, Smith G. GridRM: An extensible resource monitoring system. IEEE International Conference on Cluster

Computing, Tokyo, Japan, 2003; 207.35. Eisenberg A, Melton J, Kulkarni K, Michels J, Zemke F. SQL: 2003 has been published. SIGMOD Record, vol. 33(1),

2004.36. Andreozzi S. GLUE schema implementation for the LDAP data model. Technical Report, Instituto Nazionale Di Fisica

Nucleare, September 2004.37. Lawler EL, Wood DE. Branch-and-bound methods: A survey. Operation Research 1966; 14(4):699–719.


Maximizing revenue in Grid markets using an...

Documents

Transcript of Maximizing revenue in Grid markets using an...