Grid load balancing using intelligent agentscaoj/pub/doc/jcao_j_gridagent.pdf136 J. Cao et al. /...

Future Generation Computer Systems 21 (2005) 135–149

Grid load balancing using intelligent agents

Junwei Caoa,∗,1, Daniel P. Spoonerb, Stephen A. Jarvisb, Graham R. Nuddb

a Center for Space Research, Massachusetts Institute of Technology, Cambridge, MA, USAb Department of Computer Science, University of Warwick, Coventry, UK

Available online 28 October 2004

Abstract

Scalable management and scheduling of dynamic grid resources requires new technologies to build the next generationintelligent grid environments. This work demonstrates that AI techniques can be utilised to achieve effective workload andresource management. A combination of intelligent agents and multi-agent approaches is applied to both local grid resourcescheduling and global grid load balancing. Each agent is a representative of a local grid resource and utilises predictive applicationperformance data with iterative heuristic algorithms to engineer local load balancing across multiple hosts. At a higher level,agents cooperate with each other to balance workload using a peer-to-peer service advertisement and discovery mechanism.© 2004 Elsevier B.V. All rights reserved.

Keywords:Load balancing; Grid computing; Intelligent agents; Genetic algorithm; Service discovery

1

itfigb

tg

s

ility.re-

ime.andge-mustking

pow-m-ndingmicle-ment

g

0

. Introduction

Grid computing originated from a new comput-ng infrastructure for scientific research and coopera-ion [36,20]and is becoming a mainstream technologyor large-scale resource sharing and distributed systemntegration[21]. Current efforts towards making thelobal infrastructure a reality provide technologies onoth grid services and application enabling[6].

Workload and resource management are essen-ial functions provided at the service level of therid software infrastructure. Two main challenges that

∗ Corresponding author.E-mail address:[email protected] (J. Cao).

1 This work was carried out when the author was with the Univer-ity of Warwick.

must be addressed are scalability and adaptabGrid resources are geographically distributed andsource performance can change quickly over tGrid users submit tasks with different resourcequality of service (QoS) requirements. For manament and scheduling to be effective, such systemsdevelop intelligent and autonomous decision-matechniques.

Software agents have been accepted to be aerful high-level abstraction for modelling of coplex software systems[26]. In our previous work, aagent-based methodology is developed for buillarge-scale distributed systems with highly dynabehaviours[9,10]. This has been used in the impmentation of an agent-based resource managesystem for metacomputing[11] and grid computin[12–14].

167-739X/$ – see front matter © 2004 Elsevier B.V. All rights reserved.doi:10.1016/j.future.2004.09.032

136 J. Cao et al. / Future Generation Computer Systems 21 (2005) 135–149

This work focuses on grid load-balancing issuesusing a combination of both intelligent agents andmulti-agent approaches. Each agent is responsible forresource scheduling and load balancing across multi-ple hosts/processors in a local grid. The agent couplesapplication performance data with iterative heuristic al-gorithms to dynamically minimise task makespan andhost idle time, while meeting the deadline requirementsfor each task. The algorithm is based on an evolutionaryprocess and is therefore able to absorb system changessuch as the addition or deletion of tasks, or changesin the number of hosts/processors available in a localgrid.

At the global grid level, each agent is a representa-tive of a grid resource and acts as a service providerof high performance computing power. Agents are or-ganised into a hierarchy and cooperate with each otherto discover available grid resources for tasks using apeer-to-peer mechanism for service advertisement anddiscovery.

Agents are equipped with existing PACE applica-tion performance prediction capabilities[8,32]. Thekey features of the PACE toolkit include a good level ofpredictive accuracy, rapid evaluation time and a methodfor cross-platform comparison. These features enablePACE performance data to be utilized in real-time foragents to perform grid resource scheduling[15,16].

Several metrics are considered to measure the load-balancing performance of grid agents. A case study isincluded and corresponding results conclude that in-t ncep iced rallr ppli-c urceu

2

lti-a archya

2

ageh uling

incoming tasks to achieve local load balancing. Eachagent provides a high-level representation of a grid re-source and therefore characterises these resources ashigh performance computing service providers in awider grid environment. The layered structure of eachagent is explained below:

• Communication layer. Agents in the system must beable to communicate with each other or with usersusing common data models and communication pro-tocols. The communication layer provides an agentwith an interface to heterogeneous networks and op-erating systems.

• Coordination layer. The request an agent receivesfrom the communication layer should be explainedand submitted to the coordination layer, which de-cides how the agent should act on the request accord-ing to its own knowledge. For example, if an agentreceives a service discovery request, it must decidewhether it has related service information. This isdescribed in detail in Section4.

• Local management layer. This layer performs func-tions of an agent for local grid load balancing.Detailed scheduling algorithms are described in Sec-tion 3. This layer is also responsible for submittinglocal service information to the coordination layerfor agent decision making.

2.2. Agent hierarchy

velg ri ina-t odesa

elligent agents, supported by application performarediction, iterative heuristic algorithms and serviscovery capabilities, are effective to achieve oveesource scheduling and load balancing, improve aation execution performance and maximise resotilisation.

. Grid agents

This work combines intelligent agents and mugent approaches. The agent structure and hierre described below.

.1. Agent structure

Each agent is implemented so it can manosts/processors for a local grid resource, sched

Agents are organised hierarchically in a higher lelobal grid environment, as shown inFig. 1. The broke

s an agent that heads the whole hierarchy. A coordor is an agent that heads a sub-hierarchy. Leaf-nre simply termedagentsin this model.

Fig. 1. Agent hierarchy.

J. Cao et al. / Future Generation Computer Systems 21 (2005) 135–149 137

Brokers, coordinators and agents are differentiatedby their position in the hierarchy but not by their func-tionality. Each node has the same capabilities and prior-ity for service and could perform services on behalf ofany other node. This homogeneous hierarchy providesa high-level abstraction of a grid environment.

The agent hierarchy can represent an open and dy-namic system. New agents can join the hierarchy orexisting agents can leave the hierarchy as appropriate.The hierarchy exists only logically and each agent cancontact others as long as it has their identities.

The hierarchical model partly addresses the issuesof scalability. When the number of agents increases,the hierarchy may lead to many system activities beingprocessed in a local domain. In this way the system mayscale well and does not need to rely on one or a fewcentral agents, which may otherwise become systembottlenecks.

Service is another important concept. An agent pro-vides a simple and uniform abstraction of the functions(clients, services and go-betweens) in the grid manage-ment system. The service information provided at eachlocal grid resource can be advertised throughout thehierarchy and agents can cooperate with each other todiscover available resources. These are introduced indetail in Section4.

2.3. Performance prediction

Performance prediction for parallel programs playsa linga tingP ies.

theP theP tedi ro-g ud-i llelp top ecu-t lua-t nced CEt e ofP ndg ec-t

3. Local grid load balancing

In this section, a local grid resource is consid-ered to be a cluster of workstations or a multipro-cessor, which is abstracted uniformly as peer-to-peernetworked hosts. Two algorithms are considered in thelocal management layer of each agent to perform localgrid load balancing.

3.1. First-come-first-served algorithm

Consider a grid resource withn hosts where eachhostHi has its own type tyi . A PACE resource modelcan be used to describe the performance informationof this host.

Let m be the number of considered tasksT. Thearrival time of each taskTj is tj . A PACE applica-tion model tmj can be used to describe the applicationlevel performance information of each task. The userrequirement of deadline for the task execution is rep-resented as trj . Each taskTj also has two scheduledattributes – a start time tsj and an end time tej .MTj is the set of hosts that are allocated to taskTj .

M then is a 2D array, which describes the mappingrelationships between hosts and tasks using Booleanvalues.

M = {Mij|i = 1, 2, . . . , n; j = 1, 2, . . . , m} (1)

M

{1, if Hi ∈ MTj

rfor-m tionm ub-sc sed asf

∀

s tofi lete,a

t

key role for agents to perform resource schedund load balancing. Agents are integrated with exisACE application performance prediction capabilit

The PACE evaluation engine is the kernel ofACE toolkit. The evaluation engine combinesACE resource model (including performance rela

nformation of the hardware on which the parallel pram will be executed) and application model (incl

ng all performance related information of the pararogram, e.g. MPI or PVM programs) at run timeroduce evaluation results, e.g. estimation of ex

ion time. Agents are equipped with the PACE evaion engine and use predictive application performaata for scheduling. Detailed introduction to the PA

oolkit is out of the scope of this paper but the usACE performance prediction for both local grid alobal grid load balancing is described below in S

ions3 and 4, respectively.

ij =0, if Hi /∈ MTj

(2)

The PACE evaluation engine can produce peance prediction information based on the applicaodel tmj and resource models ty. An appropriate s

et of hostsH (note thatH cannot be an empty setΦ)an be selected, and this is evaluated and expresollows:

H ⊆ H, H = Φ, ty ⊆ ty, ty = Φ,

texej = eval(ty, tmj) (3)

The function of the agent local management ind the earliest possible time for each task to compdhering to the sequence of the task arrivals.

ej = min∀H⊆H,H =Φ

(tej) (4)


A task has the possibility of being allocated to anyselection of hosts. The agent should consider all thesepossibilities and choose the earliest task end time. Inany of these situations, the end time is equal to the ear-liest possible start time plus the execution time, whichis described as follows:

tej = tsj + texej (5)

The earliest possible start time for the taskTj ona selection of hosts is the latest free time of all theselected hosts if there are still tasks running on the se-lected hosts. If there is no task running on the selectedhosts when the taskTj arrives at timetj , Tj can be exe-cuted on these hosts immediately. These are expressedas follows:

tsj = max

(tj, max

∀i,Hi ∈ H

(tdij)

)(6)

where tdij is the latest free time of hostHi at the timetj . This equals the maximum end times of tasks that areallocated to the hostHi before the taskTj arrives:

tdij = max∀p<j,Mip=1

(tep) (7)

In summary, tej can be calculated as follows:

tej = min∀H⊆H,H =Φ( ( ( )) )

ostst oneh er ifo otherh omel

er-m s. Iti re-s in-c me-fi taska ringt ther,

but will increase the algorithm complexity. This is ad-dressed using an iterative heuristic algorithm describedbelow.

3.2. Genetic algorithm

When tasks can be reordered, the scheduling objec-tive is also changed. Rather than looking for an earliestcompletion time for each task individually, the schedul-ing algorithm described in this section focuses on themakespanω, which represents the latest completiontime when all the tasks are considered together and issubsequently defined as:

ω = max1≤j≤m

{tej} (9)

The goal is to minimise function(9), at the same time∀j, tej ≤ trj should also be satisfied as far as possible.In order to obtain near optimal solutions to this com-binatorial optimisation problem, the approach taken inthis work is to find schedules that meet the above cri-teria through the use of an iterative heuristic method– in this case a genetic algorithm (GA). The processinvolves building a set of schedules and identifying so-lutions that have desirable characteristics. These arethen carried into the next generation.

The technique requires a coding scheme that canrepresent all legitimate solutions to the optimisationproblem. Any possible solution is uniquely representedb d inv earo ro-c inga e-q ledt

rob-ls uted,a o-c esi -s

ersm ardts et

max tj, max∀i,Hi ∈ H

max∀p<j,Mip=1

(tep) + texej

(8)

It is not necessarily the case that scheduling all ho a task will achieve higher performance. On theand, the start time of task execution may be earlinly a number of processors are selected; on theand, with some tasks, execution time may bec

onger if too many hosts are allocated.The complexity of the above algorithm is det

ined by the number of possible host selections clear that if the number of hosts of a gridource increases, the scheduling complexity willrease exponentially. This is based on a first-corst-served policy that means the sequence of therrivals determines that of task executions. Reorde

he task set may optimise the task execution fur

y a particular string, and strings are manipulatearious ways until the algorithm converges on a nptimal solution. In order for this manipulation to peed in the correct direction, a method of prescribquality value (orfitness) to each solution string is ruired. The algorithm for providing this value is cal

he fitness functionfv.The coding scheme we have developed for this p

em consists of two parts: an ordering partSk, whichpecifies the order in which the tasks are to be execnd a mapping partMijk , which specifies the host allation to each task. Letk be the number of scheduln the scheduling set. The ordering ofMijk is commenurate with the task order.

A combined cost function is used which considakespan, idle time and deadline. It is straightforw

o calculate the makespan,ωk, of the schedulek repre-ented bySk andMijk . LetTjk be the reordered task s


according to the ordering part of the coding scheme,Sk.

tsjk = max∀i,Mijk=1

(max

∀p<j,Mipk=1(tepk)

)(10)

tejk = tsjk + texejk (11)

ωk = max1≤j≤m

{tejk} (12)

The timing calculation described above is similarto that given in the function(8). One difference is thatsince all of the tasks are considered together, the or-der is defined according toSk instead of the task arrivaltimetj . So the consideration oftj is not necessary in(10)as opposed to the function(6). Another aspect is thatthe host selection is defined usingMijk and the PACEevaluation result texejk is calculated directly using cor-responding resource models, while in the function(8),different possible host selectionsH have all to be con-sidered and compared.

The nature of the idle time should also be taken intoaccount. This is represented using the average idle timeof all hostsϕk.

ϕk = max1≤j≤m

{tejk} − min1≤j≤m

{tsjk}

−∑m

j=1∑n

i=1Mijk(tejk − tsjk)

n(13)

The contract penaltyθk is derived from the expecteddeadline times tr and task completion time te.

θ

a actp

f

ions lisedc r usew nc-t oml egit-i overb ewt ary)

crossover. The reordering is necessary to preserve thenode mapping associated with a particular task fromone generation to the next. The mutation stage is alsotwo-part, with a switching operator randomly appliedto the ordering parts, and a random bit-flip applied tothe mapping parts.

In the actual agent implementation using the abovealgorithm, the system dynamism must be considered.One advantage of the iterative algorithm described inthis section is that it is an evolutionary process and istherefore able to absorb system changes such as theaddition or deletion of tasks, or changes in the numberof hosts available in the grid resource.

The two scheduling algorithms are both imple-mented and can be switched from one to another inthe agent. The algorithms provide a fine-grained so-lution to dynamic task scheduling and load balanc-ing across multiple hosts of a local grid resource.However, the same methodology cannot be applieddirectly to a large-scale grid environment, since thealgorithms do not scale to thousands of hosts andtasks. An additional mechanism is required for multi-ple agents to work together and achieve global grid loadbalancing.

4. Global grid load balancing

In this work, a grid is a collection of multiple localgrid resources that are distributed geographically in aw ctioni videt rid-s indi-r tipleg

4

f itsc vicea is in-f wl-e sedi anc rd-i esei

k =∑m

j=1(tejk − trj)

m(14)

The cost value for the schedulek, represented bySkndMijk , is derived from these metrics and their impredetermined by:

kc = Wmωk + Wiϕk + Wcθk

Wm + Wi + Wc(15)

The genetic algorithm utilises a fixed populatize and stochastic remainder selection. Speciarossover and mutation functions are developed foith the two-part coding scheme. The crossover fu

ion first splices the two ordering strings at a randocation, and then reorders the pairs to produce lmate solutions. The mapping parts are crossedy first reordering them to be consistent with the n

ask order, and then performing a single-point (bin

ide area. The problem that is addressed in this ses the discovery of available grid resources that prohe optimum execution performance for globally gubmitted tasks. The service discovery processectly results in a load-balancing effect across mulrid resources.

.1. Service advertisement and discovery

An agent takes its local grid resource as one oapabilities. An agent can also receive many serdvertisements from nearby agents and store th

ormation in its coordination layer as its own knodge. All of the service information are organi

nto Agent Capability Tables (ACTs). An agent choose to maintain different kinds of ACTs accong to different sources of service information. Thnclude:


• T ACT: In the coordination layer of each agent,T ACT is used to record service information of thelocal grid resource. The local management layer isresponsible for collecting this information and re-porting it to the coordination layer.

• L ACT: Each agent can have one LACT to recordthe service information received from its loweragents in the hierarchy. The services recorded inL ACT are provided by grid resources in its localscope.

• G ACT: The GACT in an agent is actually a recordof the service information received from its up-per agent in the hierarchy. The service informationrecorded in GACT is provided by the agents, whichhave the same upper agent as the agent itself.

There are basically two ways to maintain the con-tents of ACTs in an agent: data-pull and data-push,each of which has two approaches: periodic and event-driven.

• Data-pull: An agent asks other agents for their ser-vice information either periodically or when a re-quest arrives.

• Data-push: An agent submits its service informationto other agents in the hierarchy periodically or whenthe service information is changed.

Apart from service advertisement, another impor-tant process among agents is service discovery. Discov-ering available services is also a cooperative activity.W calg canb her-wi gent,w urcem e re-q ad oft stilln y.

dis-c adb inedl d toa et thea cessd uest,b ided

by a neighbouring agent. While this may decrease theload-balancing effect, the trade-off is reasonable as gridusers prefers to find a satisfactory resource as fast andas local as possible.

The advertisement and discovery mechanism alsoallows possible system scalability. Most requests areprocessed in a local domain and need not to besubmitted to a wider area. Both advertisement anddiscovery requests are processed between neighbour-ing agents and the system has no central structure,which otherwise might act as a potential bottle-neck.

4.2. System implementation

Agents are implemented using Java and data are rep-resented in an XML format. An agent is responsiblefor collecting service information of the local grid re-source. An example of this service information can befound below.<agentgrid type=‘‘service’’>

<address>gem.dcs.warwick.ac.uk</address>

<port>1000</port>

<type>SunUltra10</type>

<nproc>16</nproc>

<environment>mpi</environment>

<environment>pvm</environment >

<environment>test</environment>

<freetime>Nov 1504:43:102001</freetime>

<

d heh alsop thisc sim-p rcea n ex-e rrentam ibedi llye mesa latests x-i omea bal-a ithin

ithin each agent, its own service provided by the lorid resource is evaluated first. If the requiremente met locally, the discovery ends successfully. Otise service information in both LACT and GACT

s evaluated and the request dispatched to the ahich is able to provide the best requirement/resoatch. If no service can meet the requirement, thuest is submitted to the upper agent. When the he

he hierarchy is reached and the available service isot found, the discovery terminates unsuccessfull

While the process of service advertisement andovery is not motivated by grid scheduling and loalancing, it can result in an indirect coarse-gra

oad-balancing effect. A task tends to be dispatchegrid resource that has less workload and can mepplication execution deadline. The discovery prooes not aim to find the best service for each requt endeavours to find an available service prov

/agentgrid>

The agent identity is provided by a tuple of thead-ressand port used to initiate communication. Tardware model and the number of processors arerovided. The example specifies a single cluster, inase a cluster of 16 SunUltra10 workstations. Tolify the problem, the hosts within each grid resoure configured to be homogeneous. The applicatiocution environments that are supported by the cugent implementation include MPI, PVM, and atestode that is designed for the experiments descr

n this work. Undertestmode, tasks are not actuaxecuted and predictive application execution tire scheduled and assumed to be accurate. Thecheduling makespanω indicates the earliest (appromate) time that corresponding grid resource becvailable for more tasks. Due to the effect of loadncing, it is reasonable to assume that all of hosts w


a grid resource have approximately the samefreetime.The agents use this item to estimate the workload ofeach grid resource and make decisions on where todispatch incoming tasks. This item changes over timeand must be frequently updated. Service advertisementis therefore important among the agents.

A portal has been developed which allows users tosubmit requests destined for the grid resources. An ex-ample request is given below.

<agentgrid type=‘‘request’’ >

<application>

<name>sweep3d</name>

<binary>

<file>binary/sweep3d</file>

<inputfile>input/input.50</inputfile>

</binary >

<performance>

<datatype>pacemodel</datatype>

<modelname>model/sweep3d</modelname>

</performance>

</application>

<requirement >

<environment>test</environment>

<deadline>Nov 1504:43:17 2001</deadline>

</requirement >

<email>junwei</email>

</agentgrid>

ap-p fore arye ap-p -m filesa ms.I en-vs ed ast

thea rnelo vicea ardw tione om-p n be

estimated using:

ter = ω + min∀H⊆H,H =Φ,ty⊆ty,ty =Φ

{eval(ty, tmr)} (16)

For a grid resource with homogeneous hosts, thePACE evaluation function is calledn times. If ter ≤ trr ,the resource is considered to be able to meet the re-quired deadline. Otherwise, the resource is not consid-ered available for the incoming task. This performanceestimation of local grid resources at the global levelis simple but efficient. However, when the task is dis-patched to the corresponding agent, the actual situa-tion may differ from the scenario considered in(16).The agent may change the task order and advance orpostpone a specific task execution in order to balancethe workload on different hosts, and in so doing max-imise resource utilisation while maintaining the dead-line contracts of each user.

Service discovery for a request within an agent in-volves multiple matchmaking processes. An agent al-ways gives priority to the local grid resource. Onlywhen the local resource is unavailable is the ser-vice information of other grid resources evaluatedand the request dispatched to another agent. In or-der to measure the effect of this mechanism for gridscheduling and load balancing, several performancemetrics are defined and many experiments are carriedout.

5

val-u ntsd s arep thiss

5

canb hedul-i anced therea atedq t ofs

A user is required to specify the details of thelication, the requirements and contact informationach request. Application information includes binxecutable files and also the corresponding PACElication performance model tmr . In the current impleentation we assume that both binary and modelre pre-compiled and available in all local file syste

n the requirements, both the application executionironment and the required deadline time trr should bepecified. Currently the user’s email address is ushe contact information.

Service discovery processes are triggered byrrival of a request at an agent, where the kef this process is the matchmaking between sernd request information. The match is straightforwhether an agent can provide the required applicaxecution environment. The expected execution cletion time for a given task on a given resource ca

. Performance evaluation

Experiments are carried out for performance eation of grid load balancing using intelligent ageescribed in above sections. Performance metricredefined and experimental results are included inection.

.1. Performance metrics

There are a number of performance criteria thate used to describe resource management and sc

ng systems. What is considered as high performepends on the system requirements. In this work,re several common statistics that can be investiguantitatively and used to characterise the effeccheduling and load balancing.


5.1.1. Total application execution timeThis defines the period of timet, when a set ofm

parallel tasksT are scheduled onto resourcesH with nhosts. Note that the host setH here is slightly differentfrom that defined in Section3.1because it may includethose either at multiple grid resources or within a singlegrid resource.

t = max1≤j≤m

{tej} − min1≤j≤m

{tsj} (17)

5.1.2. Average advance time of applicationexecution completion

This can be calculated directly using:

ε =∑m

j=1(trj − tej)

m(18)

which is negative when most deadlines fail.

5.1.3. Average resource utilisation rateThe resource utilisation rateυi of each hostHi is

calculated as follows:

υi =∑

∀j,Mij=1(tej − tsj)

t× 100% (19)

The average resource utilisation rateυ of all hostsH is:

υ =∑n

i=1υi

n(20)

whereυ is in the range 0–1.

5

d

a el

β

hend oft a gridr ulti-p alsoi ced

across all the considered hosts, the resource utilisationrate is usually high and the tasks finish quickly. An-other metrics that can only applied for measurement ofgrid agents is the number of network packets used forservice advertisement and discovery.

5.2. Experimental design

The experimental system is configured with 12agents, illustrated by the hierarchy shown inFig. 2.

These agents are namedS1, . . ., S12 (for the sakeof brevity) and represent heterogeneous hardware re-sources containing 16 hosts/processors per resource.As shown inFig. 2, the resources range in their com-putational capabilities. The SGI multiprocessor is themost powerful, followed by the SunUltra10, 5, 1, andSPARCStation 2 in turn.

In the experimental system, each agent maintains aset of service information for the other agents in thesystem. Each agent pulls service information from itslower and upper agents in every 10 s. All of the agentsemploy identical strategies with the exception of theagent at the head of the hierarchy (S1) that does nothave an upper agent.

The applications used in the experiments includetypical scientific computing programs. Each applica-tion has been modelled and evaluated using PACE.Different applications have different performance sce-narios. During each experiment, requests for oneo valst ecu-t se-l estp ringw o thea

con-fi ove,d al-g d ass

5

i nd toa

.1.4. Load-balancing levelThe mean square deviation ofυi is defined as:

=√∑n

i=1(υ − υi)2

n(21)

nd the relative deviation ofdoverυ that describes thoad-balancing level of the system is:

=(

1 − d

υ

)× 100% (22)

The most effective load balancing is achieved wequals zero andβ equals 100%. The four aspects

he system described above can be applied both toesource or a grid environment that consists of mle grid resources. These performance metrics are

nterrelated. For example, if the workload is balan

f the test applications are sent at 1-s intero randomly selected agents. The required exion time deadline for the application is alsoected randomly from a given domain. The requhase of each experiment lasts for 10 min duhich 600 task execution requests are sent out tgents.

While the experiments use the same resourcegurations and application workloads described abifferent combinations of local grid schedulingorithms and global grid mechanisms are appliehown inTable 1.

.3. Experimental results

The experimental results are given inTable 2; thisncludes the three metrics applied to each agent all the grid resources in the system.


Fig. 2. Case study: agents and resources.

Table 1Case study: experimental design

Experiment number

1 2 3

First-come-first-served algorithm√

Iterative heuristic algorithm√ √

Service advertisement and discovery√

Table 2Case study: experimental results

Experiment number

1 2 3

ε (s) υ (%) β (%) ε (s) υ (%) β (%) ε (s) υ (%) β (%)

S1 42 7 71 52 9 89 29 81 96S2 11 9 78 34 9 89 23 81 95S3 −135 13 62 23 13 92 24 77 87S4 −328 22 45 −30 28 96 44 82 94S5 −607 32 56 −492 58 95 38 82 94S6 −321 25 56 −123 29 90 42 78 92S7 −261 23 57 10 25 92 38 84 93S8 −695 33 52 −513 52 90 42 82 91S9 −806 45 58 −724 63 90 30 80 84S10 −405 28 61 −129 34 94 25 81 94S11 −1095 44 50 −816 73 92 35 75 89S12 −859 41 46 −550 67 91 26 78 90

Total −475 26 31 −295 38 42 32 80 90

5.3.1. Experiment 1In the first experiment, each agent is configured with

the first-come-first-served algorithm for local grid re-source scheduling. Agents are not organised for co-operation. The experimental scenario is visualised inFig. 3.

The algorithm does not consider makespan, idletime or deadline. Each agent receives approximately


Fig. 3. Experimental scenario I.

50 task requests on average, which results in only thepowerful platforms (SGI multiprocessorsS1 andS2)meeting the requirements. The slower machines in-cluding the Sun SPARCStations clustersS11 andS12impose serious delays in task execution with long taskqueues (seeFig. 3). The total task execution time isabout 46 min. The overall average delay for task execu-tion is approximately 8 min. It is apparent that the highperformance platforms are not utilised effectively, andthe lack of proper scheduling overloads clusters likeS11 that is only 44% utilised. The average utilisation ofgrid resources is only 26%. The workload for each hostin each grid resource is also unbalanced. For example,the load-balancing level ofS12 is as low as 46%. Theoverall grid workload is also unbalanced at 31%.

5.3.2. Experiment 2In experiment 2, the iterative heuristic algorithm is

used in place of the first-come-first-serve algorithm al-though no higher level agent cooperation mechanismis applied. The experimental scenario is visualised inFig. 4.

The algorithm aims to minimise makespan and idletime, while meeting deadlines. Compared to those ofexperiment 1, almost all metrics are improved. Taskexecutions are completed earlier. The total task exe-cution time is improved from 46 to 36 min and theaverage task execution delay is reduced to approxi-mately 5 min. However, resources such asS11 andS12remain overloaded and the GA scheduling is not ablet ner-a bet-t sf lsoi on

Fig. 4. Experimental scenario II.

each grid resources is significantly improved, the lackof any higher level load-balancing mechanism resultsin a slightly improved overall grid load balancing to42% (as opposed to 31% in experiment 1).

5.3.3. Experiment 3In experiment 3, the service advertisement and dis-

covery mechanism is enabled for high-level load bal-ancing. The experimental scenario is visualised inFig. 5.

Service discovery results in a new distribution of re-quests to the agents, where the more powerful platformreceives more requests. As shown inFig. 5, powerfulplatform likeS1 receives 16% of tasks, which is fourtimes of tasks received by relatively slow platformS11.The total task execution time is also dramatically de-creased to 11 min. As a result, the majority of task ex-ecution requirements can be met and all grid resourcesare well utilised (80% on average). The load balanc-ing of the overall grid is significantly improved from42% (in experiment 2) to 90%. The load balancing on

o find solutions that satisfy all the deadlines. Gelly, resources are better utilised as a result of the

er scheduling, such as the use ofS11 that increaserom 44 to 73%. The overall average utilisation amproves from 26 to 38%. While load balancing
Fig. 5. Experimental scenario III.


Fig. 6. Case study: trends I for experimental results on advance timesof application execution completionε.

resources such asS1 andS2 are only marginally im-proved by the GA scheduling when the workload ishigher. None of other agents show an improvement inlocal grid load balancing.

Experimental results inTable 2are also illustrated inFigs. 6–8, showing the effect on the performance met-rics given in Section5.1. The curves indicate that differ-ent platforms exhibit different trends when agents areconfigured with more scheduling and load-balancingmechanisms. Among these the curves forS1,S2 (whichare the most powerful) andS11,S12 (which are the leastpowerful) are representative and are therefore empha-sised, while others are indicated using grey lines. Thecurve for the overall grid is illustrated using a boldline.

F urceu

Fig. 8. Case study: trends III for experimental results on load-balancing levelβ.

5.3.4. Application executionIn Fig. 6, it is apparent that both the GA schedul-

ing and the service discovery mechanism contribute toimproving the application execution completion.

The curve implies that the more a resource is loadedthe more significant the effect is. For example,S1 andS2 are not overloaded during the three experiments, andtherefore the value ofε only changes slightly.S11 andS12 are heavily overloaded during the experiments 1and 2, and therefore the improvement ofε in the ex-periments 2 and 3 is more significant. The situationsof S3, . . ., S10 are distributed between these two ex-tremes. The curve for the overall grid provides an av-erage estimation for all situations, which indicates thatthe service discovery mechanism contributes more to-wards the improvement in application executions thanGA scheduling.

5.3.5. Resource utilisationThe curves inFig. 7illustrate similar trends to those

of Fig. 6. S1, S2 andS11, S12 still represent the twoextreme situations between which the other platformsare distributed.

The curve for the overall grid indicates that the ser-vice discovery mechanism contributes more to max-imising resource utilisation. However, overloaded plat-forms like S11 andS12 benefit mainly from the GAscheduling, which is more effective at load balancingwhen the workload is high; lightly loaded platformsl v-e hem.

ig. 7. Case study: trends II for experimental results on resotilisation rateυ.

ike S1 andS2 chiefly benefit from the service discory mechanism, which can dispatch more tasks to t


5.3.6. Load balancingCurves inFig. 8 demonstrate that local and global

grid load balancing are achieved in different ways.WhileS1,S2 andS11,S12 are two representative sit-

uations, the global situation is not simply an averageof local trends as those illustrated inFigs. 6 and 7. Inthe second experiment, when the GA scheduling is en-abled, the load balancing of hosts or processors withina local grid resource are significantly improved. In thethird experiment, when the service discovery mech-anism is enabled, the overall grid load balancing isimproved dramatically. It is clear that the GA schedul-ing contributes more to local grid load balancing andthe service discovery mechanism contributes more toglobal grid load balancing. The coupling of both asdescribed in this work is therefore a good choice toachieve load balancing at both local and global gridlevels.

5.4. Agent performance

Additional experiments are carried out to comparethe performance of grid agents when different ser-vice advertisement and discovery strategies are applied.These are introduced briefly in this section.

A centralised controlling mechanism is designed forthe agents. Each agent is assumed to have the pre-knowledge of any other agents. Each time an agentreceives a task execution request, it will contact all ofthe other agents for quoting of completion time. Theb o thea ac-t anst cov-e

lasts a-t nts.T pro-c eanst cov-e elowi enta gentp

t in-c cedi ri-

Fig. 9. Comparison of total application execution time between thecentralised and distributed strategies.

ments, the number of grid agents is changed to enablethe system scalability to be investigated.

5.4.1. Total application execution timeFig. 9 provides a comparison of total application

execution time for the two strategies.The total task execution time decreases when the

number of agents and grid resources increases. It isclear that the centralised strategy leads to a bit bet-ter load-balancing results, since tasks finish in a lesstime under the centralised control. This is more obvi-ous when the number of the agents increases.

It is reasonable that a centralised strategy canachieve a better scheduling, because full service adver-tisement leads to full knowledge on the performance ofall grid resources. However, under a distributed mecha-nism, each agent has only up-to-date information on itsneighbouring agents, which limit the scheduling effect.

5.4.2. Average advance time of applicationexecution completion

Similar comparison for the two strategies is includedin Fig. 10in terms of the average advance time of ap-plication execution time.

F xecu-t ies.

est bid is chosen and the request is dispatched tvailable grid resource directly in one step. This isually an event-driven data-pull strategy, which mehat full advertisement results in no necessary disry.

The service advertisement strategy used in theection is periodic data-pull, where service informion is only transferred among neighbouring agehis results that service discovery has also to beessed step by step. This distributed strategy mhat not full advertisement results in necessary disry steps. The experimental results introduced b

ndicate that balancing the overhead for advertisemnd discovery in this way can lead to a better aerformance.

The details of the experimental design are noluded, though actually very similar to that introdun Section5.2. One difference is that in these expe

ig. 10. Comparison of average advance time of application eion completion between the centralised and distributed strateg


Fig. 11. Comparison of network packets between the centralised anddistributed strategies.

Tasks are executed quicker when the number ofagents increases. It is clear that the centralised strat-egy leads to a bit better result again. The reason issimilar to that described in the last section. The re-sult values are negative since the workload of theseexperiments is quite heavy and grid resources can-not meet the deadline requirements of task executionaveragely.

5.4.3. Network packetsA different result is included inFig. 11that provides

a comparison of the network packets involved duringthe experiments of the two strategies.

The number of network messages used for serviceadvertisement and discovery increases linearly with thenumber of agents. It is clear that the distributed strategysignificantly decreases the amount of network traffic.The strategy of only passing messages among neigh-bouring agents improves the system scalability as theagent number increases.

6. Related work

In this work, local grid load balancing is performedin each agent using AI scheduling algorithms. The on-the-fly use of predictive performance data for schedul-ing described in this work is similar to that of Ap-pLeS[5], Ninf [30] and Nimrod[2]. While AppLeSa asedo theN s an ra-m -i ems,

such as Condor[28], EASY [27], Maui [25], LSF[40]and PBS[24]. Most of these support batch queuing us-ing the FCFS algorithm. The main advantage of GAscheduling used in this work for job scheduling is thequality of service and multiple performance metricssupport.

This work also focuses on the cooperation of localgrid and global grid levels of management and schedul-ing. OGSA and its implementation, the Globus toolkit[19], is the standard for grid service and applicationdevelopment, which is based on web services proto-cols and standards[31]. Some existing systems use theGlobus toolkit to integrate with the grid computing en-vironment, including Condor-G[22], Nimrod/G [3],though a centralised control structure is applied in bothimplementations. Another grid computing infrastruc-ture, Legion[23], is developed using an object-orientedmethodology that provides similar functionalities to theGlobus. In this work, a multi-agent approach is consid-ered. Agents are used to control the query process andto make resource discovery decisions based on inter-nal logic rather than relying on a fixed-function queryengine.

Agent-based grid management is also used inJAMM [7,38] and NetSolve[17,18], where a cen-tralised broker/agents architecture is developed. In thiswork, agents perform peer-to-peer service advertise-ment and discovery to achieve global grid load bal-ancing. Compared with another “Agent Grid” work

ofbal-ge-renteralent-f

er-in

her

id-grid

van-ide ahatre

nd Ninf management and scheduling are also bn performance evaluation techniques, they utiliseWS [39] resource monitoring service. Nimrod haumber of similarities to this work, including a paetric engine and heuristic algorithms[1] for schedul

ng jobs. There are also many job scheduling syst

described in[33], rather than using a collectionmany predefined specialised agents, grid loadancing in this work uses a hierarchy of homoneous agents that can be reconfigured with differoles at running time. While there are also sevother related projects that have a focus on agbased grid computing[29,34,37], the emphases othese works are quite different. In this work, pformance for grid load balancing is investigateda quantitative way that cannot found in any otwork.

There are many other enterprise computing and mdleware technologies that are being adopted formanagement, such as CORBA[35] and Jini[4]. Com-pared with these methods, the most important adtage of an agent-based approach is that it can provclear high-level abstraction of the grid environment tis extensible and compatible for integration of futugrid services and toolkits.


7. Conclusions

This work addresses grid load-balancing issues us-ing a combination of intelligent agents and multi-agentapproaches. For local grid load balancing, the iter-ative heuristic algorithm is more efficient than thefirst-come-first-served algorithm. For global grid loadbalancing, a peer-to-peer service advertisement anddiscovery technique is shown to be effective. The useof a distributed agent strategy can reduce the networkoverhead significantly and allow the system to scalewell rather than using a centralised control, as well asachieving a reasonable good resource utilisation andmeeting application execution deadlines.

Further experiments will be carried out using thegrid testbed being built at Warwick. Since large-scaledeployments of developing systems are problematic,a grid modelling and simulation environment is underdevelopment to enable performance and scalability ofthe agent system to be investigated when thousands ofgrid resources and agents are involved.

The next generation grid computing environmentmust be intelligent and autonomous to meet require-ments of self management. Related research topics in-clude semantic grids[41] and knowledge grids[42].The agent-based approach described in this workis an initial attempt towards a distributed frame-work for building such an intelligent grid environ-ment. Future work includes the extension of the agent

oSin-gy-

ul-D-

orrk-

VA,

tric,

,

[5] F. Berman, R. Wolski, S. Figueira, J. Schopf, G. Shao,Application-level scheduling on distributed heterogeneous net-works, in: Proceedings of the Supercomputing‘96, Pittsburgh,PA, USA, 1996.

[6] F. Berman, A.J.G. Hey, G. Fox, Grid Computing: Making TheGlobal Infrastructure a Reality, John Wiley & Sons, 2003.

[7] C. Brooks, B. Tierney, W. Johnston, JAVA agents for distributedsystem management, LBNL Report, 1997.

[8] J. Cao, D.J. Kerbyson, E. Papaefstathiou, G.R. Nudd, Perfor-mance modelling of parallel and distributed computing usingPACE, in: Proceedings of the IPCCC‘00, Phoenix, AZ, USA,2000, pp. 485–492.

[9] J. Cao, D.J. Kerbyson, G.R. Nudd, Dynamic application inte-gration using agent-based operational administration, in: Pro-ceedings of the PAAM‘00, Manchester, UK, 2000, pp. 393–396.

[10] J. Cao, D.J. Kerbyson, G.R. Nudd, High performance servicediscovery in large-scale multi-agent and mobile-agent systems,Int. J. Software Eng. Knowl. Eng. 11 (5) (2001) 621–641.

[11] J. Cao, D.J. Kerbyson, G.R. Nudd, Use of agent-based servicediscovery for resource management in metacomputing envi-ronment, in: Proceedings of the Euro-Par‘01, Lecture Noteson Computer Science, vol. 2150, Springer, Berlin, 2001, pp.882–886.

[12] J. Cao, D.J. Kerbyson, G.R. Nudd, Performance evaluation of anagent-based resource management infrastructure for grid com-puting, in: Proceedings of the CCGrid‘01, Brisbane, Australia,2001, pp. 311–318.

[13] J. Cao, D.P. Spooner, J.D. Turner, S.A. Jarvis, D.J. Kerbyson,S. Saini, G.R. Nudd, Agent-based resource management forgrid computing, in: Proceedings of the AgentGrid‘02, Berlin,Germany, 2002, pp. 350–351.

[14] J. Cao, S.A. Jarvis, S. Saini, D.J. Kerbyson, G.R. Nudd, ARMS:an agent-based resource management system for grid comput-ing, Scientific Programming (Special Issue on Grid Computing)

[ son,asedgs of

[ udd,task

nce,

[ r sci-t. 24

[ ork-7.

[ truc-7)

[ om-

[ ices02)

10 (2) (2002) 135–148.15] J. Cao, S.A. Jarvis, D.P. Spooner, J.D. Turner, D.J. Kerby

G.R. Nudd, Performance prediction technology for agent-bresource management in grid environments, in: Proceedinthe HCW‘02, Fort Lauderdale, FL, USA, 2002.

16] J. Cao, D.P. Spooner, S.A. Jarvis, S. Saini, G.R. NAgent-based grid load balancing using performance-drivenscheduling, in: Proceedings of the IPDPS‘03, Nice, Fra2003.

17] H. Casanova, J. Dongarra, Using agent-based software foentific computing in the NetSolve system, Parallel Compu(12–13) (1998) 1777–1790.

18] H. Casanova, J. Dongarra, Applying NetSolve’s netwenabled server, IEEE Comput. Sci. Eng. 5 (3) (1998) 57–6

19] I. Foster, C. Kesselman, Globus: a metacomputing infrasture toolkit, Int. J. High Perform. Comput. Appl. 2 (199115–128.

20] I. Foster, C. Kesselman, The GRID: Blueprint for a New Cputing Infrastructure, Morgan-Kaufmann, 1998.

21] I. Foster, C. Kesselman, J.M. Nick, S. Tuecke, Grid servfor distributed system integration, IEEE Comput. 35 (6) (2037–46.

framework with new features, e.g. automatic Qnegotiation, self-organising coordination, semantictegration, knowledge-based reasoning and ontolobased service brokering.

References

[1] A. Abraham, R. Buyya, B. Nath, Nature’s heuristics for scheding jobs on computational grids, in: Proceedings of the ACOM‘00, Cochin, India, 2000.

[2] D. Abramson, R. Sosic, J. Giddy, B. Hall, Nimrod: a tool fperforming parameterized simulations using distributed wostations, in: Proceedings of the HPDC‘95, Pentagon City,USA, 1995.

[3] D. Abramson, J. Giddy, L. Kotler, High performance paramemodelling with Nimrod/G: killer application for the global gridin: Proceedings of the IPDPS‘00, Cancun, Mexico, 2000.

[4] K. Amold, B. O’Sullivan, R. Scheifer, J. Waldo, A. WoolrathThe JiniTM Specification, Addison Wesley, 1999.


[22] J. Frey, T. Tannenbaum, M. Livny, I. Foster, S. Tuecke, Condor-G: a computation management agent for multi-institutionalgrids, Cluster Comput. 5 (3) (2002) 237–246.

[23] A. Grimshaw, W.A. Wulf, Legion team, The Legion vision ofa worldwide virtual computer, Commun. ACM 40 (1) (1997)39–45.

[24] R.L. Henderson, Job scheduling under the Portable Batch Sys-tem, in: Proceedings of the JSSPP‘95, Lecture Notes in Com-puter Science, vol. 949, Springer, Berlin, 1995, pp. 279–294.

[25] D. Jackson, Q. Snell, M. Clement, Core algorithms of theMaui scheduler, in: Proceedings of the JSSPP‘01, LectureNotes Computer Science, vol. 2221, Springer, Berlin, 2001, pp.87–102.

[26] N.R. Jennings, M.J. Wooldridge (Eds.), Agent Technology: Fo-undations, Applications, and Markets, Springer, Berlin, 1998.

[27] D. Lifka, The ANL/IBM SP scheduling system, in: Proceedingsof the JSSPP‘01, Lecture Notes Computer Science, vol. 2221,Springer, Berlin, 2001, pp. 187–191.

[28] M. Litzkow, M. Livny, M. Mutka, Condor – a hunter of idleworkstations, in: Proceedings of the ICDCS‘88, San Jose, CA,USA, 1988, pp. 104–111.

[29] L. Moreau, Agents for the grid: a comparison for web services(part 1: the transport layer), in: Proceedings of the CCGrid‘02,Berlin, Germany, 2002, pp. 220–228.

[30] H. Nakada, M. Sato, S. Sekiguchi, Design and implementationsof Ninf: towards a global computing infrastructure, Future Gen.Comput. Syst. 5–6 (1999) 649–658.

[31] E. Newcomer, Understanding Web Services: XML, WSDL,SOAP, and UDDI, Addison Wesley, 2002.

[32] G.R. Nudd, D.J. Kerbyson, E. Papaefstathiou, S.C. Perry, J.S.Harper, D.V. Wilcox, PACE – a toolset for the performance pre-diction of parallel and distributed systems, Int. J. High Perform.Comput. Appl. 14 (3) (2000) 228–251.

[33] O.F. Rana, D.W. Walker, The agent grid: agent-based resourceintegration in PSEs, in: Proceedings of the IMACS‘00, Lau-

[ tionentc-

[ Hall,

[ e I-1)

[ hni-998,

[ siveture

[ ser-e for999)

[ s dis-ting,

[41] H. Zhuge, Semantics, resource and grid, Future Gen. Comput.Syst. 20 (1) (2004) 1–5.

[42] H. Zhuge, China’s e-science knowledge grid environment, IEEEIntell. Syst. 19 (1) (2004) 13–17.

Junwei Cao is currently a Research Scien-tist at the Center for Space Research, Mas-sachusetts Institute of Technology, USA.He has been working on grid infrastructureimplementation and grid-enabled applica-tion development since 1999. Before join-ing MIT in 2004, Dr Cao was a researchstaff member of C&C Research Laborato-ries, NEC Europe Ltd., Germany. He re-ceived his PhD in Computer Science in 2001from the University of Warwick, UK and

MSc in 1998 from Tsinghua University, China, respectively. He is amember of the IEEE Computer Society and the ACM.

Daniel P. Spooner is a newly appointedLecturer and a member of the High Perfor-mance System Group at the University ofWarwick. He has 15 referred publicationson the generation and application of analyt-ical performance models to Grid comput-ing systems. He has worked at the Perfor-mance and Architectures Laboratory at theLos Alamos National Laboratory on per-formance modelling tools, and is currentlyinvolved in an e-Science programme to de-

velop performance-aware middleware services.

StephenA. Jarvisis a Senior Lecturer in theHigh Performance System Group at the Uni-versity of Warwick. He has authored over50 referred publications (including three

r-e,-

s.-

-ntrena-au-

ageres.

ir-t at-andr-

ries

sanne, Switzerland, 2000.34] W. Shen, Y. Li, H. Ghenniwa, C. Wang, Adaptive negotia

for agent-based grid computing, in: Proceedings of the Agities/AAMAS‘02, Bologna, Italy, 2002, pp. 32–36.

35] D. Slama, J. Garbis, P. Russell, Enterprise Corba, Prentice1999.

36] R. Stevens, P. Woodward, T. DeFanti, C. Catlett, From thWAY to the national technology grid, Commun. ACM 40 (1(1997) 50–60.

37] C. Thompson, Characterizing the agent grid, Teccal Report, Object Services and Consulting Inc., 1http://www.objs.com/agility/tech-reports/9812-grid.html.

38] B. Tierney, W. Johnston, J. Lee, M. Thompson, A data intendistributed computing architecture for grid applications, FuGen. Comput. Syst. 16 (5) (2000) 473–481.

39] R. Wolski, N.T. Spring, J. Hayes, The network weathervice: a distributed resource performance forecasting servicmetacomputing, Future Gen. Comput. Syst. 15 (5–6) (1757–768.

40] S. Zhou, LSF: load sharing in large-scale heterogeneoutributed systems, in: Proceedings of the Cluster Compu1992.

books) in the area of software and perfomance evaluation. While previously at thOxford University Computing Laboratoryhe worked on performance tools for a number of different programming paradigmHe has close research links with IBM, including current projects with IBM’s TJ Wat

son Research Center in New York and with their development ceat Hursley Park in the UK. Dr Jarvis sits on a number of intertional programme committees for high-performance computing,tonomic computing and active middleware; he is also the Manof the Midlands e-Science Technical Forum on Grid Technologi

Graham R. Nudd is Head of the HighPerformance Computing Group and Chaman of the Computer Science Departmenthe University of Warwick. His primary research interests are in the managementapplication of distributed computing. Prioto joining Warwick in 1984, he was employed at the Hughes Research Laboratoin Malibu, California.

http://www.objs.com/agility/tech-reports/9812-grid.html

Grid load balancing using intelligent agentscaoj/pub/doc/jcao_j_gridagent.pdf136 J. Cao et al. /...

Documents

Transcript of Grid load balancing using intelligent agentscaoj/pub/doc/jcao_j_gridagent.pdf136 J. Cao et al. /...