A reference model for grid architectures and its validation

21
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385 Published online 21 September 2009 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1505 A reference model for grid architectures and its validation Wil van der Aalst, Carmen Bratosin , , Natalia Sidorova and Nikola Trˇ cka Department of Mathematics and Computer Science, Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands SUMMARY Computing and data-intensive applications in physics, medicine, biology, graphics, and business intelli- gence require large and distributed infrastructures to address the challenges of the present and the future. For example, process mining applications are faced with terrabytes of event data and computationally expensive algorithms. Computer grids are increasingly being used to deal with such challenges. However, grid computing is often approached in an ad hoc and engineering-like manner. Despite the availability of many software packages for grid applications, a good conceptual model of the grid is missing. This paper provides a formal description of the grid in terms of a colored Petri net (CPN). This CPN can be seen as a reference model for grids as it clarifies the basic concepts at the conceptual level. Moreover, the CPN allows for various kinds of analyses ranging from verification to performance analysis. We validate our model based on real-life experiments using a testbed grid architecture available in our group and we show how the model can be used for the estimation of throughput times for scientific workflows. Copyright © 2009 John Wiley & Sons, Ltd. Received 4 April 2009; Revised 14 July 2009; Accepted 20 July 2009 KEY WORDS: computational grids; grid architecture; colored Petri nets 1. INTRODUCTION Developments in information technology offer many solutions to realize the idea of collaboration and distribution of work in order to solve a given problem. Technologies such as distributed com- puting or service-oriented architectures have been widely embraced by the scientific and industrial communities. Grid computing uses these technologies to approach distributed computing resources linked via networks as one computer. Despite the availability of a wide variety of grid products, there is little consensus on the definition of a grid and its architecture. In the last decade many researchers have tried to define what a grid is. Correspondence to: Carmen Bratosin, Department of Mathematics and Computer Science, Eindhoven University of Tech- nology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands. E-mail: [email protected] Copyright 2009 John Wiley & Sons, Ltd.

Transcript of A reference model for grid architectures and its validation

Page 1: A reference model for grid architectures and its validation

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCEConcurrency Computat.: Pract. Exper. 2010; 22:1365–1385Published online 21 September 2009 inWiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe.1505

A reference model for gridarchitectures and its validation

Wil van der Aalst, Carmen Bratosin∗,†, Natalia Sidorovaand Nikola Trcka

Department of Mathematics and Computer Science, Eindhoven University ofTechnology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands

SUMMARY

Computing and data-intensive applications in physics, medicine, biology, graphics, and business intelli-gence require large and distributed infrastructures to address the challenges of the present and the future.For example, process mining applications are faced with terrabytes of event data and computationallyexpensive algorithms. Computer grids are increasingly being used to deal with such challenges. However,grid computing is often approached in an ad hoc and engineering-like manner. Despite the availability ofmany software packages for grid applications, a good conceptual model of the grid is missing. This paperprovides a formal description of the grid in terms of a colored Petri net (CPN). This CPN can be seen asa reference model for grids as it clarifies the basic concepts at the conceptual level. Moreover, the CPNallows for various kinds of analyses ranging from verification to performance analysis. We validate ourmodel based on real-life experiments using a testbed grid architecture available in our group and we showhow the model can be used for the estimation of throughput times for scientific workflows. Copyright ©2009 John Wiley & Sons, Ltd.

Received 4 April 2009; Revised 14 July 2009; Accepted 20 July 2009

KEY WORDS: computational grids; grid architecture; colored Petri nets

1. INTRODUCTION

Developments in information technology offer many solutions to realize the idea of collaborationand distribution of work in order to solve a given problem. Technologies such as distributed com-puting or service-oriented architectures have been widely embraced by the scientific and industrialcommunities. Grid computing uses these technologies to approach distributed computing resourceslinked via networks as one computer.Despite the availability of a wide variety of grid products, there is little consensus on the definition

of a grid and its architecture. In the last decade many researchers have tried to define what a grid is.

∗Correspondence to: Carmen Bratosin, Department of Mathematics and Computer Science, Eindhoven University of Tech-nology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands.

†E-mail: [email protected]

Copyright q 2009 John Wiley & Sons, Ltd.

Page 2: A reference model for grid architectures and its validation

1366 W. VAN DER AALST ET AL.

Some argue that grid computing is just another name for distributed computing, while others claimthat it is a completely newway of computing. Recently, Stockinger presented the outcome of a surveyconducted among grid researchers all over the globe [1]. The main conclusion, on which most of therespondents agree, is that grid computing is about sharing resources in a distributed environment.This definition, however, only offers an idea of what a grid is and not how it actually works.In order to classify all the functionalities that a grid system should provide, Foster describes a grid

architecture as composed of five layers [2]: (1) fabric, providing resources such as computationalunits and network resources; (2) connectivity layer composed of communication and authenticationprotocols; (3) resource layer implementing negotiation, monitoring, accounting, and payment forindividual resources; (4) collective layer focusing on global resource management; and finally, (5)the layer composed of user applications. Similar classification is given in [3] where the architectureis composed of four layers: (1) resources, composed of the actual grid resources such as computersand storage facilities; (2) network, connecting the resources; (3) middleware layer, equivalent tothe collective layer of [2], but also including some of the functionality of the resource layer (e.g.monitoring); and (4) application layer. In both [2] and [3], as well as in most of the other similarworks done by practitioners, the grid architecture is described only at a very high level. Theseparation between the main parts of the grid is not well defined. Moreover, there is a huge gapbetween the architectural models, usually given in terms of informal diagrams, and the actual gridimplementations which use an engineering-like approach. A good conceptual reference model forgrids is missing.This paper tries to fill the gap between high-level architectural diagrams and concrete imple-

mentations, by providing a colored Petri net (CPN) [4] describing a reference grid architecture.Petri nets [5] are a well-established graphical formalism, able to model concurrency, parallelism,communication, and synchronization. CPNs extend Petri nets with data, time and hierarchy, andcombine their strength with the strength of programming languages. For these reasons, we considerCPNs to be a suitable language for modeling grid architectures. The CPN reference model, beingformal, resolves ambiguities and provides semantics. Its graphical nature and hierarchical compo-sition contribute to a better understanding of the whole grid mechanism. Moreover, as CPN modelsare executable and supported by CPN Tools [6] (a powerful modeling and simulation framework),the model can be used for rapid prototyping, and for various kinds of analyses ranging from modelchecking to performance analysis.Note that we do not intend to propose a new architecture but to provide a simple, easy-to-

understand and easy-to-use, executable model that unambiguously depicts the main modules in a(typical) grid. In addition, we want to provide a framework that offers a simple and efficient wayof, for example, estimating the throughput time of user workflows.The literature refers to different types of grids, based on the main applications supported. For

example, a data grid is used for managing large sets of data distributed on several locations, anda computational grid focuses on offering computing power for large and distributed applications.Each type of grid has its particular characteristics making it a non-trivial task to unify them. Thispaper focuses on computational grids only. However, we also take into account some data aspects,such as input and output data of computational tasks and the duration of data transfer, as these areimportant aspects for the analysis.(Computational) grids are used in different domains ranging from biology and physics to weather

forecasting and business intelligence. Although the results presented in this paper are highly generic,

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 3: A reference model for grid architectures and its validation

A REFERENCE MODEL FOR GRID ARCHITECTURES AND ITS VALIDATION 1367

we focus on process mining [7,8] as an application domain. The basic idea of process mining is todiscover, monitor, and improve real processes (i.e. not assumed processes) by extracting knowledgefrom event logs . It is characterized by an abundance of data (i.e. event logs containing millions ofevents) and potentially computing intensive analysis routines (e.g. genetic algorithms for processdiscovery). However, as many of its algorithms can be distributed by partitioning the log or model,process mining is an interesting application domain for grid computing.Here, at the Eindhoven University of Technology we are often involved in many challenging

applications of process mining that can benefit from the grid (a recent example is our analysisof huge logs coming from the ‘CUSTOMerCARE Remote Services Network’ of Philips MedicalSystems). We need a good experimental framework that allows us to experiment with differentscheduling techniques and different grid-application designs. This is an added reason for takingprocess mining as our application focus.The level of detail present in the CPN model allows us to, almost straightforwardly, implement

a real grid architecture based on the model. In this paper we present the first prototype, calledYAGA (Yet Another Grid Architecture), a service-based framework for supporting process miningexperiments. YAGA is a simple, but extendable, grid architecture meant to be used in the classroomby students doing a project in our process mining course.Having a real testbed grid and a software implementation (i.e. YAGA) at our disposal, we can

also validate the CPN model. We are not interested in a generic validation of the model (this wouldrequire an experiment of a very large scale which, in our case, would be too hard to conduct), butrather focus on seeing how the simulated behavior matches the real one when applied on our smallgrid containing four homogeneous computers. During the process mining course we intend to usethe CPN model for planning and to generate recommendations, which explains our decision to havethe model validated on the architecture that will actually be used in the classroom. Note that ourlocal grid is dedicated to run process mining experiments.For the validation we consider one usage scenario typically seen in the process mining course. In

this course students often have to apply different mining algorithms on specific logs. Based on thereceived output, students sometimes need to repeat their experiments but with changed parameters.We conduct several experiments that mimic this behavior, on both the testbed and on the modeland compare the obtained results. For the visualization and analysis of the results we use the SPSSsoftware [9].The remainder of the paper is organized as follows. In the remainder of this section we discuss

some related work. In Section 3 we present a grid architecture and its CPN model. The modelvalidation is presented in Section 4. Section 5 concludes the paper.

2. RELATED WORK

Grid architectures are often described in an informal way. Foster [2] describe the main levels of agrid architecture and their characteristics. In [10], the authors present the necessary services that agrid architecture should support. A similar architecture is proposed in [11] where the authors focuson the technologies that can be used in order to implement a web-service-based grid.While formal techniques are widely used to describe grid workflows [12–14], only a few attempts

have been made to specify the semantics of a complete grid. In [15] a semantic model for grid

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 4: A reference model for grid architectures and its validation

1368 W. VAN DER AALST ET AL.

systems is given using Abstract State Machines [16] as the underlying formalism. The model is at avery high level (a refinement method is only informally proposed) and no analysis can be performed.Zhou and Zeng [17] give a formal semantic specification (in terms of Pi-calculus) for dynamicbinding and interactive concurrency of grid services. The focus is on grid service composition.Buyya et al. [18] propose an economical model to decide the job allocation such that economicaltargets (e.g. minimal cost) are met. Du et al. [19] present a formal model for a grid architecture basedon Petri nets. They propose a three layer architecture: (1) the user level models jobs’ execution,(2) the virtual level that models the allocation of a job and (3) the physical level that models theresource layer. The model captures only the allocation and the execution of jobs without capturingimportant characteristics of a grid (e.g. the scheduling policy, data management). Moreover, themodel allows only a couple of properties’ verification based on state-space exploration.Montes et al. [20] build a behavioral model of a grid using knowledge-based discovery techniques.

The authors argue that such a model will provide more insight information than the current tech-niques that treat every grid architecture component on its own. However, their model loses the detailsof the underlying grid architecture details that can be essential in understanding the state changes.In order to analyze the grid behavior several researchers have developed grid simulators. (The

most notable examples are SimGrid [21] and GridSim [22].) These simulators are typically Javaor C implementations, meant to be used for the analysis of scheduling algorithms. They do notprovide a clear reference model as their functionality is hidden in code. This makes it difficult tocheck the alignment between the real grid and the simulated grid.In [14] we proposed to model Grid workflows using CPNs, and we also used process mining

as a running example. In that paper, however, we fully covered only the application layer of thegrid; for the other layers the basic functionality was modeled, just to close the model and make theanalysis possible. We also completely abstracted from data aspects.In a brief paper [23] we also presented a (simpler) CPN model of the grid. This model was used

in a simulation case study, to evaluate the proposed data-storage optimization strategy for grids.

3. MODELING A GRID WITH CPNs

In this section, we present a grid architecture and give its semantics using CPNs‡. The proposedarchitecture is based on an extensive review of literature. As mentioned in the introduction, CPNsare graphical and have hierarchy, so we use the CPN model itself to explain the architecture.The main page of the model is given in Figure 1. It shows the high-level view of the grid

architecture. As seen in the figure, the grid consists of three layers: (1) the resource layer, (2) theapplication layer, and (3) the middleware. The application layer is where the users describe theapplications to be submitted to the grid. The resource layer is a widely distributed infrastructure,composed of different resources linked via Internet. The main purpose of the resources is to hostdata and execute jobs. The middleware is in charge of allocating resources to jobs, and of othermanagement issues. The three layers communicate by exchanging messages, modeled as (interface)places in CPNs.

‡We assume that the reader is familiar with the formalism; if otherwise, [4] provides a good introduction.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 5: A reference model for grid architectures and its validation

A REFERENCE MODEL FOR GRID ARCHITECTURES AND ITS VALIDATION 1369

Figure 1. Grid architecture.

User applications consist of jobs, atomic units of work. The application layer sends job descrip-tions to the middleware, together with the locations of the required input data. It then waits for amessage saying whether the job was finished or canceled. When it receives a job description themiddleware tries to find a resource to execute this job. If a suitable resource is found, it is firstclaimed and then the job is sent to it. The middleware monitors the status of the job and reacts to statechanges. The resource layer sends to the middleware the acknowledgments, and the informationabout the current state of resources, new data elements, and finished jobs. When instructed by theapplication layer, the middleware removes the data that are no longer needed from the resources.We now zoom into the submodules in Figure 1, presenting each layer and its dynamics in more

detail.

3.1. Application layer

The upper level of the grid architecture is composed of user applications. These applications definethe jobs to be executed on the resources. As these jobs may causally depend on one another, theapplication layer needs to specify the ‘flow of work’. Therefore, we use the term grid workflow torefer to the processes specified at the application layer. There may be different grid workflows usingthe same infrastructure, and there may be multiple instances of the same grid workflow, referred toas cases.The purpose of a grid architecture is to offer users an infrastructure to execute complex applica-

tions and at the same time hide the complexity of the resources. This means that the user should notbe concerned with, for example, resource discovery or data movement. In order to achieve this goal,

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 6: A reference model for grid architectures and its validation

1370 W. VAN DER AALST ET AL.

Figure 2. Application layer.

the application layer provides a grid workflow description language that is independent of resourcesand their properties. Defined workflows can therefore be (re)used on different grid platforms.In our case, CPNs are themselves used to model grid workflows. However, they are expected to

follow a certain pattern. The user describes each job as a tuple (ID,AP,OD) where ID is the set ofinput data, AP is the application to be executed, and OD is the set of output data. Every data elementis represented by its logical name, leaving the user a possibility to parameterize the workflow atrun-time with the actual data. All data are case related, that is, if the user wants to realize multipleexperiments on the same data set, the data are replicated and the dependency information is lost.It is assumed that the set ID contains only the existing data, that is, the data previously created byother jobs. It is also assumed that AP is an existing software application at the level of resources.In Figure 2 we present the CPN model of the application layer. In this case, the layer consists of

only one workflow, but multiple cases can be generated by theGenCasesmodule. Every workflow ispreceded by the substitution transition RegisterData. This transition, as the name implies, registersfor every new case the location of the input data required for the first job. The workflow fromFigure 2 thus needs a ‘Log’. When all the jobs in a case are executed, the application layer sends amessage to the middleware instructing that all the data of the case are deleted (transition GarbageRemoval).In Figure 3, an example of a workflow is presented, describing a simple, but very typical, process

mining experiment. The event log (‘Log’) is filtered using two types of filters ‘Filter1’ and ‘Filter2’.The filters are executed sequentially so that the log used for mining incorporates both the changes.Next, the obtained log (‘FLog2’) is mined using two different algorithms (‘Mining1’ and ‘Mining2’).The mining steps are performed in parallel.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 7: A reference model for grid architectures and its validation

A REFERENCE MODEL FOR GRID ARCHITECTURES AND ITS VALIDATION 1371

Figure 3. Workflow example.

All the jobs follow the pattern from Figure 4. Each logical data name (from ID and OD) ismodeled as a distinct place. In this way, we can easily observe the data dependencies betweenjobs. These dependencies can be used in an optimization algorithm to decide whether some dataare no longer needed, or when some data are more suitable to be replicated. The PluginType placecontains the information on which application to execute at the resource level (the parameter APfrom the above). In our case this place contains the name of a process mining algorithm. Every jobstarts by receiving a unique id and sending its description to the middleware. It ends with either aproper termination (the job was executed and required output data were created), or a cancelation(the middleware cannot find a resource to execute the job; in our model that only happens whensome input data no longer exist on any grid resource). The application layer cancels the whole caseif at least one of the jobs gets canceled.The model can be easily extended to allow for the execution of multiple workflows by replacing,

and/or adding, new workflow pages. These new workflows should, of course, respect the above-mentioned job semantics.The noise generated by other workflows using the same grid infrastructure can be modeled as an

additional workflow that generates jobs. The arrival rate for this workflow and the execution timefor these jobs can be estimated using work similar to the one presented in [20].

3.2. Middleware

The link between user applications and resources is made via the middleware layer (Figure 5). Thislayer contains the intelligence needed to discover, allocate, and monitor resources. We consider

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 8: A reference model for grid architectures and its validation

1372 W. VAN DER AALST ET AL.

Figure 4. Job example.

just one centralized middleware, but our model can be easily extended to a distributed middleware.We also restrict ourselves to a middleware working according to a ‘just-in-time’ strategy, that is,the search for an available resource is done only at the moment a job becomes available. If thereare multiple suitable resources, an allocation policy is applied. Lookahead strategies and advancedplanning techniques are not considered in this paper.The place GlobalResInformation models an information database containing the current state of

resources. The middleware uses this information to match jobs with resources, and to monitor thebehavior of resources. The database is updated based on the information received from the resourcelayer and the realized matches.Data Catalog is a database containing information about the location of data elements and the

amount of storage area that they occupy. This information is used in the scheduling process. ThetransitionDataManagementmodels the registration of data for new cases and the removal of recordsfrom the catalogue. A message containing a list of ‘garbage data’ can be sent to the resources atany time.When a job is received (JobReceiving module), the middleware first extends its description with

an estimate of how much storage area is needed. A user-provided knowledge database is used forthis task. Next, the job is added to the jobs pool list, ordered based on the arrival time. If multiplejobs arrive at the same time, the order is non-deterministic. The scheduling process now starts(Figure 6), according to the chosen policy.Scheduling is done in two steps. First, a match between a resource and a job is found. The

matching phase succeeds if there is a resource with a free CPU and enough free space to store theinput and the output data. Second, the found resource is claimed. This step is necessary because

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 9: A reference model for grid architectures and its validation

A REFERENCE MODEL FOR GRID ARCHITECTURES AND ITS VALIDATION 1373

Figure 5. Middleware layer.

the matching process is based on a (possibly outdated) local copy of the state of the resources. Themiddleware sends a claim request to the allocated resource in order to check that its resource imageis still correct. If the claim is successful, the middleware sends the job description to the resource,extended with the list of data that need to be obtained from other resources (using the so-calledtransfer list). If the claim fails, the middleware puts the job back into the pool.The place Scheduling Policy contains one token that determines the type of scheduling to be

applied. For each different scheduling policy a function that models this policy is implemented. Toadd a new policy is simple. For this, a new function needs to be defined and the value of the tokenin place Scheduling Policy needs to be updated.After the job was sent to the resource, the middleware monitors the resource state. Basically,

it listens on the signal received from the resource layer. Each time a message is received, themiddleware compares the received state with its local copy. As these messages can be outdated(e.g. when a resource sends a message just before a job arrives), the middleware reacts using atimeout mechanism. A resource is considered unavailable if no information is received from it fora given period of time, and a job is considered canceled if no message related to the job is receivedfor several consecutive updates. When the middleware receives the message that a job is finished,

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 10: A reference model for grid architectures and its validation

1374 W. VAN DER AALST ET AL.

Figure 6. Scheduling page.

it updates the global resource information database and forwards this message to the applicationlayer.Jobs can fail at the resource layer. Therefore, a fault handling mechanism is defined (transition

Fault-Handling). When a job fails, the middleware tries to re-allocate it. However, if the necessaryinput data are no longer available at the resource level, the middleware is unable to execute the joband it sends a message to the application layer that the job is canceled.

3.3. Resource layer

Every resource is described in terms of the available computing power (expressed in number ofCPUs), the amount of storage area available for hosting data, the list of supporting applications,and the set of running and allocated jobs. The resources are unaware of job issuers and of jobdependencies. Every job is executed on just one resource. However, resources can work on multiplejobs at the same time. Figure 7 presents the conceptual representation of the functionalities of theresource layer in terms of CPNs.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 11: A reference model for grid architectures and its validation

A REFERENCE MODEL FOR GRID ARCHITECTURES AND ITS VALIDATION 1375

Figure 7. Resource layer.

The set of resources is assumed to be fixed, that is, during one simulation resources cannot beadded and/or removed, but are considered unreliable. They can appear/dissapear at any momentin time (except when transferring data). Transition Resource Dynamics, governed by a stochasticclock, simulates the possibility that a resource becomes unavailable. When this happens all the dataare lost and all running jobs are aborted on this resource. Note that in order to change the resourcepool (to add, e.g. a new resource type), only the value of the token in Resources place needs to beupdated.After a successful match by the middleware, the transition Claim is used to represent a guarantee

that the allocated resource can execute the job. Recall that this phase is necessary because theallocation at the middleware level is based on a possibly outdated information about the state ofthe resources. If the claiming succeeds, one CPU and the estimated necessary storage area arereserved at the resource. The resource is now ready to perform the job, and is waiting for the fulljob description to arrive. The job description also contains the locations of the input data and theinformation about which application to execute. The substitution transition Transfer models thegathering of necessary input data from other resources. If the input data are no longer present on asource node, the job is aborted. If the transfer starts, we assume that it ends successfully. Note thatthe reserved CPU remains unoccupied during the transfer. When all the input data are present onthe allocated resource, the job starts executing.The resources can always offer their capabilities and, hence, the resource layer constantly

updates the middleware on the current state of the resources. There are two types of updates

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 12: A reference model for grid architectures and its validation

1376 W. VAN DER AALST ET AL.

sent: (1) recognition of new data (transferred data, or data generated by job execution) and (2)signals announcing to the middleware that a resource is still available. While the former is sent onevery change in the resource status, the latter info is periodical.Remove Data transition models the fact that any data can be deleted from a resource at the request

of the middleware. These requests can arrive and be fulfilled at any moment in time.The reference model presented in this section offers a clear view and a good understanding of our

grid architecture. The next section shows how we can use this model to also analyze the behaviorof the grid.

4. MODEL VALIDATION

Our CPN reference grid model is an executable model due to the simulation and monitoringcapabilities of the CPNTools framework. In order to validate our model, we perform a com-parative study between the results of experiments on a real grid architecture and the resultsof simulations on the model. We use the testbed grid architecture YAGA (Yet Another GridArchitecture) developed for conducting process mining experiments. The architecture combinesan existing workflow management system (Yet Another Workflow Language—YAWL engine [24])and a process mining framework (ProM [25]) through a module-based grid middleware. The YAGAimplementation uses the reference model in order to define its component modules and messagepassing.We validate our model by comparing the average throughput time and the resource load given

by the percentage of running jobs from the maximum number of jobs that can run on the resourcefor different rates of starting different instance of the user workflow. We set the time parameters ofthe model using a measurement-based estimations of the data transfer time between resources andof the run-time for each type of job in the workflow. As explained we consider a case study thatmimics the behavior of students while performing process mining experiments. The user workflow,therefore, follows the steps of the workflow from Figure 3.In the following subsection, we present an overview of the YAGA architecture. Further on, we

show the validation of the reference model presented in Section 3 by comparing results of the gridexperiments with the simulation measurements.

4.1. YAGA: Yet Another Grid Architecture

The testbed architecture (YAGA) combines a powerful process engine (YAWL [24]) and a state-of-the-art analysis tool (ProM [25]) through a JAVA-based grid middleware. Figure 8 presents thebuilding blocks of the testbed and their interaction. The architecture follows the reference modelcomponents.TheYAWL engine is a web-service-based application that is able to interact with external (custom)

web-services. We use this facility for the message exchange between the YAWL engine and thegrid middleware. A YAWL custom service receives the task data (i.e. the grid job description) andforwards this information to the YAWL interface included in the grid middleware. When the jobfinishes, the middleware sends the necessary information (for example, the id’s of the output data)to the engine.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 13: A reference model for grid architectures and its validation

A REFERENCE MODEL FOR GRID ARCHITECTURES AND ITS VALIDATION 1377

Figure 8. Grid architecture implementation.

We developed the grid middleware using JAVA with respect to the building blocks and themessage passing presented in Figure 5. The JobArrival block receives the job description via theYAWL interface and transforms it into an internal representation. The internal representation containsreferences to the data id’s and the resources hosting the data. The job is added to the job pool.Each job has a life cycle modeled as a state machine. The middleware listens to the state changesin order to determine job related actions, for example, scheduling.The scheduler module periodically (every 5 seconds) allocates jobs to resources. After a match

between a resource and a job is made, the transfer list is computed. The data transfer is peerto peer between the resources. The middleware role is to provide the coordinates of the datalocation.The monitor module receives messages from the resources as described in Section 3. Its func-

tionalities are to update the data catalog, to notify the YAWL engine about the job status, and toupdate the resource information.The resources run process mining operations using the ProM framework. The ProM framework

is a powerful JAVA-based plugable analysis tool containing more than 200 plugins. It allows ex-ternal control via a socket-based interface. Moreover, different ProM instances can exchange JAVAobjects. The grid middleware ProM interface interacts with the ProM framework by requiring theexecution of a specific plugin as requested by the grid job description. The ProM interface also

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 14: A reference model for grid architectures and its validation

1378 W. VAN DER AALST ET AL.

supports message exchange related to the hosting resource state (e.g. the load, allocated memory)and data.

4.2. Time parameters

In order to measure the throughput time of a workflow instance (case) we need to define thenecessary time parameters. The throughput time of a case depends on the execution times of its gridjob components. The execution time of a job, from the user perspective, consists of the following:

• the queuing time at the middleware that depends on the resource availability and the schedulingperiod,

• the transfer time of input data, if relevant, which depends on the data size and the networkcharacteristics (bandwidth, load etc.) and,

• the run time of the job on a resource that depends on the type of application and the size ofinput data.

We need to determine these parameters to conform with the real settings. Wilhelm et al. [26]present an overview of the methods for estimating the run-times. The authors identify two ap-proaches for this estimation: (1) static methods and (2) measurement-based methods. The firsttype uses the code and hardware architecture model to determine the worst case run-time. Themeasurement-based approach performs multiple runs of the job on a specific architecture and thenthe execution time is obtained by applying statistical methods. Similar methods can be used to es-timate the data transfer times between resources (see [27,28]). We choose the measurement-basedapproach to find the distributions of the run-times per job and the distributions of the data transfertimes.In order to obtain the real life information, we log the following events:

• job events—job arriving/leaving the middleware, run job start/complete on a resource, andinput data transfer start/complete;

• data events—creation, replication, and deletion;• resource information—CPU load and number of running parallel jobs.

Figure 9 presents a snapshot of the logs.

4.3. Experimental results and validation

To validate our model, we consider the case study of students conducting a process mining project.The project requires the execution of the process mining workflow from Figure 3, four times withdifferent parameters for the mining algorithms. The students typically manipulate these parametersin order to obtain more accurate results.We consider different case-arrival rates taking into account that not all students will start working

at the same time. The arrival rate follows an exponential distribution to model variation amongstudents. The chosen mean arrival times are 50, 40, 30, and 20 s, equivalent to 80, 30, 25, and15 min, respectively, allocated for the running and analysis of the workflow. We want to determinethe amount that a student spends in waiting for the results. Each experiment is limited to 200 cases,corresponding to a group of 50 students.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 15: A reference model for grid architectures and its validation

A REFERENCE MODEL FOR GRID ARCHITECTURES AND ITS VALIDATION 1379

Figure 9. Excerpts of the grid middleware logs. (a) Job log; (b) data log; and (c) resource log.

The simulation results are obtained using the CPNTools’ monitoring capability. We collect theaverage throughput time of a case and the resource load when the cases are arriving for differentarrival rates.We run the experiments on a resource pool composed of four identical resources. Each resource

has eight Intel(R) Xeon processors with a frequency of 2.66GHz and 15Gb RAM. Note that theservers were entirely dedicated to our experiments, that is, no other applications were running inparallel and no breakdown occurred during the experiments. We limit the number of jobs by thenumber of processors. Every workflow follows the description from Figure 3. Each resource canperform all the process mining jobs (Filter1, Filter2, Mining1, and Mining2).Figures 10–13 present the comparison for the different arrival rates of the resource load and the

throughput time data series. Tables I and II present the mean and the standard deviation of the

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 16: A reference model for grid architectures and its validation

1380 W. VAN DER AALST ET AL.

Figure 10. Comparison of resource load and throughput time distributions for mean arrivalrate 1

50 . (a) Resources load distribution and (b) throughput time distribution.

Figure 11. Comparison of resource load and throughput time distributions for mean arrivalrate 1

40 . (a) Resources load distribution and (b) throughput time distribution.

results. In Table II, we also present the mean analysis time that the students have for analyzingthe workflows results and identifying new parameters. Note that we refer to the results obtainedfrom the YAGA experiments as grid data and to the one obtained from CPN Tools monitoring assimulation data.Each figure compares graphically the data series obtained from the YAGA runs and the CPN

simulation in terms of minimal/maximal values, median, and lower/upper quartile§ . The box

§ lowest/highest 25% of the data.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 17: A reference model for grid architectures and its validation

A REFERENCE MODEL FOR GRID ARCHITECTURES AND ITS VALIDATION 1381

Figure 12. Comparison of resource load and throughput time distributions for mean arrivalrate 1

30 . (a) Resources load distribution and (b) throughput time distribution.

Figure 13. Comparison of resource load and throughput time distributions for mean arrivalrate 1

20 . (a) Resources load distribution and (b) throughput time distribution.

encapsulates the median value and it is bordered by the lower and upper quartile. The error barsrepresent the minimal and maximal values. The dots represent the data outliers.For all the arrival rates, the resource load data points obtained from the simulation overlap with

the ones obtained from the grid data. The most interesting fact is that the maximal values of resourceload on grid and in the simulation are very close (e.g. for arrival rate given by the exp( 1

20 ) the griddata resource load maximum is 90, 63% and the simulation data maximum is 93.75%; similarlyfor arrival rate exp( 1

50 ) the maximal values are 50% for the grid data and 53% for the simulationdata). The minimal values of the resource load are in all cases, as expected, equal to zero. Table I

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 18: A reference model for grid architectures and its validation

1382 W. VAN DER AALST ET AL.

Table I. Resource load statistics.

Mean arrival rate (1/s) Data type Mean (%) Standard deviation (%)

150 Grid Data 14.88 9.02

Simulation Data 14.67 9.99140 Grid Data 19.35 8.95

Simulation Data 22.81 10.34130 Grid Data 33.80 15.65

Simulation Data 23.39 12.03150 Grid Data 41.04 20.67

Simulation Data 48.13 17.88

Table II. Throughput times statistics.

Mean arrival rate (1/s) Data type Mean (s) Standard deviation (s) Analysis time (min)

150 Grid Data 144.46 13.45 77.6

Simulation Data 138.30 12.75 77.69140 Grid Data 145.76 14.25 27.5

Simulation Data 141.15 16.63 27.6130 Grid Data 171.68 24.05 22.13

Simulation Data 146.58 14.75 22.5120 Grid Data 199.37 31.02 12.68

Simulation Data 171.50 23.78 13.1

shows that the means are close. The large standard deviation to mean ratio for resource load isexplained by the fact that there are periods when many resources are idle and periods with muchmore activities when a new case arrives.The throughput time estimation for low arrival rates given by the simulation data is very accurate

(in terms of mean and minimal/maximal values as seen in Figures 10 and 11 and Table II). Weobserve a deterioration of the simulation data quality when the mean arrival rate is increased. Themiddleware overhead is the reason for the gap between the average value of the grid results and thesimulation results. The main middleware overhead is due to the logging of middleware behavior andthe message passing between the different layers. Note that the logging system is switched off whenusers are using the architecture, which means that the real gap is actually smaller. As expected,when we increase the arrival rate the gap is increased since the middleware receives more jobs pertime unit. We model the middleware delay as a constant that results in a lower deviation value insimulations than on the grid.Note that the total run-time of the experiments performed with YAGA was approximately 10 h

and the simulation time for the CPN model was around 4min.

4.4. Adding a new resource

In the previous section, we show that the model can be used to estimate the throughput times usinga homogeneous resource pool. In this section, we compare the simulation and the experimental

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 19: A reference model for grid architectures and its validation

A REFERENCE MODEL FOR GRID ARCHITECTURES AND ITS VALIDATION 1383

Figure 14. Comparison of resource load and throughput time distributions for a heterogeneousenvironment. (a) Resources load distribution and (b) throughput time distribution.

results when we add a new resource to the pool. The new resource has two processors Intel(R)Pentium with a frequency of 3GHz and 2Gb of RAM. We estimate the needed parameters (i.e.execution times and transfer times) for this resource and we add the information to the model, thatis, we add a new token in the place Resources from Figure 7. Note that the 8-core resources havea Linux operating system and the 2-core resource runs on a Windows operating system.Figure 14 presents the obtained results for an arrival rate of one case per 30 s. The outliers

correspond to the cases that have tasks executed on the new added resource. The mean valuesfor the throughput times are: 155.4 for the grid data and 144.2 for the simulation data. The dataseries cannot be fit to a normal distribution because of the large difference between the workflowsexecuted on the 8-core resources and the one executed on the 2-core resource.The obtained results show some oversimplifications in our model that we intend to improve in

our future work. The simulation is not meant to present a precise estimation of time and resourceload, however it does present tight bounds on throughput time and resource load. This informationhelps the user in designing and re-scheduling his/her workflows. Therefore, we have confidencethat the simulation model can be successfully integrated in a recommendation framework to supportthe users of our process mining framework.

5. CONCLUSIONS

In this paper, we presented a reference model for grid architectures, motivated by the absence of agood conceptual definition of the grid. The model is represented in terms of Colored Petri nets. Itis thus formal, graphical, and executable; it offers a clear and unambiguous view of how differentparts of the grid are structured and how they interact. The model is also fully adaptable and eas-ily extendable.Based on the model we implemented a grid architecture, called YAGA, which we use for pro-

cess mining experiments. We validated the CPN model using the results obtained on the YAGA

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 20: A reference model for grid architectures and its validation

1384 W. VAN DER AALST ET AL.

architecture. The validation showed that the simulation results were indeed similar to the experi-mental ones. Some small discrepancy was expected due to delays of the middleware that we do notmodel and that were partially due to the overhead of the logging needed for the validation purposesonly (and thus disappearing when we use grid in the normal setting). As a result, our CPN modelcan safely be used for the purpose of planning in our process mining course.Future work: In order to improve the quality of the simulation results, we plan to model the

middleware delay as a function depending on the job arrival frequency at the middleware level.Moreover, we intend to perform more extensive tests of our model.Further on, we plan to combine the CPN model and the YAGA implementation into a powerful

grid framework for process mining experiments. The model allows for easy experimentation andextensive debugging. It also ensures an easy and rapid way to choose optimal parameters of real-lifeworkflows. These estimations can help users in planning their experiments and/or re-configuringtheir workflows. Moreover, performing model simulations on-the-fly could give realistic resourceload predictions, which could possibly be used to improve the scheduling process.The use of YAWL as a workflow language for our grid architecture enables us to directly interpret

CPN workflows integrated into our grid model. For this we need to build a suitable (automatic)translation procedure. Moreover, for a more accurate integration of the current state of the resources,we could use a similar model to the one presented in [20].The CPNmodel can be downloaded from:http://www.win.tue.nl/∼cbratosi/YAGA.

REFERENCES

1. Stockinger H. Defining the grid: A snapshot on the current view. The Journal of Supercomputing 2007; 42(1):3–17.2. Foster I. The anatomy of the grid: Enabling scalable virtual organizations. First IEEE/ACM International Symposium

on Cluster Computing and the Grid, Brisbane, Australia, 2001; 6–7.3. Grid architecture. Available at: http://gridcafe.web.cern.ch/gridcafe/gridatwork/architecture.html.4. Jensen K. Coloured Petri Nets—Basic Concepts, Analysis Methods and Practical. Springer: Berlin, 1992.5. Reisig W. System design using Petri nets. Requirements Engineering 1983; 74: 29–41.6. CPN Group, University of Aarhus, Denmark. CPN Tools Home Page. Available at: http://wiki.daimi.au.dk/cpntools/.7. Aalst W, Weijters A, Maruster L. Workflow mining: Discovering process models from event logs. IEEE Transactions

on Knowledge and Data Engineering 2004; 16(9):1128–1142.8. Aalst W, Reijers H, Weijters A, van Dongen B, Medeiros A, Song M, Verbeek H. Business process mining: An industrial

application. Information Systems 2007; 32(5):713–732.9. SPSS software. Available at: http://www.spss.com/.10. Foster I, Kesselman C, Nick JM, Tuecke S. Grid computing: Making the global infrastructure a reality. The Physiology

of the Grid. Wiley: New York, 2003; 217–249. Available at: http://www.globus.org/research/papers/ogsa.pdf.11. Atkinson M, DeRoure D, Dunlop A, Fox G, Henderson P, Hey T, Paton N, Newhouse S, Parastatidis S, Trefethen A

et al. Web Service Grids: An evolutionary approach. Concurrency and Computation: Practice and Experience 2005;17(2–4):377–389.

12. Feng Z, Yin J, He Z, Liu X, Dong J. A novel architecture for realizing grid workflow using pi-calculus technology.APWeb (Lecture Notes in Computer Science, vol. 3841). Springer: Harbin, China, 2006; 800–805.

13. Alt M, Gorlatch S, Hoheisel A, Pohl HW. A grid workflow language using high-level Petri nets. Parallel Processingand Applied Mathematics (Lecture Notes in Computer Science, vol. 3911), Wyrzykowski R, Dongarra J, Meyer N,Wasniewski J (eds.). Springer: Berlin, 2006; 715–722.

14. Bratosin C, van der Aalst WMP, Sidorova N. Modeling grid workflows with colored petri nets. Proceedings of theEighth Workshop on the Practical Use of Coloured Petri Nets and CPN Tools (CPN 2007), DAIMI, Aarhus, Denmark,vol. 584, 2007; 67–86.

15. Nemeth Z, Sunderam V. A formal framework for defining grid systems. Proceedings of the Second IEEE/ACMInternational Symposium on Cluster Computing and the Grid, CCGRID2002, Berlin, 2002. Available at:citeseer.ist.psu.edu/541939.html.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe

Page 21: A reference model for grid architectures and its validation

A REFERENCE MODEL FOR GRID ARCHITECTURES AND ITS VALIDATION 1385

16. Borger E, Stark RF. Abstract State Machines. A Method for High-Level System Design and Analysis. Springer: Berlin,2003.

17. Zhou J, Zeng G. Describing reasoning on the composition of grid services using pi-calculus. CIT ’06: Proceedings ofthe Sixth IEEE International Conference on Computer and Information Technology (CIT’06). IEEE Computer Society:Washington, DC, U.S.A., 2006; 48–54.

18. Buyya R, Stockinger H, Giddy J, Abramson D. Economic models for resource management and scheduling in gridcomputing. Concurrency and Computation: Practice and Experience 2002; 14(13–15):1507–1542.

19. Yuyue Du CJ, Guo Y. Towards a formal model for grid architectures via petri nets. Information Technology 2006;5(5):833–841.

20. Montes J, Sanchez A, Valdes JJ, Perez MS, Herrero P. The grid as a single entity: Towards a behavior model of thewhole grid. OTM ’08: Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA,

IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems. Springer-Verlag: Monterrey, Mexico,2008; 886–897.

21. Casanova H. Simgrid: A toolkit for the simulation of application scheduling. CCGRID ’01: Proceedings of the1st International Symposium on Cluster Computing and the Grid. IEEE Computer Society: Brisbane, Australia,2001; 430.

22. Buyya R, Murshed MM. Gridsim: A toolkit for the modeling and simulation of distributed resource management andscheduling for grid computing. Concurrency and Computation: Practice and Experience 2002; 14(13–15):1175–1220.

23. Trcka N, van der Aalst W, Bratosin C, Sidorova N. Evaluating a data removal strategy for grid environments usingcolored petri nets. Principles of Distributed Systems, 12th International Conference, OPODIS 2008 (Lecture Notes inComputer Science, vol. 5401), Luxor, Egypt, 15–18 December 2008, Baker TP, Bui A, Tixeuil S (eds.). Springer: Berlin,2008; 538–541.

24. Aalst W, Hofstede A. YAWL: Yet Another Workflow Language, QUT Technical Report, FIT-TR-2002-06, QueenslandUniversity of Technology, Brisbane, 2002.

25. Prom. Available at: http://www.processmining.org.26. Wilhelm R, Engblom J, Ermedahl A, Holsti N, Thesing S, Whalley D, Bernat G, Ferdinand C, Heckmann R, Mitra T

et al. The worst-case execution-time problem—Overview of methods and survey of tools. Transactions on EmbeddedComputing Systems 2008; 7(3):1–53.

27. Prasad RS, Murray M, Dovrolis C, Claffy K. Bandwidth estimation: Metrics, measurement techniques, and tools. IEEENetwork 2003; 17:27–35.

28. Vazhkudai S, Schopf JM, Foster I. Predicting the performance of wide area data transfers. International Parallel andDistributed Processing Symposium, vol. 1. IEEE Computer Society: Fort Lauderdale, U.S.A., 2002; 0034.

Copyright q 2009 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2010; 22:1365–1385DOI: 10.1002/cpe