Formulating cost effective monitoring strategies for service-based systems

22
Formulating Cost-Effective Monitoring Strategies for Service-Based Systems Qiang He, Member, IEEE, Jun Han, Yun Yang, Senior Member, IEEE, Hai Jin, Senior Member, IEEE, Jean-Guy Schneider, and Steve Versteeg Abstract—When operating in volatile environments, service-based systems (SBSs) that are dynamically composed from component services must be monitored in order to guarantee timely and successful delivery of outcomes in response to user requests. However, monitoring consumes resources and very often impacts on the quality of the SBSs being monitored. Such resource and system costs need to be considered in formulating monitoring strategies for SBSs. The critical path of a composite SBS, i.e., the execution path in the service composition with the maximum execution time, is of particular importance in cost-effective monitoring as it determines the response time of the entire SBS. In volatile operating environments, the critical path of an SBS is probabilistic, as every execution path can be critical with a certain probability, i.e., its criticality. As such, it is important to estimate the criticalities of different execution paths when deciding which parts of the SBS to monitor. Furthermore, cost-effective monitoring also requires management of the trade-off between the benefit and cost of monitoring. In this paper, we propose CriMon, a novel approach to formulating and evaluating monitoring strategies for SBSs. CriMon first calculates the criticalities of the execution paths and the component services of an SBS and then, based on those criticalities, generates the optimal monitoring strategy considering both the benefit and cost of monitoring. CriMon has two monitoring strategy formulation methods, namely local optimisation and global optimisation. In-lab experimental results demonstrate that the response time of an SBS can be managed cost-effectively through CriMon-based monitoring. The effectiveness and efficiency of the two monitoring strategy formulation methods are also evaluated and compared. Index Terms—Service-based system, web service, QoS, response time, monitoring, criticality, cost of monitoring, value of monitoring Ç 1 INTRODUCTION T HE service-oriented computing paradigm offers an effective way to engineer software systems [11], [15], [20] by composing existing services in the form of business processes, e.g., BPEL processes [34], [44]. In such a service composition or service-based systems (SBSs), the compo- nent services jointly offer the functionality of the SBS and collectively fulfil its quality requirements. Built from loosely coupled component services offered by independent (and often distributed) providers, SBSs operate in environments where key characteristics of the component services, such as the quality of service (QoS) properties, tend to be volatile. At runtime, various anoma- lies may occur and impact on the quality of an SBS, e.g., unexpected workload changes, errors in the component services and failures of data transmissions. In this context, how to manage the quality of an SBS by detecting and adapting to runtime anomalies has become an important research direction [11], [15], [20]. Monitoring, as an essential part of Service-Oriented Architecture [11], [15], is required for the quality manage- ment of SBSs. By monitoring the execution of the basic com- ponents (BCs) of an SBS using monitors provided by service providers (e.g., CloudWatch [6] by Google and System Cen- ter Global Service Monitor [41] by Microsoft) and third par- ties (e.g., NetFlow [17] and IPFIX [1]), runtime anomalies can be detected or predicted. Here the BCs of an SBS refer to its component services and the data transmission links (or transmissions in short) between the component services. With the advent of cloud computing, the distribution of the component services of SBSs rises rapidly, increasing the impact of the network on the quality of the SBSs [36]. Thus, the data transmissions between the component services must be considered in monitoring SBSs. A straightforward solution to timely detection and prediction of runtime anomalies is to monitor all the BCs of an SBS constantly. In response to predicted or detected anomalies, adaptation actions can be taken to fix those anomalies and update the SBS to better manage and guarantee its qualities. For exam- ple, when a runtime anomaly occurs to a component ser- vice, the SBS may, as perceived by the users, respond to user requests slower than usual or become unavailable. If the runtime anomaly can be detected and fixed in time, the status of the SBS can return to normal before performance degradation becomes noticeable by the users. The require- ments for the average response time and availability of the service, as usually specified in a service level agreement (SLA), can still be met. This highlights the monitoring benefit. However, monitoring incurs cost. There are two aspects to monitoring cost: resource cost and system cost. First, monitoring consumes resources, including software, Q. He, J. Han, Y. Yang and J.-G. Schneider are with the School of Software and Electrical Engineering, Swinburne University of Technology, Mel- bourne, Australia 3122. E-mail: {qhe, jhan, yyang, jschneider}@swin.edu.au. Hai Jin is with the Services Computing Technology and System Lab, Clus- ter and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China. E-mail: [email protected]. S. Versteeg is with CA Technologies, Melbourne, Australia 3004. E-mail: [email protected]. Manuscript received 10 Dec. 2012; revised 9 June 2013; accepted 26 Sept. 2013. Date of publication 20 Oct. 2013; date of current version 14 May 2014. Recommended for acceptance by P. Inverardi. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TSE.2013.48 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014 461 0098-5589 ß 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

description

2014 IEEE / Non IEEE / Real Time Projects & Courses for Final Year Students @ Wingz Technologies It has been brought to our notice that the final year students are looking out for IEEE / Non IEEE / Real Time Projects / Courses and project guidance in advanced technologies. Considering this in regard, we are guiding for real time projects and conducting courses on DOTNET, JAVA, NS2, MATLAB, ANDROID, SQL DBA, ORACLE, JIST & CLOUDSIM, EMBEDDED SYSTEM. So we have attached the pamphlets for the same. We employ highly qualified developers and creative designers with years of experience to accomplish projects with utmost satisfaction. Wingz Technologies help clients’ to design, develop and integrate applications and solutions based on the various platforms like MICROSOFT .NET, JAVA/J2ME/J2EE, NS2, MATLAB,PHP,ORACLE,ANDROID,NS2(NETWORK SIMULATOR 2), EMBEDDED SYSTEM,VLSI,POWER ELECTRONICS etc. We support final year ME / MTECH / BE / BTECH( IT, CSE, EEE, ECE, CIVIL, MECH), MCA, MSC (IT/ CSE /Software Engineering), BCA, BSC (CSE / IT), MS IT students with IEEE Projects/Non IEEE Projects and real time Application projects in various leading domains and enable them to become future engineers. Our IEEE Projects and Application Projects are developed by experienced professionals with accurate designs on hot titles of the current year. We Help You With… Real Time Project Guidance Inplant Training(IPT) Internship Training Corporate Training Custom Software Development SEO(Search Engine Optimization) Research Work (Ph.d and M.Phil) Offer Courses for all platforms. Wingz Technologies Provide Complete Guidance 100% Result for all Projects On time Completion Excellent Support Project Completion & Experience Certificate Real Time Experience Thanking you, Yours truly, Wingz Technologies Plot No.18, Ground Floor,New Colony, 14th Cross Extension, Elumalai Nagar, Chromepet, Chennai-44,Tamil Nadu,India. Mail Me : [email protected], [email protected] Call Me : +91-9840004562,044-65622200. Website Link : www.wingztech.com,www.finalyearproject.co.in

Transcript of Formulating cost effective monitoring strategies for service-based systems

Page 1: Formulating cost effective monitoring strategies for service-based systems

Formulating Cost-Effective Monitoring Strategiesfor Service-Based Systems

Qiang He, Member, IEEE, Jun Han, Yun Yang, Senior Member, IEEE, Hai Jin, Senior Member, IEEE,

Jean-Guy Schneider, and Steve Versteeg

Abstract—When operating in volatile environments, service-based systems (SBSs) that are dynamically composed from component

services must be monitored in order to guarantee timely and successful delivery of outcomes in response to user requests. However,

monitoring consumes resources and very often impacts on the quality of the SBSs being monitored. Such resource and system costs

need to be considered in formulating monitoring strategies for SBSs. The critical path of a composite SBS, i.e., the execution path in the

service composition with the maximum execution time, is of particular importance in cost-effective monitoring as it determines the

response time of the entire SBS. In volatile operating environments, the critical path of an SBS is probabilistic, as every execution path

can be critical with a certain probability, i.e., its criticality. As such, it is important to estimate the criticalities of different execution paths

when deciding which parts of the SBS to monitor. Furthermore, cost-effective monitoring also requires management of the trade-off

between the benefit and cost of monitoring. In this paper, we propose CriMon, a novel approach to formulating and evaluating

monitoring strategies for SBSs. CriMon first calculates the criticalities of the execution paths and the component services of an SBS

and then, based on those criticalities, generates the optimal monitoring strategy considering both the benefit and cost of monitoring.

CriMon has two monitoring strategy formulation methods, namely local optimisation and global optimisation. In-lab experimental results

demonstrate that the response time of an SBS can be managed cost-effectively through CriMon-based monitoring. The effectiveness

and efficiency of the two monitoring strategy formulation methods are also evaluated and compared.

Index Terms—Service-based system, web service, QoS, response time, monitoring, criticality, cost of monitoring, value of monitoring

Ç

1 INTRODUCTION

THE service-oriented computing paradigm offers aneffective way to engineer software systems [11], [15],

[20] by composing existing services in the form of businessprocesses, e.g., BPEL processes [34], [44]. In such a servicecomposition or service-based systems (SBSs), the compo-nent services jointly offer the functionality of the SBS andcollectively fulfil its quality requirements.

Built from loosely coupled component services offeredby independent (and often distributed) providers, SBSsoperate in environments where key characteristics of thecomponent services, such as the quality of service (QoS)properties, tend to be volatile. At runtime, various anoma-lies may occur and impact on the quality of an SBS, e.g.,unexpected workload changes, errors in the componentservices and failures of data transmissions. In this context,how to manage the quality of an SBS by detecting andadapting to runtime anomalies has become an importantresearch direction [11], [15], [20].

Monitoring, as an essential part of Service-OrientedArchitecture [11], [15], is required for the quality manage-ment of SBSs. By monitoring the execution of the basic com-ponents (BCs) of an SBS using monitors provided by serviceproviders (e.g., CloudWatch [6] by Google and System Cen-ter Global Service Monitor [41] by Microsoft) and third par-ties (e.g., NetFlow [17] and IPFIX [1]), runtime anomaliescan be detected or predicted. Here the BCs of an SBS refer toits component services and the data transmission links (ortransmissions in short) between the component services.With the advent of cloud computing, the distribution of thecomponent services of SBSs rises rapidly, increasing theimpact of the network on the quality of the SBSs [36]. Thus,the data transmissions between the component servicesmust be considered in monitoring SBSs. A straightforwardsolution to timely detection and prediction of runtimeanomalies is to monitor all the BCs of an SBS constantly. Inresponse to predicted or detected anomalies, adaptationactions can be taken to fix those anomalies and update theSBS to better manage and guarantee its qualities. For exam-ple, when a runtime anomaly occurs to a component ser-vice, the SBS may, as perceived by the users, respond touser requests slower than usual or become unavailable. Ifthe runtime anomaly can be detected and fixed in time, thestatus of the SBS can return to normal before performancedegradation becomes noticeable by the users. The require-ments for the average response time and availability of theservice, as usually specified in a service level agreement(SLA), can still be met. This highlights the monitoring benefit.

However, monitoring incurs cost. There are twoaspects to monitoring cost: resource cost and system cost.First, monitoring consumes resources, including software,

� Q. He, J. Han, Y. Yang and J.-G. Schneider are with the School of Softwareand Electrical Engineering, Swinburne University of Technology, Mel-bourne, Australia 3122.E-mail: {qhe, jhan, yyang, jschneider}@swin.edu.au.

� Hai Jin is with the Services Computing Technology and System Lab, Clus-ter and Grid Computing Lab, School of Computer Science and Technology,Huazhong University of Science and Technology, Wuhan 430074, China.E-mail: [email protected].

� S. Versteeg is with CA Technologies, Melbourne, Australia 3004.E-mail: [email protected].

Manuscript received 10 Dec. 2012; revised 9 June 2013; accepted 26 Sept.2013. Date of publication 20 Oct. 2013; date of current version 14 May 2014.Recommended for acceptance by P. Inverardi.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TSE.2013.48

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014 461

0098-5589 � 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Formulating cost effective monitoring strategies for service-based systems

hardware and sometimes human resources. Constantlymonitoring all the BCs has a number of shortcomings,including potentially excessive monitoring resource con-sumption and poor scalability, especially for continuouscomponent services, e.g., multimedia streaming services[14]. Furthermore, in the cloud environments where acloud service provider may maintain up to hundreds ofthousands of services for their clients [16], constantlymonitoring all the services is very expensive. It is to theirinterests to keep the resource cost of monitoring at a rea-sonable and affordable level while being able to guaran-tee the quality of their services. In such a context, theresource cost of monitoring is particularly critical. In thisresearch, we assume that there is a limit to the totalavailable monitoring resource, which we refer to as thebudget for monitoring. Otherwise, without such a limit, themonitoring strategy would be straightforward, i.e., moni-toring the entire SBS. Second, monitoring also comeswith system cost, i.e., negative impacts on the quality ofthe monitored services and systems [11]. For example,monitoring sometimes involves retrieving logs of serv-ices’ and SBSs’ behaviours and sniffing network traffic.As demonstrated in [59], those operations can result inup to 70 percent performance overhead, slowing downthe services and systems. Our previous experimentshave also identified that a monitor can cause as much as40 percent; performance overhead on a web serviceunder certain circumstances [28], [29].

In general, there should be a tradeoff between the benefitand cost of monitoring. On one hand, more monitoring(through more monitors, higher monitoring frequency andfiner-grained monitoring) provides more information aboutthe monitored targets and makes it more likely to achievetimely and accurate detection and prediction of runtimeanomalies. On the other hand, more monitoring incurshigher resource and system costs. The tradeoff betweenthem needs to be managed. There are two major aspects toformulating cost-effective monitoring strategies for SBSs:which and how. The former refers to determining the priori-ties of the BCs for selective monitoring. The latter refers todetermining the monitoring parameters for each monitoredBC, e.g., the number of monitors, the monitoring frequencyand the monitoring granularity.

The key to determining which BCs to monitor is theidentification of the critical path of the service composition(i.e., the execution path with the maximum executiontime) as any delays on the critical path will directly impacton the response time of the entire SBS. Response time,among various QoS dimensions, is of particular signifi-cance in quality management for SBSs. The increase in thenumber of time-constrained applications in the cloud, e.g.,interactive and multimedia SBSs, is also driving the needsfor response time management for SBSs [30]. The manage-ment of response time is the basis for the management ofother QoS dimensions. On one hand, effective responsetime management promises better management of otherQoS dimensions because many applications exhibit trade-offs between their response times and other QoS dimen-sions [42]. A video encoding application, for example, canoften produce higher quality video if it is given more timeto encode the video frames. On the other hand, the

management of other QoS dimensions is tightly coupledwith response time management. During execution, anSBS often needs to be adapted to address runtime anoma-lies that may jeopardise its quality. The adaptation itselftakes time, and as a result, contributes to delaying the exe-cution of the SBS. Thus, timely detection and prediction ofruntime anomalies, especially those occurring on the criti-cal path, are significant to effective response time manage-ment for an SBS. However, the volatility of the operatingenvironments makes the response times of the BCs of anSBS probabilistic [15]. Thus, the critical path of an SBS isprobabilistic, i.e., every execution path can be critical withcertain probabilities, which represent their criticalities inthe service composition. As such, the problem of identify-ing the critical path becomes the problem of calculatingthose probabilities.

When determining how to monitor BCs, the value of moni-toring (VOM), which is the benefit of monitoring balancedagainst the cost of monitoring, needs to be taken into con-sideration. Only those monitoring strategies that generatemore benefit than they cost are worth implementing.Furthermore, if there are multiple worthwhile monitoringstrategies, a decision must be made on which one to imple-ment. Usually, a monitoring strategy that can generatehigher monitoring benefit incurs higher resource cost. Inaddition, such a monitoring strategy often leads to highersystem cost. To address these issues, an approach is neededto help evaluate and select monitoring strategies.

In this paper, we propose CriMon, a novel approach tomonitoring strategy formulation for SBSs. A preliminaryversion of the work was presented in [47], which focuses onthe identification of probabilistic critical paths in servicecompositions. This paper provides a more comprehensivepresentation of the work after substantial extension andimprovement. More specifically, we have incorporated theconsideration of monitoring benefit and cost in evaluatingand selecting monitoring strategies in monitoring strategyformulation. We have also proposed two optimisationmethods for monitoring strategy formulation. Finally, wehave conducted a thorough evaluation of CriMon throughin-lab experiments.

CriMon includes the following aspects as key contribu-tions: 1) a timing model that takes into account the ran-domness of the timing properties of the BCs [47];2) methods for calculating path criticality and BC criticality[47]; 3) a method for VOM evaluation; and 4) two (localand global) optimisation methods for monitoring strategyformulation. As demonstrated by experimental results,CriMon-based monitoring is significantly more cost-effective than random monitoring. Experimental resultsalso demonstrate that the two optimisation methods formonitoring strategy formulation are suitable for differenttypes of applications.

The rest of the paper is structured as follows. The nextsection analyses the requirements with a motivating exam-ple. Section 3 introduces the composition model adopted inthis research for representing and analysing SBSs. Section 4presents the proposed methods for criticality calculation.Section 5 introduces the concept of VOM and the methodfor evaluating monitoring strategies based on VOM. Sec-tion 6 describes two optimisation methods for monitoring

462 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014

Page 3: Formulating cost effective monitoring strategies for service-based systems

strategy formulation. Section 6 presents the results of exper-imental evaluation. Section 7 reviews related work. Finally,Section 8 summarises the major contributions of this paperand outlines future work.

2 REQUIREMENT ANALYSIS WITH MOTIVATING

EXAMPLE

This section presents an example SBS, namely OnlineLive, tomotivate this research. As depicted in Fig. 1, this SBS offersan on-demand service to convert, subtitle and transmit livevideo streams, and it consists of 24 BCs ðN0; . . . ; N8;EA; . . . ; EOÞ.N0 and N8 are virtual nodes that represent the entry

and exit points of OnlineLive. N1; N2; . . . ; N7 represent thecomponent services and EA;EB; . . . ; EO represent the datatransmissions between the component services. Inresponse to a client’s request, the execution process ofOnlineLive is as follows:

Step 1: N1 splits the live media stream selected by the cli-ent into separate video and audio streams.

Step 2: The video and audio streams are processed in par-allel. More specifically,

� N2 and N3 convert the video stream and audiostream respectively into the formats that are compat-ible with the client’s end device.

� N4 generates the subtitle by performing speech rec-ognition on the audio stream. Then, based on the cli-ent’s preference or current country/region, thesubtitle is sent to either N5 or N6 to be translated intoone of the two optional languages.

Step 3: N7 produces a media stream by merging and syn-chronising the converted video stream, audio stream andthe translated subtitle.

Step 4: The media stream is transmitted to the client.OnlineLive must process the media stream timely and

continuously. Otherwise, the client will receive a jitteringmedia stream. Thus, OnlineLive must be monitored so that

runtime anomalies that may cause delays to OnlineLive canbe detected or predicted in time, and corresponding adapta-tion actions can then be taken to fix the anomalies immedi-ately, avoiding or reducing delay.

OnlineLive has four end-to-end execution paths:

EP1 ¼ N0-EA-N1-EB-EC-N2-EF -EN -N7-EO-N8;

EP2 ¼ N0-EA-N1-EB-ED-N3-EG-EN -N7-EO-N8;

EP3 ¼ N0-EA-N1-EB-EE-N4-EH -EI-N5-EK-EM -

EN -N7-EO-N8;

EP4 ¼ N0-EA-N1-EB-EE-N4-EH -EJ -N6-EL-EM -

EN -N7-EO-N8:

The critical path determines the response time of Online-Live. Any delays on it will impact on the response time ofOnlineLive directly. Thus, it needs to be identified and moni-tored with priority in order to implement cost-effectivemonitoring. However, anomalies may occur at runtime,causing delays to different execution paths. As a result, thecritical path of OnlineLive may change at runtime. Each ofthe four execution paths is critical with a certain probability.Adjusting the monitors dynamically at runtime as the criti-cal path changes is impractical, especially in highly volatileenvironments where it is difficult for the adjustment of themonitors to keep up with the pace of the critical pathchange. In addition, frequent adjustment of monitors can bevery expensive in terms of software, hardware and some-times human resources. Therefore, an alternative means toconstantly adjusting monitors is required. Those executionpaths that are more likely to be critical need to be identifiedand monitored with priority.

When formulating cost-effective monitoring strategiesfor OnlineLive, both the resource cost and system cost mustbe considered. The total resource cost must not exceed thetotal monitoring resources available for OnlineLive. At run-time, the quality of OnlineLive might be jeopardised bymonitoring. For example, in order to monitor the quality ofthe video being transmitted to clients, monitors can be

Fig. 1. Process of OnlineLive.

HE ET AL.: FORMULATING COST-EFFECTIVE MONITORING STRATEGIES FOR SERVICE-BASED SYSTEMS 463

Page 4: Formulating cost effective monitoring strategies for service-based systems

deployed at N2 and EF to intercept network packets forvideo stream parsing. Frequent packet interceptions mayincur significant network traffic overheads. In addition,while coarse-grained video quality monitoring only inspectsnetwork statistics [57], fine-grained (and usually accurate)video quality monitoring typically requires detailed knowl-edge of video content and features [51], the retrieval ofwhich may also incur excessive network traffic overheads.Such overheads lower the bandwidth available for theactual video stream. Consequently, the quality of the videostream may be jeopardised due to potential video packetlosses [39].

In this paper, we propose CriMon to address the generalissues in formulating cost-effective monitoring strategies forSBSs, as exemplified by the OnlineLive example.

3 COMPOSITION MODEL

This section introduces the composition model adopted inthis research for representing and analysing SBSs. We firstdiscuss the compositional structures for representing SBSs,and then introduce the concept of execution scenario for theprobabilistic analysis of SBSs.

3.1 Compositional Structures

Compositional structures describe the order in which thecomponent services are executed in a service compositionto realise the functionality of an SBS. There are four types ofbasic compositional structures, i.e., sequence, branch, loop andparallel [7], [60], [65], which are included in BPMN [45] andaddressed by BPEL [44]—the de facto standards for specify-ing service-oriented business processes.

� Sequence. In a sequence structure, the BCs are exe-cuted one by one.

� Branch. In a branch structure, only one branch isselected for execution. For a set of branches fb1; . . . ;bng, the execution probability distribution fpðb1Þ; . . . ;pðbnÞg; ð0 � pðbiÞ � 1;

Pni¼1 pðbiÞ¼ 1:0Þ is specified,

where pðbiÞ; i ¼ 1; . . . ; n; is the probability that theith branch is selected for execution.

� Loop. In a loop structure, the loop is executed fornð0 � n �MNIÞ times. For a loop, the probabilitydistribution fp0; . . . ; pMNIg; ð0� pi � 1;

PMNIi¼0 pi ¼

1:0Þ is specified, where pi; i ¼ 0; . . . ;MNI, is theprobability that the loop iterates for i times and MNIis the expected maximum number of iterations forthe loop.

� Parallel. In a parallel structure, all the branches areexecuted at the same time.

pðbiÞ; pi and MNI can be evaluated based on the past execu-tions of the SBS or can be specified by the developer [7], [64].We assume that for a loop, theMNI can be determined or esti-mated. Otherwise, without an upper bound for the number ofiterations, the execution times of the execution paths that con-tain the loop cannot be calculated since the loop may iterateinfinitely.

In this research, we represent service compositionsusing UML activity diagrams, where the nodes representcomponent services and the edges represent data trans-missions. Without losing generality, we assume that a

service composition is characterised by only one entrypoint and one exit point, and only includes structured loopswith only one entry point and one exit point. If a servicecomposition includes loops, we peel the loops by repre-senting loop iterations as a set of branches with corre-sponding execution probabilities [7]. Fig. 2 gives anexample of peeling a loop structure (MNI ¼ 2) by trans-forming it into a branch structure that contains threebranches b1; b2 and b3, where p0; p1 and p2 are the probabil-ities that b1; b2 and b3 are selected for execution respec-tively. (Note that the first branch b1 is selected if the loopiterates for 0 times, i.e., corresponding to p0).

3.2 Execution Scenarios

In a service composition where branches or loops areinvolved, different execution paths may be selected forexecution. Thus, multiple possible execution scenarios canbe identified from the service composition. These execu-tion scenarios do not contain branch or loop structures,and hence can be modelled as Directed Acyclic Graphs(DAGs). As depicted in Fig. 3, two possible execution sce-narios can be identified from OnlineLive (see Fig. 1):es1 ¼ fEP1; EP2; EP3g and es2 ¼ fEP1; EP2; EP4g. The crit-icality evaluation of the execution paths must consider allthe possible execution scenarios according to their execu-tion probabilities, i.e., the probabilities that the executionscenarios occur in response to user requests. Therefore,we need to calculate the execution probability of each exe-cution scenario identified from the service composition.To do so, we first calculate the execution probabilities ofthe BCs and the execution paths.

The execution probabilities of the BCs are calculated by per-forming a forward propagation through the service composi-tion. It starts with assigning 1.0 to the execution probabilityof the entry node, e.g., N0 in Fig. 1, because it is always exe-cuted. During the forward propagation, the execution prob-abilities of the BCs in different compositional structures arecalculated as follows:

� Sequence. In a sequence structure, a BC’s executionprobability equals to its precedent BC’s executionprobability. Formally, given a BC Sj and its prece-dent BC Si in a sequence structure, Sj’s executionprobability, denoted by epðSjÞ, is calculated as:

epðSjÞ ¼ epðSiÞ: (1)

For example, in Fig. 1, there are epðN1Þ ¼ epðEAÞ ¼epðN0Þ and epðEF Þ ¼ epðN2Þ ¼ epðECÞ.

� Branch. In a branch structure, the execution probabil-ity of a splitting edge (e.g., EI or EJ in Fig. 1) is theproduct of the execution probability of its precedentedge (e.g., EH in Fig. 1) and the execution probabilityof the branch that it belongs to. Formally, given a setof nðn � 2) splitting edges E1; . . . ; En in a branchstructure and their common precedent edge Ep, theexecution probabilities of the splitting edges,denoted by epðE1Þ; . . . ; epðEnÞ, are calculated as:

epðEiÞ ¼ epðEpÞ � pðbiÞ i ¼ 1; . . . ; n; (2)

where bi is the branch that Ei belongs to.

464 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014

Page 5: Formulating cost effective monitoring strategies for service-based systems

For example, in Fig. 1, suppose that the branchthat EI belongs to is b1, there is epðEIÞ ¼ epðEHÞ�pðb1Þ.

The execution probability of the succeeding edge(e.g., EM in Fig. 1) that the branches merge into in a

branch structure is the sum of the execution proba-bilities of all the edge’s precedent edges (e.g., EK

and EL in Fig. 1). Formally, for a set of nðn � 2)edges E1; . . . ; En merging into an succeeding edgeEs in a branch structure, the execution probability of

Fig. 2. The loop peeling process.

Fig. 3. Execution scenarios identified from OnlineLive.

HE ET AL.: FORMULATING COST-EFFECTIVE MONITORING STRATEGIES FOR SERVICE-BASED SYSTEMS 465

Page 6: Formulating cost effective monitoring strategies for service-based systems

Es; denoted by epðEsÞ, is calculated as:

epðEsÞ ¼Xn

i¼1

epðEiÞ: (3)

In Fig. 1, for example, there is epðEMÞ ¼ epðEKÞþepðELÞ.

The calculation of the execution probabilities ofother BCs in the branch structure follows formula (1)because each branch is actually a sequence structure.For example, in Fig. 1, there is epðEIÞ ¼ epðN5Þ ¼epðEKÞ.

� Loop. After the transformation into a branch struc-ture using the loop peeling process introduced inSection 3.1, a loop structure can be processed in thesame way as a branch structure. For example, for abranch structure transformed from a loop structurewith the probability distribution fp0; . . . ; pMNIg, thesplitting edges’ execution probabilities, denoted byepðEiÞ; i ¼ 1; . . . ;MNI þ 1, are calculated as:

epðEiÞ ¼ epðEpÞ � pi�1i ¼ 1; . . . ;MNI þ 1; (4)

where Ep is the common precedent edge of Ei andpi�1

is the execution probability of the branch that Ei

belongs to.

� Parallel. Since all branches are selected for executionin a parallel structure, the execution probabilities ofall the BCs on the branches in a parallel structureequal to the execution probability of the edge thatprecedes the parallel branches. Formally, given anprecedent edge Ep splitting into a set of nðn � 2Þedges E1; . . . ; En in a parallel structure, the executionprobabilities of E1; . . . ; En are calculated as:

epðEiÞ ¼ epðEpÞ i ¼ 1; . . . ; n: (5)

In the example in Fig. 1, there is epðECÞ ¼epðEDÞ ¼ epðEEÞ ¼ epðEBÞ.

The execution probability of the succeeding edge(e.g., EN in Fig. 1) that the parallel branches mergeinto in a parallel structure equals to the executionprobability of any of the edge’s precedent edges(e.g., EF ;EG and EM ), which have the same execu-tion probability. Formally, for a set of nðn � 2Þ edgesEi; i ¼ 1; . . . ; n, merging into a succeeding edge Es ina parallel structure, the execution probability of Es iscalculated as:

epðEsÞ ¼ epðEiÞ 8i 2 ½1; . . . ; n�: (6)

In the example in Fig. 1, there is epðENÞ ¼epðEF Þ ¼ epðEGÞ ¼ epðEMÞ.

Having calculated the execution probabilities of the BCs,we can calculate the execution probabilities of the execu-tion paths in a service composition. The execution probabil-ity of an execution path equals to the minimum executionprobability of all the BCs that belong to the execution path.Formally, for an execution path EPi ¼ fS1; . . . ; Sng, theexecution probability of EPi, denoted by epðEPiÞ, is calcu-lated as:

epðEPiÞ ¼ minðepðS1Þ; . . . ; epðSnÞÞ: (7)

For example, in Fig. 1, there is epðEP1Þ ¼ minðepðN0Þ;epðEAÞ; epðN1Þ; epðEBÞ; epðECÞ; epðN2Þ; epðEF Þ; epðENÞ;epðN7Þ; epðEOÞ; epðN8ÞÞ.

An execution scenario would only occur when all its con-stituent execution paths are selected for execution. TakeFig. 3 for example, given that EP1 and EP2 are alwaysselected for execution, es1 and es2 occur when EP3 and EP4

are selected for execution respectively.Now, given the execution probabilities of all execution

paths, we can calculate the execution probabilities of the execu-tion scenarios identified from the service composition, whichis the product of the execution probabilities of all the execu-tion paths in the execution scenario. For an execution sce-nario esi ¼ fEP1; . . . ; EPng, the execution probability of esiis calculated as:

epðesiÞ ¼Yn

i¼1

epðEPiÞ: (8)

For example, in Fig. 3a, there is epðes1Þ ¼ epðEP1Þ�epðEP2Þ � epðEP3Þ.

4 CRITICALITY CALCULATION

The critical path of an SBS can provide important infor-mation for formulating monitoring strategies for theSBS. Critical path has long been a powerful analyticaltool in many domains, such as project management[32], digital integrated circuit [48], workflow [55], dis-tributed computing [49] and service-oriented computing[64]. However, due to the dynamic and volatile natureof service-oriented environments, the timing propertiesof the constituent BCs of an SBS are probabilistic, mak-ing the existing tools unsuitable for identifying the criti-cal path of the SBS. In this section, we first introduce inSection 4.1 a probabilistic timing model that capturesthe randomness of the timing properties of the BCs inan SBS. Based on this model, we introduce the methodfor calculating the BCs’ dominance probabilities in Sec-tion 4.2. The methods for criticality calculation, whichare based on dominance probabilities, are presented inSection 4.3.

4.1 Timing Model

In volatile environments, the execution of the BCs of an SBSoften suffers impacts of various runtime anomalies. Thus,the timing properties of the BCs are random variablesinstead of deterministic values, which makes the responsetime of the SBS probabilistic [15]. For realistic evaluation ofthe response time of an SBS, a model is needed that takesinto account the probabilistic nature of the timing propertiesof its BCs. In this research, three types of timing propertiesare considered:

� Start time ðTSÞ: the time elapsed from the momentwhen the SBS is invoked (referred to as time zero) tothe moment when the BC is activated.

� Response time ðTRÞ: the time elapsed for the BC tocomplete since its start time.

� Finish time ðTF Þ: the time elapsed for the BC to com-plete since time zero.

466 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014

Page 7: Formulating cost effective monitoring strategies for service-based systems

The units for measuring these timing properties are flexi-ble and domain-specific. Usually, seconds, milliseconds andnanoseconds are used.

The response time of a BC Si is expressed in the standardform as:

TRðSiÞ ¼ t0 þXn

i¼1

wi � DXi; (9)

where t0 is the mean value of TRðSiÞ, i.e., EðTRðSiÞÞ;DXi; i ¼ 1; 2; . . . ; n, represent the variation of n sourcesof anomaly Xi; i ¼ 1; 2; . . . ; n, from their mean values;wi; i ¼ 1; 2; . . . ; n, represent the sensitivities of TRðSiÞ toeach of the sources of anomaly.

For the response time of a BC Si, i.e., TRðSiÞ; t0; wi and thedistributions of DXi can be evaluated by inspecting Si’spast executions, service consumers’ feedbacks, serviceproviders’ profiles, etc. If the t0; wi or DXi of a BC isunknown, we use the TRðSiÞ specified in the correspondingSLA or a distribution of TRðSiÞ empirically obtained by theSBS developer or administrator. Given a service composi-tion SC ¼ fS1; . . . ; Sng and TRðSiÞ; i ¼ 1; . . . ; n, the starttimes and finish times of the BCs, denoted by TSðSiÞ andTF ðSiÞ; i ¼ 1; . . . ; n, where TF ðSiÞ ¼ TSðSiÞ þ TRðSiÞ, can becalculated using the forward timing property propagation(see Section 4.2).

4.2 Dominance Probability

Given two random timing properties TA and TB, the domi-nance probability of TA over TB, noted as DTBðTAÞ, is theprobability that TA is larger than or equal to TB, whereDTBðTAÞ 2 ½0; 1� and DTBðTAÞ ¼ 1�DTAðTBÞ. Given n tim-ing properties Ti; i ¼ 1; 2; . . . ; n, the dominance probabilityof each, noted as DðTiÞ, is the probability that it is largerthan or equal to all the others. The calculation of DðTiÞdepends on the distributions of DXi (see Section 4.1). In thisresearch, we assume that DXi are subject to Gaussian distri-butions to facilitate general evaluation of timing properties.For other probability distributions, e.g., exponential distri-bution, corresponding techniques can be adopted for thecalculation of dominance probability [58].

Let us first consider the case of two timing properties TAand TB:

TA ¼ ta þXn

i¼1

ai � DXi; (10)

TB ¼ tb þXn

i¼1

bi � DXi: (11)

Their expected values are: mA ¼ E½TA� ¼ ta and mB ¼E½TB� ¼ tb, and their variances are:

sA ¼ varðTAÞ ¼ Ef½TA � EðTAÞ�2g ¼Xn

i¼1

a2i ; (12)

sB ¼ varðTBÞ ¼ Ef½TB � EðTBÞ�2g ¼Xn

i¼1

b2i : (13)

Since DXi; i ¼ 1; 2; . . . ; n, are subject to Gaussian distribu-

tions, there is TA NðmA; s2AÞ; TB NðmB; s

2BÞ and Y ¼D

TB � TA Nðtb � ta;Pn

mi¼1 a2i þ

Pni¼1 b

2i Þ. Thus, DTBðTAÞ

can be cal-culated by:

DTBðTAÞ ¼ P ðTA � TB � 0Þ ¼ P ðY � 0Þ ¼ FY ð0Þ

¼Z 0

�1

1ffiffiffiffiffiffiffiffiffiffiffi2ps2

Y

p � exp�ðx� mY Þ2

2s2Y

!

dx;(14)

mY ¼ tb � ta; (15)

s2Y ¼

Xn

i¼1

a2i þ

Xn

i¼1

b2i ; (16)

where FY is the cumulative probability function of Y .Now let us consider the case of multiple independent

timing properties T1; T2; . . . ; Tn. Let Z ¼D maxðT 1; . . . ;T i�1;T iþ1; . . . ;TnÞ, there is:

DðTiÞ ¼ P ððTi � T1Þ \ � � � \ ðTi � Ti�1Þ \ ðTi � Tiþ1Þ\ � � � \ ðTi � TnÞÞ¼ P ðTi � ZÞ ¼ P ðZ � Ti � 0Þ¼ FZ�Tið0Þ:

(17)

Now FZ�Tjð0Þ can be calculated by adopting formula(14). As an example, in the following, we use a case withthree timing properties, TA NðmA; s

2AÞ; TB NðmB; s

2BÞ

and TC NðmC; s2CÞ, to demonstrate the calculation of a

timing property’s dominance probability over multiple tim-ing properties. Let Z ¼D ðTA; TBÞ, there is:

DðTCÞ ¼ P ðTC � ZÞ ¼ P ðZ � TC � 0Þ: (18)

Before applying formula (14), we adopt the analytic solu-tion presented in [18] for expressing Z. Letfug ¼D ðs2

Aþs2BÞ

1=2, and there are

mZ ¼ E½Z� ¼ ta �DTBðTAÞ þ tb � ð1�DTBðTAÞÞ

þ u � f ta � tbu

� �

;(19)

s2Z ¼ varðZÞ ¼

�s2A þ m2

a

��DTBðTAÞ

þ�s2B þ m2

b

�ð1�DTBðTAÞÞ

þ ðta þ tbÞ � u � fta � tb

u

� �

� m2Z;

(20)

where

fðxÞ ¼D 1ffiffiffiffiffiffi2pp exp �x

2

2

� �

: (21)

Having computed mz and s2Z , formula (14) can be applied

to calculate DðTCÞ.As the BCs in a service composition have timing proper-

ties with timing dependencies, we introduce a new concept

HE ET AL.: FORMULATING COST-EFFECTIVE MONITORING STRATEGIES FOR SERVICE-BASED SYSTEMS 467

Page 8: Formulating cost effective monitoring strategies for service-based systems

concerning BCs that is similar to the dominance probabilityof timing properties, i.e., dominance probability of a BC.

Definition 1. Dominance Probability of a BC. For a givenBC Si in an execution scenario esi ¼ S1; S2; . . . ; Sn of a ser-vice composition, the dominance probability of Si, denoted asDðSiÞ, is the probability that Si’s finish time solely determinesthe start time of its succeeding BC(s).

The dominance probability computation and timingproperty propagation for the BCs through an execution sce-nario of the service composition are interleaved based ontheir timing properties and the compositional structuresthat they are involved in. Since an execution scenario doesnot contain branch or loop structures, we only need to con-sider sequence and parallel structures.

Fig. 4. Criticality calculation for OnlineLive (D ¼ Dominance Probability and C ¼ Criticality).

468 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014

Page 9: Formulating cost effective monitoring strategies for service-based systems

� Sequence. A BC Sj has only one precedent BC Si in asequence structure. Thus, the start time of Sj is solelydetermined by Si’s finish time:

TSðSjÞ ¼ TF ðSiÞ: (22)

According to Definition 1, there is

DðSiÞ ¼ 1:0: (23)

In Figs. 4a and 4b, for example, the start time ofN1 is solely determined by the finish time of EA, i.e.,TSðN1Þ ¼ TF ðEAÞ. Thus, there is DðEAÞ ¼ 1:0.

� Parallel. In the splitting part of a parallel structurewhere a precedent edge Ep splits into multiple edgesE1; . . . ; En; the start times of Ei; i ¼ 1; . . . ; n, aresolely determined by the finish time of Ep. Thus, for-mulas (22) and (23) can be applied to calculateTSðEiÞ and DðEpÞ. In Fig. 4a, for example, the starttimes of EC;ED and EE are solely determined by thefinish time of EB, i.e., TSðECÞ ¼ TSðEDÞ ¼TSðEEÞ ¼ TF ðEBÞ. Hence, there is DðEBÞ ¼ 1:0. Inthe merging part of a parallel structure where multi-ple edges E1; . . . ; En merge into an edge Es succeed-ing the branches, the start time of Es is determinedby the edge that finishes the last, i.e., the edge thathas the maximum finish time. For example, there isTSðENÞ ¼ maxðTF ðEF Þ; TF ðEGÞ; TF ðEKÞÞ in Fig. 4aand TSðENÞ ¼ maxðTF ðEF Þ; TF ðEGÞ; TF ðELÞÞ inFig. 4b. Since the finish times of the merging edgesare random variables, the start time of EN isprobabilistically determined by each merging edge’sfinish time according to their respective dominanceprobabilities:

TSðEsÞ ¼ maxðTF ðE1Þ; TF ðE2Þ; . . . ; TF ðEnÞÞ

¼

TF ðE1Þ with the probability of DðE1ÞTF ðE2Þ with the probability of DðE2Þ. . .

TF ðEnÞ with the probability of DðEnÞ;

8>>><

>>>:

(24)

wherePn

i¼1 DðEiÞ ¼ 1 and DðEiÞ; i ¼ 1; . . . ; n, can becalculated using formulas (14) to (21). To proceedwith the interleaved timing property propagationand dominance probability computation, we use theweighted average value, denoted by TSðEsÞ, as thestart time of the succeeding edge Es:

TSðEsÞ ¼Xn

i¼1

DðEiÞ � TF ðEiÞ: (25)

For example, in Fig. 4a, there are: TSðENÞ ¼DðEF Þ � TF ðEF Þ þDðEGÞ � TF ðEGÞ þ DðEKÞ �TF ðEKÞ and TF ðENÞ ¼ TSðENÞ þ TRðENÞ.

For other BCs in a parallel structure, formulas (22)and (23) apply as each parallel branch is actually asequence structure.

In a parallel structure, in particular, we consider theresponse time of the edges that precede and succeed thebranches, e.g., EB and EN in Figs. 4a and 4b, as a constant ofzero in the timing property propagation. In Figs. 4a and 4b,given TRðEBÞ ¼ 0, there are TF ðEBÞ ¼ TSðEBÞ ¼ TF ðN1Þ andTSðECÞ ¼ TSðEDÞ ¼ TSðEEÞ ¼ TF ðEBÞ ¼ TF ðN1Þ.

By interleaving the timing property propagation andthe dominance probability computation, the dominanceprobabilities of the BCs in an execution scenario can becalculated. Assuming epðes1Þ ¼ 0:6 and epðes2Þ ¼ 0:4, wepresent in Figs. 4a and 4b the dominance probabilities(denoted as D) of the BCs of OnlineLive, which are calcu-lated based on arbitrarily chosen timing properties as ademonstrative example. In reality, the dominance proba-bilities of the BCs are calculated based on their real tim-ing properties.

4.3 Criticality

This section formally defines criticality and presents themethods for criticality calculation.

First of all, we give the formal definitions of path critical-ity, node criticality and edge criticality.

Definition 2. Path Criticality. For a given execution path EPiin a service composition SC ¼ fEP1; EP2; . . . ; EPng, thecriticality of EPi, denoted as CðEPiÞ; is the probability thatEPi is the critical path, i.e., EPi has the maximum executiontime among EP1; . . . ; EPn.

Definition 3. Node Criticality. For a given node Ni in a servicecomposition SC ¼ fN1; N2; . . . ; Ni; E1; E2; . . . ; Ejg, the crit-icality of Ni, denoted as CðNiÞ, is the probability that Ni

belongs to the critical path.

Definition 4. Edge Criticality. For a given edge Ei in a servicecomposition SC ¼ fN1; N2; . . . ; Ni; E1; E2; . . . ; Ejg, the edgecriticality of Ei, denoted as CðEiÞ, is the probability that Ei

belongs to the critical path.

As introduced in Section 3.2, multiple execution sce-narios can be identified from a service composition. Tocalculate the criticalities of the execution paths and theBCs in a service composition, first we need to calculatetheir criticalities in each execution scenario with a back-ward propagation process through the execution scenariostarting with assigning 1.0 to the criticality of the exitnode, e.g., N8 in Figs. 4a and 4b, as it always belongs tothe critical path. The criticality calculation can be per-formed following certain rules.

Rule 1. The criticality of an execution path in an executionscenario is the product of the dominance probabilities of allthe BCs that belong to the execution path.

An execution path is critical only when all its constituentBCs solely determine the start times of their succeedingBC(s). Take Fig. 4a for example, for EP1 to be critical,EA;EB;EC;EFEN and EO have to determine the start timeof N1; EC;N2; EN;N7 and N8 respectively. In UML activitydiagrams, the dominance probabilities of all nodes are 1.0as they each have only one succeeding edge. Thus, we canomit the nodes when calculating the criticality of an execu-tion path. Take Fig. 4a for example, following Rule 1 andbased on the dominance probabilities calculated, the crit-icalities of EP1; EP2, and EP3 in execution scenario es1 can

HE ET AL.: FORMULATING COST-EFFECTIVE MONITORING STRATEGIES FOR SERVICE-BASED SYSTEMS 469

Page 10: Formulating cost effective monitoring strategies for service-based systems

be calculated:

Ces1ðEP1Þ ¼

Y

Ek2EP1

Des1ðEkÞ

¼ Des1ðEAÞ �Des1

ðEBÞ �Des1ðECÞ �Des1

ðEF Þ�Des1

ðENÞ �Des1ðEOÞ

¼ 0:5;

Ces1ðEP 2Þ ¼

Y

Ek2EP2

Des1ðEkÞ

¼ Des1ðEAÞ �Des1

ðEBÞ �Des1ðEDÞ �Des1

ðEGÞ�Des1

ðENÞ �Des1ðEOÞ

¼ 0:3;

Ces1ðEP 3Þ ¼

Y

Ek2EP3

Des1ðEkÞ

¼ Des1ðEAÞ �Des1

ðEBÞ �Des1ðEEÞ �Des1

ðEIÞ�Des1

ðEKÞ �Des1ðENÞ �Des1

ðEOÞ¼ 0:2:

Similarly, in execution scenario es2, there are:

Ces2ðEP 1Þ ¼

Y

Ek2EP1

Des2ðEkÞ ¼ 0:4;

Ces2ðEP 2Þ ¼

Y

Ek2EP2

Des2ðEkÞ ¼ 0:2;

Ces2ðEP 4Þ ¼

Y

Ek2EP3

Des2ðEkÞ ¼ 0:4:

Rule 2. The criticality of a node in an execution scenarioequals to the criticality of its succeeding edge. As each nodehas only one succeeding edge in an execution scenario, thecriticality of a node depends on how critical its succeedingedge is. In Fig. 4a, for example, CðN1Þ ¼ CðEBÞ ¼ 1:0 andCðN3Þ ¼ CðEGÞ ¼ 0:3.

Rule 3. The criticality of an edge in an execution scenario isthe product of its dominance probability and the sum of thecriticalities of its succeeding BCs.

Unlike nodes, an edge may have one or many succeed-ing BCs. Hence, for an edge to be critical, it has to deter-mine the start times of its succeeding BC(s) and itssucceeding BC(s) have to be critical. For example, inFig. 4a, there is CðEBÞ ¼ DðEBÞ � ðCðECÞ þ CðEDÞ þCðEEÞÞ ¼ 1:0� ð0:5þ 0:3þ 0:2Þ ¼ 1:0 and CðEIÞ ¼ DðEIÞ�CðN5Þ ¼ 1:0� 0:2 ¼ 0:2.

Now, the criticality of an execution path EPi or a BC Si inthe service composition can be computed by a weightedaverage over their criticalities obtained in all the executionscenarios using the execution scenarios’ execution probabil-ities as weights:

CðEPiÞ ¼X

EPi2esk

epðeskÞ � CeskðEPiÞ; (26)

CðSiÞ ¼X

Si2esk

epðeskÞ � CeskðSiÞ; (27)

where epðeskÞ is the execution probability of esk; CeskðEPiÞand CeskðSiÞare the criticalities of EPi and Si in esk.

Suppose epðes1Þ ¼ 0:6 and epðes2Þ ¼ 0:4 in Fig. 4, basedon formulas (26) and (27), there are:

CðEP1Þ ¼ epðes1Þ � Ces1EP1 þ epðes2Þ � Ces2

ðEP1Þ¼ 0:6� 0:5þ 0:4� 0:4 ¼ 0:46;

CðEP2Þ ¼ epðes1Þ � Ces1ðEP2Þ þ epðes2Þ � Ces2ðEP2Þ

¼ 0:6� 0:3þ 0:4� 0:2 ¼ 0:26;

CðEP3Þ ¼ epðes1Þ � Ces1ðEP3Þ

¼ 0:6� 0:2 ¼ 0:12;

CðEP4Þ ¼ epðes2Þ � Ces2ðEP4Þ

¼ 0:4� 0:4 ¼ 0:16:

5 VALUE OF MONITORING

Monitoring provides runtime information about the statusof the BCs of an SBS, which is required to facilitate timelydetection and prediction of runtime anomalies. By timelyfixing those anomalies, the qualities of the SBS can beguaranteed, realising the benefit of monitoring. However,monitoring incurs resource and system costs.

Different monitoring strategies incur different costs.Sometimes, the benefit of monitoring is not even worth themonitoring cost [12]. Monitoring strategy formulationrequires a tradeoff between the benefit and cost of monitor-ing. In monitoring strategy formulation for SBSs, the man-agement of this tradeoff is challenging and complicated asmultiple BCs and different monitoring parameters areinvolved.

In this section, we introduce the concept and calculationof value of monitoring for monitoring strategy evaluation.Here, the VOM shares its concepts with the value of perfectinformation (VPI) [53] and the value of changed information(VOC) [26], which also attempt to decide whether some-thing is necessary and worthwhile to a particular process,as explained in [25].

Monitoring strategy formulation for an SBS requires thedetermination of the monitoring parameters for each of itsBCs. Common monitoring parameters include, but are notlimited to:

1. Number of monitors: how many monitors are allo-cated to the BC. Different monitors might berequired to serve different purposes, e.g., monitoringdifferent QoS dimensions. Usually, the more moni-tors allocated to a BC, the higher the monitoring cost.

2. Monitoring frequency: how often the status of the BCis checked. Higher monitoring frequency facilitatesmore timely detection of runtime anomalies, butusually incurs higher monitoring cost. For example,as discussed in Section 2, frequent packet intercep-tion usually produces excessive network traffic over-heads, leading to quality degradation in the videostream.

3. Monitoring granularity: at what granularity the sta-tus of the BC is checked. Finer-grained monitoringguarantees higher accuracy of monitoring, but oftenleads to higher monitoring cost than coarser-grainedmonitoring. For example, as introduced in Section 2,deep parsing of the video stream jeopardises thevideo quality more than simple parsing.

470 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014

Page 11: Formulating cost effective monitoring strategies for service-based systems

Monitoring strategy formulation does not always requireall three monitoring parameters to be determined for eachBC. The monitoring parameters to be determined for a BCdepend on the nature of the BC and the monitor(s) allocatedto the BC. In this paper, the discussion is based on the abovethree parameters. The monitoring parameters introduced inother literatures can be generalised and incorporated asadditional dimensions in our VOM evaluation method. Inthis research we assume that the monitors are generic anddo not consider the types of monitors.

We introduce the notion of VOM to estimate the value ofa monitoring strategy:

V OMðS ! S0 jpða;b; gÞÞ¼ bðS ! S0 jpða;b; gÞÞ � cðpða;b; gÞÞ¼ bðS ! S0 jpða;b; gÞÞ � ðcrcðpða;b; gÞÞþ cscðS ! S0 jpða;b; gÞÞ;

(28)

where VOMðS ! S0 jpða;b; gÞÞ represents the valuebrought to S0 (see below for the explanation of S0) by moni-toring S with monitoring strategy pða;b; gÞ;a;b and g spec-ify the number of monitors, monitoring frequency andmonitoring granularity respectively, bðS ! S0 jpða;b; gÞÞ isthe monitoring benefit function that computes the benefitbrought to S’ by monitoring S with pða;b; gÞ; cðS !S0 jpða;b; gÞÞ is the monitoring cost function that computesthe monitoring cost brought to S’ by monitoring S withpða;b; gÞ; crcðpða;b; gÞÞ is the function that computes theresource cost of pða;b; gÞ, and cscðS ! S0 jpða;b; gÞ is thefunction that computes the system cost imposed on S’ bypða;b; gÞ.

The implementation of bðS ! S0 jpða;b; gÞÞ andcscðS ! S0 jpða;b; gÞÞ relies on what S and S’ represent. Forpurposes of obtaining locally and globally optimal monitor-ing strategies (see below), in particular, we are interested inthe cases where S is a BC and S’ is one of following twooptions:

1. S0 being the same BC as S, i.e., S ¼ S0. Given a BCSi; bðSi ! Si jpða;b; gÞÞ and cscðSi ! Si jpða;b; gÞÞcompute the monitoring benefit and system costbrought to Si by monitoring itself. The VOM func-tion VOMðSi ! Si jpða;b; gÞÞ is named local VOMfunction.

2. S0 being the SBS. Given a BC Si that belongs to SBSS; bðSi ! S jpða;b; gÞÞ and cscðSi ! S jpða;b; gÞÞcompute the monitoring benefit and system costbrought to S by monitoring Si. The VOM functionV OMðSi ! S jpða;b; gÞÞ is named global VOMfunction.

Functions bðS ! S0 jpða;b; gÞÞ; crcðpða;b; gÞÞ and cscðS!S0 jpða;b; gÞÞ are domain-specifically implemented basedon the quality requirements and preferences for differentquality dimensions, and the selected method for monitoringstrategy formulation. Functions bðS ! S0 jpða; b; gÞÞ andcscðS ! S0 jpða;b; gÞÞ can be implemented to computean overall score for monitoring strategy pða;b; gÞ based onmultiple quality dimensions. A straightforward way is toapply a multiple criteria decision making (MCDM) tech-nique [37] based on the weights assigned to eachquality dimension. Functions bðS ! S0 jpða;b; gÞÞ and

cscðS ! S0 jpða;b; gÞÞ can also be implemented to computethe benefit and cost of monitoring incurred by implement-ing pða;b; gÞ based on an individual quality dimension. Forexample, if the response time is the only quality dimensionbeing considered, functions bðS ! S0 jpða;b; gÞÞ andcscðS ! S0 jpða;b; gÞÞ can be implemented to compute thebenefit and system cost of monitoring based on the decreaseand increase in the response time of S0 as a result of moni-toring S with pða;b; gÞ.

A monitoring strategy is worth implementing only whenVOMðS ! S0 jpða;b; gÞÞ > 0. By comparing the estimatedbenefit and cost, monitoring strategies can be evaluated,providing the basis for the monitoring strategy formulationdiscussed in Section 6.

6 MONITORING STRATEGY FORMULATION

To formulate the global monitoring strategy for an SBS, weneed to determine the local monitoring strategies for all theBCs of the SBS. The optimal global monitoring strategy isthe one that maximises the total VOM while keeping theresource cost within budget (i.e., the total monitoring resour-ces available). In Sections 6.1 and 6.2, we present two meth-ods for monitoring strategy formulation for SBSs, namelylocal optimisation and global optimisation. Both methodsconsider the budget, the criticalities of the execution pathsand BCs of the SBS, and the VOM, and yet formulateoptimal global monitoring strategies by different means. InSection 6.3, we compare the two methods.

6.1 Local Optimisation

Prior to the monitoring strategy formulation for individualBCs, the budget needs to be allocated to the execution pathsof the SBS pro rata according to their criticalities. Morebudgets are allocated to the execution paths with highercriticalities. The sum of the resource costs for all the BCs onan execution path must not exceed the budget allocated tothe execution path. Take Fig. 1 for example, based on thecalculation presented at the end of Section 4.3, 46 percent ofthe total budget will be allocated to EP1, 26 percent to EP2,12 percent to EP3 and 16 percent to EP4. When formulatingthe local monitoring strategy for an individual BC, theresource cost must not exceed the remainder of the budgetallocated to the execution path that the BC belongs to. A BCmay belong to multiple execution paths, i.e., there might bemultiple execution paths going through one BC. To addressthis issue, we adopt the following approach to determinethe budget allowed for a BC.

1. If a BC Si belongs to only one execution path EPj, thebudget allowed for Si is the remaining budget forEPj.

2. If a BC Si belongs to multiple execution pathEP1; EP2; . . . ; EPn, the budget allowed for Si is thesum of the remaining budgets for EP1; EP2; . . . ; EPn.

The monitoring strategies for individual BCs on an exe-cution path are formulated one by one in the descendingorder of their criticalities. Take Fig. 4c for example, theorder is:

1. {N1; N7; EA;EB;EN;EO};2. {N2; EC; EF };

HE ET AL.: FORMULATING COST-EFFECTIVE MONITORING STRATEGIES FOR SERVICE-BASED SYSTEMS 471

Page 12: Formulating cost effective monitoring strategies for service-based systems

3. {N4; EE;EM;EH};4. {N3; ED;EG,};5. {N6; EJ ; EL};6. {N5; EI; EK}.

The BCs within a same set, e.g., N2; EC and EF , have equiva-lent criticalities, and their local monitoring strategies are for-mulated in a random order. The monitoring strategyformulation by local optimisation completes when the totalbudget is depleted or the local monitoring strategies are deter-mined for all BCs.

To determine the local monitoring strategy for a specificBC, the VOMs of all possible local monitoring strategies forthe BC need to be computed by adopting the method intro-duced in Section 5. Only those local monitoring strategiesthat generate positive VOM are valid. If no such strategyexists, no monitors will be allocated to the BC. If there aremultiple valid monitoring strategies, the one with the maxi-mum VOM is selected. Constraints can be imposed on theformulation of local monitoring strategies, e.g., the value ofmonitoring a specific BC must not be lower than a certainthreshold. The local monitoring strategies that cannot meetthe constraints are considered invalid.

6.2 Global Optimisation

In the local optimisation approach, the local monitoringstrategies are formulated for each BC individually.Although the monitoring strategies are locally optimised,they might be suboptimal for the SBS as a whole. Findingthe optimal global monitoring strategy for the SBS is morecomplex but often necessary.

An intuitive way to formulate optimal global monitoringstrategy is to search all the possible local monitoring strate-gies for each BC and use those local monitoring strategies toenumerate all the possible global monitoring strategies. Theone with the maximum positive total VOM is the optimalglobal monitoring strategy for the SBS. This exhaustivesearch method is naive and leads to the obvious problem ofhigh computational complexity. Assume there are n BCsand an average of m possible local monitoring strategies foreach BC. The total number of possible global monitoringstrategies is mn. That is, the complexity of an exhaustivesearch method is OðmnÞ in general under the assumptionthat the evaluation of a local monitoring strategy runs in lin-ear time, making the exhaustive search method impractical.

Inherently, the formulation of the optimal global moni-toring strategy is equivalent to a Multidimensional multi-ple-choice knapsack problem (MMKP) [3], an NP-hardknapsack problem where the local monitoring strategies areclassified in groups. Multiple constraints are imposed onthe SBS and the selection of local monitoring strategies. Letthere be n groups of local monitoring strategies correspond-ing to n BCs, Group i containing li local monitoring strate-gies. Each local monitoring strategy has a particular VOM.The objective of the MMKP is to pick exactly one local moni-toring strategy from each group to maximise the total VOMof the selected items, subject to the constraints on the SBSand the selection of local monitoring strategies.

CriMon models this MMKP as an integer programming(IP) problem [62]. IP aims at maximising (or minimising) thevalue of an objective function by adjusting the values of a set

of variables while enforcing certain constraints. By solvingthe IP problem, the maximum (or minimum) value of theobjective function and the corresponding values of the varia-bles can be obtained. The problem of finding the optimalglobal monitoring strategy for an SBS is turned into an IPproblem as follows. For a possible local monitoring strategypijða;b; gÞ for a BC Si, an integer variable xij is 1 if this localmonitoring strategy is selected, or 0 otherwise. Given an SBSS consisting of n BCs ðS1; S2; . . . ; Sn 2 SÞ, let pi be the set ofpossible local monitoring strategies for Si, the objectivefunction for the IP problem is formulated as follows:

maximiseX

i2S

X

j2Pi

xij � VOMðsi ! jpijðaij;bij;gijÞ !

; (29)

where pijðaij;bij; gijÞ is the jth possible local monitoringstrategy for Si.

The IP problem has the following constraints:

X

j2Pi

xij ¼ 1 8i 2;S; (30)

X

i2S

X

j2Pi

xij � crcðpijðaij;bij;gijÞÞ � budgettotal; (31)

X

xi2EPi

X

j2Pi

xij � crcðpijðaij;bij;gijÞÞ � budgetEPi 8EPi 2;S;

(32)

X

i2S

X

j2Pi

xij � aij � amax; (33)

amini � aij � amaxi 8i 2 S; j 2 Pi; (34)

bmini � bij � bmaxi 8i 2 S; j 2 Pi; (35)

gmini � gij � gmaxi 8i 2 S; j 2 Pi; (36)

VOMðsi ! S jpijðaij;bij; gijÞÞ > 0 8i 2 S; j 2 Pi: (37)

Constraints family (30) guarantees that exactly one localmonitoring strategy is selected for each BC (note that “NoMonitoring” is one of the possible local monitoring strate-gies, e.g., when no applicable monitors are available forsome BCs). For example, assume that there are 10 possiblelocal monitoring strategies for the video conversion serviceN2 in Fig. 1. Since only one of these strategies can beselected for N2, there is

P10j¼1 x2j ¼ 1. Constraint (31)

ensures that the total cost of monitoring all the BCs does notexceed the total budget for monitoring the SBS. Functioncrcðpijðaij;bij; gijÞÞ calculates the resource cost needed toimplement the jth local monitoring strategy for BC Si. Con-straints family (32) further states the budget limit for indi-vidual execution paths. Given any execution path EPi 2,the total cost needed to implement the local monitoringstrategies for all the BCs on EPi must not exceed the budgetallocated to EPi. Constraints family (33) ensures that the

472 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014

Page 13: Formulating cost effective monitoring strategies for service-based systems

number of all the monitors to be allocated must not exceedthe number of available monitors. Constraints families (34)-(36) express the upper and lower bounds for the number ofmonitors (denoted by a), monitoring frequency (denoted byb) and monitoring granularity (denoted by g) of every localmonitoring strategy. Constraints on other monitoringparameters (if any) can be included in the IP as additionalconstraints, in a form similar to constraints (34)-(36). Con-straints family (37) ensures that only the local monitoringstrategies that generate more benefit than they cost aretaken into consideration.

Such an IP problem can be solved using many tools, e.g.,CPLEX and AIMMS. By solving the IP problem, the optimalglobal monitoring strategy that generates the maximumtotal VOM can be obtained. The values of a;b and g of thelocal monitoring strategy for each BC can also be deter-mined. Based on the results, monitors can be allocated andconfigured for cost-effective monitoring.

6.3 Comparison between Local and GlobalOptimisations

In this section, we compare the local and global optimisa-tion methods in terms of their effectiveness and computa-tional overheads.

In the local optimisation method, the VOM is locally opti-mised for individual BCs. Local monitoring strategies areformulated for individual BCs without taking into accountother BCs. The local VOM function VOMðSi ! Si jpða;b; gÞÞ is used to evaluate the local monitoring strategies.In the global optimisation method, the VOM is optimisedfor the SBS. The VOM of a local monitoring strategy is esti-mated using the global VOM function VOMðSi !S jpða;b; gÞÞ. The global optimisation method is more effec-tive than the local optimisation method in terms of globalVOM maximisation. For example, consider two componentservices Ni and Nj that are executed in parallel and need tosynchronise upon completion, and assume that Ni usuallytakes considerably more time than Nj. When formulatingthe monitoring strategy for Nj, the local optimisationmethod would simply try to minimise the monitoring-induced decrease in Nj’s response time. However, this isnot worthwhile if Ni still takes more time than Nj to com-plete. It is preferable to formulate a local monitoring strat-egy for Nj that allows Nj’s response time to increasereasonably while better guaranteeing Ni’s other QoSdimensions.

In the local optimisation method, constraints can beimposed on individual BCs. For example, it can be requiredthat the value of monitoring a BC Sj must not be lower thana certain threshold value vj (i.e., VOMðSj ! Sj jpjðaj;bj; gjÞÞ � vjÞ. However, such constraints cannot be imposedon service combinations, e.g., the SBS. The global optimisa-tion method overcomes this shortcoming by allowing con-straints to be imposed on service combinations. Forexample, using the global optimisation method, it is possi-ble to require that the total monitoring cost must not exceeda certain level (i.e.,

Pi2S

Pj2Pi

xij � cðpijðaij;bij; gijÞÞ � ot).The benefits of the global optimisation method come

with higher computational overhead than the local optimi-sation method. The global optimisation method turns the

global monitoring strategy formulation into an NP-hard IPproblem. An experimental comparison between the localoptimisation method and the global optimisation method isgiven in Section 7.4.

In scenarios where the budget allows monitoring theentire SBS, there is no need to allocate the budget to differ-ent execution paths and no constraints are imposed on theresource costs of the local monitoring strategies. The objec-tives of both local optimisation and global optimisation turninto maximising the VOM for every single BC of the SBS.The monitoring strategy formulation completes when thelocal monitoring strategies are formulated for all BCs. Insuch cases, the local optimisation method is equivalent tothe global optimisation method in terms of VOM maximisa-tion. As a result, the local optimisation is preferable as itoutperforms the global optimisation in terms of computa-tional overhead (see also Section 7.4).

7 EXPERIMENTAL EVALUATION

We have conducted a range of experiments in a simulatedvolatile environment, aiming at evaluating CriMon-basedmonitoring for SBSs in volatile environments. This sectionpresents the experimental results. Section 7.1 describes theprototype used in the experiments. Section 7.2 describes thesetup of the experiments. Section 7.3 evaluates the effective-ness of CriMon in improving system response time andoptimising VOM. Finally, Section 7.4 evaluates the effi-ciency of CriMon measured by its computational overhead.

7.1 Prototype Implementation

To evaluate CriMon, we have developed a prototype toolthat implements the CriMon process presented in Fig. 5 inJava using JDK 1.6.0 and Eclipse Java EE IDE. The inputs ofthe prototype tool include: the functional specification of anSBS, the timing properties of the BCs of the SBS, the budgetfor monitoring and the set of monitoring constraints. UMLactivity diagrams are used to describe the business processof the SBS. All the operations of the CriMon process, i.e.,operations A-E in Fig. 5, are performed by the prototypetool automatically. Given the functional specification of anSBS and the timing properties of the BCs of the SBS, the pro-totype tool first peels the loops (if there are any) in the SBSand calculates the criticalities of the execution paths andBCs of the SBS. Then, given a budget for monitoring and aset of monitoring constraints, it formulates the optimalmonitoring strategy as output by the local optimisation orglobal optimisation method. The monitoring strategy speci-fies the monitoring configuration for each BC of the SBS,including the number of monitors, monitoring frequencyand monitoring granularity. For solving the IP-based opti-mization problem introduced in Section 6.2, the prototypetool uses CPLEX v12.2, an integer programming solver.

7.2 Experiment Setup

We have conducted two series of experiments to evaluateCriMon. As discussed in the introduction section, responsetime management is the basis of managing other qualitydimensions of SBSs. Thus, in the first series of experiments,we focus on two aspects: 1) the effectiveness (measured byresponse time improvement) of CriMon-based monitoring;

HE ET AL.: FORMULATING COST-EFFECTIVE MONITORING STRATEGIES FOR SERVICE-BASED SYSTEMS 473

Page 14: Formulating cost effective monitoring strategies for service-based systems

and 2) the efficiency of CriMon (measured by computa-tional overhead) in calculating dominance and criticalities.

In the first series of experiments, the evaluation pro-cess mimicked the example SBS presented in Section 2.The response times of the BCs are generated according todifferent normal distributions based on a publicly avail-able web service dataset QWS. QWS comprises measure-ments of nine QoS attributes (including response time) ofover 2,500 real-world web services. The informationabout the services was collected from public UDDI regis-tries, search engines and service portals. Their QoS val-ues were measured using benchmark tools. More detailsabout QWS can be found in [4].

During the execution of OnlineLive, we generated a num-ber of anomalies based on a fault rate and randomly intro-duced the anomalies to the BCs to simulate volatileenvironments. We increased the fault rate from 10 percentto 40 percent in steps of 10 percent to simulate increasinglevels of volatility in the environment. When anomaliesoccurred to unmonitored BCs, delays that were randomlygenerated based on normal distributions were applied tothe corresponding BCs. When a BC was being monitored,runtime anomalies may still cause delays because monitor-ing does not guarantee immediate diagnosis of the runtimeanomalies [61]. There are many uncertain factors that mayimpact on the actual delay caused to an SBS by an anomaly,e.g., the time taken to diagnose the anomaly and the diffi-culty of fixing the anomaly. To simplify the application ofdelays in the experiments, we assumed that if a BC wasbeing monitored, runtime anomalies that occurred to theBC could be detected or predicted and the adaptationactions taken in time to fix the anomaly. Thus, the delaycaused by the anomaly would be avoided or compensated.That is, we consider or focus on only the “time saving”effects from early anomaly detection or prediction asenabled by monitoring.

Three sets of experiments were conducted in each vola-tile environment. In each set of experiments, 1,000 Online-Live instances were run and the response times ofOnlineLive were averaged. In set #1, no monitors were allo-cated. In set #2, the criticalities of the execution paths andthe BCs were not considered and the monitors were ran-domly allocated to the BCs. In set #3, the monitors were allo-cated according to the criticalities of the execution paths

from high to low, first the execution path with the highestcriticality, then the one with the second highest criticality,etc. Whenever the monitoring resources were not enough tocover an entire execution path, the BCs with the highest crit-icalities on that execution path were monitored first. Giventhe criticalities of the BCs as weights in a DAG, we adoptthe method proposed in [21] for critical path enumeration,which runs in Oðmþ n � lognþ kÞ to find k longest execu-tion paths in a service composition that consists of n nodesand m edges. In sets #2 and #3, the monitoring coverage,i.e., the maximum proportions of BCs that were monitored,was increased from 0 to 100 percent in steps of 10 percent tosimulate scenarios with different levels of available moni-toring resources. In this way, we are able to evaluate thecost-effectiveness of CriMon.

In the experiments for evaluating the computationaloverhead of CriMon, we simulated SBSs that comprised dif-ferent numbers of BCs. Then, we used the prototype tool toperform dominance probability calculation and criticalitycalculation. By comparing the time consumption taken bythe prototype tool to complete the calculation, we were ableto evaluate the computational overhead of CriMon.

In the second series of experiments, we focused on theevaluation of the local and global optimisation methods formonitoring strategy formulation provided by CriMon withrespect to VOM and computational overheads. In this seriesof experiments, we utilised randomly composed SBSs withdifferent numbers of BCs and execution paths. Weincreased the number of BCs in the service compositionfrom 10 to 100, then in steps of 100 to 1,000, and accordinglythe number of execution paths from 1 to 10, then in steps of10 to 100. As discussed in Section 5, the implementationsof the monitoring benefit and cost functions, i.e.,bðS ! S0 jpða;b; gÞÞ and cðpða;b; gÞÞ, are domain-specific.Hence, the calculation of VOM is also domain-specific. Asdiscussed before, more monitoring resource makes it morelikely to detect and predict runtime anomalies, increasingthe benefit of monitoring. However, the increase wouldslow down at some point as the monitoring resource contin-ues to increase. In addition, increase in monitoring resourceleads to increase in resource cost and system cost. Thus,before reaching its highest value, the VOM is positively cor-related with the three monitoring parameters, i.e., the num-ber of monitors (denoted by a), monitoring frequency

Fig. 5. CriMon process.

474 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014

Page 15: Formulating cost effective monitoring strategies for service-based systems

(denoted by b) and monitoring granularity (denoted by g).After reaching its highest value, the VOM is negatively cor-related with the three monitoring parameters. The balancebetween proper monitoring and excessive monitoring needsto be reflected in the experiment. Thus, we designed ternaryquadratic functions to model the relationships between theVOM and the three monitoring parameters.

The experiments were conducted on a machine withAMD Athlon(tm) X4 640 3.000 GHz CPU and 8 GB RAM,running Windows 7� 64 Ultimate, Sun Microsystems JVMv1.6.0_22.

7.3 Effectiveness Evaluation

Fig. 6 shows the average response time of OnlineLive obtainedin different volatile environments. As the fault rateincreases, the average response time of OnlineLive increasesbecause more anomalies cause longer total delay to Online-Live. Random monitoring (Set #2) and CriMon-based moni-toring (Set #3) improved the response time of OnlineLiveover no monitoring (Set #1) by an average of 17.87 and27.80 percent respectively across all experimental cases.When the monitoring coverage exceeds 50 percent, theaverage response time of OnlineLive under CriMon basedmonitoring remains approximately 26.5 seconds, which is

the response time of OnlineLive when no anomaly occurs.Extra monitoring coverage does not further reduce theresponse time of OnlineLive. This implies that 50 percentmonitoring coverage is what CriMon-based monitoringneeds to detect all the runtime anomalies that directlyimpact on the response time of OnlineLive. The results pre-sented in Fig. 6 also provide guidance on determining thebudget for monitoring to meet different levels of require-ments for the response time of OnlineLive. Take Fig. 6d forexample, to make sure that the average response time ofOnlineLive is shorter than 30 seconds, random monitoringrequires at least 80 percent monitoring coverage while Cri-Mon-based monitoring only requires a monitoring cover-age of 40 percent. Similarly, given specific requirements forthe response times of any SBS, guidance can be obtained byimplementing monitoring strategies formulated by CriMonin simulated volatile operation environments. Based on theresults, the required monitoring coverage can be estimated.

Fig. 7 compares the improvement in the response time ofOnlineLive obtained by random monitoring and CriMon-based monitoring over no monitoring. The improvementobtained by CriMon-based monitoring is larger than ran-dom monitoring by 55.67 percent on average across allexperimental cases. That means, given the same monitoring

Fig. 6. Average response time of OnlineLive.

HE ET AL.: FORMULATING COST-EFFECTIVE MONITORING STRATEGIES FOR SERVICE-BASED SYSTEMS 475

Page 16: Formulating cost effective monitoring strategies for service-based systems

coverage (or available monitoring resources), CriMon-basedmonitoring is on average 55.67 percent more cost-effectivethan random monitoring. As presented in Fig. 7, when themonitoring coverage is equivalent to and larger than 50 per-cent, the improvement obtained by CriMon-based monitor-ing are very similar to that obtained by random monitoringwith a monitoring coverage of 100 percent. This observationconfirms that CriMon-based monitoring requires only amonitoring coverage of 50 percent for the detection and pre-diction of all the runtime anomalies that directly impact onthe response time of OnlineLive.

Furthermore, as indicated by the improvement marginsof CriMon-based monitoring over random monitoring inFig. 7, CriMon-based monitoring outperforms random mon-itoring by relatively large margins when the monitoringcoverage was between 10 and 50 percent. Specifically, theaverage margin is 105.20 percent compared to 55.67 percentacross all experimental cases. When the monitoring cover-age is close to 100 percent, there is no difference betweenrandom monitoring and CriMon-based monitoring, asexpected. This observation shows that CriMon-based moni-toring would achieve particularly significant cost-effective-ness advantage over random monitoring when the

monitoring coverage is relatively low. We conclude that theless budget for monitoring, the more crucial and necessaryit is to focus monitoring on the execution path(s) and BCswith high criticalities.

To demonstrate the change in effects of individual moni-tors as the monitoring coverage increases, we present inFig. 8 the response time improvement per monitor obtainedby random monitoring and CriMon-based monitoring.As shown, CriMon-based monitoring demonstrates signifi-cant advantage over random monitoring in per-monitorresponse time improvement. This observation indicatesmuch higher cost-effectiveness of CriMon-based monitoringthan random monitoring. Moreover, the cost-effectivenessof CriMon, measured by its additional per-monitorresponse time improvement in percentage over randommonitor, decreases from 114.8 to 0 percent as the monitoringcoverage increases from 10 to 100 percent. This observationconfirms that CriMon is particularly cost-effective when themonitoring resources are relatively limited.

Figs. 9 and 10 compare the total VOM and VOM per BCobtained by the local and global optimisation methodsrespectively. Consider that VOM is domain-specific andthere is no specific unit of VOM, the VOM obtained by the

Fig. 7. Response time improvement.

476 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014

Page 17: Formulating cost effective monitoring strategies for service-based systems

local optimisation method from the service compositionwith one execution path consisting of 10 BCs is used as theunit of measurement for VOM in Fig. 9. As presented inFig. 9, the VOM obtained by both optimisation methodsincrease with the number of BCs. Specifically, as the number

of BCs increases from 10 to 1,000, the VOM obtainedby local optimisation increases from 1.00 to 22.49 and theVOM obtained by global optimisation from 4.11 to 38.49.Apparently, the global optimisation method achieves signif-icantly higher VOM than the local optimisation method

Fig. 9. VOM Comparison between local and global optimisation. Fig. 10. VOM per BC Comparison between local and global optimisation.

Fig. 8. Response time improvement per monitor.

HE ET AL.: FORMULATING COST-EFFECTIVE MONITORING STRATEGIES FOR SERVICE-BASED SYSTEMS 477

Page 18: Formulating cost effective monitoring strategies for service-based systems

irrespective of the size of the SBS. On average, the globaloptimisation method outperforms the local optimisationmethod by a margin of 123 percent. Fig. 10 compares theVOM per BC—total VOM divided by the number of BCs—obtained by local optimisation and global optimisation.The results show that the VOM per BC decreases for bothlocal optimisation and global optimisation as the size of theSBS increases. However, the VOM per BC obtained byglobal optimisation decreases with the number of BCsmuch faster than local optimisation. In the case of 10 BCs,the advantage of global optimisation over local optimisationis approximately 311 percent. In the case of 1,000 BCs,it shrinks down to 71 percent, which is still a large margin,but much lower compared to the case of 10 BCs. Thisobservation raises a question: given the relatively small dif-ference of VOC per BC between the local optimisationand the global optimisation, is it still worthwhile to runglobal optimisation since it is more complicated andtakes more time than local optimisation to complete? Thiswill be discussed in the next section with the experimentalresults on the efficiency of the local optimisation and globaloptimisation.

7.4 Efficiency Evaluation

The effectiveness of CriMon comes at a price, i.e., it hascomputational overhead. In the service-oriented and cloudenvironments, the compositions of SBSs can be verydynamic, and thus require fast formulation of correspond-ing monitoring strategies. In this section, we evaluate thecomputational overhead of CriMon measured by the timetaken to perform the criticality evaluation and to formulatethe monitoring strategies.

The computational overhead of criticality evaluation hastwo major components: the time consumption for domi-nance probability calculation and the time consumption forcriticality calculation. As presented in Fig. 11, CriMon dem-onstrates a slow growth in computational overhead againstthe number of BCs to be processed, which shows high scal-ability. When there are 1,000 BCs involved in an SBS, it tookCriMon roughly 1.0 seconds to finish the dominance calcu-lation and the criticality calculation, which indicates highefficiency of CriMon, even in very large-scale scenarios.

Fig. 12 compares the computational overheads (in milli-seconds) of the local optimisation and global optimisationmethods in scenarios on different scales. Generally, thecomputational overheads of both methods increase with thenumber of BCs. In particular, the computational overheadof the global optimisation method increases from 18 milli-seconds to 1,299 milliseconds as the number of BCsincreases from 10 to 1,000, significantly faster than thelocal optimisation method (from 0.84 to 102 milliseconds onaverage). However, it is still acceptable in most, if not all,real-world scenarios. In the largest scenario with 1,000 BCsforming 100 different execution paths in the SBS, the globaloptimisation takes an average of only 1,299 milliseconds tocomplete. Fig. 12 also demonstrates the scalability of thetwo optimisation methods. The global optimisation methoddoes not scale as well as the local optimisation method.However, it is still very scalable as its per-BC computationaloverhead remains around 1.2 milliseconds as the scenarioscales up.

Furthermore, as discussed in Section 7.3, in large-scalescenarios, the advantage of the global optimisation methodover the local optimisation method in VOM is much less sig-nificant than in small-scale scenarios. In addition, as pre-sented in Fig. 12, the global optimisation method requiresconsiderably more time than the local optimisation methodto complete in large-scale scenarios. There is a tradeoffbetween VOM and computational overhead. The selectionbetween the local optimisation and the global optimisationmethods is dependent on the SBS developer or adminis-trator’s domain-specific need and preference for the effec-tiveness and the efficiency of the monitoring strategyformulation. However, our experimental results can pro-vide valuable guidance for the selection between the localand global optimisation methods. The lifetime of differentSBSs vary significantly, from seconds to days, months, andeven years. Usually, the short-lived SBSs are executed onlyonce while the long-lived ones are executed hundreds andthousands of times [35]. Thus, short-lived SBSs require fastformulation of monitoring strategies. For these SBSs, thelocal optimisation method is more suitable becausethe monitoring strategy can be formulated much fasterusing the local optimisation method (an average of only

Fig. 12. Computational overhead comparison between local and globaloptimisation.Fig. 11. Computational overhead of CriMon in criticality evaluation.

478 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014

Page 19: Formulating cost effective monitoring strategies for service-based systems

44.41 milliseconds for OnlineLive across all experimentalcases versus global optimisation’s 562.38 milliseconds). Onthe other hand, for long-lived SBSs, globally optimisedVOM is considered more important than fast monitoringstrategy formulation. Thus, the global optimisation methodis preferable as it benefits the SBSs more in the long-termcompared to the local optimisation method.

8 RELATED WORK

During the past years, the problem of QoS-aware servicecomposition has attracted much attention. Many effortshave been devoted to selecting appropriate componentservices at design time to fulfil the quality requirements forSBSs. Representative pieces of such work include [5], [7],[36], [64]. In [64], a middleware platform AgFlow is pre-sented that uses Integer Programming to compute the opti-mal plan for the execution of composite SBSs from severalexecution paths represented by DAGs. Following the workin [64], authors of [7] use mixed integer linear programming(MILP) to solve the QoS-aware service selection problemsthat involve decimal variables. Authors of [5] adopt the Sky-line technique to deal with the computational complexity ofQoS-aware service composition. In [36], the authors take afurther step by taking the network latency into accountwhen selecting services for SBSs.

After service selection, there are different approaches forguaranteeing the quality of SBSs at runtime. Service levelagreements are often used to provide contractual QoS guar-antees for SBSs [23]. SLA negotiation approaches for servicecomposition have also been proposed [19], [63]. However,services need to be monitored at runtime to substantiallyguarantee that their qualities conform to the SLAs [50]because SLAs only provide traditional recourse, rather thantimely alerts of impending SLA violations [52]. At runtime,anomalies, e.g. QoS violations and data transmission errors,may occur due to the volatility in service-oriented environ-ments, resulting in QoS and SLA violations. In order to actu-ally guarantee the quality of the SBSs, runtime adaptationmust be performed to fix those anomalies. Some adaptationapproaches have been proposed [11], [15], [21], [27], [38],and they are based on the premise that the service composi-tion must be monitored for the detection and prediction ofruntime anomalies. Adaptation approaches work effectivelyonly when runtime anomalies can be detected or predictedin time.

Many efforts have been made to enable and facilitate webservice monitoring. In [22], the authors propose an approachto support dynamic configuration of SLA monitoringresponsibilities for different monitoring components. Givenan SLA, they first decompose it into manageable monitoringconfigurations and then allocate monitoring resources fordifferent parts of the SLA. In the WSLA Framework [31], anSLA compliance monitor is proposed to realise automaticconfiguration of SLA monitoring services. In this SLA com-pliance monitor, three web service monitoring services,namely Measurement Service, Condition Evaluation Serviceand Deployment Service, are implemented for the purposesof metric definitions, metric update receipts and WSLA doc-ument decomposition respectively. The authors of [52] pro-pose a two-level monitoring system named ReqMon. In

ReqMon, each web service site contains at least one monitorserver and there is a global integrative monitor that controlsindividual monitors. The individual monitor services areresponsible for detecting web service failures while theglobal monitor alerts these failures to the clients. In [2], webservice management network (WSMN) agents are proposedto distribute and automate the SLA monitoring process.Those agents intercept the SOAP messages exchanged fromweb service interactions, evaluate SLAs based on the col-lected data and report SLA violations if there is any. In [50],timed automata is used to support monitoring of timeliness,reliability and throughput constraints expressed in SLAs.By adopting this technique, SLA violations can be detectedby analysing the types of SOAP messages exchangedbetween service consumers and providers.

The issue of monitoring service compositions is morecomplex than monitoring individual services as itinvolves multiple component services. In [40], a frame-work is presented for runtime verification of requirementsfor service composition. The framework supports moni-toring the component services’ behaviours at runtime.Monitoring is realised by intercepting the eventsexchanged between the composition process and the com-ponent services. In [12], an approach is proposed toimplement runtime monitoring of WS-BPEL processes.External monitoring rules, which provide parameters togovern the degree of runtime checking, can be weavedinto the service composition. These parameters includetype of monitor, priority and validity of monitoring rules,and certified service providers. In [9], the authors proposean assertion language named ALBERT to specify bothfunctional and non-functional properties of web servicecompositions. At runtime, the assertions are checked byDynamo [10], a proxy-based monitoring infrastructure.Astro is a monitoring solution proposed in [8], aiming atseparating the business logic of a web service from itsmonitoring functionality. Astro can monitor both singleSBSs and multiple SBSs in a class. Combining Astra andDynamo, it proposes a general and comprehensive solu-tion for monitoring service compositions. In the new mon-itoring solution, monitoring constraints can be defined onsingle and multiple instances, on punctual properties andon complete behaviours. In [13], SECMOL, a general mon-itoring language grounded on three existing monitoringlanguages, namely EC-Assertion [56], SLANG [43] andWSCoL [12], is proposed. In SECMOL, the Data Collectorcaptures and extracts the data needed to perform moni-toring, the Data Analyzer analyses the data collected bythe Data Collector, and the Monitoring Manager integra-tes and oversees the whole monitoring process. WSCoL isalso adopted in [11] as a means to enrich service composi-tions with self-supervision capabilities. In the S-Cubeproject [54], the researchers have proposed several techni-ques to address monitoring issues. In [24], the authorspropose a framework that integrates monitoring acrossthe software and infrastructure layers. A variation of theMAPE control loop is introduced into the framework thatacknowledges the multi-faceted nature of SBSs. In [33],the authors integrate SALMon [46] in IMA4SSP, theirmonitoring approach to seamless service provisioning.SALMon is a monitoring system that can collect dynamic

HE ET AL.: FORMULATING COST-EFFECTIVE MONITORING STRATEGIES FOR SERVICE-BASED SYSTEMS 479

Page 20: Formulating cost effective monitoring strategies for service-based systems

reliability information of SBSs expressed in pre-definedquality metrics.

While producing benefits by detecting and predictingruntime anomalies, monitoring consumes resources andhence incurs resource cost. However, none of the exitingwork has properly considered the resource cost, whichshould not be neglected, especially when cost-effectivemonitoring is required. Furthermore, monitoring also incurssystem cost [28], [59]. Monitoring strategy formulation mustconsider both the benefit and cost of monitoring. Similarissues exist in service selection [25], [26] and service adapta-tion [27] for SBSs. The authors of [25], [26] adopt the conceptof value of changed information to determine which serv-ices to select for composing an SBS. In [27], we adopt thesame concept for the evaluation of service adaptation strate-gies for SBSs. In response to runtime anomalies, candidateadaptation strategies are evaluated to determine whetherthey are worth implementing. The tradeoff between thebenefit and cost of monitoring also needs to be managedwhen formulating monitoring strategies for SBSs. Unfortu-nately, this issue has not been addressed properly in theexisting work on monitoring SBSs.

In this paper, we have proposed CriMon, a novel moni-toring strategy formulation approach for SBSs, aiming ataddressing the abovementioned issues. CriMon calculatesthe criticalities of the execution paths and BCs of the SBS todetermine which parts of the SBS should be prioritised formonitoring. Then, CriMon calculates the VOM of possiblelocal monitoring strategies for BCs. Finally, CriMon formu-lates the optimal global monitoring strategy for the SBSusing one of the two optional optimisation methods, i.e.,local optimisation and global optimisation.

9 CONCLUSIONS AND FUTURE WORK

In this paper, we have presented CriMon, a novelapproach to formulating monitoring strategies for ser-vice-based systems that is based on the criticality of sys-tem components and aims at maximising the value ofmonitoring. A probabilistic timing model is proposed totake into account the randomness of the timing proper-ties of the basic components of an SBS in volatile operat-ing environments. Based on the timing model, thecriticalities of the execution paths and BCs of the SBScan be calculated. Then, two methods, namely local opti-misation and global optimisation, are introduced to for-mulate monitoring strategies for SBSs, which take intoaccount the criticalities of the execution paths and theBCs, and the VOM. Experimental results have shownthat CriMon can facilitate cost-effective response timemanagement for SBSs in volatile operating environments.The experimental results also give guidance to determin-ing monitoring resources in meeting different levels ofresponse time requirements. We have also evaluated thelocal and global optimisation methods for monitoringstrategy formation. It shows that the global optimisationcan obtain significantly better VOM than the local opti-misation with a higher yet acceptable computationaloverhead. However, the local optimisation method ismuch more efficient and thus is more suitable for moni-toring short-lived SBSs.

At present, CriMon does not differentiate the types ofmonitors. This issue will be investigated as future work. Inthe future, we will also investigate the formulation of moni-toring strategies for SBSs across multiple levels, includingSaaS, IaaS and PaaS, to accommodate the unique character-istics of SBSs in the cloud.

ACKNOWLEDGMENTS

This work was partly supported by the Australian ResearchCouncil (projects LP0775188 and DP110101340) and CALabs. This paper is a significant revision and extensionof [47].

REFERENCES

[1] IP Flow Information Export (IPFIX) URL: http://datatracker.ietf.org/wg/ipfix/ 2013.

[2] A. Sahai, V. Machiraju, M. Sayal, A. van Moorsel, and F. Casati,“Automated SLA Monitoring for Web Services,” Proc. 13th IFIP /IEEE Int’l Workshop Distributed Systems: Operations and Management(DSOM ’02), pp. 28-41, 2002.

[3] M.M. Akbar, E.G. Manning, G.C. Shoja, and S. Khan, “HeuristicSolutions for the Multiple-Choice Multi-Dimension KnapsackProblem,” Proc. Int’l Conf. Computational Science (ICCS ’01),pp. 659-668, 2001.

[4] E. Al-Masri and Q.H. Mahmoud, “Investigating Web Services onthe World Wide Web,” Proc. 17th Int’l Conf. World Wide Web(WWW ’08), pp. 795-804, 2008.

[5] M. Alrifai, D. Skoutas, and T. Risse, “Selecting Skyline Services forQoS-based Web Service Composition,” Proc. 19th Int’l Conf. WorldWide Web (WWW ’10), pp. 11-20, 2010.

[6] Amazon CloudWatch, URL: http://aws.amazon.com/cloud-watch/, 2013.

[7] D. Ardagna and B. Pernici, “Adaptive Service Composition inFlexible Processes,” IEEE Trans. Software Eng., vol. 33, no. 6,pp. 369-384, June 2007.

[8] F. Barbon, P. Traverso, M. Pistore, and M. Trainotti, “Run-TimeMonitoring of Instances and Classes of Web ServiceCompositions,” Proc. IEEE Int’l Conf. Web Services (ICWS ’06),pp. 63-71, 2006.

[9] L. Baresi, D. Bianculli, C. Ghezzi, S. Guinea, and P. Spoletini,“Validation of Web Service Compositions,” IET Software, vol. 1,no. 6, pp. 219-232, Dec. 2007.

[10] L. Baresi and S. Guinea, “Dynamo: Dynamic Monitoring of WS-BPEL Processes,” Proc. Third Int’l Conf. Service-Oriented Computing(ICSOC ’05), pp. 478-483, 2005.

[11] L. Baresi and S. Guinea, “Self-Supervising BPEL Processes,” IEEETrans. Software Eng., vol. 37, no. 2, pp. 247-263, Mar./Apr. 2011.

[12] L. Baresi and S. Guinea, “Towards Dynamic Monitoring of WS-BPEL Processes,” Proc. Third Int’l. Conf. Service-Oriented Computing(ICSOC ’05), pp. 269-282, 2005.

[13] L. Baresi, S. Guinea, O. Nano, and G. Spanoudakis,“Comprehensive Monitoring of BPEL Processes,” IEEE InternetComputing, vol. 14, no. 3, pp. 50-57, May/June 2010.

[14] C. Bettini, D. Maggiorini, and D. Riboni, “Distributed ContextMonitoring for the Adaptation of Continuous Services,” WorldWide Web, vol. 10, no. 1, pp. 503-528, 2007.

[15] R. Calinescu, L. Grunske, M. Kwiatkowska, R. Mirandola, and G.Tamburrelli, “Dynamic QoS Management and Optimisation inService-Based Systems,” IEEE Trans. Software Eng., vol. 37, no. 3,pp. 387-409, May/June 2011.

[16] K.S. Candan, W.-S. Li, T. Phan, and M. Zhou, “Frontiers in Infor-mation and Software as Services,” Proc. 25th Int’l Conf. Data Eng.(ICDE ’09), pp. 1761-1768, 2009.

[17] CISCO, Cisco IOS NetFlow, URL: http://www.cisco.com/en/US/products/ps6601/products_ios_protocol_group_home.html,2013.

[18] E.C. Clark, “The Greatest of a Finite Set of Random Variables,”Operations Research, vol. 9, no. 2, pp. 145-162, 1961.

[19] E. Di Nitto, M. Di Penta, A. Gambi, G. Ripa, and M.L. Villani,“Negotiation of Service Level Agreements: An Architecture and aSearch-Based Approach,” Proc. Fifth Int’l Conf. Service-OrientedComputing (ICSOC ’07), pp. 295-306, 2007.

480 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014

Page 21: Formulating cost effective monitoring strategies for service-based systems

[20] E. Di Nitto, C. Ghezzi, A. Metzger, M.P. Papazoglou, and K. Pohl,“A Journey to Highly Dynamic, Self-Adaptive Service-BasedApplications,” Automated Software Eng., vol. 15, no. 3/4, pp. 313-341, 2008.

[21] D. Eppstein, “Finding the k Shortest Paths,” SIAM J. Computing,vol. 28, no. 2, pp. 652-673, 1998.

[22] H. Foster and G. Spanoudakis, “Advanced Service MonitoringConfigurations with SLA Decomposition and Selection,” Proc.ACM Symp. Applied Computing (SAC ’11), pp. 1582-1589, 2011.

[23] X. Gu, K. Nahrstedt, R.N. Chang, and C. Ward, “QoS-AssuredService Composition in Managed Service Overlay Networks,”Proc. 23rd Int’l Conf. Distributed Computing Systems (ICDCS ’03),pp. 194-203, 2003.

[24] S. Guinea, G. Kecskemeti, A. Marconi, and B. Wetzstein, “Multi-layered Monitoring and Adaptation,” Proc. Ninth Int’l Conf. Ser-vice-Oriented Computing (ICSOC ’11), pp. 359-373, 2011.

[25] J. Harney and P. Doshi, “Adaptive Web Processes Using Value ofChanged Information,” Proc. Fourth Int’l Conf. Service-OrientedComputing (ICSOC ’06), pp. 179-190, 2006.

[26] J. Harney and P. Doshi, “Speeding Up Adaptation of Web ServiceCompositions Using Expiration Times,” Proc. 16th Int’l Conf.World Wide Web (WWW ’07), pp. 1023-1032, 2007.

[27] Q. He, J. Yan, H. Jin, and Y. Yang, “Adaptation of Web ServiceComposition Based on Workflow Patterns,” Proc. Sixth Int’l Conf.Service-Oriented Computing (ICSOC ’08), pp. 22-37, 2008.

[28] G. Heward, J. Han, I. M€uller, J.-G. Schneider, and S. Versteeg,“Optimizing the Configuration of Web Service Monitors,” Proc.Eighth Int’l Conf. Service-Oriented Computing (ICSOC ’10), pp. 587-595, 2010.

[29] G. Heward, I. M€uller, J. Han, J.-G. Schneider, and S. Versteeg,“Assessing the Performance Impact of Service Monitoring,” Proc.21st Australian Software Eng. Conf. (ASWEC ’10), pp. 192-201, 2010.

[30] G. Katsaros, G. Kousiouris, S.V. Gogouvitis, D. Kyriazis, A.Menychtas, and T. Varvarigou, “A Self-Adaptive HierarchicalMonitoring Mechanism for Clouds,” J. Systems and Software,vol. 85, no. 5, pp. 1029-1041, 2012.

[31] A. Keller and H. Ludwig, “The WSLA Framework: Specifying andMonitoring Service Level Agreements for Web Services,” J. Net-work and Systems Management, vol. 11, no. 1, pp. 57-81, Mar. 2003.

[32] J. Kelley and W. Morgan, “Critical-Path Planning and Sched-uling,” Proc. Eastern Joint IRE-AIEE-ACM Computer Conf., pp. 160-173, 1959.

[33] A. Kert�esz, G. Kecskem�eti, A. Marosi, M. Oriol, X. Franch, and J.Marco, “Integrated Monitoring Approach for Seamless ServiceProvisioning in Federated Clouds,” Proc. 20th Euromicro Int’l Conf.Parallel, Distributed, and Network-Based Processing (PDP ’12),pp. 567-574, 2012.

[34] R. Khalaf, N. Mukhi, and S. Weerawarana, “Service-OrientedComposition in BPEL4WS,” Proc. 12th Int’l Conf. World Wide Web(WWW ’03), 2003.

[35] A. Klein, F. Ishikawa, and S. Honiden, “Efficient QoS-Aware Ser-vice Composition with a Probabilistic Service Selection Policy,”Proc. Eighth Int’l Conf. Service-Oriented Computing (ICSOC ’10),pp. 182-196, 2010.

[36] A. Klein, F. Ishikawa, and S. Honiden, “Towards Network-AwareService Composition in the Cloud,” Proc. 21st World Wide WebConf. (WWW ’12), pp. 959-968, 2012.

[37] M. K€oksalan and S. Zionts, Multiple Criteria Decision Making in theNew Millennium. Springer, 2001.

[38] P. Leitner, A. Michlmayr, F. Rosenberg, and S. Dustdar,“Monitoring, Prediction and Prevention of SLA Violations inComposite Services,” Proc. IEEE Int’l Conf. Web Services(ICWS ’10), pp. 369-376, 2010.

[39] X. Lu, R.O. Morando, and M.E. Zarki, “Understanding VideoQuality and Its Use in Feedback Control,” Proc. 12th Int’l PacketVideo Workshop (PV ’02), 2002.

[40] K. Mahbub and G. Spanoudakis, “Run-Time Monitoring ofRequirements for Systems Composed of Web-Services: InitialImplementation and Evaluation Experience,” Proc. IEEE Int’lConf. Web Services (ICWS ’05), pp. 257-265, 2005.

[41] Microsoft System Center Global Service Monitor, URL: http://www.microsoft.com/en-us/server-cloud/system-center/global-service-monitor.aspx, 2013.

[42] S. Misailovic, S. Sidiroglou, H. Hoffmann, and M.C. Rinard,“Quality of Service Profiling,” Proc. 32nd ACM /IEEE Int’l Conf.Software Eng. (ICSE ’10), pp. 25-34, 2010.

[43] O. Nano and M. Tilly, “Filling the Gap between SLA and Mon-itoring,” Proc. eChallenges e-2006 Conf., 2006.

[44] OASIS Web Services Business Process Execution Language Ver-sion 2.0 2007 URL: http://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.pdf, 2013.

[45] Object Management Group Business Process Model and Notation(BPMN) Version 2.0 2011 URL: http://www.omg.org/spec/BPMN/2.0/PDF/ 2013.

[46] M. Oriol, J. Marco, X. Franch, and D. Ameller, “Monitoring Adapt-able SOA-Systems Using SALMon,” Proc. Workshop Service Moni-toring, Adaptation and Beyond (Monaþ) (ServiceWave ’08), pp. 19-28,2008.

[47] H. Qiang, J. Han, Y. Yang, J.-G. Schneider, H. Jin, and S. Versteeg,“Probabilistic Critical Path Identification for Cost-Effective Moni-toring of Cloud-Based Software Applications,” Proc. Ninth Int’lConf. Service Computing (SCC ’12), pp. 178-185, 2012.

[48] J.M. Rabaey, Digital Integrated Circuits: A Design Perspective. Pren-tice Hall, 1995.

[49] M. Rahman, S. Venugopal, and R. Buyya, “A Dynamic CriticalPath Algorithm for Scheduling Scientific Workflow Applicationson Global Grids,” Proc. Third Int’l Conf. e-Science and Grid Comput-ing, pp. 35-42, 2007.

[50] F. Raimondi, J. Skene, and W. Emmerich, “Efficient Online Moni-toring of Web-Service SLAs,” Proc. 16th ACM SIGSOFT Int’l Symp.Foundations of Software Eng. (SIGSOFT FSE ’08), pp. 170-180, 2008.

[51] A.R. Reibman, V.A. Vaishampayan, and Y. Sermadevi, “QualityMonitoring of Video over a Packet Network,” IEEE Trans. Multi-media, vol. 6, no. 2, pp. 327-334, Apr. 2004.

[52] W.N. Robinson, “Monitoring Web Service Requirements,” Proc.11th Int’l Conf. Requirements Eng. (ICRE ’03), pp. 65-74, 2003.

[53] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach.second ed., Prentice Hall, 2003.

[54] S-Cube Software Services and Systems Network, URL: http://www.s-cube-network.eu/, 2013.

[55] J.H. Son and M.H. Kim, “Improving the Performance of Time-Constrained Workflow Processing,” J. Systems and Software,vol. 58, no. 3, pp. 211-219, 2001.

[56] G. Spanoudakis and K. Mahbub, “Non-Intrusive Monitoring ofService-Based Systems,” Int’l J. Cooperative Information Systems,vol. 15, no. 3, pp. 325-358, 2006.

[57] S. Tao, J.G. Apostolopoulos, and R. Gu�erin, “Real-Time Monitor-ing of Video Quality in IP Networks,” IEEE /ACM Trans. Network-ing, vol. 16, no. 5, pp. 1052-1065, Oct. 2008.

[58] K.S. Trivedi, Probability and Statistics with Reliability, Queueing, andComputer Science Applications. Wiley-Interscience, 2001.

[59] C. Verbowski, E. Kiciman, A. Kumar, B. Daniels, S. Lu, J. Lee,Y.-M. Wang, and R. Roussev, “Flight Data Recorder: Monitor-ing Persistent-State Interactions to Improve Systems Man-agement,” Proc. Seventh Symp. Operating Systems Design andImplementation (OSDI ’06), pp. 117-130, 2006.

[60] W.-L. Wang, D. Pan, and M.-H. Chen, “Architecture-Based Soft-ware Reliability Modeling,” J. Systems and Software, vol. 79, no. 1,pp. 132-146, 2006.

[61] B. Wassermann and W. Emmerich, “Monere: Monitoring of Ser-vice Compositions for Failure Diagnosis,” Proc. Ninth Int’l Conf.Service-Oriented Computing (ICSOC ’11), pp. 344-358, 2011.

[62] L. Wolsey, Integer Programming. Wiley-Interscience, 1998.[63] J. Yan, R. Kowalczyk, J. Lin, C.M.B., S. Goh, and J.Y. Zhang,

“Autonomous Service Level Agreement Negotiation for ServiceComposition Provision,” Future Generation Computer Systems,vol. 23, no. 6, pp. 748-759, 2007.

[64] L. Zeng, B. Benatallah, A.H.H. Ngu, M. Dumas, J. Kalagnanam,and H. Chang, “QoS-Aware Middleware for Web ServicesComposition,” IEEE Trans. Software Eng., vol. 30, no. 5, pp. 311-327, May 2004.

[65] Z. Zheng and M.R. Lyu, “Collaborative Reliability Prediction ofService-Oriented Systems,” Proc. 32nd ACM /IEEE Int’l Conf. Soft-ware Eng. (ICSE ’10), pp. 35-44, 2010.

HE ET AL.: FORMULATING COST-EFFECTIVE MONITORING STRATEGIES FOR SERVICE-BASED SYSTEMS 481

Page 22: Formulating cost effective monitoring strategies for service-based systems

Qiang He received the first PhD degree in infor-mation and communication technology fromSwinburne University of Technology (SUT),Australia, in 2009 and the second PhD degree incomputer science and engineering fromHuazhong University of Science and Technology(HUST), China, in 2010. He is currently aresearch fellow at SUT. His research interestsinclude services computing, cloud computing,P2P system, workflow management and agenttechnologies. He is a member of the IEEE.

Jun Han received the BEng and MEng degreesin computer science and engineering from BeijingUniversity of Science and Technology in 1982and 1986, respectively, and the PhD degree incomputer science from the University of Queens-land in 1992. He has been a professor of soft-ware engineering at Swinburne University ofTechnology, Melbourne, Australia since 2003.He has also been a research leader with Austral-ia’s Cooperative Research Centre in Smart Serv-ices (Smart Services CRC) and Cooperative

Research Centre in Advanced Automotive Technology (AutoCRC).From 1992 to 2003, he was with the University of Queensland and Mon-ash University. His research interests include adaptive and context-aware software systems, services engineering and management, soft-ware and system architectures, software security and performance, andsystem integration, evolution and interoperability. He has publishedmore than 200 peer-reviewed papers in international journals and con-ference proceedings.

Yun Yang received the BSci degree from AnhuiUniversity, Hefei, China, in 1984, the MEngdegree from the University of Science and Tech-nology of China, Hefei, China, in 1987, and thePhD degree from the University of Queensland,Brisbane, Australia, in 1992, all in computerscience. He is currently a full professor at Swin-burne University of Technology, Melbourne, Aus-tralia. Prior to joining Swinburne as an associateprofessor, he was a lecturer and a senior lecturerat Deakin University during 1996-1999. Before

that, he was a (senior) research scientist at DSTC CooperativeResearch Centre for Distributed Systems Technology during 1993-1996. He was also at Beihang University during 1987-1988. He hascoauthored four book and published more than 200 papers in journalsand refereed conferences. His current research interests include soft-ware technologies, cloud computing, p2p/grid/cloud workflow systems,and service-oriented computing. He is a senior member of the IEEE.

Hai Jin received the PhD degree in computerengineering from HUST in 1994. He is a CheungKung Scholars chair professor of computer sci-ence and engineering at the Huazhong Universityof Science and Technology (HUST) in China. Heis currently the dean of the School of ComputerScience and Technology at HUST. In 1996, hereceived a German Academic Exchange Servicefellowship to visit the Technical University ofChemnitz in Germany. He was at the Universityof Hong Kong between 1998 and 2000, and as a

visiting scholar at the University of Southern California between 1999and 2000. He received Excellent Youth Award from the National ScienceFoundation of China in 2001. He is the chief scientist of ChinaGrid, thelargest grid computing project in China, and the chief scientist of National973 Basic Research Program Project of Virtualization Technology ofComputing System. He has co-authored 15 books and published morethan 400 research papers. His research interests include computerarchitecture, virtualization technology, cluster computing and grid com-puting, peer-to-peer computing, network storage, and network security.He is the steering committee chair of International Conference on Gridand Pervasive Computing (GPC), Asia-Pacific Services Computing Con-ference (APSCC), International Conference on Frontier of ComputerScience and Technology (FCST), and Annual ChinaGrid Conference.He is a member of the steering committee of the IEEE/ACM InternationalSymposium on Cluster Computing and the Grid (CCGrid), the IFIP Inter-national Conference on Network and Parallel Computing (NPC), and theInternational Conference on Grid and Cooperative Computing (GCC),International Conference on Autonomic and Trusted Computing (ATC),International Conference on Ubiquitous Intelligence and Computing(UIC). He is a member of the Grid Forum Steering Group (GFSG). He isa senior member of the IEEE and a member of the ACM.

Jean-Guy Schneider received the MSc and PhDdegrees in computer science and applied mathe-matics from the University of Berne, Switzerland,in 1992 and 1999, respectively. Since 2000, hasbeen a lecturer, a senior lecturer, and anassociate professor of software engineering atSwinburne University of Technology. His mainresearch interests lie in the general area of reli-able software systems and are positioned in theintersection of software engineering and com-puter science. More specifically, his research

interests are in object-oriented and concurrent/distributed/service-ori-ented programming, scripting and composition languages, and the defi-nition of formal approaches for component-based software engineering.Furthermore, he is interested in methodologies and tools in the contextof the evolution of object and component-based software systems, agilesoftware development processes, mobile computing, as well as the influ-ence and applicability of software development processes in tertiaryeducation. Since April 2013, he is the head of computer science andsoftware engineering within the Faculty of ICT.

Steve Versteeg received the PhD degree incomputer science from the University of Mel-bourne. His PhD research was in the area ofneural simulation. He is a research staff mem-ber with CA Labs, based in Melbourne, Aus-tralia. His role is to coordinate collaborativeresearch between universities and CA Technol-ogies. His current projects are in the areas ofcloud computing, software engineering, largescale endpoint emulation, role engineering andinsider threat prediction. A well-studied neural

circuit was used as a case study for recreating robust behaviour incomputer systems. From 2004 until early 2008, he was at ElysiumCapital as a senior developer and researcher on an experimental auto-mated futures trading system.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

482 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 40, NO. 5, MAY 2014