Jessie Hui Wang - An Adaptive Online Scheme for ......1 An Adaptive Online Scheme for Scheduling and...

1

An Adaptive Online Scheme for Scheduling andResource Enforcement in Storm

Shengchao Liu12, Jianping Weng12, Jessie Hui Wang12, Changqing An12, Yipeng Zhou3, Jilong Wang121Institute for Network Sciences and Cyberspace, Tsinghua University, China

2Beijing National Research Center for Information Science and Technology, China3Department of Computing, Macquarie University, Australia

Abstract—As more and more applications need to analyzeunbounded data streams in a real time manner, data streamprocessing platforms, such as Storm, have drawn attention ofmany researchers, especially the scheduling problem. However,there are still many challenges unnoticed or unsolved. In thispaper, we propose and implement an adaptive online schemeto solve three important challenges of scheduling. First, how tomake scaling decision in a real time manner to handle fluctuantload without congestion? Second, how to minimize the numberof affected workers during rescheduling while satisfying theresource demand of each instance? We also point out that statefulinstances should not be placed on the same worker with statelessinstances. Third, currently the application performance cannotbe guaranteed because of resource contention even if the com-putation platform implements an optimal scheduling algorithm.In this paper, we realize resource isolation using Cgroup, andthen the performance interference caused by resource contentionis mitigated. We implement our scheduling scheme and plug itinto Storm, and our experiments demonstrate in some respectsour scheme achieves better performance than state-of-the-artsolutions.

Index Terms—resource allocation, scheduling, stream process-ing, Storm.

I. INTRODUCTION

Nowadays, various sources such as hardware sensors orsoftware applications are producing continuous data streams oflarge volume. People from industry and academia would liketo be able to collect and analyze these data streams to retrievevaluable knowledge or detect anomalies in a real time manner[1] [2]. Therefore, in these years many stream processingprogramming models and computation platforms are proposed,such as STREAM [3], Borealis [4], System S [5], StreamBase[6], InfoSphere Streams [7], S4 (Simple Scalable StreamingSystem) [8], D-Stream [9], Storm [10], Flink [11], Heron [12],and Samza [13]. In most current streaming systems, in orderto run an application, one user needs to represent the logic ofthe application as a directed acyclic graph (DAG) and specifyits requirement. An application DAG is a set of interconnectedoperators, with each operator encapsulating the semantic of aspecific operation, e.g. receiving incoming tuples, conducting acomputation, or generating new outgoing tuples. The directededges in the DAG indicate how the tuple streams are routedto be processed. If the user code of an operator is to computea result for multiple tuples it receives, the operator needs tomaintain some states when it is running, therefore it is defined

Corresponding author: J.H. Wang ([email protected]).

as a stateful operator. Otherwise, the operator is defined as astateless operator.

The DAG models an application logically. In the physicallayer, for parallel execution, each operator needs to be appro-priately replicated in multiple instances that split and processthe arriving tuples simultaneously. The number of instancesis referred to as the parallelism degree of the operator. Theseinstances are executed in one or multiple worker processes onmultiple physical machines, i.e. worker nodes.

The capacity of one streaming application, i.e. the maximumdata rate it can handle without congestion, can be affected bymany factors, such as the parallelism degree of each operator,the capacity of each instance, and the assignment of eachoperator instance to workers and worker nodes. Since thedata stream may arrive at the application at a fluctuant rate,naturally we would like to see the capacity of the applicationcan adapt dynamically to the real-time rate of the data stream.In other words, it is valuable for a streaming processingplatform to incorporate an automatic and dynamic adaptivescheduling and enforcement algorithm to determine all thefactors mentioned above to guarantee the performance ofapplications and exploit resources of physical infrastructureefficiently.

The scheduling problem has drawn attentions from manyresearchers. However, the problem has not been well solvedyet. For example, currently Storm is not able to adjust thecapacity of its applications according to their data arrivalrates automatically and adaptively. Although Storm users canset parallelism degrees for operators in their topologies usingAPIs at runtime, there are several problems. First, it is nottransparent to Storm users, i.e., users need to run the APIfrequently by themselves to change the configuration of theirapplications. Second, Storm users may not be aware of theoptimal parallelism degrees, and the assumption frequentlyused here by researchers, i.e. the capacity is linear withparallelism degree, may not be true. Third, the assignmentof instances to workers and worker nodes is done by thedefault scheduler EvenScheduler of Storm periodically each10 seconds, but the scheduler just assigns executors to theconfigured number of workers in a round-robin manner withthe aim of producing an even allocation, which may not beefficient. Forth, during migrations caused by re-scheduling,the application performance would degrade and the migrationcost must be considered. Furthermore, applications of differentStorm users may have resource contention, which means their

2

application performance cannot be guaranteed.In this paper, we propose an adaptive online scheme for

scheduling and resource enforcement on streaming processingframeworks. In particular, we make the following contribu-tions.• Our scheme can make scaling decisions, i.e. determining

the amount of resources needed by each instance, in amore timely manner to handle unexpected spikes of loadwithout congestions and resource wastes.

• We pinpoint that instances of stateful operators and state-less operators should not be colocated at the same workerin current implementations of pause-resume migrationapproaches, and we propose a resource-cost-aware place-ment algorithm that minimizes the number of affectedworkers.

• Our scheme can enforce the resource allocation decisionsmade by our scheduler and mitigate performance interfer-ence caused by resource contention by isolating instancesusing Cgroup [14].

We design our scheduler, implement and integrate it intoStorm. Experiments based on a popular topology WordCountand a realistic application demonstrate that our algorithmcan adapt the capacity of one topology to its stream dataarrival rate and produce scheduling and deployment strategiesthat achieve better performance, e.g. shorter completion time,larger throughput and less loss tuples, compared to some state-of-the-art solutions.

The remaining part of the paper is organized as follows.Section II presents an overview of previous related work.Section III describes the scheduling problem formally, andintroduces the framework and algorithms of our scheme. InSection IV, we describe how we implement our schemeon Storm, especially how we monitor the running status oftopologies and how we guarantee isolation using Cgroup. InSection V, we conduct experiments using WordCount topologyand a realistic application topology and present various perfor-mance statistics. We also demonstrate the necessity of resourceisolation by running an experiment. Section VI concludes ourwork.

II. RELATED WORK

In recent years, the scheduling problem has drawn theattention of many researchers, and there have been someresearch efforts to revise the scaling and deployment oftopologies periodically at runtime, with the goal to optimizeperformance for stream processing applications with fluctuantloads. We summarize these research efforts from the followingthree aspects of the scheduling problem, i.e., scaling decisions,placement strategies and migration issues.

Scaling Decisions. As the arriving load fluctuates, theamount of resources allocated to the streaming processingapplication should adapt to the load to ensure that the process-ing performance is acceptable while the resources are utilizedefficiently. Current works differ from each other in terms ofscaling dimensions and the method to determine the properscaling decision.

Many works try to scaling in or out by adjusting the numberof replications, i.e. parallelism degree [15][16][17][18]. The

authors of [15] focus on IBM’s System S and make itsoperators to be scalable. In [16], the authors calculate therequired parallelism degree for each component based on theestimation of its current capacity and the prediction of dataarrival rate in the next time window. In [17], once an operatoris found to be a potential bottleneck, the system would try toincrease the number of tasks by 1 for the operator. In [18],if the estimated CPU usage per replication of one operator isbeyond a predefined threshold region, its parallelism degreewould be reconfigured.

There are also some other scaling dimensions. In [19], theauthors propose a topology-based scaling mechanism. When atopology is overloaded, it scales by adjusting the number of thecloned topologies or replaced by another new topology withmore tasks. In [20], the authors regulate the number of usedcores and the CPU frequency through the DVFS (DynamicVoltage and Frequency Scaling) function offered by modernmulticore CPUs.

The reconfigurations of parallelism degrees would causerestarts of involved workers and degrade the application per-formance. The topology substitution method would also resultin a long unavailable time period. Furthermore, because ofresource contention, it cannot be guaranteed that the applica-tion performance is improved as expected when they adjustparallelism degrees of operators or topologies. In fact, it isregarded to be hard to formally model the resource contentionamong operators or topologies [17].

The scaling of the CPU frequency proposed in [20] is onlysuitable for the case in which the underlying architecture isdedicated to the execution of one elastic parallel operator. Itcannot work if more than one operators or multiple applica-tions are sharing the resources.

In this article, our scaling dimension is the resources allo-cated to each replication. Storm version 1.1.1 has integrateda resource-aware scheduler R-Storm, which enables Stormusers to specify the resource demand of each task [21].However, the resource demand should be specified before thetopology is started, and it cannot be tuned during running.Furthermore, R-Storm cannot guarantee that each componentreally gets the specified demand during running. In our work,we exploit Cgroup [14] to isolate instances of components.Our solution can mitigate interferences caused by resourcecontention among workers and executors, and it can work wellfor streaming applications running on shared infrastructuresuch as public clouds.

The method to determine the proper scaling decision canbe based on profiling approaches [17][18] or analytical models[20][22]. In [17] and [18], the authors exploit the profiling ap-proach to build the relationship between resource provisioningand performance metrics of application. The profiling resultsare then used to evaluate different scaling configurations,and the best configuration would be selected. In [20] and[22], the authors predict the performance metrics, e.g. meanservice time and mean waiting time, under different resourceprovisioning using queueing theory. Profiling is argued in [17]to be able to provide more reliable results via real experimentsthan analytical abstract models. But the profiling phase takestime.

3

Similar to [16], we predict the future load of one operatorusing two models, i.e. the regression of its load in previoussliding windows and the load of its parent operators. How-ever, [16] just assumes that the capacity of one componentincreases linearly with its parallelism degree, and it does notconsider that the amount of resources each replication reallyacquires may also vary as the parallelism degree changes,while we calculate the resource demand of each replicationdirectly and enforce the allocated resources of each replicationusing Cgroup. Comparing to the profiling approaches and thequeueing theoretic models, our method can adapt to stronglyfluctuating load in a more timely manner. The profiling ormodelling results can be integrated into our scheme if desired,and the integrated schemes can also enjoy the benefits broughtby scaling resources per replication and isolating resourceallocations using Cgroup.

Besides calculating resource demands, there are some otherissues in making scaling decision. For example, in [23], theauthors want to find the right point in time to scale in/out. In[24], the authors focus on the application of spatial preferencequeries and point out that the auto-scaling system shouldbe with a two-level architecture to cope with fluctuations ofdifferent time-scales.

Placement Strategies. The replications of operators shouldbe placed on available physical servers to be executed, andhow to place them also impacts the application performance. Insummary, placement strategies can be resource-aware, traffic-aware, and cost-aware.

Resource-aware means that only physical servers with suf-ficient resources are considered when we place a replication.In addition, some placement strategies would like to see thatthe load is spread evenly among servers, while some otherstrategies would like to minimize the number of occupiedservers. Traffic-aware means that we always want to minimizethe latency caused by transferring tuples from one operatorto its child operators. Basically, inter-node communication isslower than inter-process communication, which is slower thanintra-process communication. Cost-aware means we shouldconsider the cost caused by moving replications or operatorsfrom one place to other places.

Many placement strategies are both resource-aware andtraffic-aware [25][26][21][27][28]. In [25], Aniello et al. noticethat in some topologies where the computation latency isdominated by tuples transfer time, limiting the number oftuples that have to be sent and received through the networkcan contribute to improve the performances. Therefore, theypropose to place in the same worker executors that communi-cate each other with high frequency. The authors of [26] sortexecutors by their loads and solve the minimization problemfor each executor one by one to find its server to minimizeinter-node and inter-process traffic while ensuring no workernode is overloaded. They argue there should be a singleworker on one worker node for one topology. It may minimizecommunication traffic, but it places stateful executors andstateless executors on the same worker, which brings out anegative effect of unnecessary restarts of stateful executors. In[21], the authors select the node with the smallest Euclideandistance, which is defined based on resource demand of

the replication, resource availability of the target node, andthe bandwidth between the target node and the predefinedRef node. In [27], the authors formulate an Integer LinearProgramming problem which takes explicitly into account theheterogeneity of computing and networking resources. In [28],the authors formulate a Mixed-Integer Linear Program to findthe placement that minimizes the load imbalance among nodes,and they also exploit heuristics to colocate two replicationswith most frequent communications.

If the placement is re-evaluated and revised periodically, thecost of re-assignment should be considered. As an instanceis migrated from its current worker to a new worker, bothinvolved workers must be restarted in Storm, and we need topause the streams directed to the migrating instance and savethe tuples in transit, so to replay them as soon as the migrationis completed. Of course during migration the applicationperformance degrades. The migration cost would be moresevere if the migrating instance is for a stateful operator,because we have to extract the current state from the oldinstance and restore it within the new instance. Furthermore,the sequence of migrations must be designed carefully when aplacement transition involves migrations of multiple instancesor operators [29].

The migration of executors draws attentions of researchers,but they mainly focus on how to guarantee that the migrationis completed in a non-destructive way and all states of thebolt should be preserved [30][31][32][33][34]. The work [30]is published in 2004, and it studies the migration problemof a database stream query engine. Roughly speaking, themigration methods can be classified into two categories. Thefirst category exploits a pause-and-resume approach that wouldblock execution during migration[31][32]. In [26], the authorspropose to introduce certain delay to ensure old workers arenot stopped until new workers are ready, which is helpfulfor smooth migration procedure. The second category tries toavoid any pause by maintaining more states, which requiresmore overhead. In either category of methods, the migrationwould degrade application performance. In [33], the authorsfocus on SDF (synchronous data flow) based stream programsand propose a suite of compiler and runtime techniques forlive reconfiguration without periods of zero throughput. In[34], the authors propose two different migration approachesand conduct experiments to study their performance. However,in [35], the authors study five well-known frameworks andfind that none of these frameworks has not addressed all ofthe issues in state management. Therefore we should avoidunnecessary migrations, especially the instances of statefuloperators.

Some research efforts on scaling and placement are cost-aware, i.e. taking migration cost into consideration. In [28],the authors assume that the migration cost of one executoris linear with the size of its state. By adding a constraintin the formulated optimization problem, they ensure that thetotal migration cost is less than a pre-defined threshold. In[36], the authors combine the different costs into a single costfunction, such as reconfiguration cost, performance penaltyand resource cost, and then the estimated cost and prospectivebenefit of the reconfiguration are used to update the Q function

4

in their reinforcement learning approach. Cardellini et al.also investigate the problem of how to minimize migrationcost while satisfying the application requirements [37]. Theyformulate an optimization problem which aims to optimize aweighted sum of the normalized QoS attributes including thedowntime.

In this work, we just try to minimize the number ofaffected workers, especially the workers of stateful instances.We design two algorithms to determine how to deal withoverloaded nodes and how to scale in.

There are some works focusing on the placement problemin slightly different scenarios. The authors of [38] focus onQoS-aware operator placement in geographically distributedStorm which is operating in a dynamic environment, and theypropose a hierarchical distributed architecture. In [39], Xu Leet al. propose a scheme named Stela to solve the deploymentoptimization problem when a Storm user wants to add orremove servers (worker nodes). The operator placement issuealso appears in Mobile Distributed Complex Event Processing(DCEP) systems. In such systems, there are more challengescaused by mobility, such as limited energy and moving dataproducer or consumers [40].

III. SCHEDULING PROBLEM AND SOLUTION

In the logical layer, users need to specify each of theirapplications as a directed acyclic graph (DAG) and submit theDAG to the processing framework. Each DAG can be denotedas G(N,E, T ), wherein N represents the set of all operatorsin this application, E is the set of all directed edges in G,and T is a vector which includes the number of tasks for eachoperator. Let us assume Ns and Nt are two operators, i.e.Ns, Nt ∈ N . The edge from Ns to Nt is denoted by Es,t, andEs,t ∈ E indicates that tuples outputted from Ns would berouted to Nt for further processing. The numbers of tasks forNs and Nt are specified in Ts and Tt. All these informationshould be specified by users before the application is startedand cannot be changed during running.

After the application is submitted, in the physical layer, itis the responsibility of the streaming processing frameworkto determine how to schedule and execute all the tasks onits available servers, where each server is a worker nodewith limited resources. In this work, we focus on computing-intensive applications, therefore we mainly consider CPUresources. The problem of finding a solution to schedule andexecute all the tasks on its available servers is referred to as thescheduling problem. Each streaming processing framework hasone or multiple schedulers to take the responsibility of solvingthe problem.

Let us take Storm as an example to illustrate the physicallayer. As shown in Figure 1, in Storm, each worker node isconfigured with a limited number of slots, which are basicallyports used by workers to receive messages. The number ofslots on each server is pre-configured by Storm operators andcannot be changed after Storm is launched. Typically, it canbe set to the number of cores on the server. Each worker mustbe mapped to a slot to receive message, therefore the numberof slots is in fact the maximum number of workers that can

be run on this worker node. Please note that there can bemultiple topologies running on Storm and the limited numberof slots are partitioned by the workers of all these topologies.One topology G can spawn arbitrary number of workers on aworker node, as long as the topology can get free slots. Eachworker spawns a number of executors to run the tasks. Eachtask corresponds to only one executor, but one executor cancontain one (by default) or multiple tasks of one operator. Wecan see that the number of executors has to be smaller thanor equal to the number of tasks, otherwise there would beexecutors without tasks to execute. As a result, Storm usersalways tend to overestimate the number of tasks when theyspecify topologies in order to make sure they can increase thenumber of executors without modifying their topologies whennecessary.

Fig. 1. One worker node in Storm.

Since both the arriving load and running environment canbe dynamic, one streaming processing framework may needto change its scheduling decision to optimize the applicationperformance periodically.

After rescheduling, some instances might be assigned toworkers/nodes different from the previous assignment. It iscalled as migrations. The application may suffer performancedegradation during migrations, i.e., some instances becomeunavailable due to restarts. Therefore we should reduce thenumber and influence of migrations as much as possible.

In order to develop algorithms to solve the schedulingproblem, we should understand the factors that can affect therunning performance of one streaming application. Two keyfactors are listed as follows.

1) the number of instances for one operator and the amountof resources allocated to each instance.It is referred to as scaling decision in this paper. Thereare some constraints when we make scaling decision.The number of instances for one operator has to besmaller than or equal to the number of tasks specifiedby the DAG. Each instance can use only one CPU coreat most. Roughly speaking, in this work we proposeto spawn the maximum number of instances to enablethe largest amount of resources for future uncertaintyand adjust the resources per instance dynamically toaccommodate fluctuating loads. We also point out theresource allocation per instance should be enforcedto guarantee the system is running as expected in aresource-aware scheduling scheme.

5

2) the assignment/placement of each instance to workersand worker nodes.It is referred to as placement decision in this paper.There are at least three considerations during placement,i.e., each instance must be able to acquire sufficientresources, the migrations caused by rescheduling shouldalso be minimized, and the cost caused by communica-tion among instances should be minimized if possible.The consideration about communication cost makes usto propose to use least worker nodes and workers,as long as the worker nodes can provide sufficientresources.

Based on the above analysis, we design our algorithms tomake scaling decision and placement decision.

A. Scaling Decision

We compute our scaling decision periodically. At ti, we col-lect the data of running status during the (i−1)th timeslot, i.e.[ti−1, ti]. Now we need to predict the number of tuples eachoperator needs to process during ith timeslot [ti, ti+1] andthen derive the amount of resources each operator demandsduring this timeslot.

1) Load Prediction: It is a usual way to predict the futureload of an operator Cj from its loads in previous φ timeslotswith a linear regression model. Let us assume the arrival rateof tuples to Cj in timeslot k is Ljk. Mathematically, at ti,we compute the linear coefficients aji and bji by applying thelinear least squares method as follows:

(aji , bji ) = argmin

a,b

i−1∑k=i−φ

(Ljk − (a× tk + b))2.

Let us define the predicted value of Lji as L̄ji . We computeL̄ji as follows:

¯Lji ≈ a

ji × ti + bji .

Basically, here we assume the variation of load duringthe timewindow [ti−φ, ti−1] is similar with the variation ofload during the timewinow [ti−φ+1, ti], which is a kind ofself-similarity. Many traffic time series in the Internet exhibitcharacteristics of self-similarity [41] [42] [43] [44].

The above method provides an acceptable way to predictfuture values when we have no more information about future.For operators except spouts (i.e. operators receiving tuplesfrom data sources), it is possible for us to predict more accu-rately based on more available information. We say that Cp isa parent operator of an operator Cj if tuples outputted fromCp would be routed to Cj for further processing. Basically,it can be noticed that the incoming tuples to one operator arein fact outputted by all of its parent operators. We know thatby default a tuple would be considered to be failed if it is notcompleted by the whole topology within 30 seconds. From it,we can see that in most cases the time period one tuple staysin one operator would not be long. Therefore, we have thefollowing approximation for each non-spout operator Cj :

L̃ji =∑Cp∈Pj

(ψpi ∗ ρpi ),

where ψpi = Lpi + δp

i .

(1)

Here, Pj is the set of all parent components of Cj . ψpi is the

number of tuples Cp needs to process in the timeslot i, whichcan be calculated as the prediction of Cp’s incoming load plusthe length change of its pending queue, i.e., Lpi +δ

p

i . The upperequation means that the number of incoming tuples to Cj isthe sum of tuples produced by all Cj’s parent operators. ρpi isthe average number of tuples Cp emits to next operators foreach tuple it processes during [ti, ti+1]. Therefore, the numberof tuples Cp sends to its child operator Cj can be calculatedas ψpi ∗ ρ

pi .

There are two important variables in the equation, δ and ρ,and we estimate their values using the average of historicalmeasurements as follows:

ρpi ≈∑i−1k=i−φ ρ

pk

φ

δp

i ≈∑i−1k=i−φ δ

pk

φ.

In order to avoid congestions as much as possible, wepropose to use the maximal value between ¯

Lji and L̃ji as the

final prediction of load, i.e., Lji = max(¯Lji , L̃

ji ).

2) Resource Demand: Statistically, for an operator of acomputation-intensive application, its capacity, i.e., the numberof tuples it processes, is linear with the amount of CPU re-sources it utilizes. Let Rjk denote the amount of CPU resourcesCj utilized during kth timeslot, which can be monitored usingCgroup as described later. We have known that the number oftuples Cj has processed in this timeslot is ψjk from monitoring.Therefore, we can estimate the linear coefficient, i.e., theaverage CPU resources that Cj utilizes to process one tupleduring recent φ timeslots, as follows:

κji ≈∑i−1k=i−φ (Rjk/ψ

jk)

φ(2)

Therefore, the CPU demand of Cj in the timeslot i can becalculated as

Rji = ψji × κji . (3)

The operator Cj has Tj instances, therefore rji , the amountof resources that each instance requests, is as follows:

rji =RjiTj

(4)

Allocating exactly the resource share of rji to instances mayresult two potential issues. As a queue system with a hugenumber of tuples, the system tends to be less stable if theutilization rate keeps 100 percents. Furthermore, in terms ofqueue length, small deviations from theoretical ideal valuesare normal in a practical environment. It would result in

6

unnecessary minor changes of resource demand. Therefore,we adjust rji in the following ways before enforcement.

First, we increase the resource share aggressively but de-crease conservatively. If rji is larger than rji−1, we immediatelyimplement rji . Otherwise, we stay at rji−1 as long as theaccumulated decrease is less than θ/2. Second, we alwaysassign a little more resources to each executor to improve thestability of the system. In details,

r̂ji =

⌈rji + θ

2

θ

⌉× θ. (5)

It can be viewed as we use θ percent of a single CPU coreas the granularity of resource allocation. Roughly speaking,decreasing the granularity of resource allocation should bebeneficial for reducing resource waste, but increase the riskof instability in extreme cases. θ can be set according to themaximum acceptable (affordable) resource waste per instance.

3) The Enforcement of Resource Allocation: It is still aproblem whether the instance can get its allocation shareduring running, because there are many instances from onetopology or multiple topologies on one worker node and theseinstances are competing for resources. Therefore, we have toenforce our allocation and avoid the uncertainty caused byresource contention. We know that Cgroup can control andmonitor the resource usage of all threads. We propose touse Cgroup to make sure that any instance cannot use moreresources than its allocation share. Since all instances only usetheir own shares, performance interference caused by resourcecontention can be removed.

B. Placement Considering Resource and Migration

Now we need to decide the placement of these instancesof different operators, i.e., the mapping of each instance toworker node and worker.

There are two cases in which the placement should berevised. First, some nodes would be overloaded in ti due to theincreasing demand of their own instances. Second, the numberof nodes can be reduced due to the decreasing total demandof instances. Roughly speaking, the first case happens whenthe arriving load is becoming heavier, while the second casehappens when the arriving load is becoming lighter.

The revision of placement would result in migration ofsome instances. As one instance is migrated, its source workerand destination worker would be restarted, and the applicationperformance would degrade. Such kind of migration cost mustbe considered, therefore we should determine the placementof ti based on its placement of ti−1 and try to minimize thenumber of affected workers.

Furthermore, the migration of stateful instances takes longertime than migration of stateless instances [32], because theirstates have to be saved and restored before and after migrations[45] [46]. Therefore, we should try to avoid migrations ofstateful instances.

Besides the above consideration, there are some researchefforts that have pointed out we should reduce the unnecessaryinter-node and inter-process communications to improve theperformance of one topology [26]. Therefore the total number

of nodes/workers should be minimized. As a result, we pro-pose that there should be two workers on one worker node forone application, one for stateful instances and one for statelessinstances. In [26], in order to minimize communication cost,the authors suggest that there should be only a single workeron one worker-node. Considering the following scenario. Onestateless instance e1 needs to be migrated to other nodes, andthis migration oughts to be a stateless migration. However,stateful instances are also placed on the same worker ase1, which makes the migration turning to a costly statefulmigration. Our proposal can avoid such kind of unnecessarystateful migrations.

In summary, we propose the following placement guide-lines.

1) an instance can be placed on one node only if the nodecan provide sufficient resources;

2) there should be two workers on one node, for statefuland stateless instances separately, in order to avoidunnecessary stateful migrations;

3) the number of affected workers should be minimizedduring revising the placement;

4) the number of used nodes should be minimized, tominimize monetary cost and also the communicationcost.

Based on these guidelines, we propose heuristic algorithmsto make the placement decision. Algorithm 1 is to deal withinstances that cannot get sufficient resources without migra-tion. Algorithm 2 presents three important functions invokedby Algorithm 1 to find suitable target workers for migrations.Algorithm 3 is to find whether the number of used nodes canbe reduced. We explain our ideas in these algorithms detailedlyas follows.

1) migration due to overload: For an overloaded node,at least one worker would be affected to move out someinstances. We try these possible plans in sequence, i.e., dis-rupting its stateless worker, disrupting its stateful worker,or disrupting both workers. We can see that we prefer todisrupting stateless workers when necessary and we prefer todisrupting only one worker than two workers for one node.

After all overloaded nodes are checked, we can get a list ofdisrupted workers caused by overload directly. The resourcesof these disrupted workers are assumed to be released and wewill determine their placement locations and allocate resourcesfor them later. The next problem is where to place thesedisrupted instances to solve the resource shortage.

We first try to avoid affecting any other workers. If anode currently has no stateful (stateless) worker, or its stateful(stateless) worker is in the list of disrupted workers, it wouldbe a good target for placing stateful (stateless) instances sinceno extra worker is disrupted. It is shown as the functionFindZeroAffected() in our algorithms.

Then we try the case in which only the target worker isdisrupted, which is shown as the function FindOneAffected().The target worker should be also be added to the list. Notethat for one node, we do not move a worker out to make roomfor the other worker to accommodate more instances, since inthis case two workers are disrupted. As the last step, if we

7

still need more resources, we have to scale out and use morenodes, which is shown as the function FindFreeNodes().

Now we have found sufficient resources, and we can placeall instances of disrupted workers. In this step, we sortmigrated instances by breadth-first search of the DAG andplace sorted instances in sequence. In this way, neighboringinstances are likely to be placed closely. In this article, wefocus on computation-intensive applications in which commu-nication cost minimization is with low priority.

2) try scale in: In each period, we check if the numberof used nodes can be reduced. We iteratively check the leastloaded node and see if we can move its workers and mergethem with workers on other nodes. Note that in order to limitthe migration cost, we would select only one target worker forone migrated worker.

Algorithm 1 Migrating Instances on Overloaded Nodesn.sl and n.sf denote the stateless worker and stateful worker of noden. D(w) denotes the resource demand of worker w.

1: //step 1: find disrupted workers due to overload.2: for n ∈ usedNodes do3: if D(n.sl) +D(n.sf) > Rn then // n is overloaded4: if D(n.sf) < Rn then5: migrateSL.add(n.sl)6: else if D(n.sl) < Rn then7: migrateSF.add(n.sf)8: else9: migrateSL.add(n.sl)

10: migrateSF.add(n.sf)

11: RTotalsl ←∑

∀w∈migateSL D(w)

12: RTotalsf ←∑

∀w∈migateSF D(w)13: //step 2: find target workers.14: FINDZEROAFFECTED(RTotalsf , RTotalsl)15: if RTotalsf > 0 or RTotalsl > 0 then16: FINDONEAFFECTED(RTotalsf , RTotalsl)17: if RTotalsf > 0 or RTotalsl > 0 then18: FINDFREENODES(RTotalsf , RTotalsl)19: //step 3: assign disrupted instances to usable

nodes.20: orderedInstances ← sort instances of migrateSF and

migrateSL by bread-first search on DAG21: orderedNodes← sort nodes by nodeID22: for e ∈ orderedInstances do23: place e on the first node with usable and sufficient resources

IV. IMPLEMENTATION IN STORM

In this section, we would introduce how we implement ourscheduler in Storm. Figure 2 presents the architecture of ourimplementation. It is mainly composed of four modules, i.e.,topology monitoring, scaling decision, placement decision, andresource monitor and enforcement.• topology monitoring: it monitors the realtime demand and

performance of the topology (application) and collectsdata to learn its running status, such as the incoming datarate, outgoing rate, execution time of tuples, and internalstructure etc.

• scaling decision: based on the data collected, it computesthe amount of resources demanded by instances.

• placement decision: it is responsible for assigning in-stances to slots on servers.

Algorithm 2 Find target workers to acquire resources todeal with overloaded servers.

1: function FINDZEROAFFECTED(RTotalsf , RTotalsl)2: for n ∈ usedNodes do3: // stateful demand first4: if RTotalsf > 0 and (n.sf = NULL or n.sf ∈

migatedSF ) then5: if n.sf = NULL then6: CREATEWORKER(n.sf )7: migrateSF.add(n.sf)

8: CpuAcquire = min(RTotalsf , n.remainingRes)9: update RTotalsf and n.remainingRes

10: // now stateless demand11: if RTotalsl > 0 and (n.sl = NULL or n.sl ∈

migatedSL) then12: do the similar thing for stateless demand13:14: function FINDONEAFFECTED(RTotalsf , RTotalsl)15: for n ∈ usedNodes do16: if RTotalsf > 0 and n.remainingRes > 0 then17: migrateSF.add(n.sf) // n.sf would be affected.18: update RTotalsf and n.remainingRes accordingly19: do the similar thing for stateless demand20:21: function FINDFREENODES(RTotalsf , RTotalsl)22: for n ∈ notUsedNodes do23: if RTotalsf > 0 and n.remainingRes > 0 then24: create and migrateSF.add(n.sf)25: update RTotalsf and n.remainingRes accordingly26: do the similar thing for stateless demand

Algorithm 3 Try scale in to minimize the number of usedservers.The scheduler executes this algorithm to update the placementaccording to the predicted resource demands of executors.

1: while True do2: n ← the least loaded node //n is the node with the

highest possibility to be removed3: orderedUsedNodes ← sort nodes by their remaining re-

sources in ascending order4: for n′ ∈ orderedUsedNodes do5: if n′ 6= n and n′.remainingRes > D(n.sl) then6: targetSL = n′.sl7: update n′.remainingRes accordingly8: if n′ 6= n and n′.remainingRes > D(n.sf) then9: targetSF = n′.sf

10: update n′.remainingRes accordingly11: if targetSL 6= 0 and targetSF 6= 0 then //remove n12: place n.sf on targetSF13: place n.sl on targetSL14: else15: return //exit because no node can be removed.

• resource monitor and enforcement: this module is re-sponsible for enforcing the resources allocated to eachinstance by the scaling decision module.

We plugin our scheduler in Storm by implementing theIScheduler interface and modifying the storm.scheduler con-figuration in storm.yaml to replace the default scheduler by ourscheduler. Storm has two types of processing nodes: Nimbusand supervisor. The first three modules are running on Nimbusnode and complete their responsibilities using Nimbus APIs.The last module, i.e. resource monitor and enforcement, should

8

Fig. 2. Architecture of Our Implementation.

be deployed on each supervisor node.The design of our scaling decision and placement decision

algorithms has been described in Section III. This sectionmainly focuses on how we monitor the running status andenforce our scheduling decisions using Storm APIs.

A. Topology Monitoring

As we have stated, we need to understand the internalstructure of the topology under study, monitor its running,and collect various performance statistics. The informationcollected by our topology monitoring module would be used topredict future load to find corresponding optimal schedulingsolution (in Section III) and evaluate the performance of ascheduling algorithm (in Section V).

TABLE INIMBUS APIS USED IN THIS WORK

Nimbus API FunctionsCj .get inputs() all operators sending tuples to Cj .Cj .get common() all neighboring operators of Cj .Cj .get emitted() the number of emitted tuples by Cj .

bolt Cj .get executed() the number of tuples processed by Cj .

spoutCj .get complete ms avg() average time a tuple stays in Storm.

Cj .get acked() the number of tuples successfully completed.Cj .get failed() the number of tuples that timeout.

Table I lists all Nimbus APIs used in our work. Pleasenote that Nimbus API provides statistics from the start of thetopology to the time point when the API is invoked. Therefore,in order to know the statistics during [ti, ti+1], we shouldsubtract the data collected at ti from the data collected atti+1. For ease of reading, we skip this step in the followingparagraphs.

1) The internal structure of the topology: We say that Cpis a parent operator of an operator Cj if tuples outputted fromCp would be routed to Cj for further processing. Let Pj bethe set of all parent operators of Cj , which can be get usingNimbus API as follows,

Pj = Cj .get inputs().

In the other direction, let Cj be the set of all child operatorsof Cj . We can get Cj as follow,

Cj = Cj .get common()− Cj .get inputs().

2) Statistics of Each Operator: We monitor the numberof tuples processed and the number of tuples emitted byeach operator Cj , using Nimbus API Cj .get executed() andCj .get emitted().

The tuples emitted by Cp would be sent to its child operatorCj , then we can calculate the number of tuples Cj receivesas follows,

Lj =∑Cp∈Pj

Cp.get emitted().

Please note Lj should be monitored by injecting measure-ment logic into the codes of instances if its parent has multiplechild operators. If tuples are arriving with a faster speed thanCj’s capacity, some tuples would be put in Cj’s queue. Wemonitor the changes of queue length as follows,

δj = Cj .get executed()− Lj .

For each incoming tuple, there might be different numberof tuples emitted after it is processed by Cj . It is determinedby the content of the incoming tuple and the code logic of theoperator. We monitor the ratio as follows,

ρj =Cj .get emitted()

Cj .get executed().

3) Statistics of the Topology: In order to evaluate theperformance of different scheduling algorithms, we need toknow the running statistics of the whole topology. In this work,we monitor its average latency (complete time), throughput(the number of acked) and the number of timeout (failed).Their corresponding Nimbus APIs are listed in Table I.

B. The Enforcement of Resource Allocation

We depend on Cgroup, a function provided by the operatingsystem (OS) of servers, to enforce the desired resource alloca-tion computed by us. Each executor of Storm is in fact a threadrunning on a JVM of one worker node. It has three identifiers,i.e., executor ID assigned by Storm, thread ID assigned byJVM, and thread ID assigned by the OS of the worker node.Cgroup runs in Linux kernel and it controls the resource usageof all threads based on thread IDs assigned by OS. However,during running, by directly calling Java library API, Storm canonly get thread ID assigned by JVM. Then we have to developa function to help Storm find out the thread ID assigned byOS for each executor.

So we choose to ask each executor to report its own threadID assigned by the OS. After reading the code of Stormproject, we notice that when a thread is spawned, a functionprepare of the operator is invoked [47]. We rewrite the preparefunction to include a code segment where the thread ID isreported to our database.

Now the challenge is how one executor obtains its ownthread ID assigned by the OS correctly. The prepare functionis written in Java and runs on jvm (java virtual machine).The API to get thread ID in Java library returns a threadID on the jvm, which is not the thread ID assigned by theoperating system. The way we get the thread ID assigned by

9

OS is as follows. In the prepare function, we use the JNI (JavaNative Interface) to call a function written in C programminglanguage, wherein we invoke the system call NR gettid toobtain the thread ID assigned by OS. This task only needs tobe run once when the thread is spawned, so it does not affectthe performance of the topology.

After the executor reports its thread ID, the informationwould be stored in the database for Cgroup to use. Duringrunning, our scheme obtains a list of executor IDs on eachworker node using Nimbus API, finds the thread ID for eachexecutor ID from the database, and then controls and monitorsresource usage of the executor using its thread ID.

Figure 3 illustrates our method described above to obtainand use thread IDs of executors for resource allocation en-forcement. The two solid lines (with a single arrow) representthe procedure of one executor to obtain its own thread ID andreport to database. It is invoked only once when the executor isspawned. The two dotted lines (with double arrows) representthe procedure of our enforcement module (Cgroup agent) tocontrol and monitor resource usage of executors on the workernode. This procedure is invoked periodically.

V. EXPERIMENTS AND PERFORMANCE EVALUATION

We implement our proposed scheduling framework in Stormreleased version 1.1.1, and then we deploy the system on acluster with three worker nodes connected using a 1000Mbpsswitch. Each worker node has two Intel Xeon E5-2620 v4CPUs, i.e. in total each node has 16 physical CPU cores. Weassign six cores for running Storm. The memory size of eachnode is 128G.

In the following experiments, our topology monitor keepsrunning and collects data every 10 seconds, which is also thetime interval to run the scheduling algorithm once. φ is setto be 5, which means we predict the load of next timeslotfrom previous five time slots, i.e., 50 seconds. The scalinggranularity θ is set to be 20.

We conduct experiments to evaluate our scheduling algo-rithm using well-known data processing application (topol-ogy), namely WordCount Topology. As shown in Figure 4, thetopology is with a linear structure which consists of one Spoutand two Bolts [48]. The spout component is the source of datasteaming, and it receives data from Kafka by subscribing thetopic of Kafka [49] and sends lines of a text file to Split.Split bolt splits the data it receives into words and sends thesewords as tuples to Count bolt using fields grouping. Count boltis responsible for counting up the occurrence of each word.Here, Spout and Split are stateless components and Count isa stateful component.

We also conduct experiments for a more realistic applicationand a real data stream. We monitor all http requests goingthrough the border router of one dormitory region of a campusnetwork. The monitor agent keeps sending tuples to ourtopology, and each tuple includes the information of onehttp request. The topology of our application is shown inFigure 5. The Spout receives arriving tuples from the agenton the router. The Retrieve component retrieves a vector of(timestamp, destination IP) from each tuple. All vectors are

sent to each of Country, ISP, and Prefix components, whichfind out the geographical location (country), service provider(ISP) and /16 prefix of the destination IP respectively. TheCountry component emits (timestamp, destination country) tothe Stat.Ctry component. Stat.Ctry counts the number of visitsto each country in a recent timeslot and predicts a “normal”value based on historical data using the sliding windowalgorithm. If the visit number to one country within the recenttimeslot is too smaller or too bigger than the normal value,it would send the information to the Alarm component. Thecomponent Stat.ISP and Stat.Pfx work similarly as Stat.Ctrybut focus on the visit numbers to each ISP and each prefixrespectively.

As the campus network is providing critical network servicefor tens of thousands of students and staff, we are not allowedto connect our system to the router directly. We get a collecteddataset of 33 hours of December 2017 and develop a simulatorto replay them for our experiments. Furthermore, in order tocompare different schedulers under the same load, we alsohave to replay HTTP traffic.

A. Evaluation Framework and Metrics

In this work, we compare scheduling performance of ourscheme with R-Storm [21]. The main reason is that R-Stormis the only resource-aware scheduler which has been built intoStorm. We do not compare with the default scheduler becausethe default scheduler has clear disadvantages. With defaultscheduler, when one user submits a topology to Storm, heshould specify the number of workers used by the topologyand the number of executors for each component. The defaultscheduler would put these workers and executors evenly onall available worker nodes, which means load balancing is themost important consideration in its scheduling. In [26], the au-thors have pointed out the default scheduling did not considerthe negative effect of communication traffic across nodes andprocesses and always used all available nodes regardless ofworkload which makes it impossible to save operational costand electricity cost by consolidating worker nodes. In [21], theauthors also pointed out the default scheduling disregardedresource demands and availability and therefore it can beinefficient at times.

In the following evaluation, the parallelism degree of com-ponents is set to be 4 for both our scheduler and R-Storm. ForR-Storm, we simulate two cases. In the first case, we set theresources of each executor as 30% of one CPU core, whilethe resources of one executor are set to be 70% in the secondcase.

We simulate fluctuations of the data streaming arrival rateby controlling the speed of writing data to Kafka topic. Weuse the following metrics to evaluate the performance of ascheduler.• complete time: in the past 10 seconds, the average time

each tuple takes from its arrival at spout to the timepoint when the tuple is processed successfully by the lastcomponent.

• throughput: the number of tuples that are processedsuccessfully in the past 10 seconds.

10

Fig. 3. Obtain Thread ID of Executors for Resource Enforcement.

Fig. 4. WordCount topology.

Fig. 5. The topology of HTTP traffic monitor and analyzer.

• number of failed: the number of tuples that are notcompleted within 30 seconds (default value of Storm)in the past 10 seconds.

• the number of workers and worker nodes the topologyused in past 10 seconds.

We depend on Nimbus APIs to collect statistics of theabove metrics every 10 seconds. The Nimbus APIs areget complete ms avg(), get acked(), and get failed().

B. WordCount, Regular Fluctuations

0 200 400 600 800 1000 1200

time(s)

0

10

20

30

ka

fka

pro

du

ce

r n

um

Fig. 6. Input stream with regular fluctuations.

In this experiment, the streaming rate is changing as shownin Figure 6. During the first 5 minutes, we increase the datarate linearly every 10 seconds; during the second 5 minutes,

we decrease the data rate every 10 seconds. The input streamrepeats these changes in the latter 10 minutes.

Figure 7 presents our experiment results of their schedulingperformance. R-Storm(equal) means we configure R-Storm’sparameters to make each of its operator using equal resourcesas our scheme. Please note in reality it is not easy for usersto specify demand properly. For clarity, we further calculatethe average of all metrics and show them in Table II.

From Figure 7 and Table II, we can see that our schedulerclearly outperforms R-Storm(30) in terms of all three metrics.The average complete time decreases from 13115.4ms to7.6ms, which shows an improvement of almost 1725 times.The throughput increases from 93897 to 140995, where theimprovement is 1.5 times. The number of failed tuples de-creases from 4629 among 93897 to 699 among 140995, wherethe improvement is about 10 times.

Allocating more resources to each executor obviously canimprove the performance of the topology. Comparing ourscheduler with R-Storm(70) and R-Storm(equal), we can seethat they have smaller number of failed tuples, but ourscheduler achieves a smaller complete time and a similarthroughput. The failed tuples of our scheme is caused bymigrations, and incorporating a loss-free migration mechanismcan solve the issue. On the other hand, R-Storm(70) consumesmuch more resources than our scheme. R-Storm(70) alwaysallocates 70% of a CPU core to each executor, therefore itneeds 12 × 0.7 = 8.4 (3 components, each with 4 executors)CPU cores in total during running, which means two workernodes are always demanded.

In terms of the number of workers and worker nodes, wecan see with our scheduler the topology only needs one workernode and two workers during most of the running time. Inaverage, it uses 2.6 workers and 1.6 worker nodes. Using lessworkers and worker nodes can reduce inter-node and inter-process communication cost.

Figure 8 presents CPU utilizations of executors of threecomponents during the running of our scheduler. For executorsof one component, we plot our prediction (Equation 4), ourallocation (Equation 5), and its real utilization rate (collectedby our monitor). We can see they are trying to adapt to thefluctuating incoming data rates.

11

0 200 400 600 800 1000 1200

time(s)

0

0.5

1

1.5

2

com

ple

te tim

e(m

s)

104

Our

R-storm(30)

R-storm(70)

R-storm(equal)

0 200 400 600 800 1000 1200

time(s)

0

1

2

3

4

tuple

pro

cess n

um

ber

105

Our

R-storm(30)

R-storm(70)

R-storm(equal)

0 200 400 600 800 1000 1200

time(s)

0

1

2

3

4

tuple

fail

num

ber

104

Our

R-storm(30)

R-storm(70)

R-storm(equal)

Fig. 7. Performance statistics of three schedulings (Regular).

0 200 400 600 800 1000 1200

time(s)

0

10

20

30

40

50

60

70

80

cp

u u

tiliz

atio

n(%

)

spout component

allocation prediction monitor

0 200 400 600 800 1000 1200

time(s)

0

10

20

30

40

50

60

70

80

cp

u u

tiliz

atio

n(%

)

split component


0 200 400 600 800 1000 1200

time(s)

0

10

20

30

40

50

60

70

80

cp

u u

tiliz

atio

n(%

)

count component


0 200 400 600 800 1000 1200

time(s)

1

1.5

2

2.5

3

wo

rke

rNo

de

/wo

rke

r n

um

computer_num

worker_num

Fig. 8. Resource allocation and real usage of three components (Regular).

TABLE IIAVERAGE PERFORMANCE METRICS (REGULAR)

complete time throughput fail number

our method 7.6ms 140995 699

R-Strom(30) 13115.4ms 93897 4629

R-Strom(70) 23.9ms 142710 0

R-Storm(equal) 5079ms 143897 0

In summary, under regularly fluctuating incoming datastream, our methods can adapt resource allocation to arrivaldata rate, and thus it can achieve good performance with fewerresources.

C. WordCount, Random Fluctuations

0 200 400 600 800 1000 1200

time(s)

0

5

10

15

20

25

ka

fka

pro

du

ce

r n

um

Fig. 9. Input stream with random fluctuations.

In this experiment, the streaming rate is changing as shownin Figure 9. Every 10 seconds, we generate a random integerin [1, 25], which is the number of Kafka producers. In thisway, we get an input stream with random fluctuations. Figure10 presents the experiment results of scheduling performance.For clarity, we further calculate the average of all metrics andshow them in Table III.

From Figure 10 and Table III, we can see that our sched-uler achieves a significant improvement comparing with R-Storm(30) in terms of all performance metrics. Our schedulerexperiences a longer complete time than R-Storm(70), but ithas been 30 times better than R-Storm(equal). The throughputis similar. Again, the failed tuples of our scheme is caused bymigrations, and incorporating a loss-free migration mechanismcan solve the issue. In terms of the number of workers andworker nodes, with our scheduler the topology only needsone worker node and two workers during most of the runningtime. In average, our method uses 2.3 workers and 1.3 workernodes, which means it causes less inter-node and inter-processcommunication cost and has a positive influence on the overallperformance, such as complete time and throughput.

TABLE IIIPERFORMANCE METRICS (RANDOM)


our method 66.5ms 97474 1225

R-Strom(30) 14263ms 85097 2830

R-Strom(70) 4.8ms 107120 0

R-Strom(equal) 1857.1ms 106940 0

D. HTTP Monitor and Analyzer

The left plot of Figure 11 shows the number of HTTPrequests within consecutive 33 hours, which is the arrivingload of our HTTP monitor. The performance statistics ofexperiments is presented in Figure 11 and Table IV.

In order to make comparison, we run R-Storm with threedifferent parameter settings. We first run R-Storm with allavailable resources (6 cores ×3 servers) and monitor the CPUusages during running. In this way, we can get roughly perfectknowledge of the resource demand of each operator at any

12

0 200 400 600 800 1000 1200

time(s)

0

0.5

1

1.5

2

2.5

co

mp

lete

tim

e(m

s)

104

Our

R-storm(30)

R-storm(70)

R-storm(equal)

0 200 400 600 800 1000 1200

time(s)

0

1

2

3

tup

le p

roce

ss n

um

be

r

105

Our

R-storm(30)

R-storm(70)

R-storm(equal)

0 200 400 600 800 1000 1200

time(s)

0

2

4

6

8

tup

le f

ail

nu

mb

er

104

Our

R-storm(30)

R-storm(70)

R-storm(equal)

Fig. 10. Performance statistics of three schedulings (Random).

06:00 12:00 18:00 00:00 06:00 12:00

time

0

2

4

6

8

10

12

inp

ut

reco

rds

104

06:00 12:00 18:00 00:00 06:00 12:00

time

0

5000

10000

15000

co

mp

lete

tim

e(m

s)

Our

R-storm(max)

R-storm(avg)

R-storm(equal)

06:00 12:00 18:00 00:00 06:00 12:00

time

0

5

10

15

tup

le p

roce

ss n

um

be

r

105

Our

R-storm(max)

R-storm(avg)

R-storm(equal)

06:00 12:00 18:00 00:00 06:00 12:00

time

0

0.5

1

1.5

2

tup

le f

ail

nu

mb

er

105

Our

R-storm(max)

R-storm(avg)

R-storm(equal)

Fig. 11. Performance statistics of HTTP monitor and analyzer.

TABLE IVAVERAGE PERFORMANCE METRICS (HTTP)


our method 499ms 545530 1881

R-Strom(avg) 8385ms 497897 17391

R-Strom(max) 154.5ms 530859 0

R-Strom(equal) 638ms 530330 0

timepoint, which can be taken as a good reference to determinesettings for R-Storm. For each operator, we calculate itsaverage and the maximum resource demand during running,and we set them as parameters for R-Storm(avg) and R-Storm(max). Then we run our scheduler and make a record ofresource allocation at any time point. We calculate the totalallocated resources for each operator and set parameters tomake R-Storm consuming the equal amount of resources asour scheduler. This experiment is named as R-Storm(equal).

Please note here we are assuming R-Storm has perfectknowledge on resource demand, which is impossible in real-world. To some extend it means our evaluation has been infavor of R-Storm. As we have stated before, although R-Stormallows users to specify their resource demands explicitly whenthey submit their own topologies, what’s the optimal value toset is still an important challenge.

Our method does not require any knowledge of traffic loadin advance. The improvement of complete time is 1.2 times,comparing R-Storm(equal) with our method. The throughputis even slightly better than R-Storm(max). Due to migrationswithout any special handling, the failed tuples of our schemeis larger, but incorporating a loss-free migration mechanismcan solve the issue.

E. Necessity of Resource EnforcementThere can be multiple topologies running on one Storm

platform at the same time. Currently, Storm has no mechanismto guarantee one topology really acquires the resources itrequests, especially when there is resource contention amongmultiple topologies. That is why we propose to use Cgroup toenforce our resource allocation scheme. With Cgroup, we infact implement the resource isolation among topologies.

We conduct an experiment as follows. In Storm, we runtwo topologies, A and B. The input streaming of A is with aconstant rate, and we specify a sufficient amount of resourcesfor its executors using R-Storm. B serves as a backgroundtopology in this experiment, and its input streaming rate ischanging dynamically. B also specifies its resource demandusing R-Storm before it is started. Our goal is to study theinfluence of B’s changing load on A’s performance, i.e., theperformance interference caused by resource contention.

The resulting performance statistics of A are presented inFigure 12. Obviously, the experiment with Cgroup achievesa much better performance than the experiment withoutCgroup. With Cgroup, A achieves shorter complete time,larger throughput, and less failed tuples. Furthermore, theseperformance metrics are more stable than the experimentwithout Cgroup.

When there is no Cgroup for resource allocation, A’sperformance varies a lot, which demonstrates its performanceis affected by B’s dynamic load. Furthermore, although Arequests a sufficient amount of resources, it still experiences aserious latency and loss rate. It is because the resources whichare expected to be used by A are in fact taken over by B.Therefore, resource enforcement and isolation are necessaryfor us to guarantee performance of a topology.

VI. CONCLUSION AND FUTURE WORK

In this work, we propose an adaptive online scheme toschedule and enforce resource allocation in stream processing

13

0 100 200 300 400 500 600

time(s)

0

0.5

1

1.5

2

2.5

co

mp

lete

tim

e(m

s)

104

without cgroup

with cgroup

0 100 200 300 400 500 600

time(s)

0

0.5

1

1.5

2

2.5

tup

le p

roce

ss n

um

be

r

105

without cgroup

with cgroup

0 100 200 300 400 500 600

time(s)

0

1

2

3

tup

le f

ail

nu

mb

er

105

without cgroup

with cgroup

Fig. 12. Performance statistics of A with and without resource isolation/enforcement.

systems. Our goal is to make sure that the system can achievegood performance under fluctuant load, e.g., less congestionunder heavy load and less resource waste under light load.Particularly, we calculate the desired resource amount of eachexecutor in a more accurate and timely manner by taking bothhistorical data rates and internal structure of the topology intoconsideration. As the load fluctuates, our heuristic placementalgorithm can deal with overloaded nodes and try scale-inproperly which can minimize the number of affected workersand used nodes. Moreover, as far as we know, we are the firstto notice the negative effect of colocating stateful executorsand stateless executors. Therefore, we pinpoint that the place-ment algorithm should spawn two workers, instead of a singleworker, for each topology, then the migration of statelessexecutors would not result in the restart of stateful executors.We also implement a way to enforce that executors can reallyacquire their resource shares allocated by our scheduler andthe performance uncertainty caused by resource contention ismitigated.

We implement our scheduling scheme and plug it intoStorm, and our experiments demonstrate in some respectsour scheme achieves better performance than state-of-the-artsolutions. We hope our research can provide some valuableinsights into the scheduling problem of stream processingsystems. We recognize that the mechanism to support loss-free migrations is an important research direction, since it cansignificantly improve the performance of adaptive schedulers.

ACKNOWLEDGMENT

The authors thank the editors and anonymous reviewers fortaking time to review this paper and for their suggestions thathelped improve this paper. This work was supported by theNational Key Research and Development Program of Chinaunder Grant No. 2016YFB0801302 and the National NaturalScience Foundation of China under Grant No. 61202356.

REFERENCES

[1] N. Hidalgo, D. Wladdimiro, and E. Rosas, “Self-adaptive processinggraph with operator fission for elastic stream processing,” Journal ofSystems & Software, vol. 127, pp. 205–216, 2017.

[2] M. Nardelli, M. Nardelli, M. Nardelli, and M. Nardelli, “Optimaloperator replication and placement for distributed stream processingsystems,” ACM Sigmetrics Performance Evaluation Review, vol. 44,no. 4, pp. 11–22, 2017.

[3] A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito,R. Motwani, U. Srivastava, and J. Widom, Stream: The Stanford streamdata manager. Springer, Berlin, Heidelberg, July 2016, pp. 317–336.

[4] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack,J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina et al., “Thedesign of the borealis stream processing engine.” in CIDR, vol. 5, no.2005, 2005, pp. 277–289.

[5] N. Jain, L. Amini, H. Andrade, R. King, Y. Park, P. Selo, and C. Venka-tramani, “Design, implementation, and evaluation of the linear roadbenchmark on the stream processing core,” in Proceedings of the 2006ACM SIGMOD international conference on Management of data. ACM,2006, pp. 431–442.

[6] “TIBCO StreamBase and the TIBCO accelerator for Apache Spark,”TIBCO Software Inc., Tech. Rep., 2017.

[7] A. Biem, E. Bouillet, H. Feng, A. Ranganathan, A. Riabov, O. Ver-scheure, H. Koutsopoulos, and C. Moran, “IBM infosphere streams forscalable, real-time, intelligent transportation services,” in Proceedings ofthe 2010 ACM SIGMOD International Conference on Management ofdata. ACM, 2010, pp. 1093–1104.

[8] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributedstream computing platform,” in Proceedings of the 2010 IEEE Inter-national Conference on Data Mining Workshops (ICDMW). IEEE,2010, pp. 170–177.

[9] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica, “Discretizedstreams: An efficient and fault-tolerant model for stream processingon large clusters,” in Proceedings of the 4th USENIX Conferenceon Hot Topics in Cloud Computing, HotCloud’12. Berkeley, CA,USA: USENIX Association, 2012, pp. 10–10. [Online]. Available:http://dl.acm.org/citation.cfm?id=2342763.2342773

[10] (2013) Storm project. [Online]. Available: http://www.storm-project.net/[11] “Apache flink,” 2018. [Online]. Available: http://flink.apache.org/[12] “Apache heron,” 2018. [Online]. Available:

http://incubator.apache.org/projects/heron.html[13] “Apache samza,” 2018. [Online]. Available: http://samza.apache.org/[14] Cgroups. [Online]. Available: https://en.wikipedia.org/wiki/Cgroups[15] S. Schneider, H. Andrade, B. Gedik, A. Biem, and K.-L. Wu, “Elastic

scaling of data parallel operators in stream processing,” in Proceedingsof the 2009 IEEE International Symposium on Parallel & DistributedProcessing (IPDPS). IEEE, 2009, pp. 1–12.

[16] R. K. Kombi, N. Lumineau, and P. Lamarre, “A preventive auto-parallelization approach for elastic stream processing,” in Proceedings ofthe 2017 IEEE 37th International Conference on Distributed ComputingSystems (ICDCS). IEEE, 2017, pp. 1532–1542.

[17] X. Liu, A. V. Dastjerdi, R. N. Calheiros, C. Qu, and R. Buyya,“A stepwise auto-profiling method for performance optimizationof streaming applications,” ACM Transactions on Autonomous andAdaptive Systems, vol. 12, no. 4, pp. 24:1–24:33, 2018. [Online].Available: https://doi.org/10.1145/3132618

[18] F. Lombardi, L. Aniello, S. Bonomi, and L. Querzoni, “Elastic symbioticscaling of operators and resources in stream processing systems,” IEEETransactions on Parallel and Distributed Systems, vol. 29, no. 3, pp.572–585, March 2018.

[19] C.-K. Shieh, S.-W. Huang, L.-D. Sun, M.-F. Tsai, and N. Chilamkurti,“A topology-based scaling mechanism for Apache Storm,” InternationalJournal of Network Management, vol. 27, no. 3, p. e1933, 2017.

[20] T. D. Matteis and G. Mencagli, “Proactive elasticity and energyawareness in data stream processing,” Journal of Systems andSoftware, vol. 127, pp. 302 – 319, 2017. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0164121216301467

[21] B. Peng, M. Hosseini, Z. Hong, R. Farivar, and R. Campbell, “R-Storm:Resource-aware scheduling in Storm,” in Proceedings of the 16th AnnualMiddleware Conference. ACM, 2015, pp. 149–161.

[22] T. Z. J. Fu, J. Ding, R. T. B. Ma, M. Winslett, Y. Yang, and Z. Zhang,

14

“DRS: Auto-scaling for real-time stream analytics,” IEEE/ACM Trans-actions on Networking, vol. 25, no. 6, pp. 3338–3352, Dec 2017.

[23] T. Heinze, V. Pappalardo, Z. Jerzak, and C. Fetzer, “Auto-scalingtechniques for elastic data stream processing,” in Proceedings of theIEEE 30th International Conference on Data Engineering Workshops(ICDEW). IEEE, 2014, pp. 296–302.

[24] G. Mencagli, M. Torquati, and M. Danelutto, “Elastic-PPQ:a two-level autonomic system for spatial preference queryprocessing over dynamic data streams,” Future Generation ComputerSystems, vol. 79, pp. 862 – 877, 2018. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0167739X1730938X

[25] L. Aniello, R. Baldoni, and L. Querzoni, “Adaptive online schedulingin Storm,” in Proceedings of the 7th ACM international conference onDistributed event-based systems. ACM, 2013, pp. 207–218.

[26] J. Xu, Z. Chen, J. Tang, and S. Su, “T-Storm: Traffic-aware onlinescheduling in Storm,” in Proceedings of the 2014 IEEE 34th Interna-tional Conference on Distributed Computing Systems (ICDCS). IEEE,2014, pp. 535–544.

[27] V. Cardellini, V. Grassi, F. Lo Presti, and M. Nardelli, “Optimaloperator placement for distributed stream processing applications,”in Proceedings of the 10th ACM International Conference onDistributed and Event-based Systems, ser. DEBS ’16. NewYork, NY, USA: ACM, 2016, pp. 69–80. [Online]. Available:http://doi.acm.org/10.1145/2933267.2933312

[28] K. G. S. Madsen, Y. Zhou, and J. Cao, “Integrative dynamic recon-figuration in a parallel stream processing engine,” in 2017 IEEE 33rdInternational Conference on Data Engineering (ICDE), April 2017, pp.227–230.

[29] M. Luthra, B. Koldehofe, P. Weisenburger, G. Salvaneschi, and R. Arif,“TCEP: Adapting to dynamic user environments by enabling transitionsbetween operator placement mechanisms,” in Proceedings of the 12thACM International Conference on Distributed and Event-based Systems,ser. DEBS ’18. New York, NY, USA: ACM, 2018, pp. 136–147.[Online]. Available: http://doi.acm.org/10.1145/3210284.3210292

[30] Y. Zhu, E. A. Rundensteiner, and G. T. Heineman, “Dynamic planmigration for continuous queries over data streams,” in Proceedings ofthe 2004 ACM SIGMOD International Conference on Management ofData, ser. SIGMOD ’04. New York, NY, USA: ACM, 2004, pp. 431–442. [Online]. Available: http://doi.acm.org/10.1145/1007568.1007617

[31] B. Gedik, S. Schneider, M. Hirzel, and K.-L. Wu, “Elastic scaling fordata stream processing,” IEEE Transactions on Parallel and DistributedSystems, vol. 25, no. 6, pp. 1447–1463, 2014.

[32] V. Cardellini, M. Nardelli, and D. Luzi, “Elastic stateful stream process-ing in Storm,” in Proceedings of the 2016 International Conference onHigh Performance Computing & Simulation (HPCS). IEEE, 2016, pp.583–590.

[33] S. Rajadurai, J. Bosboom, W.-F. Wong, and S. Amarasinghe, “Gloss:Seamless live reconfiguration and reoptimization of stream programs,”in Proceedings of the Twenty-Third International Conference onArchitectural Support for Programming Languages and OperatingSystems, ser. ASPLOS ’18. New York, NY, USA: ACM, 2018, pp. 98–112. [Online]. Available: http://doi.acm.org/10.1145/3173162.3173170

[34] A. Shukla and Y. Simmhan, “Toward reliable and rapid elasticity forstreaming dataflows on clouds,” in Proceedings of 2018 IEEE 38thInternational Conference on Distributed Computing Systems (ICDCS),July 2018, pp. 1096–1106.

[35] Q.-C. To, J. Soto, and V. Markl, “A survey of state management inbig data processing systems,” The VLDB Journal, vol. 27, no. 6, pp.847–872, Dec 2018.

[36] V. Cardellini, F. L. Presti, M. Nardelli, and G. R. Russo, “Decentralizedself-adaptation for elastic data stream processing,” Future GenerationComputer Systems, vol. 87, pp. 171 – 185, 2018. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0167739X17326821

[37] V. Cardellini, F. Lo Presti, M. Nardelli, and G. Russo Russo, “Optimaloperator deployment and replication for elastic distributed data streamprocessing,” Concurrency and Computation: Practice and Experience,vol. 30, no. 9, p. e4334, 2018.

[38] V. Cardellini, V. Grassi, F. L. Presti, and M. Nardelli, “DistributedQoS-aware scheduling in Storm,” in Proceedings of the 2015 ACMInternational Conference on Distributed Event-Based Systems, 2015, pp.344–347.

[39] L. Xu, B. Peng, and I. Gupta, “Stela: Enabling stream processing systemsto scale-in and scale-out on-demand,” in Proceedings of the 2016 IEEEInternational Conference on Cloud Engineering (IC2E). IEEE, 2016,pp. 22–31.

[40] F. Starks, V. Goebel, S. Kristiansen, and T. Plagemann, Mobile Dis-tributed Complex Event Processing—Ubi Sumus? Quo Vadimus? Cham:Springer International Publishing, 2018, pp. 147–180.

[41] J. Aracil, R. Edell, and P. Varaiya, “A phenomenological approachto Internet traffic self-similarity,” in Proceedings of the 35th AnnualAllerton Conference on Communication, Control and Computing, 1996,pp. 1–24.

[42] R. D. Smith, “The dynamics of Internet traffic: Self-similarity, self-organization, and complex phenomena,” Advances in Complex Systems,vol. 14, no. 6, pp. 905–949, 2011.

[43] R. Donthi, R. Renikunta, R. Dasari, and M. Perati, “Study of delay andloss behavior of Internet switch-markovian modelling using circulantmarkov modulated poisson process (CMMPP),” Applied Mathematics,vol. 5, no. 3, pp. 512–519, 2014.

[44] C. Dabrowski, “Catastrophic event phenomena in communication net-works: A survey,” Computer Science Review, vol. 18, pp. 10 – 45, 2015.

[45] X. Liu, A. Harwood, S. Karunasekera, B. Rubinstein, and R. Buyya, “E-Storm: Replication-based state management in distributed stream pro-cessing systems,” in Proceedings of the 2017 International Conferenceon Parallel Processing, 2017, pp. 571–580.

[46] M. Yang and R. T. Ma, “Smooth task migration in Apache Storm,” inProceedings of the 2015 ACM SIGMOD International Conference onManagement of Data. ACM, 2015, pp. 2067–2068.

[47] P. T. Goetz and B. O’Neill, Storm Blueprints: Patterns for DistributedRealtime Computation. Packt Publishing Ltd, 2014.

[48] “Wordcounttopology.” [Online]. Available: http://t.cn/RajHPwZ[49] Kafka. [Online]. Available: http://kafka.apache.org/

Jessie Hui Wang - An Adaptive Online Scheme for ......1 An Adaptive Online Scheme for Scheduling and...

Documents

Transcript of Jessie Hui Wang - An Adaptive Online Scheme for ......1 An Adaptive Online Scheme for Scheduling and...