F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and...

16
WHIZ: A Fast and Flexible Data Analytics System Robert Grandl, Arjun Singhvi, Raajay Viswanathan, Aditya Akella University of Wisconsin - Madison Abstract— Today’s data analytics frameworks are compute-centric, with analytics execution almost entirely dependent on the pre-determined physical structure of the high-level computation. Relegating intermediate data to a second class entity in this manner hurts flexibility, perfor- mance, and efficiency. We present WHIZ, a new analytics framework that cleanly separates computation from inter- mediate data. It enables runtime visibility into data via programmable monitoring, and data-driven computation (where intermediate data values drive when/what compu- tation runs) via an event abstraction. Experiments with a WHIZ prototype on a large cluster using batch, streaming, and graph analytics workloads show that its performance is 1.3-2× better than state-of-the-art. 1 Introduction Many important applications in diverse settings rely on analyzing large datasets, including relational tables, event streams, and graph-structured data. To analyze such data, several systems have been introduced [4, 7, 14, 24, 33, 3841, 45, 46]. These enable data parallel computation, where a job’s analysis logic is run in parallel on data shards spread across multiple machines in large clusters. Almost all these systems, be they for batch [14, 20, 45], stream [4, 46] or graph processing [24, 38, 39], have their intellectual roots in MapReduce [20], a time-tested data parallel execution framework 1 . While they have many differences, the systems share a key attribute with MapRe- duce, in that they are compute-centric 2). Their focus, like MapReduce, is on splitting a job’s computational logic, and distributing it across tasks to be run in parallel. Like MapReduce, all aspects of the subsequent execution of the job are rooted in the job’s computational logic, and its task-level distribution, i.e., the job computation structure. These include the fact that the compute logic running inside tasks is static and/or predetermined; in- termediate data is partitioned and routed to where it is consumed based on the compute structure; and dependent tasks are launched when a fraction of upstream tasks they depend on finish. These attributes of job execution are not related to, or driven by, the properties of intermediate data, i.e., how much and what data is generated. Thus, intermediate data is a second-class citizen. Compute-centricity was a natural choice for MapRe- duce. Knowing job structure beforehand simplifies carv- ing computational units to execute tasks. Compute cen- 1 We are not referring to the MapReduce programming model here. tricity provided clean mechanisms to recover from fail- ures – only tasks on a failed machine needed to be re- executed. Job schedulers became simple because of hav- ing to deal with static inputs, i.e., fixed tasks/dependency structures. While originally designed for batch analyt- ics, frameworks for streaming and graph analytics have shown that compute-centricity can be applied broadly. However, given the benefit of hindsight, is compute- centricity the right choice? Our experience with build- ing cluster schedulers, query optimizers, and execution engines, has shown that atleast four impediments arise from compute-centricity (§2): (1) Intermediate data be- ing an opaque entity means there is no way to adapt job execution based on runtime data properties. (2) Static parallelism and intermediate data partitioning inherent to compute-centric frameworks constrain adaptation to data skew and resource flux. (3) Execution schedules being tied to compute structure can lead to resource waste while tasks wait for input to become available. (4) Compute- based organization of intermediate data can result in storage hotspots and poor cross-job I/O isolation, and it curtails data locality. Thus, compute-centricity begets inflexibility, poor performance, and inefficiency, which hurts production applications and cluster deployments. Instead of adopting point fixes to today’s systems and work within compute-centricity limitations, this paper seeks a ground-up redesign. We observe that the above limitations arise from (1) tight coupling between inter- mediate data and compute, and (2) intermediate data ag- nosticity. Our framework, WHIZ, cleanly separates com- putation from all intermediate data. Intermediate data is written to/read from a separate distributed key-value data- store. The store offers programmable visibility – applica- tions can provide custom routines for monitoring runtime data properties. An event abstraction allows the store to signal to an execution layer when an application’s run- time data values satisfy certain properties. Decoupling, monitoring, and events enable data-driven computation: based on data properties, WHIZ decides what logic to launch in order to further process data generated, how many parallel tasks to launch, when/where to launch them, and what resources to allocate to tasks. We show that such data-driven computation helps improve perfor- mance, efficiency, isolation, and flexibility, across a range of different application domains. We make the following contributions in designing WHIZ: (1) We present scalable APIs for programmable arXiv:1703.10272v5 [cs.DC] 21 Jun 2019

Transcript of F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and...

Page 1: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

WHIZ: A Fast and Flexible Data Analytics SystemRobert Grandl, Arjun Singhvi, Raajay Viswanathan, Aditya Akella

University of Wisconsin - Madison

Abstract— Today’s data analytics frameworks arecompute-centric, with analytics execution almost entirelydependent on the pre-determined physical structure of thehigh-level computation. Relegating intermediate data to asecond class entity in this manner hurts flexibility, perfor-mance, and efficiency. We present WHIZ, a new analyticsframework that cleanly separates computation from inter-mediate data. It enables runtime visibility into data viaprogrammable monitoring, and data-driven computation(where intermediate data values drive when/what compu-tation runs) via an event abstraction. Experiments with aWHIZ prototype on a large cluster using batch, streaming,and graph analytics workloads show that its performanceis 1.3-2× better than state-of-the-art.

1 IntroductionMany important applications in diverse settings rely onanalyzing large datasets, including relational tables, eventstreams, and graph-structured data. To analyze such data,several systems have been introduced [4, 7, 14, 24, 33,38–41, 45, 46]. These enable data parallel computation,where a job’s analysis logic is run in parallel on datashards spread across multiple machines in large clusters.

Almost all these systems, be they for batch [14,20,45],stream [4, 46] or graph processing [24, 38, 39], have theirintellectual roots in MapReduce [20], a time-tested dataparallel execution framework1. While they have manydifferences, the systems share a key attribute with MapRe-duce, in that they are compute-centric (§2). Their focus,like MapReduce, is on splitting a job’s computationallogic, and distributing it across tasks to be run in parallel.Like MapReduce, all aspects of the subsequent executionof the job are rooted in the job’s computational logic,and its task-level distribution, i.e., the job computationstructure. These include the fact that the compute logicrunning inside tasks is static and/or predetermined; in-termediate data is partitioned and routed to where it isconsumed based on the compute structure; and dependenttasks are launched when a fraction of upstream tasks theydepend on finish. These attributes of job execution arenot related to, or driven by, the properties of intermediatedata, i.e., how much and what data is generated. Thus,intermediate data is a second-class citizen.

Compute-centricity was a natural choice for MapRe-duce. Knowing job structure beforehand simplifies carv-ing computational units to execute tasks. Compute cen-

1We are not referring to the MapReduce programming model here.

tricity provided clean mechanisms to recover from fail-ures – only tasks on a failed machine needed to be re-executed. Job schedulers became simple because of hav-ing to deal with static inputs, i.e., fixed tasks/dependencystructures. While originally designed for batch analyt-ics, frameworks for streaming and graph analytics haveshown that compute-centricity can be applied broadly.

However, given the benefit of hindsight, is compute-centricity the right choice? Our experience with build-ing cluster schedulers, query optimizers, and executionengines, has shown that atleast four impediments arisefrom compute-centricity (§2): (1) Intermediate data be-ing an opaque entity means there is no way to adapt jobexecution based on runtime data properties. (2) Staticparallelism and intermediate data partitioning inherent tocompute-centric frameworks constrain adaptation to dataskew and resource flux. (3) Execution schedules beingtied to compute structure can lead to resource waste whiletasks wait for input to become available. (4) Compute-based organization of intermediate data can result instorage hotspots and poor cross-job I/O isolation, andit curtails data locality. Thus, compute-centricity begetsinflexibility, poor performance, and inefficiency, whichhurts production applications and cluster deployments.

Instead of adopting point fixes to today’s systems andwork within compute-centricity limitations, this paperseeks a ground-up redesign. We observe that the abovelimitations arise from (1) tight coupling between inter-mediate data and compute, and (2) intermediate data ag-nosticity. Our framework, WHIZ, cleanly separates com-putation from all intermediate data. Intermediate data iswritten to/read from a separate distributed key-value data-store. The store offers programmable visibility – applica-tions can provide custom routines for monitoring runtimedata properties. An event abstraction allows the store tosignal to an execution layer when an application’s run-time data values satisfy certain properties. Decoupling,monitoring, and events enable data-driven computation:based on data properties, WHIZ decides what logic tolaunch in order to further process data generated, howmany parallel tasks to launch, when/where to launchthem, and what resources to allocate to tasks. We showthat such data-driven computation helps improve perfor-mance, efficiency, isolation, and flexibility, across a rangeof different application domains.

We make the following contributions in designingWHIZ: (1) We present scalable APIs for programmable

arX

iv:1

703.

1027

2v5

[cs

.DC

] 2

1 Ju

n 20

19

Page 2: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

intermediate data monitoring, and for applications lever-aging events for data-driven actions. Our APIs balanceexpressiveness against overhead. (2) We show how toorganize intermediate data from multiple jobs in the data-store so as to achieve data locality, fault tolerance, andcross-job isolation. Since obtaining an optimal data or-ganization is NP-Hard, we develop novel heuristics thatcarefully trade-off among these objectives. (3) We de-velop novel iterative heuristics for the execution layer fordata-driven task parallelism and placement. This mini-mizes runtime skew in data processed and lowers datashuffle cost. (4) We launch each task in a container whosesize is late-bound to the actual data allocated to the task,ensuring low data processing skew and optimal efficiencyunder resource dynamics.

We have built a WHIZ prototype by modifying Tez [43]and YARN [47] (15K LOC). We currently support batch,graph, and stream processing. We deploy and experimentwith our prototype on a 50 machine cluster in Cloud-Lab [6]. We compare against several state-of-the-artcompute centric (CC) approaches. WHIZ improves me-dian (95%-ile) job completion time (JCT) by 1.3−1.6×(1.5−2.2×) across batch, streaming, and graph analytics.WHIZ reduces idling by launching (the right number ofappropriately-sized) tasks only when custom predicateson input data are met, and avoids expensive data shuffleseven for consumer tasks. Under high cluster load, WHIZ

offers 1.8× better JCT than CC due to better cross-jobdata management and isolation. WHIZ’s data-driven ac-tions enable computation to start much sooner than CC,leading to 1.3−1.6× better stream and graph analyticsmedian JCTs.

2 Compute-Centric Vs. Data-DrivenWe begin with a brief overview of today’s batch, streamand graph processing systems (§2.1). We then discussthe three key design principles of WHIZ (§2.2). Finally,we list the performance issues arising from compute-centricity and show how the data-driven design adoptedby WHIZ overcomes them (§2.3).

2.1 Today: Compute-Centric FrameworksProduction batch [5, 20, 52], stream [4, 10, 52] orgraph [23, 24, 38] analytics frameworks support the exe-cution of multiple interdependent stages of computation.Each stage is executed simultaneously within differenttasks, each processing different data shards, or partitions,to generate input for a later stage.

Fig. 1a is an example of a simple batch analytics job.Here, two tables need to be filtered based on providedpredicates and joined to produce a new table. There are 3stages: two maps for filtering and one reduce to perform

1 1M11

1 1M12

1 1M21

1 1M22

R11 R12

IntermediateDataShuffle

(a) A batch analytics job. Inter-mediate data is partitioned intotwo key ranges, one per reducetask, and stored in local files atmap tasks.

time

CPU idling – S2 CPU idling – S3

Records from S1

100 recordsS2 starts work

(b) Data flow in a streaming job. Tasks inall stages are always running. Output ofa stage is immediately passed to a task indownstream stage. However, CPU is idle un-til task in Stage 2 receives 100 records afterwhich computation is triggered.

Figure 1: Simplified examples of existing analytics systems

the join. Execution proceeds as follows: (1) Map tasksfrom both the stages execute first with each task process-ing a partition of the corresponding input table. (2) Mapintermediate results are written to local disk by each task,split into files, one per consumer reduce task. (3) Reducetasks are launched when the map stages are nearing com-pletion; each reduce task shuffles relevant intermediatedata from all map tasks’ locations, and generates output.

A stream analytics job (e.g., Fig. 1b) has a similarmodel [10, 34, 53]; the main difference is that tasks inall stages are always running. A graph analytics job, ina framework that relies on the popular message passingabstraction [38], has a similar but simplified model: thedifferent stages are iterations in a graph algorithm, andthus all stages execute the same processing logic (withthe input being the output of the previous iteration).

As the above shows, today’s frameworks are designedprimarily with the goal of splitting up and distributingcomputation across multiple machines, making themcompute centric (§1). Intermediate data is a second classcitizen, strewn across many files at the locations whereproducer tasks are run, and routed along predeterminededges between producer and consumer tasks.

2.2 WHIZ: A Data-Driven FrameworkWHIZ makes intermediate data a first class citizen (Fig. 3).It achieves this by adopting three design principles:Decoupling compute and data: WHIZ decouples com-pute from intermediate data (§4). Data from all stagesacross all jobs is written to/read from a separate key-value(KV) datastore, and managed by a distinct data manage-ment layer called the data service (DS). An executionservice (ES) manages compute tasks.Programmable data visibility: The above separationalso enables low-overhead approaches to gain visibilityinto all runtime data (§5.1). A programmer can gathercustom runtime data properties via a simple API.Data-driven computation: Building on data visibility,WHIZ provides an API for applications to act on events.Events form the basis for an intermediate data publish-subscribe substrate. Programmers can define custompredicates on intermediate data properties for each stage,

2

Page 3: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

and events notify the application when intermediate datasatisfies the predicates. Events help achieve data-drivencomputation: properties of intermediate data drive allaspects of further computation (§5.2).

2.3 Overcoming Issues with Compute-centricityWe contrast WHIZ with compute centricity along flexibil-ity, performance, efficiency, placement, and isolation.Data opacity, and compute rigidity: In compute-centric frameworks, there is no visibility into intermedi-ate data generated by different stages of a job and thetasks’ computational logic are decided apriori. This pre-vents adapting the tasks’ logic based on their input data.Consider the job in Fig. 1a. Existing frameworks deter-mine the type of join for the entire reduce stage basedon coarse statistics [3]; unless one of the tables is small,a sort-merge join is employed to avoid out-of-memory(OOM) errors. On the other hand, having per-key his-tograms of intermediate data would enable dynamicallydetermining the type of join to use for different reducetasks. A task can use hash join if the total size of its keyrange is less than the available memory, and merge joinotherwise. WHIZ offers visibility into all intermediatedata to support computation of such rich statistics whichcan be used to decide the logic to apply.Static Parallelism, Partitioning: Today, jobs’ per-stageparallelism, inter-task edges and intermediate data parti-tioning strategy are decided independent of runtime dataand resource dynamics. In Spark [52] the number of tasksin a stage is determined apriori by the user applicationor by SparkSQL [9]. A hash partitioner is used to placean intermediate (k,v) pair into one of |tasks| buckets.Pregel [38] vertex-partitions the input graph; partitionsdo not change during the execution of the algorithm.

This limits adaptation to resource flux and data skew.A running stage cannot utilize newly available computeresources [26, 27, 37] and dynamically increase its par-allelism. If some key (or some vertex program) in a par-tition has an abnormally large number of records (ormessages) to process then the corresponding task is sig-nificantly slowed down [13], affecting both stage andoverall job completion times. Data skew is hard to pre-dict.

With WHIZ, because runtime data is managed inde-pendently (§4), compute and parallelism for downstreamstages can be late-bound (§6.1). Based on the actualvolume of runtime data, and the current resources, wedetermine how many tasks to launch and how to provi-sion them. This controls data skew, and provisions taskresources proportional to the data to be processed.Idling due to compute-driven scheduling: Modernschedulers [22, 47] decide when to launch tasks for a

stage based on the static computation structure. When astage’s computation is commutative+associative, sched-ulers launch its tasks once 90% of all tasks in upstreamstages complete [5]. But the remaining 10% producerscan take long to complete [13] resulting in tasks idling.

Idling is worse in streaming, where consumer tasks arecontinuously waiting for data from upstream tasks. E.g.,consider the streaming job in Fig. 1b. Stage 2 computesand outputs the median for every 100 records received.Between computation, S2’s tasks stay idle. As a result,the tasks in the downstream S3 stage also lay idle. Toavoid idling, tasks should be scheduled only when, andonly as long as, relevant input is available. Above, com-putation should be launched only after ≥ 100 recordshave been generated by an S1 task.

Likewise, in batch analytics, if computation is com-mutative+associate, it is beneficial to “eagerly” launchtasks to process intermediate data whenever enough datahas been generated to process in one batch, and exit soonafter done. This is challenging to achieve today due tocompute-driven scheduling and lack of data visibility.

Idling is easily avoided with WHIZ: programmablemonitoring aids runtime data statistics collection; whenstatistics indicate that relevant data has been generated,events are triggered, aiding launch of relevant tasks.Placement, and storage isolation: Because intermedi-ate data is spread across producer tasks’ locations, it isimpossible to place consumer tasks in a data-local fash-ion. Such tasks are placed at random [20] and forced toengage in expensive shuffles that consume a significantportion of job run times (∼30% [19]).

Also, when tasks from multiple jobs are collocated, itbecomes difficult to isolate their hard-to-predict runtimeintermediate data I/O. Tasks from jobs generating largeintermediate data may occupy much more local storageand I/O bandwidth than those generating less.

Since the WHIZ store manages data from all jobs, itcan enforce policies to organize data to meet per-jobobjectives, e.g., data locality for any stage (not just input-reading stages), and to meet cluster objectives, such asI/O hotspot avoidance and cross-job isolation.

2.4 Related WorkData opacity:

Almost all database and bigdata SQL systems [9, 11,49] use statistics computed ahead of time to optimizeexecution. Adaptive query optimizers (QOs) [21] usedynamically collected statistics and re-invoke the QO tore-plan queries top-down. In contrast, WHIZ alters thequery plans on-the-fly at the execution layer based on run-time data properties, thereby circumventing additionalexpensive calls to the QO. Optimus [30] allows changing

3

Page 4: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

1 job = Whiz.createJob(“SamplePrg”, Whiz.BATCH)2 stage1 = Whiz.createStage(job, “s1”, InputImpl,TriggerImpl)3 stage2 = Whiz.createStage(job, “s2”, CountImpl)4 stage3 = Whiz.createStage(job, “s3”, SortImpl)5 Whiz.addDependency(job, stage1, stage2)6 Whiz.addDependency(job, stage2, stage3)7 Whiz.addModifyLogic(job, stage3, ModifyImpl)8 Whiz.addDataMonitor(job, stage2, DS.monitor.num_entries)

9 def ModifyImpl(job, statEvent):10 for key, count in statEvent.stats:11 if count < 500:12 return13 replaceStage(job, statEvent.stageID,EmptyImpl)

14 def TriggerImpl():15 for k in keys:16 if (DS.monitor.num_entries(k)

>= 100):17 return true18 return false

Figure 2: An example three-stage job in WHIZ

application logic based on approximate statistics oper-ators that are deployed alongside tasks. However, thesystem targets simple computation logic rewriting, andcannot enable other data-driven benefits, e.g., adaptingparallelism, and rightsizing tasks.Skew handling and static parallelism: Some paralleldatabases [25, 31, 50] and big data systems [35] dynami-cally adapt to data skew for single large joins. In contrast,WHIZ holistically solves data skew for all joins acrossmultiple jobs and further strives to achieve data local-ity. [25, 35] deal with skew in MapReduce by dynami-cally splitting data for slow tasks into smaller partitionsand processing them in parallel. But, they can cause addi-tional data movement from already slow machines lead-ing to poor performance. [15] mitigates skew by cloningslow tasks and adaptively partitioning work. However, ithas no data visibility and its data organization does notconsider data locality and fault tolerance.Decoupling: Naiad [40] and StreamScope [48] also de-couple intermediate data. They tag intermediate datawith vector clocks which are used to trigger computein the correct order. Thus, both support ordering drivencomputation, orthogonal to data-driven computation inWHIZ. Also, Naiad assumes entire data fits in memory.StreamScope is not applicable to batch/graph analytics.Storage inefficiencies: For batch analytics, [29, 42] ad-dresses storage inefficiencies by pushing intermediatedata to the appropriate external data services (like Ama-zon S3, Redis) while remaining cost efficient and runningon serverless platforms. Similarly, [32] is an elastic datastore used to store intermediate data of serverless ap-plications. However, since this data is still opaque, andcompute and storage are managed in isolation, these sys-tems cannot support data-driven computation or achievedata locality and load balancing simultaneously.

3 WHIZ Programming ModelWHIZ extends the DAG-based programming model ofexisting frameworks [10, 20, 38, 43] in a few key waysto support data-driven computation. Contrary to exist-ing frameworks, WHIZ does not require users to providelow-level details such as stage parallelism and data parti-

WHIZ CLIENT

EXECUTION SERVICE (ES)

Submit Program

Manages computation Push intermediate data Manages data

Data ready for processing [data_ready events]

Compute runtime DAG changes

1

22

3

4

4

5

DATA SERVICE (DS)

Figure 3: WHIZ control flow.

tioning strategy. The minimal embellishment to familiarDAG-based programming, and freedom from low leveldetails, ensure that WHIZ is simple to program atop.

In WHIZ, using the addDataMonitor API (Table. 1row 6), a user can add a built-in or custom module thatcomputes statistics over data generated by a stage in adata-parallel program. Using the createStage API (Ta-ble. 1 row 2), a user can provide predicates on the col-lected data properties (StageDataReadyTriggerImpl) todetermine when a downstream stage can consume thecurrent stage’s data. Using the addModi f yAction API(Table. 1 row 4), computation logic can be changed atruntime based on data properties.

As an example, consider a 3-stage job that processeswords and, for words with < 500 occurrences, sorts themby frequency. The program structure is S1 → S2 → S3,where S1 processes input words, S2 computes word oc-currences, and S3 sorts the ones with < 500 occurrences.In WHIZ it can be realized as shown in Fig. 2. Job com-position details (lines 1-6) are similar to existing frame-works. We specify the implementation of TriggerImpl forS1 and ModifyImpl for S3 that help realize data-drivencomputation. TriggerImpl specifies that as soon as 100occurrences of a word are seen, S2 can start running (lines14-18), facilitating pipelined execution of S1 and S2. Withthe ModifyImpl for S3 (lines 9-13), if the data consistsof only words ≥ 500 occurrences, then S3 doesn’t needto execute (compute is replaced by null operation).

In Appendix. A, we show several other example appli-cations written in WHIZ. Our prototype supports batchSQL analytics, graph algorithms, and streaming.

The overall control flow in WHIZ is shown in Fig. 3.The user program is provided to the WHIZ client. Theexecution service (ES) runs the root vertex and writes itsoutput to the datastore (step 3). The Data Service (DS)stores the received data and notifies ES when data is readyfor further processing via an event (step 4). DS sends datastatistics (e.g., per-key counts above) to WHIZ client viaevents as well (step 4). On receiving a data-ready event,the ES (step 5) queries the client for compute changes.The above process repeats. Interactions between the ES,DS, and the client are hidden from the user.

4

Page 5: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

API Description1 createJob(name:Str, type:Type) Creates a new job which can be of type BATCH, STREAM or GRAPH.2 createStage(j:Job, name:Str, impl:StageImpl, trig-

ger:StageDataReadyTriggerImpl)Adds a stage of computation to a Job with a custom implementation. Optionally, users can specify cus-tom predicates that determine when downstream stages can consume current stage’s data through Stage-DataReadyTriggerImpl. Otherwise default triggers are applied depending on job type.

3 addDependency(j: Job, s1: Stage, s2: Stage) Adds a starts before relationship between stages s1 and s2.4 addModifyAction(j:Job, s:Stage, impl:ModifyImpl) Changes stage computation logic as specified by ModifyImpl. ModifyImpl is called when a statEvent

arrives to rewrite job description.5 replaceStage(j:Job, s:Stage, impl:AlterStageImpl) Replaces stage implementation with an alternative implementation logic6 addDataMonitor(j:Job, s:Stage, impl:DataMonitorImpl) Adds a data monitor to compute statistics over data generated by stage s. DataMonitorImpl can either be

a built-in module or a customized one.

Table 1: WHIZ Programmer APIs.

4 Data StoreIn WHIZ, all jobs’ intermediate data is written to/readfrom a separate data store, where it is structured as<key,value> pairs. In batch/stream analytics, the keysare generated by the stage computation logic itself; ingraph analytics, keys are identifiers of vertices to whichmessages (values) are destined for processing in the nextiteration. We now address how this data is organizedin the store by the data service (DS). The DS does thisorganization via a cluster-wide master (DS-M).

An ideal data organization should achieve three cross-and per-job goals: (1) it should load balance and spreadall jobs’ data, specifically, avoid hotspots and improvecross-job isolation, and minimize within-job skew in tasks’data processing. (2) It should maximize job data localityby co-locating as much data having the same key aspossible. (3) It should be fault tolerant - when a storagenode fails, recovery should have minimal impact on jobruntime. Next, we describe our data storage granularity,which forms the basis for meeting our goals.

4.1 Capsule: A Unit of Data in WHIZ

WHIZ groups intermediate data based on <key>s intogroups called capsules. A stage’s intermediate data isorganized into some large number N capsules; cruciallyN is late-bound as described below, which helps meet ourgoals above. Intermediate data key range is split N-ways,and each capsule stores all <k, v> data from a givenrange. WHIZ strives to materialize all capsule data on onemachine; rarely, a capsule may be spread across a smallnumber of machines. This materialization property ofcapsules forms the basis for consumer task data locality.In contrast, in today’s systems, data from the same keyrange may be written at different producers’ locations.

Also, today, key ranges of the intermediate data parti-tions are tied to pre-determined task parallelism, whereasWHIZ capsule key ranges are unrelated to compute struc-ture and late-bound. Specifically, we first determine theset of machines on which capsules from a stage are tobe stored; machines are chosen to maximize the abilityto simultaneously support isolation, load balancing, lo-cality, and fault tolerance; the choice of machines thendetermines the number N for a stage’s capsules (§4.2).

Given these machines and N, as the stage producesdata at run time, N capsules are materialized, and dynam-ically allocated to right-sized tasks; this is done in a waythat preserves data-local processing, lowers skew, andoptimally uses compute resources (§6).

Thus late-binding realizes an objectives-driven parti-tioning and consumption of intermediate data.

4.2 Fast Capsule AllocationWe consider how to place multiple jobs’ capsules onmachines to avoid hotspots, ensure data locality and min-imize job runtime impact on data loss. We formulate abinary integer linear program (see Fig. 11 in Appendix.B) to this end. Solving this ILP at scale can take severaltens of seconds delaying capsule placement.

WHIZ instead uses a simpler, practical approach for thecapsule placement problem. First, instead of jointly opti-mizing global placement decisions for all the capsules,WHIZ solves a “local” problem of placing capsules foreach stage independently (while still considering inter-stage dependencies); when new stages arrive, or whenexisting capsules may exceed job quota on a machine,new locations for some of these capsules are determined.Second, instead of solving a multi-objective optimization,WHIZ uses a linear-time rule-based heuristic to place cap-sules; the heuristic prioritizes load and locality (in thatorder) in case machines satisfying all objectives cannotbe found. Isolation (quota) is always enforced.Capsule location for new stages: When a job j is readyto run, DS-M invokes an admin-provided heuristic h1

(Table 2) that assigns j a quota Q j per machine.When a stage v of job j starts to generate intermediate

data, DS-M invokes h2 to determine the number of ma-chines Mv for organizing v’s data. h2 picks Mv between2 and a fraction of the total machines which are ≤ 75%of the quota Q j for j. Mv ≥ 2 ensures opportunities fordata parallel processing (§6.1); a bounded Mv (Table 2)controls the ES task launch overhead (§6.1).

Given Mv, DS-M invokes h3 to generate a list of ma-chines

−→Mv to materialize data on. It starts by creating

three sub-lists: (1) For load balancing, machines aresorted lightest-load-first, and only ones which are≤ 75%quota usage for the corresponding job are considered.

5

Page 6: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

h1 // Q j: max storage quota per job j and machine m.Based on fairness considerations across all runnable jobs J.// Mv: number machines (out of M) to organize data that will be//generated by v of j.

h2a. Count number machines M j75 where j is using < 75% of Q j ;b. Mv = max(2,M j75×

M−M j75M ).

// Given Mv, compute list of machines−→Mv.

h3

Considers only machines where j is using < 75% of Q j ;a. Pick machines that provide LB2, DL3 and maximum possible FT4;b. If |−→Mv| < Mv, relax FT guarantees and pick machines thatprovide LB and DL;c. If |−→Mv| is still < Mv, pick machines that just provide LB.// Given Mv, compute total capsules N.

h4 N = G X Mv, where G = capsules per machine// Which machines are at risk of violating Q j?

h5 −→M j : machines which store data of j and j is using ≥ 75% of Q j .// Which capsules are hot on

−→M j?

h6 Significantly larger in size or have a higher increasing rate than others.

Table 2: Heuristics employed in data organization

(2) For data locality, we prefer machines which alreadymaterialize other capsules for this stage v, or capsulesfrom other stages whose output will be consumed bysame downstream stage as v (e.g., the two map stages inFig. 1a). (3) For fault tolerance, we pick machines wherethere are no capsules from any of v’s k upstream stagesin the job, sorted in descending order of k. Thus, for thelargest value of k, we have all machines that do not storedata from any of v’s ancestors; for k = 1 we have nodesthat store data from the immediate parent of v.

We pick machines from the sub-lists to maximallymeet our objectives in 3 steps: (1) Pick least loaded ma-chines that are data local and offer as high fault toleranceas possible (machines present in all three sub-lists). Notethat as we go down the fault tolerance list in search ofa total of Mv machines, we trade-off fault tolerance. (2)If despite reaching minimum fault tolerance as possible,i.e., reaching the bottom of the fault tolerance sub-list –the number of machines picked falls below Mv, we com-pletely trade-off fault tolerance and pick least loadedmachines that are data local (machines present in loadbalancing and data local sub-lists). (3) If still the num-ber of machines picked falls below Mv, we trade-off datalocality and simply pick least-loaded machines.

Finally, given−→Mv, DS-M invokes h4 and instantiates

a fixed number (G) of capsules per machine leading tototal capsules per-stage (N) to be G×Mv. While a large Gwould aid us in better handling of skew and computationas the capsules can be processed in parallel, it comes atthe cost of significant scheduling and storage overheads.We empirically study the sensitivity to G (in §9.4); basedon this, our prototype uses G = 24.New locations for existing capsules: Data generationpatterns can significantly vary across different stages, andjobs, due to heterogeneous compute logics and data skew.Thus a job j may run out of its quota Q j on machine m,

leaving no room to grow already-materialized capsulesof j on m. Thus, DS-M periodically reacts to runtimechanges by determining, ∀ j: (1) which machines are atrisk of being overloaded; (2) which capsules on thesemachines to spread at other locations; and (3) on whichmachines to them spread to.

Given a job j, DS-M invokes h5 to determine machineswhere j is using≥ 75% of its quota Q j. DS-M then startsclosing some capsules of j on these machines; futureintermediate data for these is materialized on anothermachine, thereby mitigating potential hotspots. Specifi-cally, DS-M invokes h6 to pick capsules that are eithersignificantly larger in size or have a higher size increaserate than others for j on m. These capsules are morelikely to dominate the load and potentially violate Q j.Focusing on them bounds the number of capsules thatwill spread out. DS-M groups the capsules selected basedon the stage which generated them, and invokes heuristich3 as before to compute the set of machines where tospread. Grouping helps to maximize data locality, and h3

provides load balance and fault tolerance.

5 Data VisibilityWe now describe how WHIZ offers run-time pro-grammable data monitoring (§5.1), and data-driven com-putation using events (§5.2).

5.1 Data MonitoringWHIZ consolidates a capsule at one or a few locations(§4.1). Thus, capsules can be analyzed in isolation, sim-plifying data visibility. We achieve scalable monitoringvia per-job masters (DS-JMs) which track light-weightstats at the capsule-level.

WHIZ supports both built-in and customizable modulesthat periodically gather statistics per capsule, spanningproperties of keys and values. These statistics are carriedto the DS-JM where they are aggregated before beingused by ES to take further data-driven actions (§5.2).

Built-in modules constantly collect light-weight statis-tics such as current capsule size, number of (k,v) pairs andrate of growth; in addition to supporting user programs,these are used by the store in runtime data organization(§4.2). Custom modules are UDFs (user defined func-tions). Since supporting arbitrary UDFs can impose highoverhead, we restrict UDFs to those that can execute inlinear time and O(1) state. We provide a library of com-mon UDFs, such as computing the number of entries forwhich values are <,=, or > than a threshold.

5.2 Acting on Monitored Data PropertiesWHIZ supports data-driven computation via two key ab-stractions - events and ready triggers. The decoupleddata and execution services interact with each other via

6

Page 7: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

ES

DSES

data_spills

data_ready data_ready_all

data_generated1

v1

v2

data_spills events non-ready CAPSULES

ready CAPSULES

2

3

4t0

t0

tn tm

tn tm time

Figure 4: Data-driven computation facilitated by WHIZ events. (1)Intermediate data (v1) batches sent from ES to DS. (2) DS detects that 2capsules are ready and sends data_ready event from DS to ES leadingto downstream computation (v2). (3) ES sends data_generated event toDS when entire output of v1 pushed to DS. (4) DS sends data_ready_allevent to ES indicating that all data_ready events have been sent.

events which trigger, and track progress of, data-drivencomputation. Ready triggers enable the DS-JM to decidewhen capsules can be deemed ready for correspondingcomputation to be run on them.Events: WHIZ introduces 3 types of events: (1) Adata_ready event is sent by the DS-JM to the ES when-ever a capsule becomes ready (as per the ready triggerdefinition) to trigger corresponding computation. (2) Adata_generated event is sent by the ES to the DS-JMwhen a stage in the user program finishes generatingall its intermediate output. This event is required be-cause in WHIZ the DS-JM is unaware of the numberof tasks that the ES launches corresponding to a stage,and thus cannot determine when a stage is completed.(3) A data_ready_all event is sent by the DS-JM to theES when a stage has generated all its intermediate datawhich is ready to be consumed by an immediate down-stream stage. This event is required as the ES is unawareof the status of capsules (the total number of capsules,and whether they are done being materialized).

The use of these events is exemplified in Fig. 4. Here:1 when a stage v1 generates a batch of intermediate data,a data_spill containing the data is sent to the data store,which accumulates it into capsules (t0 through tm). 2

Whenever the DS-JM determines that a collection of v1’scapsules (2 capsules in Fig. 4 at tn) are ready for furtherprocessing, it sends a data_ready event per capsule to theES; the ES could launch tasks of a consumer stage v2 toprocess such capsules. This event carries per-capsule in-formation such as: a list of machine(s) on which each cap-sule is spread, and a list of (aggregated) statistics (coun-ters) collected by the built-in data modules. 3 Finally,a data_generated event – from the ES, generated uponv1 computation completing – notifies the DS-JM thatv1 finished generating data_spills. 4 Subsequently, DS-JM notifies the ES via the data_ready_all event, that allcapsules corresponding to v1 have sent their data_readyevents and have completely materialized (at tm). Thisenables the ES to determine when the immediate down-

stream stage v2, that is reading the data generated by v1,has received all of its input data.Triggers: The interaction between ES and DS via eventsis enabled by triggers whose logic is based on statisticscollected by the data modules. A WHIZ program can pro-vide custom ready triggers for each job stage, which theWHIZ client transfers to the DS-JM. WHIZ also imple-ments a default ready trigger which is otherwise applied.

Default ready trigger: Here, the DS-JM deems cap-sules ready when the computation generating them isdone; this is akin to a barrier in batch systems todayand bulk synchronous execution in graph analytics. For astreaming job, the DS-JM deems a capsule ready when ithas ≥ X records (X is a system parameter that a user canconfigure) from producer tasks, and sends a data_readyevent to ES. On receiving this, ES executes a consumerstage task on this capsule. This is akin to micro-batchingin existing streaming systems [53], with the crucial dif-ference that the micro-batch is not wall clock time-based,but is based on the more natural intermediate data count.

Custom ready trigger: If available, the DS-JM deemscapsules as ready using custom triggers. Programmersdefine these triggers based on knowledge about the se-mantics of the computation performed, and the type ofdata properties sought.

Consider the partial execution of a batch (or graph)analytics job, consisting of the first two logical stages(likewise, first two iterations) v1→ v2. If the processinglogic in v2 contains commutative+associative operations(e.g., sum, min, count, etc.), it can start processing itsinput before all of it is in place. For this, the user candefine a pipelining ready trigger, and instruct DS-JMto consider a capsule generated by v1 ready wheneverthe number of records in it reaches a threshold X . Thisenables the ES to overlap v2’s computation with v1’s datageneration as follows: (1) Upon receiving a data_readyevent from the DS-JM for capsules which have > Xrecords, the ES launches tasks of v2. (2) Tasks read thecurrent data, compute the associative+commutative func-tion on the (k,v) data read, and push the result back todata store (in the same capsules advertised through thereceived data_ready event). (3) The DS-JM waits foreach capsule to grow back beyond threshold X for gen-erating subsequent data_ready events. (4) Finally, whena data_generated event is received from v1, the DS-JMtriggers a final data_ready event for all the capsules gen-erated by v1, and a subsequent data_ready_all event, toenable v2’s final output to be written in capsules andfully consumed by a downstream stage, say v3 (similarto Fig. 4). Such pipelining speeds up jobs in batch andgraph analytics, as we show in §9.1.2.

7

Page 8: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

The above pipelining ready trigger can be extendedacross capsules. E.g., the DS-JM could deem all capsulesready when the number of entries generated across allcapsules crosses a threshold. In streaming, such triggershelp improve efficiency and performance (§9.1.3). In atwo-stage streaming job v1 → v2, v2 may (re)computea weighted moving average whenever 100 data pointswith distinct keys are generated by v1. A pipelining trig-ger can be used to signal all v1 capsules as ready oncesaid threshold is met. This brings data driven-ness tostream analytics: computation is performed only whenthe required records have streamed into the system.

In a similar manner, we can trigger processing basedon special records. Stream processing systems often relyon “low watermark” records to ensure event-time process-ing [18, 36], and to support temporal joins [36]. Customtriggers can be used to launch, on demand, temporal op-erators whenever a low watermark record is observed atany of a stage’s output capsules.

6 Execution ServiceThe ES launches a per-job master (ES-JM) which late-binds the job’s computation, i.e., given intermediate dataready for processing, and available resources5, (a) it de-termines optimal parallelism and deploys tasks to mini-mize skew and shuffle, and (b) it maps capsules to tasksin a resource-aware fashion (§6.1). The ES-JM designnaturally mitigates stragglers (§6.2), and facilitates data-driven compute logic changes (§6.3).

6.1 Task Parallelism, Placement, and SizingGiven a set of ready capsules (C) for a stage, the ES-JM maps subsets of capsules to tasks, and determinesthe location (across machines M) and the size of thecorresponding tasks based on available resources. Thismulti-decision problem, which evens out data volumeprocessed by tasks in a stage, and minimizes shuffle sub-ject to resource contraints on each machine, can be castas a binary ILP (omitted for brevity). However, the for-mulation is non-linear; even a linear version is slow tosolve at scale. For tractability, we propose an iterativeprocedure that applies a set of heuristics (Tab. 3) repeat-edly until tasks for all ready capsules are allocated, andtheir locations and sizes (resources) determined.

The iterative procedure consists of 3 steps - (a) gener-ate optimal subsets of capsules to minimize cross-subsetskew (using h7 ), (b) decide on which machine shoulda subset be processed to minimize shuffle ( h8 ), and (c)determine resources required to process each subset ( h9 ).

First, we group capsules, C, into a collection of subsets,

5Similar to existing frameworks, a cluster-wise Resource Managerdecides available resources as per cross-job fairness

h7

//−→C : subsets of unprocessed capsules.

a. CaMax = 2×|c|, c is largest capsule ∈C;b. Group all capsules ∈C into subsets in strict order:

i. data local capsules together;ii. each spread capsule, along data-local capsules together;iii. any remaining capsules together;

subject to:iv. each subset size ≤CaMax;v. conflicting capsules don’t group together;vi. troublesome capsules always group together.

h8

//−→M: preferred machines to process each subset ∈ −→C .

c. no machine preference for troublesome subsets ∈ −→Cd. for every other subset ∈ −→C pick machine m such that:

i. all capsules in the subset are only materialized at m;ii. otherwise m contains the largest materialization of the subset.

h9

Compute−→R : resources needed to execute each subset ∈ −→C :

e.−→A = available resources for j on machines

−→M ;

f. F = min(−→A [m]

total size of capsules allocated to m , for all m ∈ −→M);

g. for each subset i ∈ −→C :−→R [i] = F× total size of capsules allocated to

−→C [i].

Table 3: Heuristics to group capsules and assign them to tasks

−→C , using h7 . We then try to assign each group to a task.Our grouping into subsets attempts to ensure that data ina subset is spread on just one or a few machines (lines(b.i-b.iii)), which minimizes shuffle, and that the total datais spread roughly evenly across subsets (line (b.iv)) mak-ing cross-task performance uniform. We place a boundCaMax, equaling twice the size of the largest capsule,on the total size of a subset (see line (a)). This boundensures that multiple (atleast 2) capsules are present ineach subset, which helps in mitigating stragglers (§6.2).

Second, we determine a preferred machine to processeach subset using h8 ; this is a machine where most ifnot all capsules in the subset are materialized (line (d)).Choosing a machine in this manner minimizes shuffle.

Finally, given available resources across the preferredmachines (from the cluster-wide resource manager [47])we need to allocate tasks to process subsets. But somemachines may not have resource availability. For the restof this iteration, we ignore such machines and the subsetsof capsules that prefer such machines.

Given machines with resources−→A , we assign a task

for each subset of capsules which can be processed, andallocate task resources altruistically using h9 . That is, wefirst compute the minimum resource available to processunit data (F; line (f)). Then, for each task, the resourceallocated (line (g)) is F times the total data in the subsetof capsules allocated to the task (|−→C [i]|).

Allocating resources proportional to input size cou-pled with roughly equal subset sizes, ensures that taskshave roughly equal finish times in processing their sub-sets of capsules. Furthermore, by allocating resourcescorresponding to the minimum available, our approachrealizes altruism: if a job gets more resources than whatis available for the most constrained subset, then it does

8

Page 9: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

not help the job’s completion time (because completiontime depends on how fast the most constrained subset isprocessed). Altruistically “giving back” such resourceshelps speed up other jobs or other stages in the same job.

The above 3 steps repeat whenever new capsules areready, or existing ones can’t be scheduled. Similar to de-lay scheduling [51], we attempt several tries to executea group which couldn’t be scheduled on its preferredmachine due to resource unavailability, before markingcapsules conflicting. These are re-grouped in the nextiteration (line (b.v)). Finally, capsules that cannot be exe-cuted under any grouping are marked troublesome (line(b.vi)) and processed on any machine (line (d.ii)).

6.2 Handling StragglersData organization into capsules and late-binding com-putation to data enables a natural, simple, and effectivestraggler mitigation technique. If a task struck by re-source contention makes slower progress than others in astage, given that there are multiple capsules assigned toa task, the ES-JM simply splits the task’s capsule groupinto two, in proportion of the task’s processing speedrelative to average speed of other tasks in the stage. Itthen assigns the larger-group capsules to a new task, andplaces the task using the approach above. This addressesstragglers via work reallocation, as opposed to usingclone tasks [11–13, 35, 54] in compute-centric frame-works. Cloning waste resources and duplicates work.

6.3 Runtime DAG ChangesVisibility into data enables run-time changes to tasks’logic. In WHIZ, we introduce statistics and status events,defined next, to support this. Whenever a capsule isdeemed ready, the DS-JM sends a statistics event to theWHIZ client, which carries the capsule statistics. Statusevents help the ES-JM query the WHIZ client to check ifthe user program requires alternate logic to be launchedbased on observed statistics.

Upon receiving data_ready events from the DS-JM,ES-JM sends status events to the WHIZ client to deter-mine how to process a given capsule. Given the user-provided compute logic and capsule statistics obtainedthrough statistics events, the WHIZ client notifies the ESto take one the following actions: (1) no new action – as-sign computation as planned; (2) ignore – don’t performany computation, as the user program deemed the capsuleto not have any useful data to compute on; (3) replacecomputation with new logic supplied by the user.

7 Fault ToleranceTask Failure. When a WHIZ task fails due to a machinefailure, only the failed tasks need to be re-executed ifthe input capsules are not lost. However, this will result

in duplicate data in all capsules for the stage leading tointermediate data inconsistencies. To address this, weuse checksums at the consumer task-side WHIZ library tosuppress duplicate spills.

However, if the failed machine also contains the inputcapsules of the failed task, then the ES-JM triggers theexecution of the upstream stage(s) to regenerate the in-put capsules of the failed task. Recall that WHIZ’s faulttolerance-aware capsule storage (§4) helps control thenumber of upstream (ancestor) stages that need to bere-executed in case of data loss.DS-M/DS-JM/ES-JM. WHIZ maintains replicas of DS-M/DS-JM daemons using Apache Zookeeper [28], andfails over to a standby. Given that WHIZ decides the taskcomposition of a job at runtime in a data-driven manner,upon ES-JM failure, we simply need to restart it so thatit can resume handling events from the DS. During thistime already launched tasks continue to run.

8 ImplementationWe prototyped WHIZ by modifying Tez [5] and leveragingYARN [47]. WHIZ’s core components are applicationagnostic and support diverse analytics as shown in §9.

The DS was implemented from scratch and consistsof three kinds of daemons: cluster-wide master (DS-M)and per-job masters (DS-JM), which we discussed inearlier sections, and workers (DS-W). The DS-W runson cluster machines and conducts node-level data man-agement. It handles storing the data received from theES or from other DS-Ws in a local in-memory file sys-tem (tmpfs [44]) and transfers data to other DS-Ws perDS-M directives. It collects statistics and reports to theDS-M/DS-JMs via heartbeats. Finally, it provides ACKsto ES tasks for the data they write. We use YARN tolaunch/terminate daemons.

The ES was implemented by modifying componentsof Tez to enable data-driven execution. ES tasks are mod-ified Tez tasks that have an interface to the local DS-W asopposed to local disk or cluster-wide storage. The WHIZ

client is a standalone process per-job.All communication (asynchronous) between DS, ES

and WHIZ client is implemented through RPCs in YARNusing Google Protobuf [8]. We also use RPCs for com-munication between the YARN Resource Manager (RM)and ES-JM to propagate job resource allocation (§6).

9 EvaluationWe evaluated WHIZ on a 50-machine cluster deployedon CloudLab [6] using publicly available benchmarks –batch TPC-DS jobs, PageRank for graph analytics, andsynthetic streaming jobs. Unless otherwise specified, weset WHIZ to use default ready triggers, equal storage quota

9

Page 10: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

00.20.40.60.8

1

0 500 1000 1500 2000

Frac

tion

of J

obs

Job Completion Time [Seconds]

WhizHadoopSpark

(a)

00.20.40.60.8

1

0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

Frac

tion

of J

obs

Factor of Improvement

HadoopSpark

(b)

0

100

200

300

400

0 1500 3000 4500

Run

ning

Tas

ks

Time [Seconds]

WhizHadoop

(c)

0

12

24

36

0 1500 3000 4500Tota

l int

erm

edia

te

data

per

mac

hine

[GB]

Time [Seconds]

WhizHadoop

(d)Figure 5: (a) CDF of JCT; (b) CDF of factors of improvement ofindividual jobs using WHIZ w.r.t. baselines [2, 52]; (c) Running tasks;(d) Per machine average, min and max storage load.

(Q j = 2.5GB) and 24 capsules per machine.

9.1 Testbed Experiments, Different ApplicationsWorkloads: We consider a mix of jobs, all from TPC-DS(batch), or all from PageRank (graph). In each exper-iment, jobs are randomly chosen and follow a Poissonarrival distribution with average inter-arrival time of 20s.Each job lasts up to 10s of minutes, and takes as inputtens of GBs of data. For streaming, we created a syn-thetic workload from a job which periodically replaysGBs of text data from HDFS and returns top 5 most com-mon words for the first 100 distinct words found. We runeach experiment thrice and present the median.Cluster, baseline, metrics: Our machines have 8 cores,64GB of memory, 256GB storage, and a 10Gbps NIC.The cluster network has a congestion-free core. We com-pare WHIZ as follows: (1) Batch: vs. Tez [5] running atopYARN [47], for which we use the shorthand “Hadoop”or “CC”; and vs. SparkSQL [14]; (2) Graph: vs. Giraph(i.e., open source Pregel [38]); and vs. GraphX [24]; (3)Streaming: vs. SparkStreaming [53]. We study the rela-tive improvement in the average job completion time(JCT), or DurationCC/DurationWHIZ. We measure ef-ficiency using makespan. For a fair comparison withHadoop and Giraph, we ensure that they use tmpfs.

9.1.1 Batch AnalyticsPerformance and efficiency: Fig. 5a shows the JCTdistributions of WHIZ, Hadoop, and Spark for the TPC-DS workload. Only 0.4 (1.2) highest percentile jobs areworse off by ≤ 1.06× (≤ 1.03×) than Hadoop (Spark).WHIZ speeds up jobs by 1.4× (1.27×) on average, and2.02× (1.75×) at 95th percentile w.r.t. Hadoop (Spark).Also, WHIZ improves makespan by 1.32× (1.2×).

Fig. 5b presents improvement for individual jobs. Formore than 88% jobs, WHIZ outperforms Hadoop andSpark. Only 12% jobs slow down to ≤ 0.81× (0.63×)

using WHIZ. Gains are > 1.5× for > 35% jobs.Sources of improvements: We observe that more rapidprocessing, and better data management contribute mostto benefits.

First, we snapshot the number of running tasks acrossall the jobs in one of our experiments when runningWHIZ and Hadoop (Fig. 5c). WHIZ has 1.45× more tasksscheduled over time which translates to jobs finishing1.37× faster. It has 1.38× better cluster efficiency thanHadoop. Similar observations hold for Spark (omitted).

The main reasons for rapid processing/high efficiencyare: (1) The DS (§4.2) ensures that most tasks are data lo-cal (76% in our expts). This improves average consumertask completion time by 1.59×. Resources thus freedcan be used by other jobs’ tasks. (2) Our ES can providesimilar input sizes for tasks in a stage (§6.1) – within14.4% of the mean (more in §9.2).

Second, Fig. 5d shows the size of the cross-job totalintermediate data per machine. We see that Hadoop gen-erates heavily imbalanced load spread across machines.This creates many storage hotspots and slows down taskscompeting on those machines. Spark is similar. WHIZ

mitigates hotspots (§4) improving overall performance.WHIZ slowdown: We observe jobs generating less in-termediate data are more prone to performance losses,especially under ample resource availability. A reason isthat WHIZ strives for capsule-local task execution (§6.1).If resources are unavailable, WHIZ will assign the taskto a data-remote node, or get penalized waiting for data-local placement. Also, WHIZ gains are lower w.r.t. Spark.This is an artifact of our Hadoop-based implementation,and of using a non-optimized in-memory store.

Only 18% of capsules across all jobs are spread acrossmachines. Also, > 25% of the jobs whose performanceimproves processed “spread-out” capsules; and ≤ 14%of the slowed jobs processed spread capsules.

9.1.2 Graph ProcessingWe run multiple PageRank (40 iters.) jobs on the TwitterGraph [16, 17]. WHIZ groups data (messages exchangedover algorithm iterations) into capsules based on vertexID. We use a custom ready trigger (§5.2) so that a capsuleis processed only when ≥ 1000 entries are present.

Fig. 6a shows the JCT distribution of WHIZ, GraphXand Giraph. WHIZ speeds up jobs by 1.33× (1.57×)on average and 1.57× (2.24×) at the 95th percentilew.r.t. GraphX (Giraph) (Fig. 6b). Gains are lower w.r.t.GraphX, due to its efficient implementation atop Spark.However, < 10% jobs are slowed down by ≤ 1.13×.

Improvements arise for two reasons. First, WHIZ is ableto deploy appropriate number of tasks only when needed:custom ready triggers immediately indicate data avail-

10

Page 11: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

0 500 1000 1500 20000

0.2

0.4

0.6

0.8

1Whiz

GraphX

Giraph

Job Completion Time [Seconds]

Fra

ctio

n o

f Jo

bs

(a)

0 0.5 1 1.5 2 2.5 30

0.2

0.4

0.6

0.8

1

GiraphGraphX

Factor of Improvement

Fra

cio

n o

f Jo

bs

(b)

00.20.40.60.8

1

0 1500 3000 4500

Frac

tion

of J

obs

Job Completion Time [Seconds]

WhizSparkStreaming

(c)

0

0.2

0.4

0.6

0.8

1

0.8 1 1.2 1.4 1.6 1.8F

raction o

f Jobs

Factor of Improvement

SparkStreaming

(d)Figure 6: (a) CDF of JCT using WHIZ, GraphX [24] and Giraph [1];(b) CDF of factors of improvement of individual jobs using WHIZ w.r.t.GraphX and Giraph; (c) CDF of JCT using WHIZ and SparkStream-ing [53]; (d) CDF of factors of improvement of individual jobs usingWHIZ w.r.t. SparkStreaming.

ability, and runtime parallelism (§6.1) allows messagesto high-degree vertices [24] to be processed by more thanone task. Also, WHIZ has 1.53× more tasks (each runsmultiple vertex programs) scheduled over time; rapidprocessing and runtime adaptation to data directly leadsto jobs finishing faster. Second, because of triggered com-pute, WHIZ doesn’t hold resources for a task if not needed,resulting in 1.25× better cluster efficiency.

9.1.3 Stream ProcessingWe configure the Spark Streaming micro-batch intervalto 1 minute. With WHIZ, we implemented a custom readytrigger to enable computation whenever ≥ 100 distinctentries are present in the intermediate data. Figures 6c,6d show our results. WHIZ speeds up jobs by 1.33× onaverage and 1.55× at the 95th %-ile. Also, 15% of thejobs are slowed down to around 0.8×.

The reason for gains is data-driven computation viacustom ready triggers; WHIZ does not have to delay exe-cution till the next micro-batch if data can be processednow. A SparkStreaming task has to wait as it has no datavisibility. In our experiments, more than 73% executionshappen at less than 40s time intervals with WHIZ.

WHIZ suffers due to implementation atop a non-optimized streaming stack. Launching tasks in YARN issignificantly slower than acquiring tasks in Spark. Thiscan be exacerbated by ES stickiness to data locality. How-ever, this is amortized over long job run times.

9.1.4 WHIZ OverheadsCPU, memory overhead: We find that DS-W (§8) pro-cesses inflate the memory and CPU usage by a negligibleamount even when managing data close to storage capac-ity. DS-M and DS-JM have similar resource profiles.Latency: We compute the average time to process heart-beats from various ES/DS daemons, and WHIZ client. For

0.4 0.6 0.8 1 1.20.8

1

1.2

1.4

CCTasks / WhizTasks

CC

Ske

w /

Wh

izS

kew

(a)

0.4 0.6 0.8 1 1.20.8

1

1.2

1.4

CCTasks / WhizTasks

CC

Ske

w /

Wh

izS

kew

(b)Figure 7: Fraction of tasks allocated by CC w.r.t. to WHIZ vs. thefraction of skew among tasks due to CC w.r.t. WHIZ, in a given stage:(a) for a job with 12 stages where WHIZ improves JCT by 1.6×; (b) fora job with 6 stages, where WHIZ improves JCT by 1.2×. CCSkew

WhizSkew > 1means WHIZ has less skew; CCTasks

WhizTasks < 1 means CC under-parallelizea given stage.

5000 heartbeats, the time to process each is 2−5ms. Weimplemented the WHIZ client and ES-JM logic atop TezAM. Our changes inflate AM decision logic by ≤ 14msper request with negligible increase in AM memory/CPU.Network overhead from events/heartbeats is negligible.

9.2 Benefits of Data-driven ComputationThe overall benefits of WHIZ above also included theeffects of dynamic parallelism/placement/sizing (§6.1)and straggler mitigation (§6.2). We now delve deeperinto them to shed more light on data-driven computationbenefits during one of our TPC-DS runs.Skew and parallelism: Fig. 7 shows fractions of skewand parallelism as generated by CC w.r.t. WHIZ for twoTPC-DS jobs from one of our runs. WHIZ’s ability todynamically change parallelism at runtime, driven by thenumber of capsules for each vertex, leads to significantlyless data skew than CC. When CC is under-parallelizing,the skew is significantly higher than WHIZ (up to 1.43×).Over-parallelizing does not help either; CC incurs upto 1.15× larger skew, due to its rigid data partitioningand tasks allocation schemes. Even when WHIZ incursmore skew (up to 1.26×), corresponding tasks will getallocated more resources to alleviate this overhead (§6.1).Straggler mitigation: We evaluated the benefits enabledby WHIZ’s straggler mitigation strategy w.r.t. (1) the de-fault speculative execution strategy from CC and (2)when there is no strategy. We slowdown the same setof 10% of machines in the cluster for each run. Table 4bshows our results. WHIZ’s straggler mitigation strategyimproves JCT 1.47× on average w.r.t. no strategy. Thisis because stragglers can significantly impact overall per-formance. However, even when the CC speculative exe-cution strategy is enabled, WHIZ performs 1.19× better.This is due to its ability to reassign only the unprocessedinput data to the speculative tasks instead of launchingclones of the straggler tasks.

Additionally, we run microbenchmarks to delve furtherinto WHIZ’s data-driven benefits (results in Appendix C).

11

Page 12: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

00.20.40.60.8

1

0 25 50 75 100Number of machines

DLFT

Frac

tion

of g

ranu

les

per j

obcapsules

(a)

0

0.05

0.1

0.15

10 25 50 75 100

Number of machines

LB Ideal

Fra

ctio

n o

f to

tal sto

rag

e

loa

d p

er

ma

ch

ine

(b)Figure 8: (a) Average, min and max fraction of capsules which aredata local (DL) respectively fault tolerant (FT) across all the jobs fordifferent cluster load; (b) Max, min and ideal storage load balance(LB) on every machine for different cluster load.

% Machines JCT [Seconds]Failed Avg Min MaxNone 725 215 210010% 740 250 232025% 820 310 236050% 1025 350 271075% 1600 410 3300

(a)

Straggler JCT [Seconds]Strategy Avg Min Max

None 1200 410 2750Speculative 970 280 2300

WHIZ 815 330 1920

(b)

Table 4: WHIZ’s variation in JCT in the presence of: (a) randommachines failures; (b) different straggler mitigation strategies; Specu-lative is the default approach used by CC and WHIZ is our approachas described in §6.2.9.3 Load Balancing, Locality, Fault ToleranceTo evaluate DS load balancing (LB), data locality (DL)and fault tolerance (FT), we stressed the data organizationunder different cluster load. We used job arrivals and allstages’ capsule sizes from one of our TPC-DS runs.

The main takeaways are (Fig. 8): (1) WHIZ prioritizesDL and LB over FT across cluster loads (§4.2); (2) whenthe available resources are scarce (5× higher load thaninitial), all three metrics suffer. However, the maximumload imbalance per machine is < 1.5× than the ideal,while for any job, ≥ 47% of the capsules are DL. Also,on average 16% of the capsules per job are FT; (3) lesscluster load (0.6× lower than initial) enables more oppor-tunities for DS to maximize all of the objectives: ≥ 84%of the per-job capsules are DL, 71% are FT, with at most1.17× load imbalance per machine than the ideal.Failures: Using the same workload, we also evaluatedthe performance impact in the presence of machine fail-ures (Table 4a). We observe that WHIZ does not degradejob performance by more than 1.13× even when 50% ofthe machines fail. This is mainly due to DS’s ability toorganize capsules to be FT across ancestor stages andavoid data recomputations. Even when 75% of the ma-chines fail, the maximum JCT does not degrade by morethan 1.57×, mainly due to capsules belonging to someancestor stages still being available, which leads to fastrecomputation for corresponding downstream vertices.

9.4 Sensitivity AnalysisImpact of Contention: We vary storage load, and henceresource contention, by changing the number of ma-chines while keeping the workload constant; half as manyservers lead to twice as much load. We see that at 1×

Multiple of # CapsulesOriginal Load 8 16 20 24 28 32 36 40

1 1.07 1.33 1.46 1.52 1.57 1.63 1.54 1.462 1.10 1.16 1.53 1.58 1.56 1.61 1.47 1.314 0.85 1.12 1.34 1.39 1.32 1.16 0.95 0.74

Table 5: Factors of improvement w.r.t. CC for different number of cap-sules per machine and cluster load.

cluster load, WHIZ improves over CC by 1.39× (1.32×)on average in terms of JCT (makespan). Even at high con-tention (up to 4×), WHIZ’s gains keep increasing 1.83×(1.42×). This is because data is load balanced leadingto few storage hotspots, better isolation, and WHIZ mini-mizes resource wastage and the time spent in shuffling.Impact of G (number of capsules per machine): Wenow provide the rationale for picking G = 24. Table 5shows the factors of improvements w.r.t. CC for differentvalues of G and levels of contention.

The main takeaways are as follows: for G = 8 the per-formance gap between WHIZ and CC is low (< 1.1x).This is expected because small number of capsules re-sults in less data locality (each capsule is more likely tobe spread). Further, the gap decreases at high resourcecontention. In fact, at 4x the cluster load, CC performsbetter (0.85x). At larger values of G the performance gapincreases. For example, at G = 24, its gains are the most(between 1.39x and 1.58x). This is because larger G im-plies (1) more flexibility for WHIZ to balance the loadacross machines; (2) more likely that few capsules arespread out; (3) lesser data skew and more predictableper task performance. However, a very large G does notnecessarily improve performance, as it can lead to mas-sive task parallelism. The resulting scheduling overheaddegrades performance, especially at high load.Altruism: Can altruistically deciding task sizing (§6)impact job performance? We compare WHIZ’s approachwith a greedy task sizing approach, where each task gets

1# jobs resource share per machine, and uses all of it. Forthe same workload as §9.3, WHIZ actually speeds up jobsby 1.48× on avg., and 3.12× (4.8×) at 75th %-ile (95th%-ile) w.r.t. greedy approach. Only 16% jobs are sloweddown by ≤ 0.6×. Altriusm, by late-binding resourceswith the slowest potential task, helps all jobs benefit.

10 SummaryThe compute-centric nature of existing frameworks hurtsflexibility, efficiency, isolation, and performance. WHIZ

reenvisions analytics frameworks by cleanly separatingcomputation from intermediate data. Via programmablemonitoring of data properties and a rich event abstraction,WHIZ enables data-driven decisions for what computa-tion to launch, where to launch it, and how many parallelinstances to use, while ensuring isolation. Our evalua-tion using batch, stream and graph workloads shows thatWHIZ outperforms state-of-the-art frameworks.

12

Page 13: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

References[1] Apache Giraph. http://giraph.apache.org.

[2] Apache Hadoop. http://hadoop.apache.org.

[3] Apache Hive. http://hive.apache.org.

[4] Apache Samza. http://samza.apache.org.

[5] Apache Tez. http://tez.apache.org.

[6] Cloudlab. https://cloudlab.us.

[7] Presto | Distributed SQL Query Engine for BigData. prestodb.io.

[8] Protocol Buffers. https://bit.ly/1mISy49.

[9] Spark SQL. https://spark.apache.org/sql.

[10] Storm: Distributed and fault-tolerant realtimecomputation. http://storm-project.net.

[11] S. Agarwal, S. Kandula, N. Burno, M.-C. Wu,I. Stoica, and J. Zhou. Re-optimizing data parallelcomputing. In NSDI, 2012.

[12] G. Ananthanarayanan, A. Ghodsi, S. Shenker, andI. Stoica. Effective Straggler Mitigation: Attack ofthe Clones. In NSDI, 2013.

[13] G. Ananthanarayanan, S. Kandula, A. Greenberg,I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining inthe outliers in mapreduce clusters using Mantri. InOSDI, 2010.

[14] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu,J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin,A. Ghodsi, and M. Zaharia. Spark SQL: Relationaldata processing in Spark. In SIGMOD, 2015.

[15] L. Bindschaedler, J. Malicevic, N. Schiper,A. Goel, and W. Zwaenepoel. Rock you like ahurricane: Taming skew in large scale analytics. InEuroSys, 2018.

[16] P. Boldi, M. Rosa, M. Santini, and S. Vigna.Layered label propagation: A multiresolutioncoordinate-free ordering for compressing socialnetworks. In WWW, 2011.

[17] P. Boldi and S. Vigna. The webgraph framework i:Compression techniques. In WWW, 2004.

[18] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl,S. Haridi, and K. Tzoumas. Apache flink: Streamand batch processing in a single engine. Bulletin ofthe IEEE Computer Society Technical Committeeon Data Engineering, 36(4), 2015.

[19] M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan,and I. Stoica. Managing data transfers in computerclusters with Orchestra. In SIGCOMM, 2011.

[20] J. Dean and S. Ghemawat. MapReduce: Simplifieddata processing on large clusters. In OSDI, 2004.

[21] A. Deshpande, Z. Ives, and V. Raman. Adaptivequery processing. Found. Trends databases,1(1):1–140, Jan. 2007.

[22] A. Ghodsi, M. Zaharia, B. Hindman,A. Konwinski, S. Shenker, and I. Stoica. DominantResource Fairness: Fair allocation of multipleresource types. In NSDI, 2011.

[23] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, andC. Guestrin. Powergraph: Distributedgraph-parallel computation on natural graphs. InOSDI, 2012.

[24] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw,M. J. Franklin, and I. Stoica. GraphX: Graphprocessing in a distributed dataflow framework. InOSDI, 2014.

[25] K. A. Hua and C. Lee. Handling data skew inmultiprocessor database computers using partitiontuning. In VLDB, 1991.

[26] B. Huang, N. W. Jarrett, S. Babu, S. Mukherjee,and J. Yang. Cümülön: Matrix-based data analyticsin the cloud with spot instances. PVLDB,9(3):156–167, 2015.

[27] B. Huang and J. Yang. Cümülön-d: data analyticsin a dynamic spot market. PVLDB, 10(8):865–876,2017.

[28] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed.Zookeeper: Wait-free coordination forinternet-scale systems. In ATC, 2010.

[29] E. Jonas, Q. Pu, S. Venkataraman, I. Stoice, andB. Recht. Occupy the cloud: Distributedcomputing for the 99%. In SOCC, 2017.

[30] Q. Ke, M. Isard, and Y. Yu. Optimus: A dynamicrewriting framework for data-parallel executionplans. In EuroSys, 2013.

[31] M. Kitsuregawa and Y. Ogawa. Bucket spreadingparallel hash: A new, robust, parallel hash joinmethod for data skew in the super databasecomputer (sdc). In VLDB, 1990.

13

Page 14: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

[32] A. Klimovic, Y. Wang, P. Stuedi, A. Trivedi,J. Pfefferle, and C. Kozyrakis. Pocket: Elasticephemeral storage for serverless analytics. InOSDI, 2018.

[33] M. Kornacker, A. Behm, V. Bittorf,T. Bobrovytsky, C. Ching, A. Choi, J. Erickson,M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff,D. Kumar, A. Leblang, N. Li, I. Pandis,H. Robinson, D. Rorke, S. Rus, J. Russell,D. Tsirogiannis, S. Wanderman-Milne, andM. Yoder. Impala: A modern, open-source SQLengine for Hadoop. In CIDR, 2015.

[34] S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli,C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy,and S. Taneja. Twitter heron: Stream processing atscale. In SIGMOD, 2015.

[35] Y. Kwon, M. Balazinska, B. Howe, and J. Rolia.Skewtune: Mitigating skew in mapreduceapplications. In SIGMOD, 2012.

[36] W. Lin, Z. Qian, J. Xu, S. Yang, J. Zhou, andL. Zhou. Streamscope: continuous reliabledistributed processing of big data streams. InNSDI, 2016.

[37] K. Mahajan, M. Chowdhury, A. Akella, andS. Chawla. Dynamic query re-planning usingQOOP. In OSDI, 2018.

[38] G. Malewicz, M. H. Austern, A. J. Bik, J. C.Dehnert, I. Horn, N. Leiser, and G. Czajkowski.Pregel: A system for large-scale graph processing.In SIGMOD, 2010.

[39] X. Meng, J. K. Bradley, B. Yavuz, E. R. Sparks,S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai,M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin,R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib:Machine learning in Apache Spark. CoRR,abs/1505.06807, 2015.

[40] D. G. Murray, F. McSherry, R. Isaacs, M. Isard,P. Barham, and M. Abadi. Naiad: A timelydataflow system. In SOSP, 2013.

[41] D. G. Murray, M. Schwarzkopf, C. Smowton,S. Smith, A. Madhavapeddy, and S. Hand. Ciel: AUniversal Execution Engine for DistributedData-Flow Computing. In NSDI, 2011.

[42] Q. Pu, S. Venkataraman, and I. Stoica. Shuffling,fast and slow: Scalable analytics on serverlessinfrastructure. In NSDI, 2019.

[43] B. Saha, H. Shah, S. Seth, G. Vijayaraghavan,A. Murthy, and C. Curino. Apache tez: A unifyingframework for modeling and building dataprocessing applications. In SIGMOD, 2015.

[44] P. Snyder. tmpfs: A virtual memory file system. InEuropean UNIX Users’ Group Conference, 1990.

[45] A. Thusoo, R. Murthy, J. S. Sarma, Z. Shao,N. Jain, P. Chakka, S. Anthony, H. Liu, andN. Zhang. Hive – a petabyte scale data warehouseusing Hadoop. In ICDE, 2010.

[46] A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy,J. M. Patel, S. Kulkarni, J. Jackson, K. Gade,M. Fu, J. Donham, N. Bhagat, S. Mittal, andD. Ryaboy. Storm@twitter. In SIGMOD, 2014.

[47] V. K. Vavilapalli, A. C. Murthy, C. Douglas,S. Agarwal, M. Konar, R. Evans, T. Graves,J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino,O. O’Malley, S. Radia, B. Reed, andE. Baldeschwieler. Apache Hadoop YARN: Yetanother resource negotiator. In SoCC, 2013.

[48] L. Wei, Q. Zhengping, X. Junwei, Y. Sen,Z. Jingren, and Z. Lidong. Streamscope:Continuous reliable distributed processing of bigdata streams. In NSDI, 2016.

[49] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin,S. Shenker, and I. Stoica. Shark: SQL and richanalytics at scale. In SIGMOD, 2013.

[50] Y. Xu, P. Kostamaa, X. Zhou, and L. Chen.Handling data skew in parallel joins inshared-nothing systems. In SIGMOD, 2008.

[51] M. Zaharia, D. Borthakur, J. Sen Sarma,K. Elmeleegy, S. Shenker, and I. Stoica. Delayscheduling: A simple technique for achievinglocality and fairness in cluster scheduling. InEuroSys, 2010.

[52] M. Zaharia, M. Chowdhury, T. Das, A. Dave,J. Ma, M. McCauley, M. Franklin, S. Shenker, andI. Stoica. Resilient Distributed Datasets: Afault-tolerant abstraction for in-memory clustercomputing. In NSDI, 2012.

[53] M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica.Discretized streams: Fault-tolerant streamcomputation at scale. In SOSP, 2013.

[54] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz,and I. Stoica. Improving MapReduce performancein heterogeneous environments. In OSDI, 2008.

14

Page 15: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

1 job = Whiz.createJob(“Top5Words”, Whiz.STREAM)2 stage1 = Whiz.createStage(job, “s1”, wordSplit, TriggerImpl)3 stage2 = Whiz.createStage(job, “s2”, Top5Impl)4 Whiz.addDependency(job, stage1, stage2)5 Whiz.addDataMonitor(job, stage1, DS.monitor.unique_key_count)

5 def wordSplit(input):6 words = input.map(input => (input, 1))7 return words

8 def Top5Impl(words):9 top5 = words.top(5)10 return top5//top(k) returns the top k words

11 def TriggerImpl():12 for key in keys:13 if (DS.monitor.unique_key_count >= 100):14 return true15 return false

Figure 10: An example streaming job in WHIZ

Objectives (to be minimized):

O1 maxi

(∑k(bk

i + xki ek)

)

O2 ∑k

Pk−

∑i∈Ik−

xki

bki(k)+ ∑

i∈Ik+

xki(bk

i + ek)O3 ∑

k

((1− f k) ∑

i∈I◦xk

i

)Constraints:

C1 ∑k:J(gk )= j

(bk

i + xki ek)≤ Q j , ∀ j, i

Variables:xk

i Binary indicator denoting capsule gk is placed on machine iParameters:

bki Existing number of bytes of capsule gk in machine i

ek Expected number of remaining bytes for capsule, gk

Pk ek +∑i

bki

J(gk) The job ID for job gk

i(k) argmaxi bki

Ik−, I

k+ {i : bk

i ≤ bki(k)− ek}, {i : bk

i > bki(k)− ek}

f k Binary parameter indicating that capsules for same stage asgk share locations with capsules for preceding stages

I◦ Set of machines where capsules of preceding stages are storedQ j Administrative storage quota for job, j.

Figure 11: Binary ILP formulation for capsule placement.1 job = Whiz.createJob(“PageRank”, Whiz.GRAPH)2 graph = Whiz.loadGraph(job, “graph.txt”, PageRankImpl, TriggerImpl)//PageRankImpl implements page rank vertex logic3 Whiz.addDataMonitor(job, graph, DS.monitor.num_entries)

4 def TriggerImpl():5 for key in keys:6 if (DS.monitor.num_entries(key) >= 1000):7 return true8 return false

Figure 9: An example graph job in WHIZ

A WHIZ Sample ProgramsA key advantage of WHIZ is its programming modelwherein users no longer have to provide low level de-

tails and gain the benefits of data-driven computationusing our APIs. While we showed a sample batch analyt-ics application earlier (§3), we now show how to writemore diverse application atop WHIZ. More specifically,we show how to write a graph analytics job and a streamprocessing job.Graph Analytics (Fig. 9). We consider the graph analyt-ics application that we used in our testbed experiments(§9), .i.e., a job running the PageRank algorithm. Jobcomposition details are similar to existing frameworks(lines 1-2). Similar to the createStage API which is usedfor batch/stream jobs, we have a loadGraph API whichconstructs the graph, given the input. Additionally, wespecify the vertex program logic to run PageRank via thePageRankImpl function.

Crucially, WHIZ allows a user to run the graph algo-rithm in a data-driven manner by specifying the imple-mentation of TriggerImpl. TriggerImpl specifies that assoon as 1000 entries corresponding to a vertex are seen,then we trigger the execution of the vertex program, lead-ing to pipelined execution (lines 4-8). The user instructsthe DS to use the built-in module to collect the per-keycounts (line 3). This simple yet powerful trigger natu-rally deals with high-degree vertices and leads to betterperformance and efficiency (as seen in §9.1.2).Stream Analytics. (Fig. 10) We consider the stream an-alytics job that we used in our testbed experiments (§9),i.e., a job that return the top 5 words when we see 100distinct words. As before, job composition details aresimilar to existing frameworks (lines 1-3). This job canbe realized as a 2-stage job where in S1 processes theinput words (lines 5-7) and S2 return the top 5 commonwords (lines 8-10). In this example, the user can lever-age the flexibility of specifying a custom ready triggerto trigger computation. More specifically, the user canspecify, using built-in data modules that trigger the S2only when 100 distinct keys are encountered (lines 11-15). As we have seen in §9.1.3, this trigger leads to betterperformance.

B Allocating Capsules to Machines ILPWe consider how to place multiple jobs’ capsules to avoidhotspots, reduce per-capsule spread (for data locality) andminimize job runtime impact on data loss. We formulatea binary integer linear program (see Fig. 11) to this end.The indicator decision variables, xk

i , denote that all futuredata to capsule gk is materialized at machine Mi. The ILPfinds the best xk

i ’s that minimizes a multi-part weightedobjective function, one part each for the three objectivesmentioned above.

15

Page 16: F : Reenvisioning Data Analytics Systems · large clusters, fault tolerance is important, and compute centricity provided clean mechanisms to recover from failures. Only tasks on

02468

0 200 400 600 800

Run

ning

Tas

ks

Time [Seconds]

Hadoop - v1 Hadoop - v2 Hadoop - v3Whiz - v1 Whiz - v2 Whiz - v3

(a)

0 50 100 150 200 250 300 350Time [Seconds]

Whiz

Hadoop

Straggler Task

Whiz Slowdown Notification

Whiz Speculative Task

TerminateStraggler Task

Hadoop Speculative

Taskv1

v1

v2

v2

v2’

v2’

(b)Figure 12: (a) Controlling task parallelism significantly improvesWHIZ’s performance over CC. (b) Straggler mitigation with WHIZand CC.

The first part (O1) represents the maximum amount ofdata stored across all machines across all capsules. Min-imizing this ensures load balance and avoids hotspots.The second part (O2) represents the sum of data-spreadpenalty across all capsules. Here, for each capsule, we de-fine the primary location as the machine with the largestvolume of data for that capsule. The total volume of datain non-primary locations is the data-spread penalty, in-curred from shuffling the data prior to processing it. Thethird part (O3) is the sum of fault-tolerance penaltiesacross capsules. Say a machine m storing intermediatefor current stage s fails; then we have to re-execute sto regenerate the data. If the machine also holds datafor ancestor stages of s then multiple stages have to bere-executed. If we ensure that data from parent and childstages are stored on different machines, then, upon childdata failure only the child stage has to be executed. Wemodel this by imposing a penalty whenever a capsule inthe current stage is materialized on the same machine asthe parent stage. Penalties O2, O3 need to be minimized.

Finally, we impose isolation constraint (C1) requiringthe total data for a job to not exceed an administrator setquota Q j. Quotas help ensure isolation across jobs.

However, solving this ILP at scale can take severaltens of seconds delaying capsule placement. Thus, WHIZ

uses a linear-time rule-based heuristic to place capsules(as described in §4).

C WHIZ MicrobenchmarksApart from the experiments on the 50-machine clus-ter (§9), we also ran several microbenchmarks to delvedeeper into WHIZ’s data-driven benefits. The microbench-marks were run on a 5 machine cluster and the work-loads consists of the following jobs: J1 (v1→ v2) and J2(v1→ v2→ v3). These patterns typically occur in TPC-

DS queries.Skew and parallelism: Fig. 12a shows the execution ofone of the J2 queries from our workload when runningWHIZ and CC. WHIZ improves JCT by 2.67× over CC.CC decides stage parallelism tied to the number of datapartitions. That means stage v1 generates 2 intermediatepartitions as configured by the user and 2 tasks of v2will process them. However, execution of v1 leads todata skew among the 2 partitions (1GB and 4GB).On theother hand, WHIZ ends up generating capsules that areapproximately equal in size and decides at runtime a max.input size per task of 1GB (twice the largest capsule).This leads to running 5 tasks of v2 with equal input sizeand 2.1× faster completion time of v2 than CC.

Over-parallelizing execution does not help. With CC,v2 generates 12 partitions processed by 12 v3 tasks. Un-der resource crunch, tasks get scheduled in multiplewaves (at 570s in Fig. 12a) and completion time for v3suffers (85s). In contrast, WHIZ assigns at runtime only5 tasks of v3 which can run in a single wave; v3 finishes1.23× faster.Straggler mitigation: We run an instance of J1 with 1task of v1 and 1 task of v2 with an input size of 1GB. Aslowdown happens at the v2 task, which was assigned 2capsules by WHIZ.

In CC (Fig. 12b), once a straggler is detected (v2 taskat 203s), it is allowed to continue, and a speculative taskv′2 is launched that duplicates v2’s work. The work com-pletes when v2 or v′2 finishes (at 326s). In WHIZ, uponstraggler detection, the straggler (v2) is notified to fin-ish processing the current capsule; a task v′2 is launchedand assigned data from v2’s unprocessed capsule. v2 fin-ishes processing the first capsule at 202s; v′2 processesthe other capsule and finishes 1.7× faster than v′2 in CC.Runtime logic changes: We consider a job which pro-cesses words and, for words with < 100 occurrences,sorts them by frequency. The program structure is v1→v2→ v3, where v1 processes input words, v2 computesword occurrences, and v3 sorts the ones with < 100 oc-currences. In CC, v1 generates 17GB of data organized in17 partitions; v2 generates 8GB organized in 8 partitions.Given this, 17 v2 tasks and 8 v3 tasks execute, leadingto a CC JCT of 220s. Here, the entire data generatedby v2 has to be analyzed by v3. In contrast, WHIZ reg-isters a custom DS module to monitor #occurrences ofall the words in the capsules generated by v2. We imple-ment actions to ignore v2 capsules that don’t satisfy theprocessing criteria of v3 (§6.3). At runtime, 2 statisticsevents are triggered by the DS module, and 6 tasks of v3(instead of 8) are executed; JCT is 165s (1.4× better).

16