OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru...

15
OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds Radu Tudoran, Alexandru Costan, Gabriel Antoniu To cite this version: Radu Tudoran, Alexandru Costan, Gabriel Antoniu. OverFlow: Multi-Site Aware Big Data Management for Scientific Workflows on Clouds. IEEE Transactions on Cloud Computing, 2016, <10.1109/TCC.2015.2440254>. <hal-01239128> HAL Id: hal-01239128 https://hal.inria.fr/hal-01239128 Submitted on 7 Dec 2015 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destin´ ee au d´ epˆ ot et ` a la diffusion de documents scientifiques de niveau recherche, publi´ es ou non, ´ emanant des ´ etablissements d’enseignement et de recherche fran¸cais ou ´ etrangers, des laboratoires publics ou priv´ es. Public Domain

Transcript of OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru...

Page 1: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

OverFlow: Multi-Site Aware Big Data Management for

Scientific Workflows on Clouds

Radu Tudoran, Alexandru Costan, Gabriel Antoniu

To cite this version:

Radu Tudoran, Alexandru Costan, Gabriel Antoniu. OverFlow: Multi-Site Aware Big DataManagement for Scientific Workflows on Clouds. IEEE Transactions on Cloud Computing,2016, <10.1109/TCC.2015.2440254>. <hal-01239128>

HAL Id: hal-01239128

https://hal.inria.fr/hal-01239128

Submitted on 7 Dec 2015

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinee au depot et a la diffusion de documentsscientifiques de niveau recherche, publies ou non,emanant des etablissements d’enseignement et derecherche francais ou etrangers, des laboratoirespublics ou prives.

Public Domain

Page 2: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 1

OverFlow: Multi-Site Aware Big DataManagement for Scientific Workflows on

CloudsRadu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE,

and Gabriel Antoniu, Member, IEEE

Abstract—The global deployment of cloud datacenters is enabling large scale scientific workflows to improve performance anddeliver fast responses. This unprecedented geographical distribution of the computation is doubled by an increase in the scaleof the data handled by such applications, bringing new challenges related to the efficient data management across sites. Highthroughput, low latencies or cost-related trade-offs are just a few concerns for both cloud providers and users when it comes tohandling data across datacenters. Existing solutions are limited to cloud-provided storage, which offers low performance based onrigid cost schemes. In turn, workflow engines need to improvise substitutes, achieving performance at the cost of complex systemconfigurations, maintenance overheads, reduced reliability and reusability. In this paper, we introduce OverFlow, a uniform datamanagement system for scientific workflows running across geographically distributed sites, aiming to reap economic benefitsfrom this geo-diversity. Our solution is environment-aware, as it monitors and models the global cloud infrastructure, offeringhigh and predictable data handling performance for transfer cost and time, within and across sites. OverFlow proposes a setof pluggable services, grouped in a data scientist cloud kit. They provide the applications with the possibility to monitor theunderlying infrastructure, to exploit smart data compression, deduplication and geo-replication, to evaluate data managementcosts, to set a tradeoff between money and time, and optimize the transfer strategy accordingly. The system was validated onthe Microsoft Azure cloud across its 6 EU and US datacenters. The experiments were conducted on hundreds of nodes usingsynthetic benchmarks and real-life bio-informatics applications (A-Brain, BLAST). The results show that our system is able tomodel accurately the cloud performance and to leverage this for efficient data dissemination, being able to reduce the monetarycosts and transfer time by up to 3 times.

Index Terms—Big Data, scientific workflows, cloud computing, geographically distributed, data management

F

1 INTRODUCTION

W ITH their globally distributed datacenters,cloud infrastructures enable the rapid develop-

ment of large scale applications. Examples of such ap-plications running as cloud services across sites rangefrom office collaborative tools (Microsoft Office 365,Google Drive), search engines (Bing, Google), globalstock market analysis tools to entertainment services(e.g., sport events broadcasting, massively parallelgames, news mining) and scientific workflows [1].Most of these applications are deployed on multiplesites to leverage proximity to users through contentdelivery networks. Besides serving the local clientrequests, these services need to maintain a globalcoherence for mining queries, maintenance or moni-toring operations, that require large data movements.To enable this Big Data processing, cloud providershave set up multiple datacenters at different geograph-ical locations. In this context, sharing, disseminat-

• R. Tudoran is with Inria / ENS Rennes, France. E-mail:[email protected]

• A. Costan is with IRISA / INSA Rennes, France. E-mail: [email protected]

• G. Antoniu is with Inria Bretagne Atlantique, Rennes, France. E-mail:[email protected]

ing and analyzing the data sets results in frequentlarge-scale data movements across widely distributedsites. The targeted applications are compute inten-sive, for which moving the processing close to datais rather expensive (e.g., genome mapping, physicssimulations), or simply needing large-scale end-to-end data movements (e.g., organizations operatingseveral datacenters and running regular backup andreplication between sites, applications collecting datafrom remote sensors, etc.). In all cases, the cost savings(mainly computation-related) should offset the signif-icant inter-site distance (network costs). Studies showthat the inter-datacenter traffic is expected to triplein the following years [2], [3]. Yet, the existing clouddata management services typically lack mechanismsfor dynamically coordinating transfers among differ-ent datacenters in order to achieve reasonable QoSlevels and optimize the cost-performance. Being ableto effectively use the underlying storage and networkresources has thus become critical for wide-area datamovements as well as for federated cloud settings.

This geographical distribution of computation be-comes increasingly important for scientific discovery.In fact, many Big Data scientific workloads enablenowadays the partitioning of their input data. Thisallows to perform most of the processing indepen-

Page 3: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 2

dently on the data partitions across different sitesand then to aggregate the results in a final phase.In some of the largest scenarios, the data sets arealready partitioned for storage across multiple sites,which simplifies the task of preparing and launchinga geographical-distributed processing. Among the no-torious examples we recall the 40 PB/year data thatis being generated by the CERN LHC. The volumeoverpasses single site or single institution capacity tostore or process, requiring an infrastructure that spansover multiple sites. This was the case for the Higgs bo-son discovery, for which the processing was extendedto the Google cloud infrastructure [4]. Acceleratingthe process of understanding data by partitioning thecomputation across sites has proven effective also inother areas such as solving bio-informatics problems[5]. Such workloads typically involve a huge numberof statistical tests for asserting potential significant re-gion of interests (e.g., links between brain regions andgenes). This processing has proven to benefit greatlyfrom a distribution across sites. Besides the need foradditional compute resources, applications have tocomply with several cloud providers requirements,which force them to be deployed on geographicallydistributed sites. For instance, in the Azure cloud,there is a limit of 300 cores allocated to a user within adatacenter, for load balancing purposes; any applica-tion requiring more compute power will be eventuallydistributed across several sites.

More generally, a large class of such scientific appli-cations can be expressed as workflows, by describingthe relationship between individual computationaltasks and their input and output data in a declarativeway. Unlike tightly-coupled applications (e.g., MPI-based) communicating directly via the network, work-flow tasks exchange data through files. Currently, theworkflow data handling in clouds is achieved usingeither some application specific overlays that mapthe output of one task to the input of another ina pipeline fashion, or, more recently, leveraging theMapReduce programming model, which clearly doesnot fit every scientific application. When deployinga large scale workflow across multiple datacenters,the geographically distributed computation faces abottleneck from the data transfers, which incur highcosts and significant latencies. Without appropriatedesign and management, these geo-diverse networkscan raise the cost of executing scientific applications.

In this paper, we tackle these problems by tryingto understand to what extent the intra- and inter-datacenter transfers can impact on the total makespanof cloud workflows. We first examine the challengesof single site data management by proposing anapproach for efficient sharing and transfer of in-put/output files between nodes. We advocate storingdata on the compute nodes and transferring filesbetween them directly, in order to exploit data localityand to avoid the overhead of interacting with a shared

file system. Under these circumstances, we propose afile management service that enables high throughputthrough self-adaptive selection among multiple trans-fer strategies (e.g. FTP-based, BitTorrent-based, etc.).Next, we focus on the more general case of large-scaledata dissemination across geographically distributedsites. The key idea is to accurately and robustly pre-dict I/O and transfer performance in a dynamic cloudenvironment in order to judiciously decide how toperform transfer optimizations over federated data-centers. The proposed monitoring service updates dy-namically the performance models to reflect changingworkloads and varying network-device conditions re-sulting from multi-tenancy. This knowledge is furtherleveraged to predict the best combination of proto-col and transfer parameters (e.g., multi-routes, flowcount, multicast enhancement, replication degree) tomaximize throughput or minimize costs, accordingto users policies. To validate our approach, we haveimplemented the above system in OverFlow, as partof the Azure Cloud so that applications could use itbased on a Software as a Service approach.

Our contribution, which extends previous workintroduced in [6], [7], can be summarised as follows:

• We implement OverFlow, a fully-automatedsingle- and multi-site software system for scien-tific workflows data management (Section 3.2);

• We introduce an approach that optimizes theworkflow data transfers on clouds by meansof adaptive switching between several intra-sitefile transfer protocols using context information(Section 3.3);

• We build a multi-route transfer approach acrossintermediate nodes of multiple datacenters,which aggregates bandwidth for efficient inter-sites transfers (Section 3.4);

• We show how OverFlow can be used to supportlarge scale workflows through an extensive setof pluggable services that estimate and optimizecosts, provide insights on the environment per-formance and enable smart data compression,deduplication and geo-replication (Section 4);

• We experimentally evaluate the benefits of Over-Flow on hundred of cores of the Microsoft Azurecloud in different contexts: synthetic benchmarksand real-life scientific scenarios (Section 5).

2 CONTEXT AND BACKGROUND

One important phase that contributes to the totalmakespan (and the consequent cost) of a scientificworkflow executed on clouds is the transfer of itstasks and files to the execution nodes [8], [9]. Severalfile transfers are required to start the task executionand to retrieve the intermediate and output data. Withthe file sizes increasing to PBs levels, data handlingbecomes a bottleneck. The elastic model of the cloudshas the potential to alleviate this through an improved

Page 4: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 3

resource utilization. In this section, we show that thisfeature has to be complemented by a wider insightabout the environment and the application, on onehand, and a two-way communication between theworkflow engine and the data management system,on the other hand.

2.1 Terminology

We define the terms used throughout this paper.The cloud site / datacenter: is the largest build-

ing block of the cloud. It contains a broad numberof compute nodes, which provide the computationpower, available for renting. A cloud, particularlya public one, has several sites, geographically dis-tributed on the globe. The datacenter is organized in ahierarchy of switches, which interconnect the computenodes and the datacenter itself with the outside world,through the Tier 1 ISPs. Multiple output endpoints(i.e., through Tier 2 switches) are used to connect thedata center to the Internet [10].

The cloud deployment: refers to the VMs thathost a user application, forming a virtual space.The VMs are placed on different compute nodes inseparate network partitions in order to ensure faulttolerance and availability for the service. Typically, aload balancer distributes all external requests amongthe VMs. The topology of the VMs in the datacenteris unknown to the user. Several other aspects are in-visible (or transparent) to users due to virtualization,e.g., the mapping of the IPs to physical nodes, thevicinity of VMs, the load introduced by the neigh-bouring VMs, if any, on the physical node, etc. Adeployment is limited to a single site, and dependingon commercial constraints and priorities, it can belimited in size. Therefore a multi-site application willrely on multiple-deployments.

The multi-site cloud: is made up of several ge-ographically distributed datancenters. An applicationthat has multiple running instances in several deploy-ments across multiple cloud datacenters is referred toas a multi-site cloud application. Our focus in this paperis on such applications. Although applications couldbe deployed across sites belonging to different cloudvendors (i.e. cloud federations), they are out of thescope of this work. We plan to address such issues inthe context of hybrid clouds in future work.

2.2 Workflow data management

Requirements for cloud-based workflows. In orderto support data-intensive workflows, a cloud-basedsolution needs to: 1) Adapt the workflows to thecloud environment and exploit its specificities (e.g.running in user’s virtualized space, commodity com-pute nodes, ephemeral local storage); 2) Optimizedata transfers to provide a reasonable time to solution;

3) Manage data so that it can be efficiently placed andaccessed during the execution.

Data management challenges. Data transfers areaffected by the instability and heterogeneity of thecloud network. There are numerous options, someproviding additional security guarantees (e.g., TLS)others designed for a high colaborative throughput(e.g., BitTorrent). Data locality aims to minimize datamovements and to improve end-application perfor-mance and scalability. Addressing data and computa-tional problems separately typically achieves the con-trary, leading to scalability problems for tomorrow’sexascale datasets and millions of nodes, and yieldingsignificant underutilization of the resources. Metadatamanagement plays an important role as, typically, thedata relevant for a certain task can be stored inmultiple locations. Logical catalogs are one approachto provide information about data items location.

Programming challenges. So far, MapReduce hasbeen the ”de-facto” cloud computing model, com-plemented by a number of variations of languagesfor task specification [11]. They provide some dataflow support, but all require a shift of the appli-cation logic into the MapReduce model. Workflowsemantics go beyond the map-shuffle-reduce-mergeoperations, and deal with data placement, sharing,inter-site data transfers etc. Independently of the pro-gramming model of choice, they need to address thefollowing issues: support large-scale parallelism tomaximize throughput under high concurrency, enabledata partitioning to reduce latencies and handle themapping from task I/O data to cloud logical struc-tures to efficiently exploit its resources.

Targeted workflow semantics. In order to ad-dress these challenges we studied 27 real-life appli-cations [12] from several domains (bio-informatics,business, simulations etc.) and identified a set of core-characteristics:

• The common data patterns are: broadcast,pipeline, gather, reduce, scatter.

• Workflows are composed of batch jobs with well-defined data passing schemes. The workflow en-gines execute the jobs (e.g. take the executable toa VM; bring the needed data files and libraries;run the job and retrieve the final result) andperform the required data passing between jobs.

• Inputs and outputs are files, usually written once.• Tasks and inputs/outputs are uniquely identified.Our focus is on how to efficiently handle and

transfer workflow data between the VMs in a clouddeployment. We argue that keeping data in the localdisks of the VMs is a good option considering thatfor such workflows, most of the data files are usuallytemporary - they must exist only to be passed from thejob that produced it to the one that will further processit. With our approach, the files from the virtual diskof each compute node are made available to all othernodes within the deployment. Caching the data where

Page 5: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 4

it is produced and transferring it directly where it isneeded reduces the time for data manipulation andminimizes the workflow makespan.

2.3 The Need of site-aware file management forcloud based workflows

As we move to the world of Big Data, single-siteprocessing becomes insufficient: large scale scientificworkflows can no longer be accommodated within asingle datacenter. Moreover, these workflows typicallycollect data from several sources and even the dataacquired from a single source is distributed on multi-ple sites for availability and reliability reasons. Cloudlet users to control remote resources. In conjunctionwith a reliable networking environment, we can nowuse geographically distributed resources: dynamicallyprovisioned distributed domains are built in this wayover multiple sites. Several advantages arise fromcomputations running on such multi-site configura-tions: resilience to failures, distribution across par-titions (e.g., moving computation close to data orviceversa), elastic scaling to support usage bursts, etc.

However, in order to exploit the advantages ofmulti-site executions, users currently have to set uptheir own tools to move data between deployments,through direct endpoint to endpoint communication(e.g. GridFTP, scp, etc.). This baseline option is rela-tively simple to set in place, using the public endpointprovided for each deployment. The major drawbackin this scenario is the low bandwidth between sites,which limits drastically the throughput that can beachieved. Clearly, this paradigm shift towards multi-site workflow deployments calls for appropriate datamanagement tools, that build on a consistent, globalview of the entire distributed datacenter environment.This is precisely the goal targeted by OverFlow.

3 ADAPTIVE FILE MANAGEMENT ACROSSCLOUD SITES WITH OVERFLOW

In this section we show how the main design ideasbehind the architecture of OverFlow are leveraged tosupport fast data movements both within a single siteand across multiple datacenters.

3.1 Core design principles

Our proposal relies on the following ideas:• Exploiting the network parallelism. Building

on the observations that a workflow typicallyruns on multiple VMs and that communicationbetween datacenters follows different physicalroutes, we rely on multi-route transfer strate-gies. Such schemes exploit the intra-site low-latency bandwidth to copy data to intermediatenodes within the source deployment (site). Next,this data is forwarded towards the destination

across multiple routes, aggregating additionalbandwidth between sites.

• Modeling the cloud performance. The com-plexity of the datacenters architecture, topologyand network infrastructure make simplistic ap-proaches for dealing with transfer performance(e.g., exploiting system parallelism) less appeal-ing. In a virtualized environment such techniquesare at odds with the goal of reducing coststhrough efficient resource utilization. Accurateperformance models are then needed, leveragingthe online observations of the cloud behavior.Our goal is to monitor the virtualized environ-ment and to predict performance metrics (e.g.,transfer time, costs). As such, we argue for amodel that provides enough accuracy for au-tomating the distributed data management tasks.

• Exploiting the data locality. The cumulative stor-age capacity of the VMs leased in one’s deploy-ment easily reaches the TBs order. Although tasksstore their input and output files on the localdisks, most of the storage remains unused. Mean-while, workflows typically use remote cloud stor-age (e.g., Amazon S3, Azure Blobs) for sharingdata [8]. This is costly and highly inefficient, dueto high latencies, especially for temporary filesthat don’t require persistent storage. Instead, wepropose aggregating parts of the virtual disks ina shared common pool, managed in a distributedfashion, in order to optimize data sharing.

• Cost effectiveness. As expected, the cost closelyfollows performance. Different transfer plans ofthe same data may result in significantly differentcosts. In this paper we ask the question: given theclouds interconnect offerings, how can an applica-tion use them in a way that strikes the right balancebetween cost and performance?

• No modification of the cloud middleware. Dataprocessing in public clouds is done at user level,which restricts the application permissions to thevirtualized space. Our solution is suitable forboth public and private clouds, as no additionalprivileges are required.

3.2 ArchitectureThe conceptual scheme of the layered architecture ofOverFlow is presented in Figure 1. The system isbuilt to support at any level a seamless integrationof new, user defined modules, transfer methods andservices. To achieve this extensibility, we opted for theManagement Extensibility Framework 1, which allowsthe creation of lightweight extensible applications, bydiscovering and loading at runtime new specializedservices with no prior configuration.

We designed the layered architecture of OverFlowstarting from the observation that Big Data appli-

1. http://msdn.microsoft.com/en-us/library/dd460648.aspx

Page 6: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 5

DatafCenterDataf

Center

CommunicationfLayerIntra-SiteAdaptivefProtocolSwitichingfTranserf

Inter-SitesMulti-Routesf

TransferfacrossfIntermediatefNodesIn-MemoryFTP Torrent

ManagementfLayerMetaDatafRegistry TransferfManager

ServerfLayer

TransferTransferfTimef

EstimationTransferfCost

EstimationGeographicalReplication

Compression

...

MonitorService

Throughput

Bandwidth

VMfComputefPerformance

...

PerformancefModeling

ReplicationfAgent

Fig. 1. The extendible, server-based architecture of theOverFlow System

cation require more functionality than the existingput/get primitives. Therefore, each layer is designedto offer a simple API, on top of which the layerabove builds new functionality. The bottom layerprovides the default “cloudified” API for communi-cation. The middle (management) layer builds on ita pattern aware, high performance transfer service(see Sections 3.3, 3.4). The top (server) layer exposesa set of functionalities as services (see Section 4). Theservices leverage information such as data placement,performance estimation for specific operations or costof data management, which are made available bythe middle layer. This information is delivered tousers/applications, in order to plan and to optimizecosts and performance while gaining awareness onthe cloud environment.

The interaction of OverFlow system with the work-flow management systems is done based on its publicAPI. For example, we have integrated our solutionwith the Microsoft Generic Worker [12] by replacingit’s default Azure Blobs data management backedwith OverFlow. We did this by simply mapping theI/O calls of the workflow to our API, with OverFlowleveraging the data access pattern awareness as fuhrerdetailed in Sections 5.6.1, 5.6.2. The next step is toleverage OverFlow for multiple (and ideally generic)workflow engines (e.g., Chiron [13]), across multiplesites. We are currently working jointly with Microsoftin the context of the Z-CloudFlow [14] project, onusing OverFlow and its numerous data managementservices to process highly distributed workflows or-chestrated by multiple engines.

3.3 Intra-site transfers via protocol switchingIn a first step, we focus on intra-site communication,leveraging the observation that workflow tasks usu-ally produce temporary files that exist only to bepassed from the job that produced them to the onefurther processing them. The simplified schema of ourapproach to manage such files is depicted in Figure 2.File sharing between tasks is achieved by advertising

Fig. 2. Architecture of the adaptive protocol-switchingfile management system [6]. Operations for transferringfiles between VMs: upload (1), download (2,3,4).

file locations and transferring the file towards thedestination, without intermediately storing them ina shared repository. We introduce 3 components thatenable file sharing across compute instances:

The Metadata Registry holds the locations of filesin VMs. It uses an in-memory distributed hash-tableto hold key-value pairs: file ids (e.g., name, user, shar-ing group etc.) and locations (the information requiredby the transfer module to retrieve the file). Severalimplementation alternatives are available: in-memorydatabases, Azure Tables, Azure Caching. We chose thelatter as our preliminary evaluations showed that theAzure Caching delivers better performance than theAzure Tables (10 times faster for small items) and hasa low CPU consumption footprint (unlike a database).

The Transfer Manager performs the transfers be-tween the nodes by uploading and downloading files.The upload operation (arrow 1 in Figure 2 ) consistssimply in advertising the file location, which is doneby creating a record in the Metadata Registry. Thisimplies that the execution time does not depend onthe data size. The download operation first retrievesthe location information about the data from theregistry (arrow 2 in Figure 2 ), then contacts the VMholding the targeted file (arrow 3), transfers it andfinally updates the metadata (arrow 4). With theseoperations, the number of reads and writes neededto move a file between tasks (i.e. nodes) is reduced.Multiple options are available for performing theactual transfer. Our proposal is to integrate severalof them and dynamically switch the protocol basedon the context. Essentially, the system is composed ofuser-deployed and default-provided transfer modulesand their service counter parts, deployed on eachcompute instance. Building on the extensibility prop-erties, users can deploy custom transfer modules. Theonly requirement is to provide an evaluation functionfor scoring the context. The score is computed byweighting a set of parameters, e.g., number or sizeof files, replica count, resource load, data format, etc.These weights reflect the relevance of the transfersolution for each specific parameter. Currently, weprovide 3 protocols among which the system adap-tively switches:

• In-Memory: targets small files transfers or de-

Page 7: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 6

Receiver Sender

Data Center

Data Center

Intermediate Nodes

High Latency Network Low Latency Network

Data Center

Fig. 3. The multi-route transfer schema based onintermediate cloud nodes for inter-site transfers.

ployments which have large spare memory.Transferring data directly in memory booststhe performance especially for scatter andgather/reduce access patterns. We used AzureCaching to implement this module, by buildinga shared memory across the VMs.

• FTP: is used for large files, that need to be trans-ferred directly between machines. This solutionis mostly suited for pipeline and gather/reducedata access patterns. To implement it we startedfrom an open-source implementation [15], ad-justed to fit the cloud specificities. For examplethe authentication is removed, and data is trans-ferred in configurable sized chunks (1 MB bydefault) for higher throughput.

• BitTorrent: is adapted for broadcast/multicast ac-cess patterns, to leverage the extra replicas in col-laborative transfers. Hence, for scenarios involv-ing a replication degree above a user-configurablethreshold, the system uses BitTorrent. We rely onthe MonoTorrent [16] library, again customisedfor our needs: we increased the 16 KB defaultpacket size to 1 MB, which translates into athroughput increase of up to 5 times.

The Replication Agent is an auxiliary componentto the sharing functionality, ensuring fault toleranceacross the nodes. The service runs as a backgroundprocess within each VM, providing in-site replicationfunctionality alongside with the geographical repli-cation service, described in Section 4.2. In order todecrease its intrusiveness, transfers are performedduring idle bandwidth periods. The agents communi-cate and coordinate via a message-passing mechanismthat we built on top of the Azure Queuing system. Asa future extension, we plan to schedule the replicaplacement in agreement with the workflow engine(i.e. the workflow semantics).

3.4 Inter-site data transfers via multi-routes

In a second step, we move to the more complicatedcase of inter-site data transfers. Sending large amountsof data between 2 datacenters can rapidly saturate thesmall interconnecting bandwidth. Moreover, due tothe high latency between sites, switching the trans-fer protocol as for intra-site communication is notenough. To make things worse, our empirical observa-tions showed that the direct connections between thedatacenters are not always the fastest ones. This is due

to the different ISP grids that connect the datacenters(the interconnecting network is not the property ofthe cloud provider). Considering that many Big Dataapplications are executed on multiple nodes acrossseveral sites, an interesting option is to use thesenodes and sites as intermediate hops between sourceand destination.

The proposed multi-route approach that enablessuch transfers is shown in Figure 3. Data to betransferred is first replicated within the same site(green links in Figure 3) and it is then forwarded tothe destination using nodes from other intermediatedatacenters (red arrows in the Figure). Instances ofthe inter-site transfer module are deployed on thenodes across the datacenters and used to route datapackages from one node to another, forwarding datatowards the destination. In this way, the transfersleverage multiple paths, aggregating extra bandwidthbetween sites. This approach builds on the fact thatvirtual routes of different nodes are mapped to dif-ferent physical paths, which cross distinct links andswitches (e.g., datacenters are interconnected with theISP grids by multiple layer 2 switches [10]). Therefore,by replicating data within the site, which is up to10 times faster than the inter-site transfers, we areable to increase the performance when moving dataacross sites. OverFlow also supports other optimiza-tions: data fragmentation and re-composition usingchunks of variable sizes, hashing, acknowledgementfor avoiding data losses or packets duplication. Onemight consider the acknowledgement-based mecha-nism redundant at application level, as similar func-tionality is provided by the underlying TCP protocol.We argue that this can be used to efficiently handleand recover from possible cloud nodes failures.

Our key idea for selecting the paths is to considerthe cloud as a 2-layer graph. At the top layer, avertex represents a datacenter. Considering the smallnumber of datacenters in a public cloud (i.e., lessthan 20), any computation on this graph (e.g., de-termine the shortest path or second shortest path)is done very fast. On the second layer, each vertexcorresponds to a VM in a datacenter. The number ofsuch nodes depends on application deployments andcan be scaled dynamically, in line with the elasticityprinciple of the clouds. These nodes are used for thefast local replication with the purpose of transferringdata in parallel streams between the sites. OverFlowselects the best path, direct or across several sites,and maximizes its throughput by adding nodes to it.Nodes are added on the bottom layer of the graphto increase the inter-site throughput, giving the edgesbased on which the paths are selected on the top layerof the graph. When the nodes allocated on a pathfail to bring performance gains, OverFlow switches tonew paths, converging towards the optimal topology.

Algorithm 1 implements this approach. The firststep is to select the shortest path (i.e., the one with

Page 8: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 7

Algorithm 1 The multi-path selection across sites1: procedure MULTIDATACENTERHOPSEND2: Nodes2Use =Model.GetNodes(budget)3: while SendData < TotalData do4: MonitorAgent.GetLinksEstimation();5: Path = ShortestPath(infrastructure)6: while UsedNodes < Nodes2Use do7: deployments.RemovePath(Path)8: NextPath = ShortestPath(deployments)9: UsedNodes+ = Path.NrOfNodes()

. // Get the datacenter with minimal throughput10: Node2Add = Path.GetMinThr()11: while UsedNodes < Nodes2Use &12: Node2Add.Thr >= NextPath.NormalizedThr do13: Path.UpdateLink(Node2Add)14: Node2Add = Path.GetMinThr()15: end while16: TransferSchema.AddPath(Path)17: Path = NextPath18: end while19: end while20: end procedure

the highest throughput) between the source and thedestination datacenters. Then, building on the elas-ticity principle of the cloud, we try to add nodes tothis path, within any of the datacenters that formthis shortest path. More nodes add more bandwidth,translating into an increased throughput along thepath. However, as more nodes are added, the ad-ditional throughput brought by them will becomesmaller (e.g., due to network interferences and bottle-necks). To address this issue, we consider also the nextbest path (computed at lines 7-8). Having these twopaths, we can compare at all times the gain of addinga node to the current shortest path versus adding anew path (line 12). The tradeoff between the cost andperformance can be controlled by users through thebudget parameter. This specifies how much users arewilling to pay in order to achieve higher performance.Our solution then increases the number of intermedi-ate nodes in order to reduce the transfer time as longas the budget allows it.

4 LEVERAGING OVERFLOW AS A KEY TOSUPPORT GEOGRAPHICALLY DISTRIBUTEDAPPLICATIONS

We introduce a set of pluggable services that wedesigned on top of OverFlow, exploiting the system’sknowledge about the global distributed environmentto deliver a set of scientific functionalities needed bymany large-scale workflows. The modular architec-ture allows to easily deploy new, custom plugins.

4.1 Cost estimation engine

One of the key challenges that users face today isestimating the cost of operations in the cloud. The

typical pricing estimations are done simply by multi-plying the timespan or the size of the data with the ad-vertised costs. However, such a naive cost estimationfails to capture any data management optimizationsor access pattern semantics, nor it can be used togive price hints about the data flows of applications.To address this issue, we designed a cost estimationservice, based on a cost model for the management ofworkflow data across sites.

We start by defining the transfer time (Tt) asthe ratio between the data size and the throughput,i.e., Tt = Size

Thr . In order to integrate the multi-routetransfer scheme, we express the throughput accordingto the contribution of each of the N parallel routes.Additionally, we need to account for the overhead ofreplicating the data within the site compared withthe inter-site communication, denoted LRO – localreplication overhead. The idea is that, the multi-route approach is able to aggregate throughput fromadditional routes as long as the total time to locallyreplicate data is less than the time to remotely transferit. Empirically, we observed that local replication isabout 10 times faster than inter-site communication,which can be used as a baseline value for LRO. Hence,we define the total throughput from the parallelstreams as:

Thr =

N,N<LRO∑i=1

thrlink × LRO − (i− 1)

LRO(1)

With these formulas we can now express the costof transferring data according to the flows of datawithin a workflow. The cost of a geographical transferis split into the cost charged by the cloud providerfor outbound data (outboundCost), as usually inbounddata is free, and the cost derived from leasing theVMs during the transfer. This model can be appliedalso to local, in-site, replication by ignoring the out-bound cost component in the formula. The servicedetermines automatically whether the source and thedestination belong to the same datacenter based ontheir IPs. According to the workflow data dissemina-tion pattern, we have the following costs:

• Pipeline – defines the transfers between a sourceand a destination as:

CostPipeline = N ×Tt×VMCost+outboundCost×Size (2)

• Multicast/Broadcast – defines the transfer of datatowards several destination nodes (D):

CostMulticast = D×(N×Tt×VMCost+outboundCost×Size)(3)

• Reduce/Gather – defines the cost of assemblinginto a node, files or pieces of data, from severalother nodes (S):

CostGather =

S∑i=1

(N×Tti×VMCost+outboundCost×Sizei)

(4)One can observe that pipeline is a sub-case of multi-

cast with one destination, and both these patterns aresub-cases of the gather with only one source. There-fore, the service can identify the data flow pattern

Page 9: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 8

through the number of sources and destination pa-rameters. This observation allows us to gather the es-timation of the cost in a single API function: Estimate-Cost(Sizes[],S,D,N). Hence, applications and workflowengines can obtain the cost estimation for any dataaccess pattern, simply by providing the size of thedata, the data flow type and the degree of parallelism.The other parameters such as the links throughputand the LRO factor are obtained directly from theother services of OverFlow.

This model captures the correlation between per-formance (time or throughput) and cost (money) andcan be used to adjust the tradeoff between them, fromthe point of view of the number of resources or thedata size. While urgent transfers can be sped-up usingadditional routes at the expense of some extra money,less urgent ones can use a minimum set of resourcesat a reduced outbound cost. Moreover, leveraging thiscost model in the geographical replication/transferservice, one can set a maximum cost for a trans-fer, based on which our system is able to infer theamount of resources to use. Although the network orend-system performance can drop, OverFlow rapidlydetects and adapts to the new reality to satisfy thebudget constraint.

4.2 Geographical replication

Turning geo-diversity into geo-redundancy requiresthe data or the state of applications to be distributedacross sites. Data movements are time and resourceconsuming and it is inefficient for applications tosuspend their main computation in order to performsuch operations. Therefore, we introduce a pluginthat releases applications and users from this taskby providing Geo-Replication-as-a-Service on top ofOverFlow. Applications simply indicate the data tobe moved and the destination via an API functioncall, i.e., Replicate(Data,Destination). Then, the serviceperforms the geographical replication via multi-pathtransfers, while the application continues uninter-rupted.

Replicating data opens the possibilities for differentoptimization strategies. By leveraging the previouslyintroduced service for estimating the cost, the geo-replication service is able to optimize the operation forcost or execution time. To this purpose, applicationsare provided with an optional parameter when callingthe function (i.e., Policy). By varying the value of thisparameter between 0 and 1, applications will indicatea higher weight for cost (i.e., a value of 0) or for time(i.e., a value of 1), which in turn will determine theamount of resources to use for replicating the data.This is done by querying the cost estimation servicefor the minimum and maximum times, the respectivecost predictions, and then using the Policy parameteras a slider to select between them.

4.3 Smart compressionData compression can have many flavors and declina-tions. One can use it to define the process of encodingthe data based on a dictionary containing commonsequences (e.g., zip) or based on previous sequences(e.g., mpeg), to search for already existing blocks ofdata (i.e., deduplication) or even to replace the datawith the program which generated it. In the contextof inter-site transfers, reducing the size of data isinteresting as it can potentially reduce the cost (i.e.,the price paid for outbound traffic) or the time toperform the transfer. Nevertheless, as typically thecompression techniques are time consuming, a smarttradeoff is needed to balance the time invested toreduce the data and the actual gains obtained whenperforming the transfer.

Our goal is to provide applications with a ser-vice that helps them evaluate these potential gains.First, the service leverages deduplication. Applicationscall the CheckDeduplication(Data,DestinationSite) func-tion to verify in the Metadata Registry of the destina-tion site if (similar) data already exist. The verificationis done based on the unique ID or the hash of thedata. If the data exist, the transfer is replaced by theaddress of the data at destination. This brings themaximum gains, both time and money wise, amongall “compression” techniques. However, if the data arenot already present at the destination site, their sizecan still potentially be reduced by applying compres-sion algorithms. Whether to spend time and resourcesto apply such an algorithm and the selection of thealgorithm itself are decisions that we leave to users,who know the application semantics.

Second, the service supports user informedcompression-related decisions, that is, compression–time or compression–cost gain estimation. To thispurpose we extend the previous estimation modelto include the compression operation. The updatedexecution-time model is given by Equation 5, whilethe cost is depicted in Equation 6.TtC =

Size

SpeedCompression+

Size

FactorCompression × Thr(5)

CostC =S∑

i=1

(N × TtCi × VMCost+

outboundCost ×Sizei

FactorCompression)×D (6)

Users can access these values by overloading thefunctions provided by the Cost Estimation Servicewith the SpeedCompression and FactorCompression pa-rameters. Obtaining the speed with which a com-pression algorithm encodes data is trivial, as it onlyrequires to measure the time taken to encode a pieceof data. On the other hand, the compression rateobtained on the data is specific to each scenario, but itcan be approximated by users based on their knowl-edge about the application semantics, by queryingthe workflow engine or from previous executions.Hence, applications can query in real-time the service

Page 10: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 9

to determine whether by compressing the data thetransfer time or the cost are reduced and to orchestratetasks and data accordingly.

4.4 Monitoring as a Service

Monitoring the environment is particularly importantin large-scale systems, be it for assessing the resourceperformance or simply for supervising the applica-tions. We provide a Monitor as a Service module, whichon the one hand tracks the performance of variousmetrics and on the other hand offers performanceestimates based on the collected samples. To obtainsuch estimates, we designed a generic performancemodel, which aggregates all information on: the deliv-ered performance (i.e., what efficiency one should ex-pect) and the instability of the environment (i.e., howlikely is it to change). Our goal is to make accurateestimations but at the same time to remain genericwith our model, regardless of the tracked metrics orthe environment variability. To this end, we proposeto dynamically weight the trust given to each samplebased on the environment behaviour deducted thatfar. The resulting inferred trust level is used then toupdate the view on the environment performance.This allows to transparently integrate samples for anymetric while reasoning with all available knowledge.

The monitoring samples about the environment arecollected at configurable time intervals, in order tokeep the service non-intrusive. They are then inte-grated with the parameter trace about the environ-ment (h - history gives the fixed number of previoussamples that the model will consider representative).Based on this, the average performance (µ - Equation7) and the corresponding variability (σ - Equation 8)are estimated at each moment i. These estimations areupdated based on the weights (w) given to each newmeasured sample.

µi =(h− 1) ∗ µi−1 + (1− w) ∗ µi−1 + w ∗ S

h(7)

σi =√γ − µ2

i (8)

γi =(h− 1) ∗ γi−1 + w ∗ γi−1 + (1− w) ∗ S2

h(9)

where S is the value of the new sample and γ(Equation 9) is an internal parameter. Equation 9 isobtained by rewriting the standard variability formula

(i.e. σi =√

1N

∑Nj=1(xj − µi)2), in terms of the previ-

ous value at moment i−1 and the value of the currentsample. This rewriting allows to save the memory thatwould be needed otherwise to store the last h samples.The corresponding weights are selected based on thefollowing principles:

• A high standard deviation will favor acceptingnew samples (even the ones farther from the aver-age), indicating an environment likely to change;

• Samples far from average are potential outliersand weighted less (it can be a temporal glitch);

• Less frequent samples are weighted higher, asthey are more valuable.

These observations are captured in Equation 10. Theformula combines the Gaussian distribution with atime reference component, which we defined basedon the frequency of the sample - tf within the timereference interval T . The values range from 0 - notrust - to 1 - full trust.

w =e− (µ−S)2

2σ2 + (1− tfT)

2(10)

The metrics considered by the service are: avail-able bandwidth, I/O throughput, CPU load, memorystatus, inter-site links and VM routes. The availablebandwidth between the nodes and the datacenters ismeasured using the Iperf software [17]; the through-put and interconnecting VM routes are computedbased on time measurements of random data trans-fers; the CPU and memory performance is evaluatedusing a benchmark, that we implemented based onthe Kabsch algorithm [18]. New metrics can be easilydefined and integrated as pluggable monitoring mod-ules. The very simplicity of the model allows it to begeneral, but at the expense of becoming less accuratefor some precise, application dependent, settings. Itwas our design choice to trade accuracy for generality.

5 EVALUATIONIn this section we analyze the impact of the overheadof the data services on the transfer performance in aneffort to understand to what extent they are able tocustomize the management of data. We also providean analysis that helps users understand the costs in-volved by certain performance levels. The metric usedas a reference is the data transfer time, complementedfor each experiment with the relevant “cost” metricfor each service (e.g., size, money, nodes). We focus onboth synthetic benchmarks (Sections 5.2,5.3,5.4) andreal-life workflows (Sections 5.5,5.6).

5.1 Experimental setupThe experiments are performed on the MicrosoftAzure cloud, at the PaaS level, using the EU (North,West) and US (North-Central, West, South and East)datacenters. The system is run on Small (1 CPU, 1.75GB Memory and 100 Mbps) and Medium (2 CPU, 3.5GB Memory and 200 Mbps) VM instances, deployingtens of VMs per site, reaching a total number of 120nodes and 220 cores in the global system.

The experimental methodology considers 3 metrics:execution time, I/O throughput and cost. The exe-cution time is measured using the timer functionsprovided by the .Net Diagnostic library. The through-put is determined at destination as a ratio betweenthe amount of data sent and the operation time.Finally, the cost is computed based on Azure pricesand policies [19]. Each value reported in the chartsis the average of tens of measurements performed atdifferent daily moments.

Page 11: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 10

50 100 150 200 250

Size (MB)

0

10

20

30

40

50

60

70

80

Time (

sec)

FTP Transfers Torrent TransfersOverFlow Azure Blobs

Fig. 4. The I/O time per node, when 2 input files aredownloaded and 1 is uploaded.

25 50 75 100 250 500 750 1000

Size (MB)

0

100

200

300

400

Time (

seco

nds)

AzureBlobs OverFlowEndpoint2EndPoint GEO-DMS

Fig. 5. Transfer time for different data sizes.5.2 Evaluating the intra-site transfersThe first scenario measures the improvements broughtto I/O-intensive applications by adaptively switchingbetween transfer protocols. Multiple nodes read andproduce files, concurrently accessing the data man-agement system to replicate (multicast) and transfer(pipeline) them within the deployment. We compareOverFlow with 2 other solutions: using the remotecloud storage (i.e., Azure Blobs[20]) as an intermediateand several static transfer mechanisms (e.g., FTP orTorrent). Figure 4 presents the average I/O time pernode. We notice that the adaptive behaviour of oursolution, which alternates the transfer strategies, leadsto a 2x speedup compared to a static file handlingand a 5x speedup compared to Azure Blobs. Localfile management options (OverFlow, FTP, Torrent)are better than the remote cloud storage becausethey leverage deployment-locality by managing fileswithin the compute nodes. Finally, our approach out-performs static FTP or Torrent, because it dynamicallyswitches between them, i.e., use Torrent for multicastoperations and FTP for direct transfers.

5.3 Evaluating the inter-sites transfersClouds lack proper support for inter-sites applicationdata transfers. We compared our transfer/replicationservice with several state-of-the-art solutions adaptedas placeholders: the Globus Online tool (which usesGridFTP as a server backend), the Azure Blobs cloudstorage (used as an intermediate storage for passingdata) and direct transfers between endpoints, denotedEndPoint2EndPoint. The results of this evaluation areshown in Figure 5. Azure Blobs is the slowest optionwith the transfer performed in 2 steps: a writingphase from the source node to the storage, followedby a read phase in which data is downloaded atdestination. Despite the high latencies of the HTTP-based access, the cloud storage is the only optioncurrently available by default. Globus Online, thestate of the art alternative, lacks the cloud-awareness,

1 2 3 4 5 6 7 8 9 10

Time (minutes)

0

2000

4000

6000

8000

10000

12000

14000

16000

Thro

ughp

ut (

MB

/s)

OverFlowShortestPath dynamicDirectLinkShortestPath static

a) The aggregation of throughput in time

5 10 15 20 25

Number of Nodes Used

0

2000

4000

6000

8000

10000

12000

14000

16000

Thro

ughp

ut (

MB

/s)

OverFlowShortestPath dynamicDirectLinkShortestPath static

b) The throughput based on number of nodes

Fig. 6. The throughput between North EU and NorthUS sites for different multi-route strategies, evaluated:a) in time, with a fixed number of node (25); b) basedon multiple nodes, with a fix time frame (10 minutes)

being more adequate for managing data collectionsof warehouses rather than application data exchanges.Finally, the EndPoint2EndPoint does not aggregate ex-tra bandwidth between datacenters implementing thetransfers over the basic, but widely used, TCP client-server socket communication. Our approach reducesthe transfer times with a factor of 5 over the cloudoffering and with up to 50% over the other options.Budget-wise, these approaches do not incur the samecosts: Azure Blobs charges extra cost for storage; the1 node EndPoint2EndPoint cost/performance tradeoffis analyzed in Section 5.4; Globus Online incurs highercosts than OverFlow as it delivers lower performancewith the same resource set.

Next, we consider a scenario in which additionalsites are used as intermediate routing hops to transferthe data. For this experiment we considered thatthe service is available on nodes deployed acrossall the 6 US and EU sites. Figure 6 presents theevaluation of the proposed approach, described inAlgorithm 1. We compare our solution with 3 othertransfer strategies which schedule the transfer acrossmultiple nodes – all using equal number of nodes.The first option, denoted DirectLink, considers onlydirect transfers between the source and destinationdatacenters, meaning that all nodes are allocated inthese two sites. As this direct path might not be infact the shortest one, the other strategies consider theshortest path computed using Dijkstra’s algorithm,which can span over multiple sites. The selection ofthe shortest path for routing the transfer can be done:1) once, at the beginning of the transfer (this optionis denoted ShortestPath static); or 2) regularly, basedon a fresh view of the environment obtained from theMonitor Service (denoted ShortestPath dynamic).

In Figure 6 a) we present the cumulative through-put obtained with each strategy when scheduling 25nodes. The shortest path strategy and our approachhave similar performance for the first part of thetransfer. This happens because our algorithm extendsthe shortest path strategy. The difference lays in themechanisms for selecting alternative paths when thegain brought by a node along the initial path becomessmaller than switching to a new path (due to conges-tion). As a result, the improvement brought increases

Page 12: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 11

Fig. 7. The tradeoff between transfer time and cost formultiple VM usage (1 GB transferred from EU to US)with time, reaching 20% in the 10 minute windowconsidered. Alternatively, selecting the routes withstatic strategies decreases the performance with time,thus being inefficient for Big Data. In Figure 6 b) weanalyze the throughput when increasing the numberof nodes. The goal is to understand how much datacan be sent given a resource set and a timeframe(here 10 minutes). For small scenarios, all strategiesperform roughly similar. However, for larger setups,our algorithm is capable to better schedule resourceplacement, which translates into higher throughput,in line with Big Data applications needs.

5.4 Efficient resource usage with the Cost Esti-mator ServiceOur next experiment focuses on the relation betweenprice and the transfer efficiency. The cost estimatorservice maps the costs with the performance gains(i.e., speedup) obtained by the parallel transfer acrossintermediate nodes. As the degree of parallelism isscaled, we observe in Figure 7 a non-linear reductionfor the transfer time. This demonstrates that eachallocated node brings a lower performance gain asmodeled in Equation 1. At the same time, the costof the nodes remains the same, regardless of theefficiency of their usage. The time reduction obtainedby using more nodes prevents the transfer cost togrow significantly, up to a certain point. This is thecase in Figure 7 when using 3 to 5 VMs. Thus, despitepaying for more resources the transfer time reductionobtained balances the cost. When scaling beyond thispoint, the performance gains per new nodes start todecrease which results in a price increase with thenumber of nodes. Looking at the cost/time ratio,an optimal point is found around 6 VMs for thiscase (the maximum time reduction for a minimumcost). However, depending on how urgent the transferis, different transfer times can be considered accept-able. Having a service which is able to offer sucha mapping between cost and transfer performance,enables applications to adjust the cost and resourcerequirements according to their needs and deadlines,while managing costs.

5.5 Evaluating the gains brought by the SmartCompression ServiceSending data across sites is an expensive operation,both money- and time-wise. One of the functionalities

NUSEUS

SUSWUS

NEU0

20

40

60

80

100

120

140

Tim

e (s

econ

d)

Transfer timeTransfer time and Deduplication Checking

Overhead Time

Fig. 8. The overhead of checking for deduplication onthe remote site before sending the dataof this service is deduplication, by verifying on theremote sites whether (parts of) the data are alreadypresent and thus, avoid duplicate transfers. Perform-ing this checking adds an extra time, presented inFigure 8. The measurements are done for the scenariowith 100 MB of data transferred from West EU. Theoverhead introduced by our service for the dedupli-cation check is small (∼5%), with only one call to theremote service. This observation and the possibilityof fully eliminating the transfer make this service aviable option for inter-sites data management.

When avoiding transfers is not possible, the alter-native to reduce the costs is to compress the data.However, such an action increases the overall timeto execute the operation, as depicted in Equations5. On the other hand, it reduces the costs (Equation6) involved in moving the data, considering thatthe outbound traffic is charged more than leasingVMs [19]. In Figure 9 we present the relation betweenthe extra time needed to compress data and the costsavings. The results are computed as time and costdifferences when compression is enabled / disabled.In this experiment, we consider that a protein dataset of 800 MB is compressed at a speed of 5 MB/s,achieving a compression ratio of 50% (i.e., the datarepresents a protein set of the BLAST scientific work-flow [21]). Not only the transfer cost is reduced viacompression, but the multi-route approach can alsocompensate through a higher parallelism degree forthe compression. Using more resources decreases thetransfer time, making the compression time dominantfor the equivalent setup with replication enabled.Consequently, the “compute cost” component chargedfor the resource usage decreases without compression,compensating for the cost decrease in outbound trafficbrought by compression. Hence, the service enablesapplications to identify and adopt one strategy or theother, according to their needs.

5.6 Impact of OverFlow for real-life workflowsNext, we evaluate the benefits of our approach forsingle and multiple datacenters workflows execution.

5.6.1 Single-site: MapReduce benchmarkingMapReduce [22] is one of the most wide-spread typeof computation performed nowadays on the cloud,making it an interesting candidate to evaluate theimpact of OverFlow. We propose the following ex-periment: we take an actual MapReduce engine (i.e.,

Page 13: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 12

0 1 2 3 4 5 6 7 8 9 10 11

Number of VMs

55

60

65

70

75

80Ti

me

(sec

onds

)Extra Time to compress before sending

0.01

0.015

0.02

0.025

Euro

cen

ts

Cost Savings

Fig. 9. The cost savings and the extra time required ifdata is compressed before sending

the Hadoop on Azure service [23], denoted HoA) anda general-purpose workflow engine (i.e., the open-source Microsoft Generic Worker [12], denoted GW)and run a typical WordCount MapReduce benchmark.We used an input of 650 MB data, processed in 2setups: with 20 mappers (32 MB per job) and 40mappers (16 MB per job); the number of reducers is3 in both cases. The results, presented in Figure 10,show, as expected, that the specialized MapReducetool is more efficient. The question is: can OverFlowimprove the performance of a workflow engine compared toa MapReduce engine? To answer, we replaced the AzureBlob based storage of the Generic Worker with ourapproach (denoted GW++), and run the benchmark.We notice that GW++ achieves a 25% speedup com-pared to Hadoop, building on its protocol switchingapproach which enables to use the best local transferoption according to the data access pattern.

5.6.2 Single-site: BLAST - protein sequencingThe next set of experiments focuses on the benefitsthat our approach can bring to a real-life scientificworkflow – BLAST [21], which compares biologi-cal sequences to identify those that resemble. Theworkflow is composed of 3 types of jobs: a splitterfor input partitioning (an 800 MB protein set), thecore algorithm (i.e., the BLAST jobs) matching thepartitions with reference values stored in 3 databasefiles (the same for all jobs) and an assembler to aggre-gate the result. We show in Figure 11 the makespanof executing the BLAST analysis with the GenericWorker engine, using its default data managementbackend and our adaptive solution. We test both smalland large files transfers, as increasing the number ofjobs translates into smaller intermediate files. We alsoperform a broadcast, as the input database files (witha constant size of ∼1.6 GB) are disseminated to alljobs. We notice that protocol switching among thesepatterns pays off reducing to more than half the datamanagement time.

5.6.3 Multi-site: A-Brain – searching for neuroimaging- genetic correlationsFinally, we present an evaluation of the time re-ductions obtained with the multi-route approach forwide-area transfers. We perform the evaluation fora real-life application, A-Brain, which makes jointgenetic and neuro-imaging data analysis [24]. A-Brainhas high resource requirements, which leads to its

Fig. 10. Map and Reduce times for WordCount (32 MBvs. 16 MB / map job).

25 BLAST jobs 50 BLAST jobs 100 BLAST jobs0:00:00

0:30:00

1:00:00

1:30:00

GenericWorker - AzureBlobsGenericWorker-OverFlowCompute time

Fig. 11. The BLAST workflow makespan: the computetime is the same (marked by the horizontal line), theremaining time is the data handling

execution across 3 cloud datacenters. In this scenario,we focus on the transfer times of 1000 files of partialdata, which are aggregated in one site to compute thefinal result. We compare OverFlow with the transferdone via Azure Blobs. The results are shown in Figure12 for multiple file sizes, resulted from different inputdata sets and configurations. For small data sets (108MB resulted from 3x1000x36KB files), the overheadintroduced by our solution, due to the extra acknowl-edgements, makes the transfer less efficient. However,as the data size grows (120 GB), the total transfer timeis reduced by a factor of 3.

6 RELATED WORK

The handiest option for handling data distributedacross several datacenters is to rely on the existingcloud storage services (e.g., Amazon S3, Azure Blobs)).This approach allows to transfer data between arbi-trary endpoints via the cloud storage and it is adoptedby several systems in order to manage data move-ments over wide-area networks [2]. Typically, they arenot concerned by achieving high throughput, nor bypotential optimizations, let alone offer the ability tosupport different data services (e.g., geographicallydistributed transfers). Our work aims is to specificallyaddress these issues.

Besides storage, there are few cloud-provided servicesthat focus on data handling. Some of them use thegeographical distribution of data to reduce latenciesof data transfers. Amazon’s CloudFront [25], for in-stance, uses a network of edge locations around theworld to cache copy static content close to users.The goal here is different from ours: this approachis meaningful when delivering large popular objectsto many end users. It lowers the latency and allows

Page 14: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 13

AzureBlobs

OverFlow

1000 x 36KB

0

200

400

600

800

1,000

1,200

Tim

e (s

econ

ds)

NEU

AzureBlobs

OverFlow

1000 x 1MB

0

500

1,000

1,500

2,000

NUS

AzureBlobs

OverFlow

1000 x 40 MB

0

10,000

20,000

30,000

40,000

WUS

Fig. 12. A-Brain execution times across 3 sites, usingAzure Blobs and OverFlow as backends, to transferdata towards the NUS datacenterhigh, sustained transfer rates. Similarly, [26] and [27]considered the problem of scheduling data-intensiveworkflows in clouds assuming that files are replicatedin multiple execution sites. These approaches canreduce the makespan of the workflows but come atthe cost and overhead of replication. In contrast, weextend this approach to exploit also the data accesspatterns and leverage a cost/performance tradeoff toallow per file optimizations of transfers.

The alternative to the cloud offerings are the trans-fer systems that users can choose and deploy on theirown, which we generically call user-managed solutions.A number of such systems emerged in the contextof the GridFTP [28] transfer tool, initially developedfor grids. In these private infrastructures, informa-tion about the network bandwidth between nodesas well as the topology and the routing strategiesare publicly available. Using this knowledge, transferstrategies can be designed for maximizing certainheuristics [29]; or the entire network of nodes acrossall sites can be viewed as a flow graph and thetransfer scheduling can be solved using flow-basedgraph algorithms [30]. However, in the case of publicclouds, information about the network topology is notavailable to the users. One option is to profile theperformance. Even with this approach, in order toapply a flow algorithm the links between all nodesneed to be continuously monitored. Such monitoringwould incur a huge overhead and impact on thetransfer. Among these, the work most comparableto ours is Globus Online [31], which provides highperformance file transfers through intuitive web 2.0interfaces, with support for automatic fault recovery.However, Globus Online only performs file transfersbetween GridFTP instances, remains unaware of theenvironment and therefore its transfer optimizationsare mostly done statically. Several extensions broughtto GridFTP allow users to enhance transfer perfor-mance by tuning some key parameters: threading in[32] or overlays in [29]. Still, these works only focuson optimizing some specific constraints and ignoreothers (e.g., TCP buffer size, number of outboundrequests). This leaves the burden of applying the mostappropriate settings effectively to users. In contrast,we propose a self-adaptive approach through a simpleand transparent interface, that doesn’t require addi-tional user management.

Other approaches aim at improving the throughputby exploiting the network and the end-system paral-lelism or a hybrid approach between them. Buildingon the nework parallelism, the transfer performancecan be enhanced by routing data via intermediatenodes chosen to increase aggregate bandwidth. Multi-hop path splitting solutions [29] replace a directTCP connection between the source and destinationby a multi-hop chain through some intermediatenodes. Multi-pathing [33] employs multiple indepen-dent routes to simultaneously transfer disjoint chunksof a file to its destination. These solutions come atsome costs: under heavy load, per-packet latencymay increase due to timeouts while more memoryis needed for the receive buffers. On the other hand,end-system parallelism can be exploited to improve uti-lization of a single path by means of parallel streams[31] or concurrent transfers [34]. However, one shouldalso consider system configuration since specific localconstraints (e.g., low disk I/O speeds or over-taskedCPUs) may introduce bottlenecks. One issue with allthese techniques is that they cannot be ported to theclouds, since they strongly rely on the underlyingnetwork topology, unknown at the user-level.

Traditional techniques commonly found in scientificcomputing, e.g. relying on parallel file systems are notalways adequate for processing big data on clouds.Such architectures usually assume high-performancecommunication between computation nodes and stor-age nodes (e.g. PVFS [35], Sector [36]). This assump-tion does not hold in current cloud architectures,which exhibit much higher latencies between computeand storage resources within a site, and even higherones between datacenters.

7 CONCLUSION

This paper introduces OverFlow, a data managementsystem for scientific workflows running in large, ge-ographically distributed and highly dynamic envi-ronments. Our system is able to effectively use thehigh-speed networks connecting the cloud datacen-ters through optimized protocol tuning and bottleneckavoidance, while remaining non-intrusive and easy todeploy. Currently, OverFlow is used in production onthe Azure Cloud, as a data management backend forthe Microsoft Generic Worker workflow engine.

Encouraged by these results, we plan to furtherinvestigate the impact of the metadata access on theoverall workflow execution. For scientific workflowshandling many small files, this can become a bot-tleneck, so we plan to replace the per site metadataregistries with a global, hierarchical one. Furthermore,an interesting direction to explore is the closer integra-tion between OverFlow and an ongoing work [37] onhandling streams of data in the cloud, as well as otherdata processing engines. To this end, an extension ofthe semantics of the API is needed.

Page 15: OverFlow: Multi-Site Aware Big Data Management for ... · Radu Tudoran, Member, IEEE, Alexandru Costan, Member, IEEE, and Gabriel Antoniu, Member, IEEE Abstract—The global deployment

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. X, NO. X, AUGUST 2014 14

REFERENCES[1] “Azure Succesful Stories,” http://www.windowsazure.com/en-

us/case-studies/archive/.[2] T. Kosar, E. Arslan, B. Ross, and B. Zhang, “Storkcloud:

Data transfer scheduling and optimization as a service,” inProceedings of the 4th ACM Science Cloud ’13, 2013, pp. 29–36.

[3] N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez, “Inter-datacenter bulk transfers with netstitcher,” in Proceedings of theACM SIGCOMM 2011 Conference, 2011, pp. 74–85.

[4] “Cloud Computing and High-Energy Particle Physics:How ATLAS Experiment at CERN Uses Google ComputeEngine in the Search for New Physics at LHC,”https://developers.google.com/events/io/sessions/333315382.

[5] A. Costan, R. Tudoran, G. Antoniu, and G. Brasche, “Tomus-blobs: scalable data-intensive processing on azure clouds,”Concurrency and Computation: Practice and Experience, 2013.

[6] R. Tudoran, A. Costan, R. R. Rad, G. Brasche, and G. Antoniu,“Adaptive file management for scientific workflows on theazure cloud,” in BigData Conference, 2013, pp. 273–281.

[7] R. Tudoran, A. Costan, R. Wang, L. Bouge, and G. Antoniu,“Bridging data in the clouds: An environment-aware systemfor geographically distributed data transfers,” in Proceedingsof the 14th IEEE/ACM CCGrid 2014, 2014. [Online]. Available:http://hal.inria.fr/hal-00978153

[8] H.Hiden, S. Woodman, P.Watson, and J.Caa, “Developingcloud applications using the e-science central platform.” inProceedings of Royal Society A, 2012.

[9] K. R. Jackson, L. Ramakrishnan, K. J. Runge, and R. C. Thomas,“Seeking supernovae in the clouds: a performance study,” inProceedings of the 19th ACM International Symposium on HighPerformance Distributed Computing, 2010, pp. 421–429.

[10] A. Greenberg, J. Hamilton, D. A. Maltz, and P. Patel, “Thecost of a cloud: research problems in data center networks,”SIGCOMM Comput. Commun. Rev., vol. 39, no. 1, pp. 68–73,Dec. 2008.

[11] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad:Distributed data-parallel programs from sequential buildingblocks,” in Proc. of the 2nd ACM SIGOPS/EuroSys 2007, ser.EuroSys ’07, 2007, pp. 59–72.

[12] Y. Simmhan, C. van Ingen, G. Subramanian, and J. Li, “Bridg-ing the gap between desktop and the cloud for escienceapplications,” in Proceedings of the 2010 IEEE 3rd InternationalConference on Cloud Computing, 2010, pp. 474–481.

[13] E. S. Ogasawara, J. Dias, V. Silva, F. S. Chirigati,D. de Oliveira, F. Porto, P. Valduriez, and M. Mattoso,“Chiron: a parallel engine for algebraic scientific workflows,”Concurrency and Computation: Practice and Experience,vol. 25, no. 16, pp. 2327–2341, 2013. [Online]. Available:http://dx.doi.org/10.1002/cpe.3032

[14] “Z-cloudflow,” http://www.msr-inria.fr/projects/z-cloudflow-data-workflows-in-the-cloud.

[15] “Ftp library,” http://www.codeproject.com/Articles/380769.[16] “Bittorrent,” http://www.mono-project.com/MonoTorrent.[17] “Iperf,” http://iperf.fr/.[18] W. Kabsch, “A solution for the best rotation to relate two sets

of vectors,” in Acta Crystallographica A , 32:922923, 1976.[19] “Azure Pricing,” http://azure.microsoft.com/en-

us/pricing/calculator/.[20] B. e. a. Calder, “Windows azure storage: a highly available

cloud storage service with strong consistency,” in Proceedingsof the Twenty-Third ACM Symposium on Operating Systems Prin-ciples, ser. SOSP ’11, 2011, pp. 143–157.

[21] “Blast,” http://blast.ncbi.nlm.nih.gov/Blast.cgi.[22] J. Dean and S. Ghemawat, “Mapreduce: simplified data pro-

cessing on large clusters,” Commun. ACM, vol. 51, no. 1, pp.107–113, Jan. 2008.

[23] “Hadoop on azure,” https://www.hadooponazure.com/.[24] B. Da Mota, R. Tudoran, A. Costan, G. Varoquaux, G. Brasche,

P. Conrod, H. Lemaitre, T. Paus, M. Rietschel, V. Frouin, J.-B.Poline, G. Antoniu, and B. Thirion, “Generic machine learningpattern for neuroimaging-genetic studies in the cloud,” Fron-tiers in Neuroinformatics, vol. 8, no. 31, 2014.

[25] “Cloudfront,” http://aws.amazon.com/cloudfront/.[26] S. Pandey and R. Buyya, “Scheduling workflow applications

based on multi-source parallel data retrieval,” Comput. J.,vol. 55, no. 11, pp. 1288–1308, Nov. 2012.

[27] L. Ramakrishnan, C. Guok, K. Jackson, E. Kissel, D. M. Swany,and D. Agarwal, “On-demand overlay networks for large sci-entific data transfers,” in Proceedings of the 2010 10th IEEE/ACMInternational Conference on Cluster, Cloud and Grid Computing,ser. CCGRID ’10, 2010, pp. 359–367.

[28] W. Allcock, “GridFTP: Protocol Extensions to FTP for theGrid.” Global Grid ForumGFD-RP, 20, 2003.

[29] G. Khanna, U. Catalyurek, T. Kurc, R. Kettimuthu, P. Sadayap-pan, I. Foster, and J. Saltz, “Using overlays for efficient datatransfer over shared wide-area networks,” in Proceedings of the2008 ACM IEEE conference on Supercomputing, pp. 47:1–47:12.

[30] G. Khanna, U. Catalyurek, T. Kurc, R. Kettimuthu, P. Sa-dayappan, and J. Saltz, “A dynamic scheduling approach forcoordinated wide-area data transfers using gridftp,” in Paralleland Distributed Processing, 2008. IPDPS 2008., 2008, pp. 1–12.

[31] T. J. Hacker, B. D. Noble, and B. D. Athey, “Adaptive datablock scheduling for parallel tcp streams,” in Proc. of the 14thIEEE High Performance Distributed Computing, ser. HPDC ’05,2005, pp. 265–275.

[32] W. Liu, B. Tieman, R. Kettimuthu, and I. Foster, “A datatransfer framework for large-scale science experiments,” inProceedings of the 19th ACM International Symposium on HighPerformance Distributed Computing, 2010, pp. 717–724.

[33] C. Raiciu, C. Pluntke, S. Barre, A. Greenhalgh, D. Wischik, andM. Handley, “Data center networking with multipath tcp,” inProceedings of the 9th ACM SIGCOMM Workshop on Hot Topicsin Networks, ser. Hotnets-IX, 2010, pp. 10:1–10:6.

[34] W. Liu, B. Tieman, R. Kettimuthu, and I. Foster, “A datatransfer framework for large-scale science experiments,” inProceedings of the 19th ACM International Symposium on HighPerformance Distributed Computing, 2010, pp. 717–724.

[35] P. Carns, W. B. Ligon, R. B. Ross, and R. Thakur, “PVFS: aparallel file system for linux clusters,” in In Proc. of the 4thannual Linux Showcase & Conference 2000.

[36] R. L. Grossman, Y. Gu, M. Sabala, and W. Zhang, “Computeand storage clouds using wide area high performance net-works,” Future Gener. Comput. Syst., vol. 25, pp. 179–183, 2009.

[37] R. Tudoran, O. Nano, I. Santos, A. Costan, H. Soncu, L. Bouge,and G. Antoniu, “Jetstream: Enabling high performance eventstreaming across cloud data-centers,” in Proc. of DEBS’14.ACM, 2014, pp. 23–34.

Radu Tudoran is a PhD student at ENS /IRISA Rennes working on cloud data man-agement at large scale. He graduated fromthe Technical University of Cluj-Napoca, Ro-mania in 2010 and then received a MasterDegree in 2011 from University of Rennes1 / ENS Rennes. He is the main archi-tect and technical contributor to the A-Brainproject on high-performance processing forbio-informatics applications on Azure clouds,in collaboration with Microsoft Research.

Alexandru Costan is an Associate Profes-sor at INSA Rennes and member of theInria’s KerData team. In 2011 he obtained aPh.D. in Computer Science from PolitehnicaUniversity of Bucharest for a thesis focusedon self-adaptive behavior of large-scale sys-tems, as one of the main contributors to theMonALISA monitoring framework. He is cur-rently involved in several research projects,working on efficient Big Data managementon clouds and hybrid infrastructures.

Gabriel Antoniu is a Senior Research Sci-entist at Inria Rennes, where he foundedand is leading the KerData team focusingon scalable distributed storage. His currentresearch activities include data managementfor large scale distributed infrastructures in-cluding clouds and post-petascale HPC ma-chines. In this area, he leads the MapReduceANR project, the A-Brain and Z-CloudFlowprojects on scalable storage for Azure cloudsin partnership with Microsoft Research.