Geo-Distributed Big Data Processing - GitHub...

49
Geo-Distributed Big Data Processing Patrick Eugster

Transcript of Geo-Distributed Big Data Processing - GitHub...

Page 1: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Geo-Distributed Big Data Processing

Patrick Eugster

Page 2: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Outline• Big data background

• Geo-distribution motivation

• Geo-distributed tasks

• Geo-distributed workflows

• Conclusions and outlook

Page 3: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Outline• Big data background

• Geo-distribution motivation

• Geo-distributed tasks

• Geo-distributed workflows

• Conclusions and outlook

Page 4: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Big Data• Large datasets ranging from hundreds of GBs to

hundreds of TBs (for most users) and even PBs for large corporations [Wikipedia]

• Often GB range [Schwarzkopf et al.;HotCloud’12]

• Too large for traditional relational database tools and single nodes to handle

• Processed using data-parallel software running tens, hundreds, even thousands of computers

Page 5: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Big Data - Why ?• We need it

• More users connected to the Internet: “Everyone on earth will be connected to the Internet by 2020” [E. Schmidt’13]

• We want it

• Applications use large datasets, e.g., for operation, monitoring, auditing, knowledge extraction

• Because we can

• Large amounts of cheap “cloud” storage available: “Amazon S3 contains over 449 billion objects and during peak time, processes more than 290K requests per second” [AWS blog’11]

Page 6: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Processing Big Data• MapReduce (MR) popularized by [Dean and Ghemawat;OSDI’04]

• Inspired by functional programming

• Consists of two phases

• map - takes input records and outputs sets of <key, value> pairs

• reduce - handles set of values for given keys and emits sets of values

• Open source Apache Hadoop

• HDFS distributed file system inspired by Google’s GFS [Ghemawat et al.;SOSP’03]

Page 7: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Workflow Programming• Many “high-level languages” proposed, e.g.,

• Pig Latin [Olston et al.;SIGMOD’08]

• (Mostly) declarative untyped scripting language

• Open source Apache Pig

• Flume Java [Chambers et al.;PLDI‘10]

• Java library

• Open source Apache Crunch

• Many compile to MR

LoadLoad

Filter

Join

Group

For Each

Store

LoadLoad

Filter

Join

Validate

Group

For Each

Store

Validate

Validate

MapReduce boundary

Logical operation

Page 8: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Pig Latin Example

input_lines = LOAD ‘input_file’ AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; word_groups = GROUP words BY word; word_count = FOREACH word_groups GENERATE group, COUNT(words); STORE word_count INTO ’output_file’;

“Word count”

“Yahoo estimates that between 40% and 60% of its Hadoop workloads are generated from Pig [...] scripts. With 100,000 CPUs at Yahoo and roughly 50% running Hadoop, that’s a lot [...]” [IBM DeveloperWorks’12]

Page 9: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Pig Latin Example

input_lines = LOAD ‘input_file’ AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; word_groups = GROUP words BY word; word_count = FOREACH word_groups GENERATE group, COUNT(words); STORE word_count INTO ’output_file’;

“Word count”

“Yahoo estimates that between 40% and 60% of its Hadoop workloads are generated from Pig [...] scripts. With 100,000 CPUs at Yahoo and roughly 50% running Hadoop, that’s a lot [...]” [IBM DeveloperWorks’12]

Page 10: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Outline• Big data background

• Geo-distribution motivation

• Geo-distributed tasks

• Geo-distributed workflows

• Conclusions and outlook

Page 11: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Geo-Distributed Big Data• Many large datasets geo-distributed, i.e., split across

sites

• Stored near resp. sources, frequently accessing entities

• Gathered and stored by different (sub-)organizations yet shared towards a common goal

• E.g., US census, Google “buckets”

• Replicated across datacenters for availability, incompletely to limit the overhead of updates

Page 12: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Geo-Distributed Big Data• Many analysis tasks involve several datasets,

which may be distributed

• Legal constraints may confine certain datasets to specific locations

• The “cloud” is not a single datacenter

• Inter-DC latency ≠ intra-DC latency

Page 13: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Concrete Scenario• Global web-based service provider

• Serve customers from close-by datacenters

• “Regional” customer bases

• Run analyses across all regions

• E.g., average age of customers buying product x

Page 14: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

GD in current Toolchain• Hadoop

• HDFS

• Pig Latin, Flume Java et al.

• Assumes uniform latencies

• Reducer placement based on resource availability

• Data must be in one HDFS instance or S3 bucket

• Single point of management (namenode)

• Performs poorly with high and/or inconsistent latencies

• Inherit weaknesses of underlying systems

• No support for expressing distribution

Page 15: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Potential for Improvement• Conjecture: poor execution choices result in high costs/delays

• E.g., US Census 2000 data (121 GB), 2 Amazon EC2 datacenters, MapReduce cluster of 10 nodes each

• Two tasks (MR jobs) (1) filter records (2) group records

• Associative: can execute on subsets of data and then aggregate

8

0

2

4

6

Cos

t ($

)

Copy FilterGroupCopy

121 GB 61 GB

FilterCopy

0

2

4

6

8T

ime

(hou

rs)

Copy FilterCopyFilterGroupCopy

121 GB 61 GB

Page 16: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

State of the Art• GD storage: Many systems, e.g., [Lloyd et al.;SOSP’11], [Sovran et

al.;SOSP’11], [Cho&Aguilera;ATC’12],[Sciasica&Pedone;DSN’13],[Zhang et al.;SOSP’13], consider GD data reads&writes.

• GD data location: Volley [Agraval et al.;NSDI’10] or [Tran et al.;ATC’11] migrate GD big data based on application needs.

• GD computation: HOG [Weitzel et al.;MTAGS’12] modifies Hadoop for Open Science Grid. Focus on site failures, not performance. G-Hadoop [Wanga et al.;Future Gen. Comp. Systems’13] similar.

• (G)D programming: Flink [Ewen et al.;PVLDB’12], Presto [Venkataraman et al.;HotCloud’12], Spark [Zaharia et al.;NSDI’12] support distributed datastructures but still in single datacenter.

Page 17: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Outline• Big data background

• Geo-distribution motivation

• Geo-distributed tasks

• Geo-distributed workflows

• Conclusions and outlook

Page 18: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

GD Tasks [Jayalath&Eugster;IEEE TC’14]

• Dataset I distributed across n datacenters (DC1 to DCn), each has execution cluster

• Sequence of m tasks T1 to Tm (cf. transducers)

...

DC1 DC2 DC3 DCn

DCi - datacenter

...

X1 X2 X3 Xn

Xi - Execution cluster

I1

Ii - sub-dataset

I2 I3 In

T1

T2

Tm

Execution order

T - Task

Page 19: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Problem Statement• How to efficiently perform a task sequence on a GD dataset?

• Several solutions varying by consolidation point, e.g., MR:

• Copy all data to 1 datacenter, perform job

• Perform mapping in respective datacenters, allocate all reducers in 1 datacenter

• Perform mapping and reducing in respective datacenters, aggregate subsequently (assuming “associativity”)

• Combinations, e.g., consolidate input from 2 of 3 datacenters, perform mapping individually, then reducing in 1 datacenter

Page 20: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Data Transformation Graphs (DTGs)

• A node - distribution of data and the task execution progress

• Weight of an edge - cost (monetary) or time for performing a task or a copy operation

• Each path from a starting node to an end node is a possible execution path

• A shortest path calculation algorithm is used to determine the optimal path

• Optimal with respect to a given partition distribution and other parameter values

Page 21: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

DTGs by Example• 3 datacenters, 1 input

partition in each, 1 MR job (2 tasks - map and reduce)

• 3 stages - stage i contains all nodes with exactly i tasks executed

• “Direct” vs “indirect” MR

• Intermediate data stored locally

N0DN0A

stage = 0

stage = 1

N0B

N1A N1B

N2A N2B

N0C

N1C

N2C

N1D

N2D

A=<1,2,3>

B=<2,2,3>

C=<3,2,3>

D=<1,1,3>

F=<1,2,1>

G=<1,2,2>

E=<1,3,3>

N0E N0F N0G

N1E N1F N1G

N2E N2F N2G

J=<2,2,2>

I=<1,1,1>

H=<3,3,3>

N0H N0I N0J

N1H N1I N1J

N2H N2I N2Jstage = 2

Page 22: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Sequences

• DTG for each job

• Each node in stage 2 of DTG of MR job i merged with corresponding node in stage 0 of MR job i+1 DTG

N0A0

1

N0B

N1A N1

B

N2A N2

B

N0C

N1C

N2C

. . .. . .. . .2

4

N3A N3

B N3C

N4A N4

B N4C

. . .

. . .3

A= <1,2,3> B= <2,2,3> C= <3,2,3>

Page 23: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Sampling and Extrapolation• Determining edge weights

• Execute each task on data samples in all execution clusters (in parallel), develop functions to determine execution time and output size

• (Not sampling all paths)

• Extrapolation used to predict execution time and output size for large amounts of data

• Users can manually specify functions

Page 24: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Determining Edge Weights

• Two datacenters, 1 partition in each datacenter, 1 MapReduce job (2 tasks)

N0<1,2>stage 0

stage 1

stage 2

N1<1,2>

N2<1,2>

N0<2,2>

N1<2,2>

N2<2,2>

N0<1,1>

N1<1,1>

N2<1,1>

W1=Ψ/B2,1 , W2=Ψ*C2,1

W1=Ψ/B1,2, W2=Ψ*C1,2

W1=Mt(Ψ)

W2=2*Mt(Ψ)*X*K

W1=Rt(Ψ1)

W2=2*Rt(Ψ1)*X*K

W1=MRt(Ψ)W2=2*MRt(Ψ))*X*K

Ψ1=Md(Ψ)

Ψ2=MRd(Ψ)

W1W2

- Time- Cost

W1= Ψ1/B2,1, W2= Ψ1*C2,1W1= Ψ1/B1,2, W2= Ψ1*C1,2

W1=Mt(2*Ψ)

W2=Mt(2*Ψ)*X*K

W1=MRt(2*Ψ)

W2=MRt(2*Ψ)*X*K

W1=Mt(2*Ψ)

W2=Mt(2*Ψ)*X*K

W1=Ψ2/B1,2 + At(2*Ψ2)

W2=Ψ2*C1,2 + At(2*Ψ2)*X*K

W1=Ψ2/B2,1 + At(2*Ψ2)

W2=Ψ2*C2,1 + At(2*Ψ2)*X*K To next job

W1=Rt(2*Ψ1)

W2=Rt(2*Ψ1)*X*KW1=Rt(2*Ψ1)

W2=Rt(2*Ψ1)*X*K

W1=MRt(2*Ψ)

W2=MRt(2*Ψ)*X*K

X1=X2=X

Example DTG and functions

Page 25: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

G-MR• Java framework implementing DTGs and corresponding

algorithms

• Extends Apache Hadoop

• Java annotations for associativity, functions

• Tested in Amazon EC2 with up to 1 TB of data distributed across 4 datacenters

CopyManager

JobManager

Hadoop AggregationManager

GroupManager + DTG algorithm DCconfig

Jobconfig

Page 26: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Evaluation Setup• Up to 4 EC2 datacenters located in US East Coast, US West Coast,

Europe and, Asia

• 10 large EC2 nodes (7.5 GB of memory, 4 EC2 compute units) in each datacenter

• Nodes leased at $0.34 per hour, data transfer $0.1 per GB

Dataset GBs Description

CENSUSData 121 Year 2000 US Census

EDUData 5 University Website crawl

WEATHERData 20 Weather measurements

PLANTData 10 Properties of Iris plant

HADOOPData 100 Logs of Yahoo! Hadoop cluster

NGRAMData 300 Google Books Ngrams

Datasets Task sequencesJob Description

CENSUSPROCESSOR Filters and groups CENSUSData.

WORDCOUNT Counts the number of occurences of words in EDUData

MEDIANWEATHER Computes the median of a record in WEATHERData

KNN Type of each plant record in PLANTData

ETL Extracts and performs a cross product on HADOOPData

NGRAM All combinations of last two words of 4 grams

Page 27: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Evaluation

• Two datacenters (DC1 and DC2)

• Different execution paths

• CopyAndExecute - copy all data to a single datacenter prior to execution

• ExecuteAndCopy - execute all tasks prior to copying

• PartialCopy - balance the partitions in the middle

Page 28: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Monetary CostWORDCOUNT NGRAM

Optimal - ExecuteAndCopy Optimal - copy data after first MR job

20 40 60 80

0.6

0.2

0.3

0.4

0.5

% of input in DC1

Cos

t ($

)

CopyAndExecuteOptimal

0 A B C D

20

0

5

10

15

Execution Path

Cost

($)

CopyAndExecute ExecuteAndCopyOptimal

AB

CD PartialCopy

Page 29: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Execution TimeWORDCOUNT NGRAM

0 A B C D

8

0

2

4

6

Execution Path

Exec

utio

n Ti

me

(hou

rs)

CopyAndExecute ExecuteAndCopyOptimal

A

B

C

D PartialCopy

0 20 40 60 80

360

120

160

200

240

280

320

% of input in DC1

Exec

utio

n Ti

me

(sec

) CopyAndExecuteOptimal

Optimal - ExecuteAndCopy Optimal - copy data after first MR job

Page 30: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Prediction Accuracy 0 20 40 60 80

360

120

160

200

240

280

320

% of input in DC1Executi

on T

ime (

sec) CopyAndExecute

Optimal

(a) Actual execution time

20 40 60 80

0.6

0.2

0.3

0.4

0.5

% of input in DC1

Cost

($)

CopyAndExecuteOptimal

(b) Actual monetary cost

900 10 20 30 40 50 60 70 80

320

120

160

200

240

280

% of input in DC1

Executi

on T

ime (

sec)

Copy PredictedCopy Actual

Multi PredictedMulti Actual

(c) Time prediction

850 10 20 30 40 50 60 70

0.4

0.2

0.25

0.3

0.35

% of Input in DC1

Cost

($)

Copy PredictedCopy Actual

Multi PredictedMulti Actual

(d) Cost prediction

Figure 3: Optimizing a sequence of two MapReducejobs on a geo-distributed dataset for optimal execu-tion time (a) and cost (b). The distribution of databetween two datacenters DC1 and DC2 is varied.The Multi... runs of (c) and (d) correspond to theOptimal execution paths of (a) and (b) respectively.

execution time optimization while Figure 3(b) showsthe results when the objective was to optimize formonetary cost. Two Amazon EC2 datacenters areused here, North Virgina and North California, with10 large instances (7.5 GB of memory with 4 EC2compute units) leased in each of them. The costof each instance was $0.34 per hour while the costfor transferring data between two EC2 datacenterswas $0.1 per GB. The MapReduce job counts thenumber of occurrences of each word of a text-baseddataset (“word count” [11]). Figures 3(d) and 3(c)demonstrate the high accuracy of G-MRs predictionsachieved despite the performance variability in Ama-zon EC2. More benchmarks can be found in [16].

Big Data-flow Processing

Rout. G-MR operates at the level of MapReducetasks, which represents the execution substrate for anumber of systems for big data processing. However,programming directly with MapReduce is tedious forcomplex jobs. For that reasons, several “data-flow”

languages have been proposed, which use specific datastructures to represent intermediate computation states,and model computational steps as operations appliedto such data structures. Programs in many of theselanguages are directly compiled to MapReduce tasks.

Another limitation of G-MR is that it does notsupport joining of datasets, which is a requirementfor many big data analysis scenarios.

Next we thus describe a more generic approach forgeo-distributed big data processing, which consists ina moderate extension we made to Pig Latin [18] (im-plemented in Apache Pig [18, 7]), named Rout. Oursystem exposes only little geo-distribution to pro-grammers and lets the distributed runtime environ-ment deal with the tedious aspects of geo-distributionsuch as orchestration of operations.

Figure 4 shows a high-level overview of the exe-cution of a Rout program. The Rout runtime envi-ronment, named Rrun, first applies the same algo-rithms as the Pig Latin interpreter for generating aMapReduce data-flow graph. We refer to this data-flow graph as level one data-flow graph. In this step,Rrun assumes all datasets to be located in a singledatacenter. After the above step, Rrun analyzes thegenerated data-flow graph to determine the placeswhere data has to be transferred over the networkacross datacenters. For example, all data of a geo-distributed dataset may have to be transferred to asingle datacenter before performing non-associativefunctions. The data-flow graph will be annotated ac-cordingly. We refer to the mechanism used to deter-mine the scheduling of these data transfer operationsas data transfer heuristic. The resulting updateddata-flow graph is referred to as level two data-flowgraph.

Rout Program

Level 1 Dataflow Graph

Level 2 Dataflow Graph

Piginterpreter

Copy heuristic Program

Execution

Runtime decisions

Figure 4: Overview of execution of a Rout program.

In this latter data-flow graph associative MapRe-duce jobs are executed without transferring data acrossdatacenters. That is, such MapReduce jobs are exe-cuted individually on sub-datasets in the respectivedatacenters, and the results are stored in the same re-spective datacenters. Data transfer operations con-solidate all corresponding data in a single datacen-ter. A simple data transfer heuristics follows a lazyexecution strategy in that sub-datasets of a given

5

Page 31: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Outline• Big data background

• Geo-distribution motivation

• Geo-distributed tasks

• Geo-distributed workflows

• Conclusions and outlook

Page 32: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

• n datacenters DC1 to DCn and d input datasets DS1 to DSd - dataset DSi consists of si sub-datasets (1 ≤ si ≤ n)

• GD workflow W, each task taking in one or more (possibly GD) datasets as input and generating one or more datasets as output

GD Workflows [Jayalath&Eugster;ICDCS’13]

Xi - Execution cluster

DC1 DC2 DC3

X1 X2 X3

W

Example deployment

I1,1I3,1

I3,3

I1,3

I1,2 I2,1

I3,2I2,2

Page 33: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Problem Statement• How to efficiently perform workflows with GD datasets?

• Two aspects

1. Executing efficiently: runtime extensions

2. Expressing computation and constraints: language

• Model: geo-distributed datastructures and operations

• Why not transparent?

• Pig Latin/Pig and Flume Java/Crunch

Page 34: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

“Levels of Associativity”A function f can be

1. Associative (mathematical sense), f(X1.X2)= f(f(X1).f(X2)), e.g., max, min, top-k

2. There exists a (known) function g s.t. f(X1.X2)= g(f(X1).f(X2)), e.g., avg.

A. g is well-known (mostly for built-in simple functions, e.g., avg, word count)

B. Can be synthesized (cf. 3rd Homomorphism Thm. [Morihata et al.;POPL’09])

C. Can not be synthesized

3. Doesn’t exist / is unknown, e.g., join, top-k word count

Why not let programmer explicitly code?

Page 35: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Manual Distribution Exampleinput_lines = LOAD ‘input_file’ AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; word_groups = GROUP words BY word; word_count = FOREACH word_groups GENERATE group, COUNT(words); STORE word_count INTO ’output_file’;

• More lines of code

• One hard-wired path - may not be optimal

• Has to be associative (e.g., AVG) or “aggregatable” if not strictly associative (e.g., COUNT+SUM)

// Part 1 : Executed in both datacentersinput_lines = LOAD ‘input_file’ AS (line:chararray); words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; word_groups = GROUP words BY word; word_count = FOREACH word_groups GENERATE group, COUNT(words); STORE word_count INTO ’file1’;

// Part 2 : Executed in datacenter DC2 only// Copied data is stored in file2.// -> Copy file1 of DC2 to file2 in DC1.

// Part 3: Executed in datacenter DC1 onlyrecords1 = LOAD ’file1’ AS (word:chararray, count:int); records2 = LOAD ’file2’ AS (word:chararray, count:int); all_records = UNION records1, records2; grouped = GROUP all_records BY word; word_count = FOREACH grouped GENERATE group, SUM(all_records.count); STORE word_count INTO ’output_file’;

Page 36: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Pig Latin Background: Types• Simple: int, long, float, double, chararray, bytearray, boolean

• Complex

• tuple - an ordered set of fields

• bag - a collection of tuples

• map - a set of key-value pairs

• relation - an outer bag with no complex types

Page 37: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Operations and Functions• Operations

• UNION ceates a union of two or more relations

• CROSS creates a cross product of two or more relations

• JOIN joins two or more relations

• GROUP groups elements of a relation based on a given criteria

• Functions

• Eval functions, e.g., AVG, COUNT, CONCAT, SUM, TOKENIZE

• Math functions, e.g., ABS, COS

• String functions, e.g., SUBSTRING, TRIM

• User defined functions (UDFs)

Page 38: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Rout• Define two new complex data structures

• gdbag - collection of tuples but may represent tuples from multiple datacenters

• gdmap - collection of key-value pairs which may be from multiple datacenters

• bag and map are retained but pinned to single datacenters

• Operations and functions

• String and math functions are always applied tuple-wise

• Applied individually to sub-datasets in respective datacenters

Page 39: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Eval Functions• Eval functions are usually applied to groups of tuples

• Users can provide optional merge function “g“(original eval function “f” is called work function)

• Merge function: eval function is applied to individual sub-datasets followed by aggregation via merge

• Otherwise: all data represented by the corresponding datastructure copied to a single datacenter

• Destination is decided by Rout runtime (Rrun)

Page 40: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Operators and Example• Other operations

• Load and store operations - GDLOAD, GDSTORE

• Operations for converting between bags/maps and gdbags/gdmaps - COLLAPSE, GDCONVERT

• GD relational operations - GDJOIN, GDGROUP

// Input represents data in both datacentersgd_input_lines = GDLOAD ’input’ AS (line:chararray); gd_words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; gd_word_groups = GDGROUP gd_words BY word; gd_word_count = FOREACH gd_word_groups GENERATE group, COUNT(gd_words); word_count = COLLAPSE gd_word_count; STORE word_count INTO ’output_file’;

Page 41: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Rout Runtime Infrastructure (Rrun)

Execution steps of Rrun

• Lazy heuristic copying data across datacenters when needed

• E.g., operation is non-associative and no merge function is provided

• Only decides at which points data should be copied, not where to

Rout Programs

Level 1 Dataflow Graph

Level 2 Dataflow

Graph

Program Execution

Pig interpreter

Copy heuristic

Runtime decisions

Page 42: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Evaluation• 2 EC2 datacenters

• 10 nodes from each datacenter with 1.7 GB of memory and 1 EC2 virtual core running Ubuntu Linux

• Uses HADOOPData dataset, search for exceptions

Log Search

Rout - copy data after individual searches

18000 200 400 600 800 1000 1200 1400 1600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(a) Log debugger

7000 100 200 300 400 500 600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(b) Log search

90000 1000 2000 3000 4000 5000 6000 7000 8000

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(c) Weather explorer

Figure 8. Total execution time

900

0100200300400500600700800

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

CountMR MergeMR

Copy

(a) Log debugger

250

0

50

100

150

200

Execution

Exec

utio

n Ti

me

(s)

Pig - Dataset1

- Dataset1- Dataset1

SearchMR Copy

Rout -Dataset1

(b) Log search

4500

0500

1000150020002500300035004000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

WeatherMR Copy

(c) Weather explorer

Figure 9. Breakdown of execution time - Dataset1

1400

0

200

400

600

800

1000

1200

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

CountMR MergeMRCopy

(a) Log debugger

500

0

50

100

150

200

250

300

350

400

450

Execution

Exec

utio

n Ti

me

(s)

Rout - Dataset2 Pig - Dataset2

SearchMR Copy

(b) Log search

9000

010002000300040005000600070008000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

WeatherMR Copy

(c) Weather explorer

Figure 10. Breakdown of execution time - Dataset2

execution, representing their individual execution times fromleft to right in the order in which they occur in the respectiveschedules. More precisely, PigLatin executes the script as asingle MapReduce job. Due to execution techniques detailedin Section IV, Rrun executes the script as two MapReducejobs instead. The first MapReduce job (CountMR) deter-mines the results for input in each individual datacenter.The results are then copied to a single datacenter (Copy)and merged using another MapReduce job (MergeMR).

As the figures illustrate, Rout performs the task moreefficiently than PigLatin – in only 86% of the time for thesmaller Dataset1 compared to PigLatin and in only 64% ofthe time for the full dataset, Dataset2. This is mainly dueto two reasons: (1) since Rrun executes a part of the scriptkeeping input data GD, the amount of data that is copiedacross datacenters is significantly smaller; (2) for executingthe count MapReduce job, Rrun employes Hadoop clusters

in both datacenters, significantly reducing total executiontime.

C. Log Search

The second script performs an advanced search operation,again on HADOOPDATA (thus Dataset1 and Dataset2 arethe same as for the previous benchmark). The search canbe for specific records in the log files. In our example wesearched and listed the execution times of each MapReducejob logged.

Both Rrun and PigLatin dataflow graphs consisted of asingle MapReduce job (SearchMR) but in the former casethe MapReduce job was executed while keeping input dataGD resulting in a MapReduce job being executed in eachof the Hadoop clusters. The results from the datacenterscan be considered together with no need for merging. Thecorresponding execution times for Dataset1 and Dataset2

18000 200 400 600 800 1000 1200 1400 1600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(a) Log debugger

7000 100 200 300 400 500 600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(b) Log search

90000 1000 2000 3000 4000 5000 6000 7000 8000

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(c) Weather explorer

Figure 8. Total execution time

900

0100200300400500600700800

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

CountMR MergeMR

Copy

(a) Log debugger

250

0

50

100

150

200

Execution

Exec

utio

n Ti

me

(s)

Pig - Dataset1

- Dataset1- Dataset1

SearchMR Copy

Rout -Dataset1

(b) Log search

4500

0500

1000150020002500300035004000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

WeatherMR Copy

(c) Weather explorer

Figure 9. Breakdown of execution time - Dataset1

1400

0

200

400

600

800

1000

1200

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

CountMR MergeMRCopy

(a) Log debugger

500

0

50

100

150

200

250

300

350

400

450

Execution

Exec

utio

n Ti

me

(s)

Rout - Dataset2 Pig - Dataset2

SearchMR Copy

(b) Log search

9000

010002000300040005000600070008000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

WeatherMR Copy

(c) Weather explorer

Figure 10. Breakdown of execution time - Dataset2

execution, representing their individual execution times fromleft to right in the order in which they occur in the respectiveschedules. More precisely, PigLatin executes the script as asingle MapReduce job. Due to execution techniques detailedin Section IV, Rrun executes the script as two MapReducejobs instead. The first MapReduce job (CountMR) deter-mines the results for input in each individual datacenter.The results are then copied to a single datacenter (Copy)and merged using another MapReduce job (MergeMR).

As the figures illustrate, Rout performs the task moreefficiently than PigLatin – in only 86% of the time for thesmaller Dataset1 compared to PigLatin and in only 64% ofthe time for the full dataset, Dataset2. This is mainly dueto two reasons: (1) since Rrun executes a part of the scriptkeeping input data GD, the amount of data that is copiedacross datacenters is significantly smaller; (2) for executingthe count MapReduce job, Rrun employes Hadoop clusters

in both datacenters, significantly reducing total executiontime.

C. Log Search

The second script performs an advanced search operation,again on HADOOPDATA (thus Dataset1 and Dataset2 arethe same as for the previous benchmark). The search canbe for specific records in the log files. In our example wesearched and listed the execution times of each MapReducejob logged.

Both Rrun and PigLatin dataflow graphs consisted of asingle MapReduce job (SearchMR) but in the former casethe MapReduce job was executed while keeping input dataGD resulting in a MapReduce job being executed in eachof the Hadoop clusters. The results from the datacenterscan be considered together with no need for merging. Thecorresponding execution times for Dataset1 and Dataset2

Page 43: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Log Debugger

18000 200 400 600 800 1000 1200 1400 1600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(a) Log debugger

7000 100 200 300 400 500 600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(b) Log search

90000 1000 2000 3000 4000 5000 6000 7000 8000

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(c) Weather explorer

Figure 8. Total execution time

900

0100200300400500600700800

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

CountMR MergeMR

Copy

(a) Log debugger

250

0

50

100

150

200

Execution

Exec

utio

n Ti

me

(s)

Pig - Dataset1

- Dataset1- Dataset1

SearchMR Copy

Rout -Dataset1

(b) Log search

4500

0500

1000150020002500300035004000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

WeatherMR Copy

(c) Weather explorer

Figure 9. Breakdown of execution time - Dataset1

1400

0

200

400

600

800

1000

1200

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

CountMR MergeMRCopy

(a) Log debugger

500

0

50

100

150

200

250

300

350

400

450

Execution

Exec

utio

n Ti

me

(s)

Rout - Dataset2 Pig - Dataset2

SearchMR Copy

(b) Log search

9000

010002000300040005000600070008000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

WeatherMR Copy

(c) Weather explorer

Figure 10. Breakdown of execution time - Dataset2

execution, representing their individual execution times fromleft to right in the order in which they occur in the respectiveschedules. More precisely, PigLatin executes the script as asingle MapReduce job. Due to execution techniques detailedin Section IV, Rrun executes the script as two MapReducejobs instead. The first MapReduce job (CountMR) deter-mines the results for input in each individual datacenter.The results are then copied to a single datacenter (Copy)and merged using another MapReduce job (MergeMR).

As the figures illustrate, Rout performs the task moreefficiently than PigLatin – in only 86% of the time for thesmaller Dataset1 compared to PigLatin and in only 64% ofthe time for the full dataset, Dataset2. This is mainly dueto two reasons: (1) since Rrun executes a part of the scriptkeeping input data GD, the amount of data that is copiedacross datacenters is significantly smaller; (2) for executingthe count MapReduce job, Rrun employes Hadoop clusters

in both datacenters, significantly reducing total executiontime.

C. Log Search

The second script performs an advanced search operation,again on HADOOPDATA (thus Dataset1 and Dataset2 arethe same as for the previous benchmark). The search canbe for specific records in the log files. In our example wesearched and listed the execution times of each MapReducejob logged.

Both Rrun and PigLatin dataflow graphs consisted of asingle MapReduce job (SearchMR) but in the former casethe MapReduce job was executed while keeping input dataGD resulting in a MapReduce job being executed in eachof the Hadoop clusters. The results from the datacenterscan be considered together with no need for merging. Thecorresponding execution times for Dataset1 and Dataset2

18000 200 400 600 800 1000 1200 1400 1600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(a) Log debugger

7000 100 200 300 400 500 600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(b) Log search

90000 1000 2000 3000 4000 5000 6000 7000 8000

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(c) Weather explorer

Figure 8. Total execution time

900

0100200300400500600700800

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

CountMR MergeMR

Copy

(a) Log debugger

250

0

50

100

150

200

Execution

Exec

utio

n Ti

me

(s)

Pig - Dataset1

- Dataset1- Dataset1

SearchMR Copy

Rout -Dataset1

(b) Log search

4500

0500

1000150020002500300035004000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

WeatherMR Copy

(c) Weather explorer

Figure 9. Breakdown of execution time - Dataset1

1400

0

200

400

600

800

1000

1200

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

CountMR MergeMRCopy

(a) Log debugger

500

0

50

100

150

200

250

300

350

400

450

Execution

Exec

utio

n Ti

me

(s)

Rout - Dataset2 Pig - Dataset2

SearchMR Copy

(b) Log search

9000

010002000300040005000600070008000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

WeatherMR Copy

(c) Weather explorer

Figure 10. Breakdown of execution time - Dataset2

execution, representing their individual execution times fromleft to right in the order in which they occur in the respectiveschedules. More precisely, PigLatin executes the script as asingle MapReduce job. Due to execution techniques detailedin Section IV, Rrun executes the script as two MapReducejobs instead. The first MapReduce job (CountMR) deter-mines the results for input in each individual datacenter.The results are then copied to a single datacenter (Copy)and merged using another MapReduce job (MergeMR).

As the figures illustrate, Rout performs the task moreefficiently than PigLatin – in only 86% of the time for thesmaller Dataset1 compared to PigLatin and in only 64% ofthe time for the full dataset, Dataset2. This is mainly dueto two reasons: (1) since Rrun executes a part of the scriptkeeping input data GD, the amount of data that is copiedacross datacenters is significantly smaller; (2) for executingthe count MapReduce job, Rrun employes Hadoop clusters

in both datacenters, significantly reducing total executiontime.

C. Log Search

The second script performs an advanced search operation,again on HADOOPDATA (thus Dataset1 and Dataset2 arethe same as for the previous benchmark). The search canbe for specific records in the log files. In our example wesearched and listed the execution times of each MapReducejob logged.

Both Rrun and PigLatin dataflow graphs consisted of asingle MapReduce job (SearchMR) but in the former casethe MapReduce job was executed while keeping input dataGD resulting in a MapReduce job being executed in eachof the Hadoop clusters. The results from the datacenterscan be considered together with no need for merging. Thecorresponding execution times for Dataset1 and Dataset2

18000 200 400 600 800 1000 1200 1400 1600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(a) Log debugger

7000 100 200 300 400 500 600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(b) Log search

90000 1000 2000 3000 4000 5000 6000 7000 8000

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(c) Weather explorer

Figure 8. Total execution time

900

0100200300400500600700800

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

CountMR MergeMR

Copy

(a) Log debugger

250

0

50

100

150

200

Execution

Exec

utio

n Ti

me

(s)

Pig - Dataset1

- Dataset1- Dataset1

SearchMR Copy

Rout -Dataset1

(b) Log search

4500

0500

1000150020002500300035004000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

WeatherMR Copy

(c) Weather explorer

Figure 9. Breakdown of execution time - Dataset1

1400

0

200

400

600

800

1000

1200

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

CountMR MergeMRCopy

(a) Log debugger

500

0

50

100

150

200

250

300

350

400

450

Execution

Exec

utio

n Ti

me

(s)

Rout - Dataset2 Pig - Dataset2

SearchMR Copy

(b) Log search

9000

010002000300040005000600070008000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

WeatherMR Copy

(c) Weather explorer

Figure 10. Breakdown of execution time - Dataset2

execution, representing their individual execution times fromleft to right in the order in which they occur in the respectiveschedules. More precisely, PigLatin executes the script as asingle MapReduce job. Due to execution techniques detailedin Section IV, Rrun executes the script as two MapReducejobs instead. The first MapReduce job (CountMR) deter-mines the results for input in each individual datacenter.The results are then copied to a single datacenter (Copy)and merged using another MapReduce job (MergeMR).

As the figures illustrate, Rout performs the task moreefficiently than PigLatin – in only 86% of the time for thesmaller Dataset1 compared to PigLatin and in only 64% ofthe time for the full dataset, Dataset2. This is mainly dueto two reasons: (1) since Rrun executes a part of the scriptkeeping input data GD, the amount of data that is copiedacross datacenters is significantly smaller; (2) for executingthe count MapReduce job, Rrun employes Hadoop clusters

in both datacenters, significantly reducing total executiontime.

C. Log Search

The second script performs an advanced search operation,again on HADOOPDATA (thus Dataset1 and Dataset2 arethe same as for the previous benchmark). The search canbe for specific records in the log files. In our example wesearched and listed the execution times of each MapReducejob logged.

Both Rrun and PigLatin dataflow graphs consisted of asingle MapReduce job (SearchMR) but in the former casethe MapReduce job was executed while keeping input dataGD resulting in a MapReduce job being executed in eachof the Hadoop clusters. The results from the datacenterscan be considered together with no need for merging. Thecorresponding execution times for Dataset1 and Dataset2

18000 200 400 600 800 1000 1200 1400 1600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(a) Log debugger

7000 100 200 300 400 500 600

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(b) Log search

90000 1000 2000 3000 4000 5000 6000 7000 8000

Execution Time (s)

Rout - Dataset1

Pig - Dataset1

Rout - Dataset2

Pig - Dataset2

(c) Weather explorer

Figure 8. Total execution time

900

0100200300400500600700800

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

CountMR MergeMR

Copy

(a) Log debugger

250

0

50

100

150

200

Execution

Exec

utio

n Ti

me

(s)

Pig - Dataset1

- Dataset1- Dataset1

SearchMR Copy

Rout -Dataset1

(b) Log search

4500

0500

1000150020002500300035004000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset1 Pig - Dataset1

WeatherMR Copy

(c) Weather explorer

Figure 9. Breakdown of execution time - Dataset1

1400

0

200

400

600

800

1000

1200

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

CountMR MergeMRCopy

(a) Log debugger

500

0

50

100

150

200

250

300

350

400

450

Execution

Exec

utio

n Ti

me

(s)

Rout - Dataset2 Pig - Dataset2

SearchMR Copy

(b) Log search

9000

010002000300040005000600070008000

Execution

Exec

utio

n T

ime

(s)

Rout - Dataset2 Pig - Dataset2

WeatherMR Copy

(c) Weather explorer

Figure 10. Breakdown of execution time - Dataset2

execution, representing their individual execution times fromleft to right in the order in which they occur in the respectiveschedules. More precisely, PigLatin executes the script as asingle MapReduce job. Due to execution techniques detailedin Section IV, Rrun executes the script as two MapReducejobs instead. The first MapReduce job (CountMR) deter-mines the results for input in each individual datacenter.The results are then copied to a single datacenter (Copy)and merged using another MapReduce job (MergeMR).

As the figures illustrate, Rout performs the task moreefficiently than PigLatin – in only 86% of the time for thesmaller Dataset1 compared to PigLatin and in only 64% ofthe time for the full dataset, Dataset2. This is mainly dueto two reasons: (1) since Rrun executes a part of the scriptkeeping input data GD, the amount of data that is copiedacross datacenters is significantly smaller; (2) for executingthe count MapReduce job, Rrun employes Hadoop clusters

in both datacenters, significantly reducing total executiontime.

C. Log Search

The second script performs an advanced search operation,again on HADOOPDATA (thus Dataset1 and Dataset2 arethe same as for the previous benchmark). The search canbe for specific records in the log files. In our example wesearched and listed the execution times of each MapReducejob logged.

Both Rrun and PigLatin dataflow graphs consisted of asingle MapReduce job (SearchMR) but in the former casethe MapReduce job was executed while keeping input dataGD resulting in a MapReduce job being executed in eachof the Hadoop clusters. The results from the datacenterscan be considered together with no need for merging. Thecorresponding execution times for Dataset1 and Dataset2

Weather Explorer

Page 44: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Programmer EffortExperiment FlumeJava Pig

Naïve Explicit DuctWork Naïve Explicit RoutLoC K LoC K LoC K LoC K LoC K LoC K

Log debugger 5 7 11 14 6 8 6 15 13 26 7 18Log search 3 4 8 9 4 5 3 7 8 13 4 8Weather explorer 7 10 12 15 8 11 7 17 12 24 7 18Weather top k count 5 8 11 14 6 10 6 17 11 23 7 20Weather top k average 5 7 11 13 6 9 6 17 12 25 7 20

Table 3: Number of lines of code (LoC), number of keywords (K).

Experiment DuctWork/FlumeJava Rout/PigCompared To Naïve Compared To Explicit Compared To Naïve Compared To ExplicitLoC K LoC K LoC K LoC K

Log debugger +20% +28% �45% �43% +17% +20% �46% �31%Log search +33% +25% �50% �44% +33% +14% �50% �38%Weather explorer +14% +10% �33% �27% +0% +6% �42% �25%Weather top k count +20% +25% �45% �28% +17% +18% �36% �15%Weather top k average +20% +28% �45% �31% +17% +18% �42% �20%

Table 4: Percentage difference of LoC and K of DuctWork and Rout compared to naive and explicit handling of distribution in therespective base languages.

DuctWork increases when the input size is increased (DSet1 toDSet2), thus the relative cost benefits of using DuctWork becomeeven more significant on larger inputs. Also, our execution strategyattempts to minimize execution time; an alternative strategy gearedat costs could achieve further improvements as the two are not al-ways aligned [22].

4.4 ComplexityIn Table 3 we gauge program complexity for our programs through

two measures: (1) the number of lines of code and (2) the number ofinvolved “keywords”. More precisely, for the latter we consider thenumber of functions used in the case of FlumeJava/DuctWork (asit is implemented as a library), the number of actual keywords forPig/Rout (which includes built-in operators). We omit any decla-rations of data types. We compare DuctWork/Rout programs withFlumeJava/Pig programs written to explicitly optimize executionaccording to the given input data distribution (Explicit) and a naïveFlumeJava/Pig program that assumes all data to be in a single dat-acenter (Naïve).

Table 4 shows the percentage differences of the above two mea-sures when programs of DuctWork/Rout are compared to naïveand explicitly optimized versions of respective FlumeJava/Pig pro-grams. Results clearly show that the code improvements of ourlanguages over explicit versions are much more substantial thanthe costs of our languages over naïve versions.

The table shows that programs written using DuctWork/Rout usefewer lines of code and keywords than the corresponding Explicitversions of the FlumeJava/Pig programs being only slightly largerthan the corresponding Naïve versions of the FlumeJava/Pig pro-grams written assuming all data to be located in a single datacenter.Note that for the latter versions we did not include in the programsthe actual eager copying of data to a single datacenter.

4.5 DiscussionWe believe that the significant performance improvements achieved

with DuctWork over FlumeJava with naïve copying outweigh themodest increases in program complexity (at most one additionalLoC) with less distribution-agnostic collections across both con-

sidered languages. The main argument on the side of additionaltransparency is that in all the investigated programs the same per-formance could be achieved explicitly in DuctWork. However, thismight not always be the case in general. In addition, the best exe-cution plan might differ across input datasets/dataset versions (cf.B. in Section 2.3); in some cases copying upfront might indeed rep-resent the most efficient execution plan [22]. Identifying the mostefficient one might go through implementing and comparing sev-eral program versions. Last but not least, achieving an efficientexecution plan explicitly in a transparent model increases programcomplexity substantially. When consolidation does not happen atprogram start the resulting code is more complex in nature. Thiscomplexity increases with the number of datasets and datacentersinvolved. Inversely, the full backward compatibility of our solu-tion(s) allow novice programmers to settle for less performancewithout any added complexity.

In summary, we feel that the benefits of making (geo-)distributionmore explicit in big data analysis in terms of performance widelyoutweigh the downsides in terms of increased program complexity.While our empirical analysis has focused on geo-distribution only,we believe that our results are also encouraging for other scenarioswhere distribution matters, such as main memory big data process-ing. In fact, in parallel efforts we have investigated the effect of thecharacteristics of merge functions on performance of data aggrega-tion: while existing systems such as Presto [] or the work of Yu etal. [] use trees for aggregating data by applying merge functions inseveral stages, they all use constant fanouts for such trees. As weshow in [35], function characteristics can be used to select an op-timal fanout, leading to up to 10-fold speedups over a poor choice.By avoiding the typical MapReduce disk I/O in such main memorycomputations, bottlenecks namely are shifted to network I/O.

4.6 Threats to ValidityWe have considered several threats to validity, discussed in the

following.The benchmark programs have been inspired by other publica-

tions (e.g., [25] uses word count, top k, grep) and by the datasetsthemselves, which have been made publicly available by represen-

Experiment FlumeJava PigNaïve Explicit DuctWork Naïve Explicit Rout

LoC K LoC K LoC K LoC K LoC K LoC KLog debugger 5 7 11 14 6 8 6 15 13 26 7 18Log search 3 4 8 9 4 5 3 7 8 13 4 8Weather explorer 7 10 12 15 8 11 7 17 12 24 7 18Weather top k count 5 8 11 14 6 10 6 17 11 23 7 20Weather top k average 5 7 11 13 6 9 6 17 12 25 7 20

Table 3: Number of lines of code (LoC), number of keywords (K).

Experiment DuctWork/FlumeJava Rout/PigCompared To Naïve Compared To Explicit Compared To Naïve Compared To ExplicitLoC K LoC K LoC K LoC K

Log debugger +20% +28% �45% �43% +17% +20% �46% �31%Log search +33% +25% �50% �44% +33% +14% �50% �38%Weather explorer +14% +10% �33% �27% +0% +6% �42% �25%Weather top k count +20% +25% �45% �28% +17% +18% �36% �15%Weather top k average +20% +28% �45% �31% +17% +18% �42% �20%

Table 4: Percentage difference of LoC and K of DuctWork and Rout compared to naive and explicit handling of distribution in therespective base languages.

DuctWork increases when the input size is increased (DSet1 toDSet2), thus the relative cost benefits of using DuctWork becomeeven more significant on larger inputs. Also, our execution strategyattempts to minimize execution time; an alternative strategy gearedat costs could achieve further improvements as the two are not al-ways aligned [22].

4.4 ComplexityIn Table 3 we gauge program complexity for our programs through

two measures: (1) the number of lines of code and (2) the number ofinvolved “keywords”. More precisely, for the latter we consider thenumber of functions used in the case of FlumeJava/DuctWork (asit is implemented as a library), the number of actual keywords forPig/Rout (which includes built-in operators). We omit any decla-rations of data types. We compare DuctWork/Rout programs withFlumeJava/Pig programs written to explicitly optimize executionaccording to the given input data distribution (Explicit) and a naïveFlumeJava/Pig program that assumes all data to be in a single dat-acenter (Naïve).

Table 4 shows the percentage differences of the above two mea-sures when programs of DuctWork/Rout are compared to naïveand explicitly optimized versions of respective FlumeJava/Pig pro-grams. Results clearly show that the code improvements of ourlanguages over explicit versions are much more substantial thanthe costs of our languages over naïve versions.

The table shows that programs written using DuctWork/Rout usefewer lines of code and keywords than the corresponding Explicitversions of the FlumeJava/Pig programs being only slightly largerthan the corresponding Naïve versions of the FlumeJava/Pig pro-grams written assuming all data to be located in a single datacenter.Note that for the latter versions we did not include in the programsthe actual eager copying of data to a single datacenter.

4.5 DiscussionWe believe that the significant performance improvements achieved

with DuctWork over FlumeJava with naïve copying outweigh themodest increases in program complexity (at most one additionalLoC) with less distribution-agnostic collections across both con-

sidered languages. The main argument on the side of additionaltransparency is that in all the investigated programs the same per-formance could be achieved explicitly in DuctWork. However, thismight not always be the case in general. In addition, the best exe-cution plan might differ across input datasets/dataset versions (cf.B. in Section 2.3); in some cases copying upfront might indeed rep-resent the most efficient execution plan [22]. Identifying the mostefficient one might go through implementing and comparing sev-eral program versions. Last but not least, achieving an efficientexecution plan explicitly in a transparent model increases programcomplexity substantially. When consolidation does not happen atprogram start the resulting code is more complex in nature. Thiscomplexity increases with the number of datasets and datacentersinvolved. Inversely, the full backward compatibility of our solu-tion(s) allow novice programmers to settle for less performancewithout any added complexity.

In summary, we feel that the benefits of making (geo-)distributionmore explicit in big data analysis in terms of performance widelyoutweigh the downsides in terms of increased program complexity.While our empirical analysis has focused on geo-distribution only,we believe that our results are also encouraging for other scenarioswhere distribution matters, such as main memory big data process-ing. In fact, in parallel efforts we have investigated the effect of thecharacteristics of merge functions on performance of data aggrega-tion: while existing systems such as Presto [] or the work of Yu etal. [] use trees for aggregating data by applying merge functions inseveral stages, they all use constant fanouts for such trees. As weshow in [35], function characteristics can be used to select an op-timal fanout, leading to up to 10-fold speedups over a poor choice.By avoiding the typical MapReduce disk I/O in such main memorycomputations, bottlenecks namely are shifted to network I/O.

4.6 Threats to ValidityWe have considered several threats to validity, discussed in the

following.The benchmark programs have been inspired by other publica-

tions (e.g., [25] uses word count, top k, grep) and by the datasetsthemselves, which have been made publicly available by represen-

Page 45: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Outline• Big data background

• Geo-distribution motivation

• Geo-distributed tasks

• Geo-distributed workflows

• Conclusions and outlook

Page 46: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Conclusions• Unlikely all data in the world will ever be in 1

datacenter

• Communication latency related to distance

• Geographical constraints matter

• Operating closer to data pays off in most cases

• Beyond the atmosphere - fog computing

Page 47: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Future Work• Optimization

• DTGs/G-MR

• Heuristics to further reduce complexity

• Higher-degree polynomials for approximation

• Rout reconciliation

• Fine granularity of DTG offline optimization

• Simple Rout heuristic considering online resource usage

Page 48: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Yet More Future Work• Model and expressiveness

• Flume Java/Ductwork

• Iterative and incremental jobs, main-memory datastructs, cf. Flink, Spark

• Optimal aggregation [Culhane et al.;HotCloud’14], [Culhane et al.;INFOCOM’15]

• UDFs

• Security

• Integrity, availability, and isolation [Stephen&Eugster;Middleware’13]

• Confidentiality [Stephen et al.;HotCloud’14], [Stephen et al.;ASE’14]

Page 49: Geo-Distributed Big Data Processing - GitHub Pagesstg-tud.github.io/.../CTBD_13_geo_distributed_big_data.pdfGeo-Distributed Big Data • Many large datasets geo-distributed, i.e.,

Next Week

• Resource management

• Security