Upper and Lower Bound on the Cost of a MapReduce Computation

24
Upper and Lower Bounds on the Cost of a Map - Reduce Computation 38 th International Conference on Very Large Data Bases (VLDB 2012) Tzu-Li Tai National Cheng Kung University Dept. of Electrical Engineering HPDS Laboratory Foto N. Afrati National Technical University of Athens Anish Das Sarma Google Research Semih Salihoglu, Jeffrey D. Ullman Stanford University

description

[Paper Study] VLDB 2012 Author: Foto N. Afrati, National Technical University of Athens

Transcript of Upper and Lower Bound on the Cost of a MapReduce Computation

Page 1: Upper and Lower Bound on the Cost of a MapReduce Computation

Upper and Lower Bounds

on the Cost of a

Map-Reduce Computation

38th International Conference on Very Large Data Bases (VLDB 2012)

Tzu-Li TaiNational Cheng Kung UniversityDept. of Electrical EngineeringHPDS Laboratory

Foto N. AfratiNational Technical University of Athens

Anish Das SarmaGoogle Research

Semih Salihoglu, Jeffrey D. UllmanStanford University

Page 2: Upper and Lower Bound on the Cost of a MapReduce Computation

Agenda

A. BackgroundB. A Motivating ExampleC. Tradeoff: Parallelism & CommunicationD. Problem Model and AssumptionsE. The Hamming-Distance-1 ProblemF. Conclusion

220

Page 3: Upper and Lower Bound on the Cost of a MapReduce Computation

Background

The MapReduce Paradigm

Map

Map

Map

Map

Reduce

Reduce

Reduce

(๐’Œ๐Ÿ, ๐’—๐Ÿ)

(๐’Œ๐Ÿ, ๐’—๐Ÿ)

(๐’Œ๐Ÿ, ๐’—๐Ÿ)

(๐’Œ๐Ÿ, ๐’—๐Ÿ) (๐’Œ๐Ÿ, [๐‘ฝ๐Ÿ]) (๐’Œ๐Ÿ‘, ๐’—๐Ÿ‘)

221

Page 4: Upper and Lower Bound on the Cost of a MapReduce Computation

Background

Distributed/Parallel Computing in Clusters

โ€ข Often uses MapReduce to express applications (Hadoop)- This paper focuses on single-round MR applications

โ€ข Limited bandwidth

โ€ข Limited resources (memory, processing units, etc.)

โ€ข For public clouds, you โ€œpay as you goโ€ for these resources- Amazon EC2 charges for both bandwidth usage & processing units

222

Page 5: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

The Drug Interaction Problem

โ€ข 3000 sets of drug data (patients taking, dates, diagnoses)

โ€ข About 1M of data per drug

โ€ข Problem:Find 2 drugs that when taken together increase the risk of heart attack

โ€ข Cross-referencing 2 drugs across whole set of drugs

223

Page 6: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

Reduce for {๐Ÿ, ๐Ÿ}

Drug 1

Map

Drug 2

Map

Drag 3

Map

Drug 4

Map

Reduce for {๐Ÿ, ๐Ÿ‘}

Reduce for {๐Ÿ, ๐Ÿ’}

Reduce for {๐Ÿ, ๐Ÿ‘}

Reduce for {๐Ÿ, ๐Ÿ’}

Reduce for {๐Ÿ‘, ๐Ÿ’}

( 1,2 , )data 1

( 1,3 , )data 1

( 1,4 , )data 1

( 1,2 , )data 2

( 2,3 , )data 2

( 2,4 , )data 2

( 1,3 , )data 3

( 2,3 , )data 3

( 3,4 , )data 3

( 1,4 , )data 4

( 2,4 , )data 4

( 3,4 , )data 4

( 1,2 , )data 1+2

( 1,3 , )data 1+3

( 1,4 , )data 1+4

( 2,3 , )data 2+3

( 2,4 , )data 2+4

( 3,4 , )data 3+4

224

Page 7: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

What Went Wrong?

โ€ข For 3000 drugs, each set of drug data is replicated 2999 times

โ€ข Each set of data is 1M large= 9 terabytes of communication= 90,000 sec for 1 Gigabit network

โ€ข Communication cost is too high!

225

Page 8: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( ๐บ1, ๐บ2 , )data 1

( ๐บ1, ๐บ3 , )data 1

( ๐บ1, ๐บ2 , )data 2

( ๐บ1, ๐บ3 , )data 2

( ๐บ1, ๐บ2 , )data 3

( ๐บ2, ๐บ3 , )data 3

( ๐บ1, ๐บ2 , )data 4

( ๐บ2, ๐บ3 , )data 4

( ๐บ1, ๐บ3 , )data 5

( ๐บ2, ๐บ3 , )data 5

( ๐บ1, ๐บ3 , )data 6

( ๐บ2, ๐บ3 , )data 6

Different Approach: Grouping Drugsโ€ข ๐บ1: Drugs 1-2โ€ข ๐บ2: Drugs 3-4โ€ข ๐บ3: Drugs 5-6

Key: Own Group + Other Groups

226

Page 9: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( ๐บ1, ๐บ2 , )data 1

( ๐บ1, ๐บ3 , )data 1

( ๐บ1, ๐บ2 , )data 2

( ๐บ1, ๐บ3 , )data 2

( ๐บ1, ๐บ2 , )data 3

( ๐บ2, ๐บ3 , )data 3

( ๐บ1, ๐บ2 , )data 4

( ๐บ2, ๐บ3 , )data 4

( ๐บ1, ๐บ3 , )data 5

( ๐บ2, ๐บ3 , )data 5

( ๐บ1, ๐บ3 , )data 6

( ๐บ2, ๐บ3 , )data 6

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

( ๐บ1, ๐บ2 , )data 1+2+3+4

( ๐บ1, ๐บ3 , )data 1+2+5+6

( ๐บ2, ๐บ3 , )data 3+4+5+6

227

Page 10: Upper and Lower Bound on the Cost of a MapReduce Computation

A Motivating Example

โ€ข Therefore, if we group 3000 drugs as 30 groups- ๐บ1: 1-100, ๐บ2: 101-200, โ€ฆโ€ฆ, ๐บ3:2901-3000

โ€ข Each set of drug data is only replicated 29 times= 87 GB vs. 9TB communication cost

โ€ข But lower parallelism, higher processing cost!

228

Page 11: Upper and Lower Bound on the Cost of a MapReduce Computation

Tradeoff: Parallelism & Communication

ParallelismCommunication

โ€ข To evaluate communication cost, define ๐‘Ÿ๐‘’๐‘๐‘™๐‘–๐‘๐‘Ž๐‘ก๐‘–๐‘œ๐‘› ๐‘Ÿ๐‘Ž๐‘ก๐‘’ ๐’“, which represents the average number of key-value pairs created from a single map input

โ€ข To evaluate processing cost, define ๐‘Ÿ๐‘’๐‘‘๐‘ข๐‘๐‘’๐‘Ÿ ๐‘ ๐‘–๐‘ง๐‘’ ๐’’, which represents the maximum amount of values for a single key

229

Page 12: Upper and Lower Bound on the Cost of a MapReduce Computation

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( ๐บ1, ๐บ2 , )data 1

( ๐บ1, ๐บ3 , )data 1

( ๐บ1, ๐บ2 , )data 2

( ๐บ1, ๐บ3 , )data 2

( ๐บ1, ๐บ2 , )data 3

( ๐บ2, ๐บ3 , )data 3

( ๐บ1, ๐บ2 , )data 4

( ๐บ2, ๐บ3 , )data 4

( ๐บ1, ๐บ3 , )data 5

( ๐บ2, ๐บ3 , )data 5

( ๐บ1, ๐บ3 , )data 6

( ๐บ2, ๐บ3 , )data 6

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

( ๐บ1, ๐บ2 , )data 1+2+3+4

( ๐บ1, ๐บ3 , )data 1+2+5+6

( ๐บ2, ๐บ3 , )data 3+4+5+6

๐’“ = ๐Ÿ, ๐’’ = ๐Ÿ’

Tradeoff: Parallelism & Communication

2210

Page 13: Upper and Lower Bound on the Cost of a MapReduce Computation

How the Tradeoff can be Used

๐‘Ÿ = ๐‘“(๐‘ž)

โ€ข Communication cost: ๐‘Ž๐‘Ÿ, a: constant

โ€ข Processing cost: Some function of ๐‘ž- Take for example the previous drug interaction problem- The work for each reducer is ๐‘‚ ๐‘ž2 , so

๐ถ๐‘œ๐‘ ๐‘ก๐‘’๐‘Ž๐‘โ„Ž = ๐‘๐‘ž2, b: constant

- The number of reducers is proportional to 1

๐‘ž

- ๐ถ๐‘œ๐‘ ๐‘ก๐‘ก๐‘œ๐‘ก๐‘Ž๐‘™ = ๐‘๐‘ž2 ร—1

๐‘ž= ๐‘๐‘ž

Tradeoff: Parallelism & Communication

2211

Page 14: Upper and Lower Bound on the Cost of a MapReduce Computation

How the Tradeoff can be Used

๐ถ๐‘œ๐‘š๐‘๐‘–๐‘›๐‘’๐‘‘ ๐ถ๐‘œ๐‘ ๐‘ก = ๐‘Ž๐‘Ÿ + ๐‘๐‘ž= ๐‘Ž๐‘“ ๐‘ž + ๐‘๐‘ž

โ€ข Solve for ๐‘ž for minimal combined cost

โ€ข Determine ๐‘Ÿ with ๐‘Ÿ = ๐‘“(๐‘ž)

โ€ข Decide appropriate algorithm implementation

Tradeoff: Parallelism & Communication

2212

Page 15: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

Mapping Schema๐‘Ÿ , ๐‘ž

Hypothetical set of all inputsconstructed from domain N

Finite domain N All possible outputs corresponding to

the inputs

2213

Page 16: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

Example: Hamming Distance 1

1011010011

Distance:2

1011010010

Distance:1

2214

Page 17: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

Example: Hamming Distance 1

000โ€ฆโ€ฆ00000โ€ฆโ€ฆ01000โ€ฆโ€ฆ10

.

.

.

.111โ€ฆโ€ฆ00111โ€ฆโ€ฆ01111โ€ฆโ€ฆ10111โ€ฆโ€ฆ11

{Domain: ๐’ƒ bits string length

2๐‘โ„Ž๐‘ฆ๐‘๐‘œ๐‘กโ„Ž๐‘’๐‘ก๐‘–๐‘๐‘Ž๐‘™๐‘–๐‘›๐‘๐‘ข๐‘ก๐‘ 

Mapping Schema๐‘Ÿ , ๐‘ž

No. of outputs =

๐Ÿ๐’ƒ ร— ๐’ƒ

๐Ÿ

2215

Page 18: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

The Mapping Schema Tradeoff Derivation

Given the maximum reducer size ๐‘ž, and assume there are ๐‘ reducers,

๐‘Ÿ =

๐‘–=1

๐‘

๐‘ž๐‘–๐ผ

๐‘ž๐‘–: reducer size of reducer ๐‘– (๐‘ž๐‘– โ‰ค ๐‘ž)๐ผ: Total input size

2216

Page 19: Upper and Lower Bound on the Cost of a MapReduce Computation

22

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( ๐บ1, ๐บ2 , )data 1

( ๐บ1, ๐บ3 , )data 1

( ๐บ1, ๐บ2 , )data 2

( ๐บ1, ๐บ3 , )data 2

( ๐บ1, ๐บ2 , )data 3

( ๐บ2, ๐บ3 , )data 3

( ๐บ1, ๐บ2 , )data 4

( ๐บ2, ๐บ3 , )data 4

( ๐บ1, ๐บ3 , )data 5

( ๐บ2, ๐บ3 , )data 5

( ๐บ1, ๐บ3 , )data 6

( ๐บ2, ๐บ3 , )data 6

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

( ๐บ1, ๐บ2 , )data 1+2+3+4

( ๐บ1, ๐บ3 , )data 1+2+5+6

( ๐บ2, ๐บ3 , )data 3+4+5+6

Problem Model & Assumptions

๐’’๐Ÿ = ๐Ÿ’

๐’’๐Ÿ = ๐Ÿ’

๐’’๐Ÿ‘ = ๐Ÿ’๐‘ฐ=๐Ÿ”

โ‡’ ๐’“ =

๐’Š=๐Ÿ

๐’‘

๐’’๐’Š๐‘ฐ =๐Ÿ’ + ๐Ÿ’ + ๐Ÿ’

๐Ÿ”= ๐Ÿ

17

Page 20: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

1. Deriving ๐‘”(๐‘ž): upper bound of outputs a reducer with size ๐‘ž covers

Finding the lower bound of ๐’“ with given ๐’’

2218

Page 21: Upper and Lower Bound on the Cost of a MapReduce Computation

M

Drug 1

M

Drug 2

M

Drug 3

M

Drug 4

M

Drug 5

M

Drug 6

( ๐บ1, ๐บ2 , )data 1

( ๐บ1, ๐บ3 , )data 1

( ๐บ1, ๐บ2 , )data 2

( ๐บ1, ๐บ3 , )data 2

( ๐บ1, ๐บ2 , )data 3

( ๐บ2, ๐บ3 , )data 3

( ๐บ1, ๐บ2 , )data 4

( ๐บ2, ๐บ3 , )data 4

( ๐บ1, ๐บ3 , )data 5

( ๐บ2, ๐บ3 , )data 5

( ๐บ1, ๐บ3 , )data 6

( ๐บ2, ๐บ3 , )data 6

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

Reduce for {๐‘ฎ๐Ÿ, ๐‘ฎ๐Ÿ‘}

( ๐บ1, ๐บ2 , )data 1+2+3+4

( ๐บ1, ๐บ3 , )data 1+2+5+6

( ๐บ2, ๐บ3 , )data 3+4+5+6

Problem Model & Assumptions

๐’’ = ๐Ÿ’

โ‡’ ๐’„๐’๐’—๐’†๐’“๐’”๐Ÿ’๐Ÿ๐’๐’–๐’•๐’‘๐’–๐’•๐’”

โŸน ๐’ˆ ๐’’ =๐’’๐Ÿ=๐’’(๐’’ โˆ’ ๐Ÿ)

๐Ÿโ‰ˆ๐’’๐Ÿ

๐Ÿ 2219

Page 22: Upper and Lower Bound on the Cost of a MapReduce Computation

Problem Model & Assumptions

1. Deriving ๐‘”(๐‘ž): upper bound of outputs a reducer with size ๐‘ž covers2. Determine number of Inputs ๐ผ and Outputs ๐‘‚3. Establish Inequality:

๐‘–=1

๐‘

๐‘”(๐‘ž๐‘–) โ‰ฅ ๐‘‚

4. Manipulate Inequality:

๐‘–=1

๐‘

๐‘ž๐‘–๐‘”(๐‘ž๐‘–)

๐‘ž๐‘–โ‰ฅ ๐‘‚ โ‡’

๐‘–=1

๐‘

๐‘ž๐‘–๐‘”(๐‘ž)

๐‘žโ‰ฅ ๐‘‚

Finding the lower bound of ๐’“ with given ๐’’

โ‡’ ๐’“ =

๐‘–=1

๐‘

๐‘ž๐‘–๐ผ โ‰ฅ๐’’ ร— ๐‘ถ

๐’ˆ(๐’’) ร— ๐‘ฐ

2220

Page 23: Upper and Lower Bound on the Cost of a MapReduce Computation

The Hamming-Distance-1 Problem

1. ๐‘” ๐‘ž = ( ๐‘ž 2) log2 ๐‘ž (by mathematical induction)

2. ๐ผ = 2๐‘, ๐‘‚ =๐‘

22๐‘

3. Inequality:

๐‘–=1

๐‘

๐‘” ๐‘ž๐‘– =

๐‘–=1

๐‘๐‘ž๐‘–2log2 ๐‘ž๐‘– โ‰ฅ

๐‘

22๐‘

โ‡’

๐‘–=1

๐‘๐‘ž๐‘–2log2 ๐‘ž โ‰ฅ

๐‘

22๐‘

โ‡’ ๐’“ =

๐‘–=1

๐‘

๐‘ž๐‘–2๐‘โ‰ฅ ๐’ƒ ๐ฅ๐จ๐ ๐Ÿ ๐’’

2221

Page 24: Upper and Lower Bound on the Cost of a MapReduce Computation

Conclusion

โ€ข Presents a new approach to study optimal Map-Reduce algorithms

โ€ข Established a unified model with two parameters, replication rate and reducer size to study performance over a spectrum ofpossible computing clusters.

โ€ข For several problems, it had been shown that the two parameters are related by a tradeoff formula.

2222