Satisfying Strong Application Requirements in Data-Intensive Clouds

1

Satisfying Strong Application Requirementsin Data-Intensive Clouds

Ph.D Final ExamBrian Cho

2

Motivating scenario: Using thedata-intensive cloud

• Researchers contract with defense agency to investigate ongoing suspicious activity– e.g., botnet attack, worm, etc.– Other applications: processing

click logs, news items, etc.

1. Transfer large logs (TBs-PBs) from possible victim sites

2. Run computations on logs to find vulnerabilities and source of attack

3. Store data

3

Can today’s data-intensive cloud meet these demands?

The researchers require:1. Control over time and $ cost of

transfer, to stay within the contracted budget and time

2. Prioritization of this time-sensitive job over other jobs in its cluster

3. Consistent updates and reads at data store

• Current limitation: Systems are built to optimize key metrics at large scales, but not to meet these strong user requirements

4

Strong user requirements

• Many real-world requirements are too important to relax– Time– $$$– Priority– Data consistency

• It is essential to treat these strong requirements as problem constraints– … not just as side effects of resource

limitations in the cloud

5

Thesis statement

• It is feasible to satisfy strong application requirements for data-intensive cloud computing environments, in spite of resource limitations, while simultaneously optimizing run-time metrics.– Strong application requirements: real-time deadlines,

dollar budgets, data consistency, etc.– Resource limitations: finite compute nodes, limited

bandwidth, high latency, frequent failures, etc.– Run-time metrics: throughput, latency, $ cost, etc.

6

Contributions: Practical solutionsSolution Strong user

requirementKey optimized metric

Natjam Prioritize production jobs

Job completion time

Vivace[USENIX ATC 2012] Consistency Low latency

Key-value Storage

Computation

Pandora-A[ICDCS 2010] Deadline Low $ cost

Pandora-B[ICAC 2011] $ Budget Short transfer time

Bulk Data Transfer

7

Pandora-A: Bulk Data Transfer via Internet and Shipping Networks

• Minimize $ costsubject to time deadline

• Transfer options– Internet links with proportional costs

but limited bandwidth– Shipping links with fixed costs and

shipping times depending on method (e.g. ground, air)

• Solution– Transform into time-expanded network– Solve min-cost flow on network

• Trace-driven experiments– Pandora-A solutions better than direct

Internet or shipping

8

Pandora-B: Bulk Data Transfer via Internet and Shipping Networks

• Minimize transfer timesubject to $ budget– Bounded binary search on Pandora-A

solutions– Bounds created by transforming time-

expanded networks

B

Transfer Time T (hrs)

Dolla

r Cos

t ($)

UBLB

9

Vivace: Consistent data for congested geo-distributed systems

• Strongly consistent key-value store– Low latency across geo-distributed

data centers– Under congestion

• New algorithms– Prioritize a small amount of critical

information– To avoid delay due to congestion

• Evaluated using a practical prioritization infrastructure

10

Natjam: Prioritizing production jobsin MapReduce/Hadoop

• Mixed workloads– Production jobs

• Time sensitive• Directly affect revenue

– Research jobs• e.g., long term analysis

• Example: Ad provider

Count clicks

Update ads

Slow counts → Show old ads → Don’t get

paid $$$

Ad click-through logs

Is there a better way to place ads?

Run machine learning analysis

Lots of historical logs. Need a large cluster.

Prioritize production jobs

11

Contributions

• Natjam prioritizes production jobs• While giving research jobs spare capacity

• Suspend/Resume tasks in research jobs– Production jobs can gain resources immediately– Research jobs can use many resources at a time,

without wasting work• Develop eviction policies that choose which tasks

to suspend

12

Natjam Outline

• Motivation• Contributions• Background: MapReduce/Hadoop• State-of-the-art• Solution: Suspend/Resume• Design• Evaluation

13

Background: MapReduce/Hadoop• Distributed computation on large

cluster• Each job consists of Map and Reduce

tasks• Job stages

1. Map tasks run computations in parallel2. Shuffle combines intermediate Map

outputs3. Reduce tasks run computations in

parallel

M M

M

M M

R

R R

14

Background: MapReduce/Hadoop• Distributed computation on large

cluster• Each job consists of Map and Reduce

tasks• Job stages

1. Map tasks run computations in parallel2. Shuffle combines intermediate Map

outputs3. Reduce tasks run computations in

parallel

• Map input/Reduce output stored in distributed file system (e.g. HDFS)

• Scheduling: Which task to run on empty resources (slots)

M M

M

M M

R

R R

R

M

M M

M M

M

R R

M

M

M

M M

R R

Job 1

Job 2

Job 3

15

State-of-the-art: Separate clusters

• Submit production jobs to a production cluster

• Submit research jobs to a research cluster

16

State-of-the-art: Separate clusters

• Submit production jobs to a production cluster

• Submit research jobs to a research cluster

• Trace of job submissions to Yahoo production cluster

• Periods of under-utilization, where research jobs could potentially fill in

# R

educ

e sl

ots

Reduce slot capacity

( under- utilization )

Plot used with permission from Yahoo

020

0040

0060

0080

0010

000

time (hours:mins)

0:20 1:000:40

17

State-of-the-art: Single clusterHadoop scheduling

• Ideally,– Enough capacity for

production jobs– Run research tasks on all

idle production slots• But,

– Killing tasks (e.g. Fair Scheduler) can lead to wasted work


wasted work

# R

educ

e sl

ots


( under- utilization )

020

0040

0060

0080

0010

000

time (hours:mins)

0:20 1:000:40

18

State-of-the-art: Single clusterHadoop scheduling

• Ideally,– Enough capacity for

production jobs– Run research tasks on all

idle production slots• But,

– Killing tasks (e.g. Fair Scheduler) can lead to wasted work

– No preemption (e.g. Capacity Scheduler) can lead to production jobs waiting for resources


# R

educ

e sl

ots


production jobs aren’tassigned resources

020

0040

0060

0080

0010

000

time (hours:mins)

0:20 1:000:40

19

Approach: Suspend/Resume• Suspend/Resume tasks

within and across research jobs– Production jobs can gain

resources immediately– Research jobs can use many

resources at a time, without wasting work

• Focus on Reduce tasks– Reduce tasks take longer, so

more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook])


# R

educ

e sl

ots


020

0040

0060

0080

0010

000

time (hours:mins)

0:20 1:000:40

20

Goals: Prioritize production jobs

• Requirement: Production jobs should have the same completion time as if they were executed in an exclusive production cluster– Possibly with a small overhead

• Optimization: Research jobs should have the shortest completion time possible

• Constraint: Finite cluster resources

21

Challenges

• Avoid Suspend overhead– Would require production jobs to wait for resources

• Avoid Resume overhead– Would delay research jobs from making progress

• Optimize task evictions– Job completion time is metric that users care about– Develop eviction policies that have the least impact on

job completion times

22

Natjam Design

• Motivation• Contributions• Background: MapReduce/Hadoop• State-of-the-art• Solution: Suspend/Resume• Design• Evaluation

• Scheduler– Hadoop → Natjam

• Architecture– Hadoop → Natjam

• Suspend/Resume tasks

• Eviction Policies– Task– Job

23

Background: Capacity Scheduler• Limitation: research jobs

cannot scale down• Hadoop capacity shared

using queues– Guaranteed capacity (G)– Maximum capacity(M)

24

Background: Capacity Scheduler• Limitation: research jobs

cannot scale down• Hadoop capacity shared

using queues– Guaranteed capacity (G)– Maximum capacity(M)

• Example– Production (P) queue:

G 80%/M 80%– Research (R) queue:

G 20%/M 40%

1. Production jobsubmitted first:

2. Research jobsubmitted first:

time →

P takes 80%(under-utilization)

R grows to 40%

time →

R takes 40%(under-utilization)

P cannot grow beyond 60%

25

Natjam Scheduler

• Does not require Maximum capacity

• Scales down research jobs

26

Natjam Scheduler

• Does not require Maximum capacity

• Scales down research jobs

1. P/R Guaranteed: 80%/20%

2. P/R Guaranteed: 100%/0%

time →

R takes 100%

P takes 80%

time →

R takes 100%

P takes 100%

Prioritize Production Jobs

27

Background: Hadoop YARN architecture

• Resource Manager• Application Master

per application

• Tasks are launched on containers of memory– Formerly, slots in

Hadoop

Resource ManagerCapacity Scheduler

Node A Node BNode Manager A

Application Master 1

Node Manager B


Task (App2)

ask container

(empty container)

Task (App1)

28

Suspend/Resume architecture

• Preemptor– Decides when

resources should be reclaimed from queues

– Chooses victim job• Releaser

– Chooses task to evict• Local Suspender

– Saves state– Promptly exits

• Messaging overheads


Node A

(empty container)

Node BNode Manager A


Node Manager B


Task (App2)

Preemptor

Releaser

Task (App2)

Local Suspender

Releaser Local Suspender

preempt()

# containers to release

release()suspend

saved state

ask container

Task (App1)

resume()

29

Suspending and Resuming Tasks

• When suspending, we must save enough state to be used when resuming the task.

• By using existing intermediate datawe save small state– Simple– Low overhead

30

Suspending and Resuming Tasks• Existing intermediate data

used– Reduce inputs,

stored at local host– Reduce outputs,

stored on HDFS

• Suspend state saved– Key counter– Reduce input path– Hostname– List of suspended task attempt

IDs

HDFSTask Attempt 1

Inputs

KeyCounter

tmp/task_att_1

tmp/task_att_2

outdir/

(Resumed) Task Attempt 2

Inputs

KeyCounter

(skip)

(Suspended)Container freed,

Suspend state saved

31

Two-level Eviction Policies

• Job-level Eviction– Chooses victim job

• Task level-eviction– Chooses task to evict


Node A Node BNode Manager A


Node Manager B


Task (App2)

Preemptor

Releaser

Task (App2)

Local Suspender

Releaser Local Suspender

# containers to release

preempt()

release()

32

Task eviction policies• Based on time remaining

– Last task to finish decides job completion time– Task that finishes earlier releases container earlier

• Application Master keeps track of time remaining

• Shortest Remaining Time (SRT) Shortens the tail Holds on to containers that would be released soon

• Longest Remaining Time (LRT) May lengthen the tail Releases containers as soon as possible

33

Job eviction policies• Based on amount of resources (e.g. memory) held by job• Resource Manager holds resource information

• Least Resources (LR) Large jobs benefit Starvation even with small production jobs

• Most Resources (MR) Small jobs benefit Large jobs may be delayed for a long time

• Probabilistically-weighted on Resources (PR) Avoids biasing tasks: chance of eviction for task is same across all jobs, assuming random task eviction policy Many jobs may be delayed

34

Evaluation

• Microbenchmarks• Trace-driven experiments

• Natjam was implemented based on Hadoop 0.23 (YARN)

• 7-node cluster in CCT

35

Microbenchmarks: Setup

• Avg completion times on empty cluster– Research Job: ~200s– Production Job: ~70s

• Job sizes: XL (100% of cluster), L (75%), M (50%), S (25%)

• Task workloads within a job chosen uniformly between range of (1/2 of largest task, largest task]

36

Microbenchmark: Comparing Natjam to other techniques

Ideal Capacity scheduler: Hard cap

Capacity scheduler: Soft cap

Killing Natjam0

50

100

150

200

250

300

350

Research-XL Job Production-S Job

Aver

age

Exec

ution

Tim

e (s

econ

ds)

50% more than ideal

90% more than ideal

20% more than ideal

2% more than ideal15% less than Killing

7% more than ideal40% less than Soft cap

time (seconds)

t=0s Research-XL t=50s Production-S

37

Microbenchmark:Suspend overhead

• 1.25s increase due to messaging delays

• Task assignments happen in parallel: 4.7s increase in job completion time isi. Assign Application Masterii. Assign Map tasksiii. Assign Reduce tasks

01234

Aver

age

Tim

e (s

econ

ds)

1.25 s (50%) increase

38

Microbenchmark:Task eviction policies

Random Longest remaining time

Shortest remaining time

0

50

100

150

200

250

300

Research-XL Job

Aver

age

Exec

ution

Tim

e (s

econ

ds)

17% less than Random

time (seconds)

t=0s Research-XL t=50s Production-S

Theorem 1: When production tasks are the same length,SRT results in shortest job completion time.

39

Microbenchmark:Job eviction policies

Probabilistic Most Resources Least Resources0

50

100

150

200

250

300

Research-L Job Research-S Job

Aver

age

Exec

ution

Tim

e (s

econ

ds)

Most Resources + SRT = good fit

time (seconds)

t=0s Research-LResearch-S

t=50s Production-S

Theorem 2: When tasks within each job are the same length,evicting from the minimum number of jobsresults in the shortest average job completion time.

40

Trace-driven evaluation• Yahoo trace: scaled production cluster workload + scaled research cluster• Job completion times

00:00 10:00 20:00 30:00 40:00 50:00 00:000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Submission Time (mins:seconds)

Com

pleti

on T

ime

(sec

onds

)

41

Trace-driven evaluation:Research jobs only

00:00 10:00 20:00 30:00 40:00 50:00 00:000

500

1000

1500

2000

2500

3000

NatjamSoft CapKilling

Submission Time (mins:seconds)

Com

pleti

on T

ime

(sec

onds

)

115 seconds

42

Trace-driven evaluation:CDF of differences (negative is good)

-250 -200 -150 -100 -50 0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Production Jobs: Natjam - Soft Cap

-250 -200 -150 -100 -50 0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Production Jobs: Natjam - Killing

-1250 -750 -250 250 750 12500

0.2

0.4

0.6

0.8

1

Research Jobs: Natjam - Soft Cap

-250 -200 -150 -100 -50 0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Research Jobs: Natjam - Killing

43

Related Work

• Single cluster job scheduling has focused on:– Locality of Map tasks [Quincy, Delay Scheduling]– Speculative execution [LATE Scheduler]– Average fairness between queues [Capacity

Scheduler, Fair Scheduler]– Recent work: Elastic queues [Amoeba]

• We solve the requirement of prioritizing production jobs

44

Natjam summary

• Natjam prioritizes production jobs• Suspend/Resume tasks in research jobs• Eviction policies that choose which tasks to

suspend

• Evaluation– Microbenchmarks– Trace-drive experiments

45

Conclusion

Solution Strong user requirement

Key optimized metric

Pandora-A[ICDCS 2010] Deadline Low $ cost

Pandora-B[ICAC 2011] $ Budget Short transfer

time

Natjam Prioritize production jobs

Job completion time

Vivace[USENIX ATC 2012] Consistency Low latency

• Thesis: It is feasible to satisfystrong application requirementsfor data-intensive cloud computing environments, in spite ofresource limitations,while simultaneously optimizingrun-time metrics.

• Contributions: Solutions that reinforce this statement in diverse data-intensive cloud settings.

Satisfying Strong Application Requirements in Data-Intensive Clouds

Documents

Transcript of Satisfying Strong Application Requirements in Data-Intensive Clouds