Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Size-Based Disciplines for Job Schedulingin Data-Intensive Scalable Computing

Systems

Mario Pastorelli

Prof. Ernst BIERSACKProf. Guillaume URVOY-KELLERProf. Giovanni CHIOLADr. Patrick BROWN

Supervisor:

Prof. Pietro MICHIARDI

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 1

Context 1/3

: In 2004, Google presented MapReduce, a system used to processlarge quantity of data. The key ideas are:

� Client-Server architecture� Move the computation, not the data� Programming model inspired by Lisp lists functions:

map : (k1, v1) → [(k2, v2)]reduce : (k2, [v2])→ [(k3, v3)]

: Hadoop, the main open-source implementation of MapReduce, isreleased one year later. It is widely adopted and used by manyimportant companies (Facebook, Twitter, Yahoo, IBM, Microsoft. . . )

Context 1/3

: In 2004, Google presented MapReduce, a system used to processlarge quantity of data. The key ideas are:

� Client-Server architecture� Move the computation, not the data� Programming model inspired by Lisp lists functions:

map : (k1, v1) → [(k2, v2)]reduce : (k2, [v2])→ [(k3, v3)]

: Hadoop, the main open-source implementation of MapReduce, isreleased one year later. It is widely adopted and used by manyimportant companies (Facebook, Twitter, Yahoo, IBM, Microsoft. . . )

Context 2/3

In MapReduce, the Scheduling Policy is fundamental

: Complexity of the system

� Distributed resources� Multiple jobs running in parallel� Jobs are composed by two sequential phases, the map and the

reduce phase� Each phase is composed by multiple tasks, where each task runs on a

slot of a client

: Heterogeneous workloads

� Big differences in jobs sizes� Interactive jobs (e.g. data exploration, algorithm tuning,

orchestration jobs. . . ) must run as soon as possible. . .� . . . without impacting batch jobs too much

Context 2/3

In MapReduce, the Scheduling Policy is fundamental

: Complexity of the system

� Distributed resources� Multiple jobs running in parallel� Jobs are composed by two sequential phases, the map and the

reduce phase� Each phase is composed by multiple tasks, where each task runs on a

slot of a client

: Heterogeneous workloads

� Big differences in jobs sizes� Interactive jobs (e.g. data exploration, algorithm tuning,

orchestration jobs. . . ) must run as soon as possible. . .� . . . without impacting batch jobs too much

Context 3/3

: Schedulers (strive to) optimize one or more metrics. For example:

� Fairness: how a job is treated compared to the others� Mean response time: of jobs, that is the responsiveness of the system� . . .

: Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairnessrather than other metrics

: Short response times are very important! Usually there is one ormore system administrators making a manual ad-hoc configuration� Fine-tuning of the scheduler parameters� Configuration of pools of jobs with priorities� Complex, error prone and difficult to adapt to workload/cluster

changes

Context 3/3

changes

Context 3/3

changes

Motivations

: Size-based schedulers are more efficient than other schedulers (intheory). . .

� Job priority based on the job size� Focus resources on a few jobs instead of splitting them among many

: . . . but (in practice) they are not adopted in real systems� Job size is unknown� No studies on applicability to distributed systems

: MapReduce is suitable for size-based scheduling

� We don’t have the job size but we have the time to estimate it� No perfect estimation is required . . .

� . . . as long as very different jobs are sorted correctly

Motivations

Size-Based Schedulers: Example

Job Arrival Time Sizejob1 0s 30sjob2 10s 10sjob3 15s 10s

Scheduler AVG sojourn time

Processor Sharing 35s

SRPT 25s

Processor Sharing

Shortest Remaining

Processing Time

(SRPT)

Size-Based Schedulers: Example

Job Arrival Time Sizejob1 0s 30sjob2 10s 10sjob3 15s 10s

Scheduler AVG sojourn time

Processor Sharing 35s

SRPT 25s

Processor Sharing

Shortest Remaining

Processing Time

(SRPT)

Challenges

: Job sizes are unknown: how do you obtain an approximation of ajob size while the job is running?

: Estimation errors: how do you cope with an approximated size?

: Scheduler for real and distributed systems: can we design asize-based scheduler that works for existing systems?

: Job preemption: preemption is fundamental for scheduling, butcurrent system support it partially. Can we improve that?

Challenges

The Hadoop Fair Sojourn Protocol

Hadoop Fair Sojourn Protocol [BIGDATA 2013]

Size-based scheduler for Hadoop that is fair and achieves small responsetimes

: The map and the reduce phases are treated independently and thusa job has two sizes

: Sizes estimations are done in two steps by the Estimation Module

: Estimated sizes are then given in input to the Aging Module thatconverts them into virtual sizes to avoid starvation

: Schedule jobs with smallest virtual sizes

Estimation Module

: Two ways to estimate a job size:� Offline: based on the information available a priori (num tasks, block

size, past history . . . ):

� available since job submission but not very precise

� Online: based on the performance of a subset of t tasks:

� need time for training but more precise

: We need both:

� Offline estimation for the initial size, because jobs need size since theirsubmission

� Online estimation because it is more precise: when it is completed, thejob size is updated to the final size

: Tiny Jobs: jobs with less than t tasks are considered tiny and havethe highest priority possible

Estimation Module

: We need both:

Estimation Module

: We need both:

Aging Module 1/2

: Aging: the more a job stays in queue, the higher its priority will be

: A technique used in the literature to age jobs is the Virtual Size

� Each job is run in a simulation using processor sharing

� The output of the simulation is the job virtual size, that is the job sizeaged by the amount of time the job has spent in the simulation

� Jobs are sorted by remaining virtual size and resources are assigned tothe job with smallest virtual size

0 1 2 3 4 5 6 7 8 9 10time (s)

Job 1Job 2Job 3

Virtual Size (Simulation)

0 1 2 3 4 5 6 7 8 9 10time (s)

Job 1Job 2Job 3

Real Size (Real Scheduling)

Aging Module 1/2

: Aging: the more a job stays in queue, the higher its priority will be

: A technique used in the literature to age jobs is the Virtual Size

� Each job is run in a simulation using processor sharing

� The output of the simulation is the job virtual size, that is the job sizeaged by the amount of time the job has spent in the simulation

� Jobs are sorted by remaining virtual size and resources are assigned tothe job with smallest virtual size

0 1 2 3 4 5 6 7 8 9 10time (s)

Job 1Job 2Job 3

Virtual Size (Simulation)

0 1 2 3 4 5 6 7 8 9 10time (s)

Job 1Job 2Job 3

Real Size (Real Scheduling)

Aging Module 2/2

: In HFSP the estimated sizes are converted in virtual sizes by theAging Module

� The simulation is run in a virtual cluster that has the same resourcesof the real one

� Simulating Processor Sharing with Max-Min Fair Sharing

: The number of tasks of a job determines how fast it can age

Aging Module 2/2

: In HFSP the estimated sizes are converted in virtual sizes by theAging Module

� The simulation is run in a virtual cluster that has the same resourcesof the real one

� Simulating Processor Sharing with Max-Min Fair Sharing

: The number of tasks of a job determines how fast it can age

Task Scheduling Policy

: When a job is submitted� If it is tiny then assign a final size to it of 0� Else

� assign an initial size to it based on its number of tasks� mark the job as in training stage and select t training tasks

: When a resource becomes available� If there are jobs in the training stage then assign a task from the job

with the smallest initial virtual size� Else assign a task from the job with the smallest final virtual size

Task Scheduling Policy

: When a job is submitted� If it is tiny then assign a final size to it of 0� Else

� assign an initial size to it based on its number of tasks� mark the job as in training stage and select t training tasks

: When a resource becomes available� If there are jobs in the training stage then assign a task from the job

with the smallest initial virtual size� Else assign a task from the job with the smallest final virtual size

Experimental Evaluation

Experimental Setup

: 20 TaskTrackers (MapReduce clients) for a total of 40 map and 20reduce slots

: Three kinds of workloads inspired by existing traces

BinDataset

SizeAverag. num.

Map TasksWorkload

DEV TEST PROD

1 1 GB < 5 65% 30% 0%2 10 GB 10− 50 20% 40% 10%3 100 GB 50− 150 10% 10% 60%4 1 TB > 150 5% 20% 30%

: Each experiment is composed by 100 jobs taken from PigMix andhas been executed 5 times

: HFSP compared to the Fair Scheduler

Experimental Setup

BinDataset

SizeAverag. num.

Map TasksWorkload

DEV TEST PROD

1 1 GB < 5 65% 30% 0%2 10 GB 10− 50 20% 40% 10%3 100 GB 50− 150 10% 10% 60%4 1 TB > 150 5% 20% 30%

Experimental Setup

BinDataset

SizeAverag. num.

Map TasksWorkload

DEV TEST PROD

1 1 GB < 5 65% 30% 0%2 10 GB 10− 50 20% 40% 10%3 100 GB 50− 150 10% 10% 60%4 1 TB > 150 5% 20% 30%

Experimental Setup

BinDataset

SizeAverag. num.

Map TasksWorkload

DEV TEST PROD

1 1 GB < 5 65% 30% 0%2 10 GB 10− 50 20% 40% 10%3 100 GB 50− 150 10% 10% 60%4 1 TB > 150 5% 20% 30%

Performance Metrics

: Mean Response Time� A job response time is the time passed between the job submission and

when it completes� The mean of the response times of all jobs indicates the

responsiveness of the system under that scheduling policy

: Fairness� A common approach is to use the job slowdown, i.e. the ratio

between job response time and its size, to indicate how fair thescheduler has been with that job

� In the literature a scheduler with same or smaller slowdowns thanthe Processor Sharing is considered fair

Performance Metrics

: Mean Response Time� A job response time is the time passed between the job submission and

when it completes� The mean of the response times of all jobs indicates the

responsiveness of the system under that scheduling policy

: Fairness� A common approach is to use the job slowdown, i.e. the ratio

between job response time and its size, to indicate how fair thescheduler has been with that job

� In the literature a scheduler with same or smaller slowdowns thanthe Processor Sharing is considered fair

Results: Mean Response Time

-34% -26%

HFSP Fair

: Overall HFSP decreases the meanresponse time of ∼30%

: Tiny jobs (bin 1) are treated in the sameway by the two schedulers: they run assoon as possible

: Medium, large and huge jobs (bins 2, 3and 4) are consistently treated betterby HFSP thanks to its size-basedsequential nature

Results: Mean Response Time

-34% -26%

HFSP Fair

: Overall HFSP decreases the meanresponse time of ∼30%

: Tiny jobs (bin 1) are treated in the sameway by the two schedulers: they run assoon as possible

: Medium, large and huge jobs (bins 2, 3and 4) are consistently treated betterby HFSP thanks to its size-basedsequential nature

Results: Fairness

0.1 1.0 10.0 100.0Response time / Isolation runtime

HFSPFair

DEV workload

HFSPFair

TEST workload

HFSPFair

PROD workload

: HFSP is globally more fair to jobs than the Fair Scheduler

: The “heavier” the workload is, the better HFSP treats jobs comparedto the Fair Scheduler

: For the PROD workload, the gap between the median under HFSPand the one under Fair is one order of magnitude

Impact of the errors

Task Times and Estimation Errors

: Tasks of a single job are stable

: Even a small number oftraining tasks is enough forestimating the phase size

1 10 102

task time / mean task time

mapreduce

0.25 0.5 1 2 4error using 5 samples

mapreduce

: error = est. sizereal size

� error > 1⇒ estimated size is biggerthan the real one (over-estimation)

� error < 1⇒ estimated size is smallerthan the real one (under-estimation)

: Biggest errors are on over-estimating mapphases

Task Times and Estimation Errors

: Tasks of a single job are stable

: Even a small number oftraining tasks is enough forestimating the phase size

1 10 102

task time / mean task time

mapreduce

0.25 0.5 1 2 4error using 5 samples

mapreduce

: error = est. sizereal size

� error > 1⇒ estimated size is biggerthan the real one (over-estimation)

� error < 1⇒ estimated size is smallerthan the real one (under-estimation)

: Biggest errors are on over-estimating mapphases

Estimation Errors: Job Sizes and Phases

bin2 bin3 bin40.0

Map Phase

bin2 bin3 bin40.8

Reduce Phase

: Majority of estimated sizes are close to the correct one

: Tendency to over-estimate in all the bins

: Smaller errors on medium jobs (bin 2) compared to large and hugeons (bin 3 and 4)

: Switching jobs is highly unlikely

FSP with Estimation Errors

: Our experiments show that, in Hadoop, the estimation errors don’timpact our size-based scheduler performance

: Can we abstract from Hadoop and extract a general rule on theapplicability of size-based scheduling policies?

: Simulative approach: simulations are fast making possible to trydifferent workloads, jobs arrival times and errors

: Our results show that size-based schedulers, like FSP and SRPT, aretolerant to errors in many cases

: We created FSP+PS that tolerates even more “extreme” conditions[MASCOTS 2014]

Task Preemption

Task Preemption in HFSP

: In theory� Preemption consists in removing resources from a running job and

granting them to another one� Without knowledge of the workload, preemptive schedulers outmatch

their non-preemptive counterparts

: In practice� Preemption is difficult to implement� In Hadoop

� Task preemption support through the kill primitive: it removesresources from a task by killing it ⇒ all work is lost!

� Kill disadvantages are well known and usually it is disabled or used verycarefully

� HFSP is a preemptive scheduler and supports the task kill primitive

: In practice� Preemption is difficult to implement

� In Hadoop

: In practice� Preemption is difficult to implement� In Hadoop

Results: Kill Preemption

1 10 100slowdown (s)

killwait

1 10 102 103 104 105

sojourn time (s)

killwait

: Kill improves fairness and response times of small and mediumjobs. . .

: . . . but impacts heavily large jobs response times

OS-Assisted Preemption

: Kill preemption is non-optimal: it preempts running tasks but has ahigh cost

: Can we do a mechanism that is more similar to an ideal preemption?

: Idea . . .

� Instead of killing a task, we can suspend it where it is running� When the task should run again, we can resume it where it was

running

: . . . but how can be implemented?

� Operating Systems know very well how to suspend and resumeprocesses

� At low-level, tasks are processes� Exploit OS capabilities to get a new preemption primitive: Task

Suspension [DCPERF 2014]

: Idea . . .

running

: Idea . . .

running

Conclusions

Conclusion

: Size-based schedulers with estimated (imprecise) sizes canoutperform schedulers not size-based in real systems

: We showed this by designing the Hadoop Fair Sojourn Protocol, asize-based scheduler for a real and distributed system such asHadoop

: HFSP is fair and achieves small mean response time

: It can also use Hadoop preemption mechanism to improve fairnessand response times of small jobs, but this will affect the performanceof large and huge jobs

Conclusion

Future Work

: HFSP + Suspension: adding the suspension mechanism to HFSPraises many challenges, such as the eviction policy and the reducelocality

: Recurring Jobs: exploit the past runs of recurring jobs to obtain analmost perfect estimation since their submission.

: Complex Jobs: high-level languages and libraries push the schedulingproblem from simple jobs to complex jobs, that are chains of simplejobs. Can we adapt HFSP to such jobs?

Size-Based Scheduling with EstimatedSizes

Impact of Over-estimation and Under-estimation

Over-‐es'ma'on Under-‐es'ma'on

Remaining size

J1 J2 J3

J4 J5 J6

: Over-estimating a job affects only that job. Other jobs in queue arenot affected

: Under-estimating a job can affect other jobs in queue

FSP+PS

: In FSP, under-estimated jobs can complete in the virtual systembefore than in the real system. We call them late jobs

: When a job is late, it should not prevent executing other jobs

: FSP+PS solves the problem by scheduling late jobs using processorsharing

OS-Assisted Task Preemption

Kill preemption primitive has many drawbacks, can we do better?

: At low level, tasks are processes and processes can be suspendedand resumed by the Operating System

: We exploit this mechanism by enabling task suspension andresuming

: No need to change existent jobs! Done at low-level and transparentto the user

: Bonus: the operating system manages the memory of processes

� Memory of suspended tasks can be granted to other (running) tasks bythe OS. . .

� . . . and because the OS knows how much memory the process needs,only the memory required will be taken from the suspended task

OS-Assisted Task Preemption: Trashing

: Trashing: when data is continuously read from and written to swapspace, the machine performance are highly degraded to a point thatthe machine doesn’t work properly anymore

: Trashing is caused by the working set (memory) that is larger thanthe system memory

: In Hadoop this doesn’t happen because:

� Running tasks per machine are limited� Heap space per task is limited

OS-Assisted Task Preemption: Trashing

: Trashing: when data is continuously read from and written to swapspace, the machine performance are highly degraded to a point thatthe machine doesn’t work properly anymore

: Trashing is caused by the working set (memory) that is larger thanthe system memory

: In Hadoop this doesn’t happen because:

� Running tasks per machine are limited� Heap space per task is limited

OS-Assisted Task Preemption: Experiments

: Test the worst case for suspension, that is when the jobs allocate allthe memory

: Two jobs, th and tl , allocating 2 GB of memory

10 20 30 40 50 60 70 80 90tl progress at launch of th (%)

170180190200210220230240250

: Our primitive outperform kill and wait

: Overhead for swapping doesn’t affect the jobs too much

OS-Assisted Task Preemption: Conclusions

: Task Suspension/Resume outperform current preemptionimplementations. . .

: . . . but it raises new challenges, e.g. state locality for task suspended

: With a good scheduling policy (and eviction policy), OS-assistedpreemption can substitute current preemption mechanism

Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Engineering

Transcript of Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems