Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

79
Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems Mario Pastorelli Jury: Prof. Ernst BIERSACK Prof. Guillaume URVOY-KELLER Prof. Giovanni CHIOLA Dr. Patrick BROWN Supervisor: Prof. Pietro MICHIARDI Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 1
  • date post

    19-Oct-2014
  • Category

    Engineering

  • view

    426
  • download

    3

description

The presentation of my Ph.D. thesis on size-based scheduling policies for Data Intensive Scalable Computing Systems and in particular Hadoop.

Transcript of Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Page 1: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Size-Based Disciplines for Job Schedulingin Data-Intensive Scalable Computing

Systems

Mario Pastorelli

Jury:

Prof. Ernst BIERSACKProf. Guillaume URVOY-KELLERProf. Giovanni CHIOLADr. Patrick BROWN

Supervisor:

Prof. Pietro MICHIARDI

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 1

Page 2: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Context 1/3

: In 2004, Google presented MapReduce, a system used to processlarge quantity of data. The key ideas are:

� Client-Server architecture� Move the computation, not the data� Programming model inspired by Lisp lists functions:

map : (k1, v1) → [(k2, v2)]reduce : (k2, [v2])→ [(k3, v3)]

: Hadoop, the main open-source implementation of MapReduce, isreleased one year later. It is widely adopted and used by manyimportant companies (Facebook, Twitter, Yahoo, IBM, Microsoft. . . )

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2

Page 3: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Context 1/3

: In 2004, Google presented MapReduce, a system used to processlarge quantity of data. The key ideas are:

� Client-Server architecture� Move the computation, not the data� Programming model inspired by Lisp lists functions:

map : (k1, v1) → [(k2, v2)]reduce : (k2, [v2])→ [(k3, v3)]

: Hadoop, the main open-source implementation of MapReduce, isreleased one year later. It is widely adopted and used by manyimportant companies (Facebook, Twitter, Yahoo, IBM, Microsoft. . . )

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2

Page 4: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Context 2/3

In MapReduce, the Scheduling Policy is fundamental

: Complexity of the system

� Distributed resources� Multiple jobs running in parallel� Jobs are composed by two sequential phases, the map and the

reduce phase� Each phase is composed by multiple tasks, where each task runs on a

slot of a client

: Heterogeneous workloads

� Big differences in jobs sizes� Interactive jobs (e.g. data exploration, algorithm tuning,

orchestration jobs. . . ) must run as soon as possible. . .� . . . without impacting batch jobs too much

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3

Page 5: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Context 2/3

In MapReduce, the Scheduling Policy is fundamental

: Complexity of the system

� Distributed resources� Multiple jobs running in parallel� Jobs are composed by two sequential phases, the map and the

reduce phase� Each phase is composed by multiple tasks, where each task runs on a

slot of a client

: Heterogeneous workloads

� Big differences in jobs sizes� Interactive jobs (e.g. data exploration, algorithm tuning,

orchestration jobs. . . ) must run as soon as possible. . .� . . . without impacting batch jobs too much

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3

Page 6: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Context 3/3

: Schedulers (strive to) optimize one or more metrics. For example:

� Fairness: how a job is treated compared to the others� Mean response time: of jobs, that is the responsiveness of the system� . . .

: Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairnessrather than other metrics

: Short response times are very important! Usually there is one ormore system administrators making a manual ad-hoc configuration� Fine-tuning of the scheduler parameters� Configuration of pools of jobs with priorities� Complex, error prone and difficult to adapt to workload/cluster

changes

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4

Page 7: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Context 3/3

: Schedulers (strive to) optimize one or more metrics. For example:

� Fairness: how a job is treated compared to the others� Mean response time: of jobs, that is the responsiveness of the system� . . .

: Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairnessrather than other metrics

: Short response times are very important! Usually there is one ormore system administrators making a manual ad-hoc configuration� Fine-tuning of the scheduler parameters� Configuration of pools of jobs with priorities� Complex, error prone and difficult to adapt to workload/cluster

changes

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4

Page 8: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Context 3/3

: Schedulers (strive to) optimize one or more metrics. For example:

� Fairness: how a job is treated compared to the others� Mean response time: of jobs, that is the responsiveness of the system� . . .

: Schedulers for Hadoop, e.g. the Fair Scheduler, focus on fairnessrather than other metrics

: Short response times are very important! Usually there is one ormore system administrators making a manual ad-hoc configuration� Fine-tuning of the scheduler parameters� Configuration of pools of jobs with priorities� Complex, error prone and difficult to adapt to workload/cluster

changes

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4

Page 9: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Motivations

: Size-based schedulers are more efficient than other schedulers (intheory). . .

� Job priority based on the job size� Focus resources on a few jobs instead of splitting them among many

jobs

: . . . but (in practice) they are not adopted in real systems� Job size is unknown� No studies on applicability to distributed systems

: MapReduce is suitable for size-based scheduling

� We don’t have the job size but we have the time to estimate it� No perfect estimation is required . . .

� . . . as long as very different jobs are sorted correctly

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5

Page 10: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Motivations

: Size-based schedulers are more efficient than other schedulers (intheory). . .

� Job priority based on the job size� Focus resources on a few jobs instead of splitting them among many

jobs

: . . . but (in practice) they are not adopted in real systems� Job size is unknown� No studies on applicability to distributed systems

: MapReduce is suitable for size-based scheduling

� We don’t have the job size but we have the time to estimate it� No perfect estimation is required . . .

� . . . as long as very different jobs are sorted correctly

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5

Page 11: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Motivations

: Size-based schedulers are more efficient than other schedulers (intheory). . .

� Job priority based on the job size� Focus resources on a few jobs instead of splitting them among many

jobs

: . . . but (in practice) they are not adopted in real systems� Job size is unknown� No studies on applicability to distributed systems

: MapReduce is suitable for size-based scheduling

� We don’t have the job size but we have the time to estimate it� No perfect estimation is required . . .

� . . . as long as very different jobs are sorted correctly

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5

Page 12: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Size-Based Schedulers: Example

Job Arrival Time Sizejob1 0s 30sjob2 10s 10sjob3 15s 10s

Scheduler AVG sojourn time

Processor Sharing 35s

SRPT 25s

Processor Sharing

Shortest Remaining

Processing Time

(SRPT)

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6

Page 13: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Size-Based Schedulers: Example

Job Arrival Time Sizejob1 0s 30sjob2 10s 10sjob3 15s 10s

Scheduler AVG sojourn time

Processor Sharing 35s

SRPT 25s

Processor Sharing

Shortest Remaining

Processing Time

(SRPT)

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6

Page 14: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Challenges

: Job sizes are unknown: how do you obtain an approximation of ajob size while the job is running?

: Estimation errors: how do you cope with an approximated size?

: Scheduler for real and distributed systems: can we design asize-based scheduler that works for existing systems?

: Job preemption: preemption is fundamental for scheduling, butcurrent system support it partially. Can we improve that?

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7

Page 15: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Challenges

: Job sizes are unknown: how do you obtain an approximation of ajob size while the job is running?

: Estimation errors: how do you cope with an approximated size?

: Scheduler for real and distributed systems: can we design asize-based scheduler that works for existing systems?

: Job preemption: preemption is fundamental for scheduling, butcurrent system support it partially. Can we improve that?

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7

Page 16: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Challenges

: Job sizes are unknown: how do you obtain an approximation of ajob size while the job is running?

: Estimation errors: how do you cope with an approximated size?

: Scheduler for real and distributed systems: can we design asize-based scheduler that works for existing systems?

: Job preemption: preemption is fundamental for scheduling, butcurrent system support it partially. Can we improve that?

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7

Page 17: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Challenges

: Job sizes are unknown: how do you obtain an approximation of ajob size while the job is running?

: Estimation errors: how do you cope with an approximated size?

: Scheduler for real and distributed systems: can we design asize-based scheduler that works for existing systems?

: Job preemption: preemption is fundamental for scheduling, butcurrent system support it partially. Can we improve that?

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7

Page 18: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

The Hadoop Fair Sojourn Protocol

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8

Page 19: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Hadoop Fair Sojourn Protocol [BIGDATA 2013]

Size-based scheduler for Hadoop that is fair and achieves small responsetimes

: The map and the reduce phases are treated independently and thusa job has two sizes

: Sizes estimations are done in two steps by the Estimation Module

: Estimated sizes are then given in input to the Aging Module thatconverts them into virtual sizes to avoid starvation

: Schedule jobs with smallest virtual sizes

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9

Page 20: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Hadoop Fair Sojourn Protocol [BIGDATA 2013]

Size-based scheduler for Hadoop that is fair and achieves small responsetimes

: The map and the reduce phases are treated independently and thusa job has two sizes

: Sizes estimations are done in two steps by the Estimation Module

: Estimated sizes are then given in input to the Aging Module thatconverts them into virtual sizes to avoid starvation

: Schedule jobs with smallest virtual sizes

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9

Page 21: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Hadoop Fair Sojourn Protocol [BIGDATA 2013]

Size-based scheduler for Hadoop that is fair and achieves small responsetimes

: The map and the reduce phases are treated independently and thusa job has two sizes

: Sizes estimations are done in two steps by the Estimation Module

: Estimated sizes are then given in input to the Aging Module thatconverts them into virtual sizes to avoid starvation

: Schedule jobs with smallest virtual sizes

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9

Page 22: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Hadoop Fair Sojourn Protocol [BIGDATA 2013]

Size-based scheduler for Hadoop that is fair and achieves small responsetimes

: The map and the reduce phases are treated independently and thusa job has two sizes

: Sizes estimations are done in two steps by the Estimation Module

: Estimated sizes are then given in input to the Aging Module thatconverts them into virtual sizes to avoid starvation

: Schedule jobs with smallest virtual sizes

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9

Page 23: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Hadoop Fair Sojourn Protocol [BIGDATA 2013]

Size-based scheduler for Hadoop that is fair and achieves small responsetimes

: The map and the reduce phases are treated independently and thusa job has two sizes

: Sizes estimations are done in two steps by the Estimation Module

: Estimated sizes are then given in input to the Aging Module thatconverts them into virtual sizes to avoid starvation

: Schedule jobs with smallest virtual sizes

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 9

Page 24: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Estimation Module

: Two ways to estimate a job size:� Offline: based on the information available a priori (num tasks, block

size, past history . . . ):

� available since job submission but not very precise

� Online: based on the performance of a subset of t tasks:

� need time for training but more precise

: We need both:

� Offline estimation for the initial size, because jobs need size since theirsubmission

� Online estimation because it is more precise: when it is completed, thejob size is updated to the final size

: Tiny Jobs: jobs with less than t tasks are considered tiny and havethe highest priority possible

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10

Page 25: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Estimation Module

: Two ways to estimate a job size:� Offline: based on the information available a priori (num tasks, block

size, past history . . . ):

� available since job submission but not very precise

� Online: based on the performance of a subset of t tasks:

� need time for training but more precise

: We need both:

� Offline estimation for the initial size, because jobs need size since theirsubmission

� Online estimation because it is more precise: when it is completed, thejob size is updated to the final size

: Tiny Jobs: jobs with less than t tasks are considered tiny and havethe highest priority possible

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10

Page 26: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Estimation Module

: Two ways to estimate a job size:� Offline: based on the information available a priori (num tasks, block

size, past history . . . ):

� available since job submission but not very precise

� Online: based on the performance of a subset of t tasks:

� need time for training but more precise

: We need both:

� Offline estimation for the initial size, because jobs need size since theirsubmission

� Online estimation because it is more precise: when it is completed, thejob size is updated to the final size

: Tiny Jobs: jobs with less than t tasks are considered tiny and havethe highest priority possible

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 10

Page 27: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Aging Module 1/2

: Aging: the more a job stays in queue, the higher its priority will be

: A technique used in the literature to age jobs is the Virtual Size

� Each job is run in a simulation using processor sharing

� The output of the simulation is the job virtual size, that is the job sizeaged by the amount of time the job has spent in the simulation

� Jobs are sorted by remaining virtual size and resources are assigned tothe job with smallest virtual size

0 1 2 3 4 5 6 7 8 9 10time (s)

0.51

1.52

2.53

3.54

job

virt

ualt

ime

(s)

Job 1Job 2Job 3

Virtual Size (Simulation)

0 1 2 3 4 5 6 7 8 9 10time (s)

0.51

1.52

2.53

3.54

job

size

(s)

Job 1Job 2Job 3

Real Size (Real Scheduling)

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 11

Page 28: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Aging Module 1/2

: Aging: the more a job stays in queue, the higher its priority will be

: A technique used in the literature to age jobs is the Virtual Size

� Each job is run in a simulation using processor sharing

� The output of the simulation is the job virtual size, that is the job sizeaged by the amount of time the job has spent in the simulation

� Jobs are sorted by remaining virtual size and resources are assigned tothe job with smallest virtual size

0 1 2 3 4 5 6 7 8 9 10time (s)

0.51

1.52

2.53

3.54

job

virt

ualt

ime

(s)

Job 1Job 2Job 3

Virtual Size (Simulation)

0 1 2 3 4 5 6 7 8 9 10time (s)

0.51

1.52

2.53

3.54

job

size

(s)

Job 1Job 2Job 3

Real Size (Real Scheduling)

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 11

Page 29: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Aging Module 2/2

: In HFSP the estimated sizes are converted in virtual sizes by theAging Module

� The simulation is run in a virtual cluster that has the same resourcesof the real one

� Simulating Processor Sharing with Max-Min Fair Sharing

: The number of tasks of a job determines how fast it can age

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 12

Page 30: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Aging Module 2/2

: In HFSP the estimated sizes are converted in virtual sizes by theAging Module

� The simulation is run in a virtual cluster that has the same resourcesof the real one

� Simulating Processor Sharing with Max-Min Fair Sharing

: The number of tasks of a job determines how fast it can age

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 12

Page 31: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Task Scheduling Policy

: When a job is submitted� If it is tiny then assign a final size to it of 0� Else

� assign an initial size to it based on its number of tasks� mark the job as in training stage and select t training tasks

: When a resource becomes available� If there are jobs in the training stage then assign a task from the job

with the smallest initial virtual size� Else assign a task from the job with the smallest final virtual size

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 13

Page 32: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Task Scheduling Policy

: When a job is submitted� If it is tiny then assign a final size to it of 0� Else

� assign an initial size to it based on its number of tasks� mark the job as in training stage and select t training tasks

: When a resource becomes available� If there are jobs in the training stage then assign a task from the job

with the smallest initial virtual size� Else assign a task from the job with the smallest final virtual size

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 13

Page 33: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Experimental Evaluation

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 14

Page 34: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Experimental Setup

: 20 TaskTrackers (MapReduce clients) for a total of 40 map and 20reduce slots

: Three kinds of workloads inspired by existing traces

BinDataset

SizeAverag. num.

Map TasksWorkload

DEV TEST PROD

1 1 GB < 5 65% 30% 0%2 10 GB 10− 50 20% 40% 10%3 100 GB 50− 150 10% 10% 60%4 1 TB > 150 5% 20% 30%

: Each experiment is composed by 100 jobs taken from PigMix andhas been executed 5 times

: HFSP compared to the Fair Scheduler

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15

Page 35: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Experimental Setup

: 20 TaskTrackers (MapReduce clients) for a total of 40 map and 20reduce slots

: Three kinds of workloads inspired by existing traces

BinDataset

SizeAverag. num.

Map TasksWorkload

DEV TEST PROD

1 1 GB < 5 65% 30% 0%2 10 GB 10− 50 20% 40% 10%3 100 GB 50− 150 10% 10% 60%4 1 TB > 150 5% 20% 30%

: Each experiment is composed by 100 jobs taken from PigMix andhas been executed 5 times

: HFSP compared to the Fair Scheduler

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15

Page 36: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Experimental Setup

: 20 TaskTrackers (MapReduce clients) for a total of 40 map and 20reduce slots

: Three kinds of workloads inspired by existing traces

BinDataset

SizeAverag. num.

Map TasksWorkload

DEV TEST PROD

1 1 GB < 5 65% 30% 0%2 10 GB 10− 50 20% 40% 10%3 100 GB 50− 150 10% 10% 60%4 1 TB > 150 5% 20% 30%

: Each experiment is composed by 100 jobs taken from PigMix andhas been executed 5 times

: HFSP compared to the Fair Scheduler

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15

Page 37: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Experimental Setup

: 20 TaskTrackers (MapReduce clients) for a total of 40 map and 20reduce slots

: Three kinds of workloads inspired by existing traces

BinDataset

SizeAverag. num.

Map TasksWorkload

DEV TEST PROD

1 1 GB < 5 65% 30% 0%2 10 GB 10− 50 20% 40% 10%3 100 GB 50− 150 10% 10% 60%4 1 TB > 150 5% 20% 30%

: Each experiment is composed by 100 jobs taken from PigMix andhas been executed 5 times

: HFSP compared to the Fair Scheduler

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 15

Page 38: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Performance Metrics

: Mean Response Time� A job response time is the time passed between the job submission and

when it completes� The mean of the response times of all jobs indicates the

responsiveness of the system under that scheduling policy

: Fairness� A common approach is to use the job slowdown, i.e. the ratio

between job response time and its size, to indicate how fair thescheduler has been with that job

� In the literature a scheduler with same or smaller slowdowns thanthe Processor Sharing is considered fair

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 16

Page 39: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Performance Metrics

: Mean Response Time� A job response time is the time passed between the job submission and

when it completes� The mean of the response times of all jobs indicates the

responsiveness of the system under that scheduling policy

: Fairness� A common approach is to use the job slowdown, i.e. the ratio

between job response time and its size, to indicate how fair thescheduler has been with that job

� In the literature a scheduler with same or smaller slowdowns thanthe Processor Sharing is considered fair

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 16

Page 40: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Results: Mean Response Time

DEV

TEST

PROD

-34% -26%

-33%

25 28

109

38 38

163

Mea

nR

esp

onse

Tim

e(s

)

HFSP Fair

: Overall HFSP decreases the meanresponse time of ∼30%

: Tiny jobs (bin 1) are treated in the sameway by the two schedulers: they run assoon as possible

: Medium, large and huge jobs (bins 2, 3and 4) are consistently treated betterby HFSP thanks to its size-basedsequential nature

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 17

Page 41: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Results: Mean Response Time

DEV

TEST

PROD

-34% -26%

-33%

25 28

109

38 38

163

Mea

nR

esp

onse

Tim

e(s

)

HFSP Fair

: Overall HFSP decreases the meanresponse time of ∼30%

: Tiny jobs (bin 1) are treated in the sameway by the two schedulers: they run assoon as possible

: Medium, large and huge jobs (bins 2, 3and 4) are consistently treated betterby HFSP thanks to its size-basedsequential nature

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 17

Page 42: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Results: Fairness

0.1 1.0 10.0 100.0Response time / Isolation runtime

0.0

0.2

0.4

0.6

0.8

1.0

ECD

F

HFSPFair

DEV workload

0.1 1.0 10.0 100.0Response time / Isolation runtime

0.0

0.2

0.4

0.6

0.8

1.0

ECD

F

HFSPFair

TEST workload

0.1 1.0 10.0 100.0Response time / Isolation runtime

0.0

0.2

0.4

0.6

0.8

1.0

ECD

F

HFSPFair

PROD workload

: HFSP is globally more fair to jobs than the Fair Scheduler

: The “heavier” the workload is, the better HFSP treats jobs comparedto the Fair Scheduler

: For the PROD workload, the gap between the median under HFSPand the one under Fair is one order of magnitude

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 18

Page 43: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Impact of the errors

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 19

Page 44: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Task Times and Estimation Errors

: Tasks of a single job are stable

: Even a small number oftraining tasks is enough forestimating the phase size

1 10 102

task time / mean task time

0.0

0.2

0.4

0.6

0.8

1.0

ECD

F

mapreduce

0.25 0.5 1 2 4error using 5 samples

0.0

0.2

0.4

0.6

0.8

1.0

ECD

F

mapreduce

: error = est. sizereal size

� error > 1⇒ estimated size is biggerthan the real one (over-estimation)

� error < 1⇒ estimated size is smallerthan the real one (under-estimation)

: Biggest errors are on over-estimating mapphases

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 20

Page 45: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Task Times and Estimation Errors

: Tasks of a single job are stable

: Even a small number oftraining tasks is enough forestimating the phase size

1 10 102

task time / mean task time

0.0

0.2

0.4

0.6

0.8

1.0

ECD

F

mapreduce

0.25 0.5 1 2 4error using 5 samples

0.0

0.2

0.4

0.6

0.8

1.0

ECD

F

mapreduce

: error = est. sizereal size

� error > 1⇒ estimated size is biggerthan the real one (over-estimation)

� error < 1⇒ estimated size is smallerthan the real one (under-estimation)

: Biggest errors are on over-estimating mapphases

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 20

Page 46: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Estimation Errors: Job Sizes and Phases

bin2 bin3 bin40.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Map Phase

bin2 bin3 bin40.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

Reduce Phase

: Majority of estimated sizes are close to the correct one

: Tendency to over-estimate in all the bins

: Smaller errors on medium jobs (bin 2) compared to large and hugeons (bin 3 and 4)

: Switching jobs is highly unlikely

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 21

Page 47: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

FSP with Estimation Errors

: Our experiments show that, in Hadoop, the estimation errors don’timpact our size-based scheduler performance

: Can we abstract from Hadoop and extract a general rule on theapplicability of size-based scheduling policies?

: Simulative approach: simulations are fast making possible to trydifferent workloads, jobs arrival times and errors

: Our results show that size-based schedulers, like FSP and SRPT, aretolerant to errors in many cases

: We created FSP+PS that tolerates even more “extreme” conditions[MASCOTS 2014]

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22

Page 48: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

FSP with Estimation Errors

: Our experiments show that, in Hadoop, the estimation errors don’timpact our size-based scheduler performance

: Can we abstract from Hadoop and extract a general rule on theapplicability of size-based scheduling policies?

: Simulative approach: simulations are fast making possible to trydifferent workloads, jobs arrival times and errors

: Our results show that size-based schedulers, like FSP and SRPT, aretolerant to errors in many cases

: We created FSP+PS that tolerates even more “extreme” conditions[MASCOTS 2014]

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22

Page 49: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

FSP with Estimation Errors

: Our experiments show that, in Hadoop, the estimation errors don’timpact our size-based scheduler performance

: Can we abstract from Hadoop and extract a general rule on theapplicability of size-based scheduling policies?

: Simulative approach: simulations are fast making possible to trydifferent workloads, jobs arrival times and errors

: Our results show that size-based schedulers, like FSP and SRPT, aretolerant to errors in many cases

: We created FSP+PS that tolerates even more “extreme” conditions[MASCOTS 2014]

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 22

Page 50: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Task Preemption

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 23

Page 51: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Task Preemption in HFSP

: In theory� Preemption consists in removing resources from a running job and

granting them to another one� Without knowledge of the workload, preemptive schedulers outmatch

their non-preemptive counterparts

: In practice� Preemption is difficult to implement� In Hadoop

� Task preemption support through the kill primitive: it removesresources from a task by killing it ⇒ all work is lost!

� Kill disadvantages are well known and usually it is disabled or used verycarefully

� HFSP is a preemptive scheduler and supports the task kill primitive

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24

Page 52: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Task Preemption in HFSP

: In theory� Preemption consists in removing resources from a running job and

granting them to another one� Without knowledge of the workload, preemptive schedulers outmatch

their non-preemptive counterparts

: In practice� Preemption is difficult to implement

� In Hadoop

� Task preemption support through the kill primitive: it removesresources from a task by killing it ⇒ all work is lost!

� Kill disadvantages are well known and usually it is disabled or used verycarefully

� HFSP is a preemptive scheduler and supports the task kill primitive

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24

Page 53: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Task Preemption in HFSP

: In theory� Preemption consists in removing resources from a running job and

granting them to another one� Without knowledge of the workload, preemptive schedulers outmatch

their non-preemptive counterparts

: In practice� Preemption is difficult to implement� In Hadoop

� Task preemption support through the kill primitive: it removesresources from a task by killing it ⇒ all work is lost!

� Kill disadvantages are well known and usually it is disabled or used verycarefully

� HFSP is a preemptive scheduler and supports the task kill primitive

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 24

Page 54: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Results: Kill Preemption

1 10 100slowdown (s)

0.0

0.2

0.4

0.6

0.8

1.0

ECD

F

killwait

1 10 102 103 104 105

sojourn time (s)

0.0

0.2

0.4

0.6

0.8

1.0

ECD

F

killwait

: Kill improves fairness and response times of small and mediumjobs. . .

: . . . but impacts heavily large jobs response times

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 25

Page 55: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Preemption

: Kill preemption is non-optimal: it preempts running tasks but has ahigh cost

: Can we do a mechanism that is more similar to an ideal preemption?

: Idea . . .

� Instead of killing a task, we can suspend it where it is running� When the task should run again, we can resume it where it was

running

: . . . but how can be implemented?

� Operating Systems know very well how to suspend and resumeprocesses

� At low-level, tasks are processes� Exploit OS capabilities to get a new preemption primitive: Task

Suspension [DCPERF 2014]

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26

Page 56: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Preemption

: Kill preemption is non-optimal: it preempts running tasks but has ahigh cost

: Can we do a mechanism that is more similar to an ideal preemption?

: Idea . . .

� Instead of killing a task, we can suspend it where it is running� When the task should run again, we can resume it where it was

running

: . . . but how can be implemented?

� Operating Systems know very well how to suspend and resumeprocesses

� At low-level, tasks are processes� Exploit OS capabilities to get a new preemption primitive: Task

Suspension [DCPERF 2014]

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26

Page 57: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Preemption

: Kill preemption is non-optimal: it preempts running tasks but has ahigh cost

: Can we do a mechanism that is more similar to an ideal preemption?

: Idea . . .

� Instead of killing a task, we can suspend it where it is running� When the task should run again, we can resume it where it was

running

: . . . but how can be implemented?

� Operating Systems know very well how to suspend and resumeprocesses

� At low-level, tasks are processes� Exploit OS capabilities to get a new preemption primitive: Task

Suspension [DCPERF 2014]

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 26

Page 58: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Conclusions

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 27

Page 59: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Conclusion

: Size-based schedulers with estimated (imprecise) sizes canoutperform schedulers not size-based in real systems

: We showed this by designing the Hadoop Fair Sojourn Protocol, asize-based scheduler for a real and distributed system such asHadoop

: HFSP is fair and achieves small mean response time

: It can also use Hadoop preemption mechanism to improve fairnessand response times of small jobs, but this will affect the performanceof large and huge jobs

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28

Page 60: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Conclusion

: Size-based schedulers with estimated (imprecise) sizes canoutperform schedulers not size-based in real systems

: We showed this by designing the Hadoop Fair Sojourn Protocol, asize-based scheduler for a real and distributed system such asHadoop

: HFSP is fair and achieves small mean response time

: It can also use Hadoop preemption mechanism to improve fairnessand response times of small jobs, but this will affect the performanceof large and huge jobs

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28

Page 61: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Conclusion

: Size-based schedulers with estimated (imprecise) sizes canoutperform schedulers not size-based in real systems

: We showed this by designing the Hadoop Fair Sojourn Protocol, asize-based scheduler for a real and distributed system such asHadoop

: HFSP is fair and achieves small mean response time

: It can also use Hadoop preemption mechanism to improve fairnessand response times of small jobs, but this will affect the performanceof large and huge jobs

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28

Page 62: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Conclusion

: Size-based schedulers with estimated (imprecise) sizes canoutperform schedulers not size-based in real systems

: We showed this by designing the Hadoop Fair Sojourn Protocol, asize-based scheduler for a real and distributed system such asHadoop

: HFSP is fair and achieves small mean response time

: It can also use Hadoop preemption mechanism to improve fairnessand response times of small jobs, but this will affect the performanceof large and huge jobs

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 28

Page 63: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Future Work

: HFSP + Suspension: adding the suspension mechanism to HFSPraises many challenges, such as the eviction policy and the reducelocality

: Recurring Jobs: exploit the past runs of recurring jobs to obtain analmost perfect estimation since their submission.

: Complex Jobs: high-level languages and libraries push the schedulingproblem from simple jobs to complex jobs, that are chains of simplejobs. Can we adapt HFSP to such jobs?

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 29

Page 64: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 30

Page 65: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Size-Based Scheduling with EstimatedSizes

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 1

Page 66: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

Impact of Over-estimation and Under-estimation

Over-­‐es'ma'on   Under-­‐es'ma'on  

t  

t  

t  

t  

Remaining  size  

Remaining  size  

Remaining  size  

Remaining  size  

J1   J2  J3  

J2  J3  

J1  ^  

J4  

J5  J6  

J4   J5  J6  

^  

: Over-estimating a job affects only that job. Other jobs in queue arenot affected

: Under-estimating a job can affect other jobs in queue

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 2

Page 67: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

FSP+PS

: In FSP, under-estimated jobs can complete in the virtual systembefore than in the real system. We call them late jobs

: When a job is late, it should not prevent executing other jobs

: FSP+PS solves the problem by scheduling late jobs using processorsharing

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 3

Page 68: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 4

Page 69: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption

Kill preemption primitive has many drawbacks, can we do better?

: At low level, tasks are processes and processes can be suspendedand resumed by the Operating System

: We exploit this mechanism by enabling task suspension andresuming

: No need to change existent jobs! Done at low-level and transparentto the user

: Bonus: the operating system manages the memory of processes

� Memory of suspended tasks can be granted to other (running) tasks bythe OS. . .

� . . . and because the OS knows how much memory the process needs,only the memory required will be taken from the suspended task

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5

Page 70: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption

Kill preemption primitive has many drawbacks, can we do better?

: At low level, tasks are processes and processes can be suspendedand resumed by the Operating System

: We exploit this mechanism by enabling task suspension andresuming

: No need to change existent jobs! Done at low-level and transparentto the user

: Bonus: the operating system manages the memory of processes

� Memory of suspended tasks can be granted to other (running) tasks bythe OS. . .

� . . . and because the OS knows how much memory the process needs,only the memory required will be taken from the suspended task

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5

Page 71: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption

Kill preemption primitive has many drawbacks, can we do better?

: At low level, tasks are processes and processes can be suspendedand resumed by the Operating System

: We exploit this mechanism by enabling task suspension andresuming

: No need to change existent jobs! Done at low-level and transparentto the user

: Bonus: the operating system manages the memory of processes

� Memory of suspended tasks can be granted to other (running) tasks bythe OS. . .

� . . . and because the OS knows how much memory the process needs,only the memory required will be taken from the suspended task

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5

Page 72: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption

Kill preemption primitive has many drawbacks, can we do better?

: At low level, tasks are processes and processes can be suspendedand resumed by the Operating System

: We exploit this mechanism by enabling task suspension andresuming

: No need to change existent jobs! Done at low-level and transparentto the user

: Bonus: the operating system manages the memory of processes

� Memory of suspended tasks can be granted to other (running) tasks bythe OS. . .

� . . . and because the OS knows how much memory the process needs,only the memory required will be taken from the suspended task

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5

Page 73: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption

Kill preemption primitive has many drawbacks, can we do better?

: At low level, tasks are processes and processes can be suspendedand resumed by the Operating System

: We exploit this mechanism by enabling task suspension andresuming

: No need to change existent jobs! Done at low-level and transparentto the user

: Bonus: the operating system manages the memory of processes

� Memory of suspended tasks can be granted to other (running) tasks bythe OS. . .

� . . . and because the OS knows how much memory the process needs,only the memory required will be taken from the suspended task

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 5

Page 74: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption: Trashing

: Trashing: when data is continuously read from and written to swapspace, the machine performance are highly degraded to a point thatthe machine doesn’t work properly anymore

: Trashing is caused by the working set (memory) that is larger thanthe system memory

: In Hadoop this doesn’t happen because:

� Running tasks per machine are limited� Heap space per task is limited

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6

Page 75: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption: Trashing

: Trashing: when data is continuously read from and written to swapspace, the machine performance are highly degraded to a point thatthe machine doesn’t work properly anymore

: Trashing is caused by the working set (memory) that is larger thanthe system memory

: In Hadoop this doesn’t happen because:

� Running tasks per machine are limited� Heap space per task is limited

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 6

Page 76: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption: Experiments

: Test the worst case for suspension, that is when the jobs allocate allthe memory

: Two jobs, th and tl , allocating 2 GB of memory

10 20 30 40 50 60 70 80 90tl progress at launch of th (%)

80

90

100

110

120

130

140

150

sojo

urn

tim

et h

(s)

wait

kill

susp

10 20 30 40 50 60 70 80 90tl progress at launch of th (%)

170180190200210220230240250

mak

espa

n(s

)

wait

kill

susp

: Our primitive outperform kill and wait

: Overhead for swapping doesn’t affect the jobs too much

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 7

Page 77: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption: Conclusions

: Task Suspension/Resume outperform current preemptionimplementations. . .

: . . . but it raises new challenges, e.g. state locality for task suspended

: With a good scheduling policy (and eviction policy), OS-assistedpreemption can substitute current preemption mechanism

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8

Page 78: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption: Conclusions

: Task Suspension/Resume outperform current preemptionimplementations. . .

: . . . but it raises new challenges, e.g. state locality for task suspended

: With a good scheduling policy (and eviction policy), OS-assistedpreemption can substitute current preemption mechanism

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8

Page 79: Size-Based Disciplines for Job Scheduling in Data-Intensive Scalable Computing Systems

OS-Assisted Task Preemption: Conclusions

: Task Suspension/Resume outperform current preemptionimplementations. . .

: . . . but it raises new challenges, e.g. state locality for task suspended

: With a good scheduling policy (and eviction policy), OS-assistedpreemption can substitute current preemption mechanism

Mario Pastorelli (EURECOM) Ph.D. Thesis Defense 18 July 2014 8