Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an...

57
Themis: Fair and Efficient GPU Cluster Scheduling Kshiteej Mahajan 1 , Arjun Balasubramanian 1 , Arjun Singhvi 1 , Shivaram Venkataraman 1 , Aditya Akella 1 , Amar Phanishayee 2 , Shuchi Chawla 1 1 2 1

Transcript of Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an...

Page 1: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Fair and Efficient GPU Cluster Scheduling

Kshiteej Mahajan1, Arjun Balasubramanian1, Arjun Singhvi1, Shivaram Venkataraman1, Aditya Akella1, Amar Phanishayee2, Shuchi Chawla1

1 2

1

Page 2: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Deep Learning at a Large Enterprise

Speech, Image, Ads, NLP, Web Search…

2

Page 3: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Deep Learning at a Large Enterprise

Speech, Image, Ads, NLP, Web Search…

Innovate and Train newer DL models on GPUs

3

Page 4: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Deep Learning at a Large Enterprise

• Hyperparameter Optimization is typical –train same DL model (M)with different hyperparameters (H)

• DL App H1 (lr = 1e-3) H2 (lr = 1e-4)

H3 (lr = 5e-4) H4 (lr = 2e-3)

Job 1 Job 2

Job 3 Job 4

DL App, Model M 4

Page 5: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

DL Apps at a Large Enterprise

• Hyperparameter Optimization is typical –train same DL model (M)with different hyperparameters (H)

• DL App H1 (lr = 1e-3) H2 (lr = 1e-4)

H3 (lr = 5e-4) H4 (lr = 2e-3)

Job 1 Job 2

Job 3 Job 4

DL App, Model M 5

Page 6: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Deep Learning at a Large Enterprise

Independent GPU Instances

6

Page 7: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

GPU Cluster Scheduler: Goal

GPU Cluster Scheduler

Multiplex access to shared GPU cluster for contending DL apps

Shared GPU cluster

7

Page 8: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

GPU Cluster Scheduler: Goal

GPU Cluster Scheduler

Multiplex access to shared GPU cluster for contending DL apps

Shared GPU cluster

Desired Goal –Incentivize sharing of GPU resources

8

Page 9: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

GPU Cluster Scheduler: Goal

GPU Cluster Scheduler

Multiplex access to shared GPU cluster for contending DL apps

Shared GPU cluster

Primary Goal – Sharing Incentive (SI)

If N DL apps are sharing a cluster then no DL app should run slower than on a private cluster with 1/N resources.

9

Page 10: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Overview• Existing GPU Cluster Schedulers

• Do not give Sharing Incentive • DL App Properties• Drawbacks• Requirements

• Themis • Design• Implementation• Evaluation

10

Page 11: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Existing GPU Cluster Schedulers

GPU Cluster Scheduler

Shared GPU cluster

• DRF• Tiresias (LAS)• SLAQ• Optimus• Gandiva

11

Page 12: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

GPU Cluster Scheduler: Drawback 1

Metric = App Resource

Share

Mechanism =Allocate on Task Completion to

Min Metric Shared GPU cluster

• DRF

Interface = Task Resource Demand

12

Page 13: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

GPU Cluster Scheduler: Drawback 1

Metric = Resource Share

Mechanism =on task completion, schedule task from

app with min Resource Share

Shared GPU cluster• DRF

Interface = Task Resource Demand

• Assume Short Tasks for Sharing Incentive

• Short Tasks allow for frequent multiplexing

13

Page 14: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Interface = Task Resource Demand

GPU Cluster Scheduler: Drawback 1

Metric = Resource Share

Mechanism =on task completion, schedule task from

app with min Resource Share

Shared GPU cluster

• Assume Short Tasks for Sharing Incentive

• Short Tasks allow for frequent multiplexing

• DRF

• ML median task duration –3.75 hours

• Lot of apps with 5X shorter and 5X longer tasks

14

Page 15: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Interface = Task Resource Demand

GPU Cluster Scheduler: Drawback 1

Metric = Resource Share

Mechanism =on task completion, schedule task from

app with min Resource Share

Shared GPU cluster

• Assume Short Tasks for Sharing Incentive

• Short Tasks allow for frequent multiplexing

• DRF

• ML median task duration –3.75 hours

• Lot of apps with 5X shorter and 5X longer tasks

• Long waiting time for Short Apps

• No SI for Short Apps

SHORT RESOURCE INTENSIVE LONGTASK

long waiting time

15

Page 16: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Interface = Task Resource Demand

GPU Cluster Scheduler: Requirement 1

Metric = Resource Share

Mechanism =on task completion, schedule task from

app with min Resource Share

Shared GPU cluster

• Assume Short Tasks for Sharing Incentive

• Short Tasks allow for frequent multiplexing

• DRF

• ML median task duration –3.75 hours

• Lot of apps with 5X shorter and 5X longer tasks

• No SI for Short Apps

SHORT BREAK UP LONG TASK

• SI for ML Apps => Preemption is necessary

• Allocate any task for at most lease duration

schedule short task on preemption of

long task

16

Page 17: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

GPU Cluster Scheduler: Drawback 2

Metric = App Attained

Service (= #GPUs * time)

Mechanism =Allocate for lease duration to Min

MetricShared GPU cluster

• Tiresias (LAS)

Interface = Task Resource Demand

17

Page 18: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

GPU Cluster Scheduler: Drawback 2

Metric = App Attained

Service (= GPU time)

Mechanism =Allocate for lease duration to Min

MetricShared GPU cluster• Tiresias (LAS)

Interface = Task Resource Demand

• DL apps have a placement preference

• E.g.: VGG model family prefers dense placement

18

Page 19: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

GPU Cluster Scheduler: Drawback 2

Metric = App Attained

Service (= GPU time)

Mechanism =Allocate for lease duration to Min

MetricShared GPU cluster• Tiresias (LAS)

Interface = Task Resource Demand

• DL apps have a placement preference

• E.g.: VGG model family prefers dense placement

Server 1 Server 2• Attained Service is equal in both placements(= 4 GPUs * time)

• Both placements are equivalent

• Poor placement => slower execution time

• VGG app would rather prefer its own server

• No SI

Server 1 Server 2

VGG App

VGG App

Placement 1

Placement 2

19

Page 20: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

GPU Cluster Scheduler: Drawback 2

Metric = App Attained

Service (= GPU time)

Mechanism =Allocate for lease duration to Min

MetricShared GPU cluster• Tiresias (LAS)

Interface = Task Resource Demand

• DL apps have a placement preference

• E.g.: VGG model family prefers dense placement

• Attained Service is equal in both placements(= 4 GPUs * time)

• Both placements are equivalent

• SI violation with placement 1

Server 1 Server 2

Server 1 Server 2

VGG App

VGG App

Placement 1

Placement 2

• Binary Placement Enforcement –Strict Consolidation (wait for Placement 2)

• Partial Progress can be made with Placement 1

• Long wait time without progress

• SI is violated

20

Page 21: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

GPU Cluster Scheduler: Requirement 2

Metric = App Attained

Service (= GPU time)

Mechanism =Allocate for lease duration to Min

MetricShared GPU cluster• Tiresias (LAS)

Interface = Task Resource Demand

• DL apps have a placement preference

• E.g.: VGG model family prefers dense placement

• Attained Service is equal in both placements(= 4 GPUs * time)

• Both placements are equivalent

• SI violation with placement 1

Server 1 Server 2

Server 1 Server 2

VGG App

VGG App

Placement 1

Placement 2

• Binary Placement Enforcement –Strict Consolidation (only allow Placement 2)

• High wait time• SI is violated• SI for ML Apps =>

Fine-grained Placement Preference

21

Page 22: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Overview• Existing GPU Cluster Schedulers

• Do not give Sharing Incentive • DL App Properties• Drawbacks• Requirements

• Themis • Design• Implementation• Evaluation

22

Page 23: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Towards a new GPU Cluster Scheduler

Metric = ?

Mechanism =?

Shared GPU clusterThemis

Interface = ?

23

Page 24: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Metric

Metric = ?

Mechanism =?

Shared GPU clusterThemis

Interface = ?

24

Page 25: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Metric

GPU Cluster Scheduler

Shared GPU cluster

Green App’sShared Finish

Time Tsh

25

Page 26: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Metric

GPU Cluster Scheduler

Shared GPU cluster

Independent GPU instances

Green App’sIndependent Finish Time

Tid

26

Page 27: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Metric

GPU Cluster Scheduler

Shared GPU cluster

Independent GPU instances

Green App’sIndependent Running Time

Tid

Green App’sShared Running

Time Tsh

Primary Goal – Sharing Incentive (SI)

Tsh <= Tid

>=

27

Page 28: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Finish-Time Fairness Metric

• ⍴ = Tsh / Tid• Tsh: finish-time of app in shared cluster • Tid: finish-time of app in exclusive 1/N share of cluster• N: Average contention during app lifetime

28

Page 29: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Finish-Time Fairness Metric

• ⍴ = Tsh / Tid• Tsh: finish-time of app in shared cluster • Tid: finish-time of app in exclusive 1/N share of cluster• N: Average contention during app lifetime

• SI: for all apps, ⍴ <= 1

29

Page 30: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Finish-Time Fairness Metric

• ⍴ = Tsh / Tid• Tsh: finish-time of app in shared cluster • Tid: finish-time of app in exclusive 1/N share of cluster• N: Average contention during app lifetime

• SI: for all apps, ⍴ <= 1• Fine-Grained Placement Preferences –

• Excessive queueing or bad placements worsens Tsh and hence ⍴

30

Page 31: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Metric

Metric = finish-time fairness

(⍴)

Mechanism =?

Shared GPU clusterThemis

Interface = ?

31

Page 32: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Interface

Metric = finish-time fairness

(⍴)

Mechanism =?

Shared GPU clusterThemis

Interface = ?

32

Page 33: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Interface

• Key Purpose: Enable book-keeping of ⍴

• Who calculates ⍴ – the app or the scheduler?

33

Page 34: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Interface

• DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier

34

Page 35: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Interface

• DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier

• Launch several DL jobs with different Hyperparameters Hi

H1 H2 HN…

35

Page 36: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Interface

• DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier

• Launch several DL jobs with different Hyperparameters Hi

• GPU allocation, Gi, within jobs decided by Hyperparam-Opt

H1 H2 HN…G1 G2 GNGapp

Hyperparam-OptAllocation Logic

36

Page 37: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Interface

• DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier

• Launch several DL jobs with different Hyperparameters Hi

• GPU allocation, Gi, within jobs decided by Hyperparam-Opt

• Track training accuracy of each job and classify jobs as poor, ok and good

• Estimate finish time of each job

H1 H2 HN…G1 G2 GNGvector

Hyperparam-OptAllocation Logic

Hyperparam-OptTermination Logic

terminate at time1

terminate at timeN

trained model at time2

poor okgood

37

Page 38: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Interface

• DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier

• Launch several DL jobs with different Hyperparameters Hi

• GPU allocation, Gi, within jobs decided by Hyperparam-Opt

• Terminate poor model training instances until one best trained model

• Wleft = ∑ Gi ∗ Hi�( is best calculated by

the Hyperparam-Opt

H1 H2 HN…G1 G2 GNGvector

Hyperparam-OptAllocation Logic

Hyperparam-OptTermination Logic

terminate at time1

terminate at timeN

trained model at time2

• App’s Hyperparam-Opt tracks per-job progress

• App does calculation of ⍴

• Scheduler pulls updated values of ⍴ from the Agent co-located with App’s Hyperparam-Opt

• Details in the paper38

Page 39: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Towards a new GPU Cluster Scheduler

Metric = finish-time fairness

(⍴)

Mechanism =?

Shared GPU clusterThemis

Interface = Get ⍴estimates via Agent

39

Page 40: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Towards a new GPU Cluster Scheduler

Metric = finish-time fairness

(⍴)

Mechanism =?

Shared GPU clusterThemis

Interface = Get ⍴estimates via Agent

40

Page 41: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Mechanism

• Key Goal: Sharing Incentive

• SI: for all apps, ⍴ <= 1

• Difficult to guarantee with online arrivals

• Our focus: min (max ⍴ ): empirically keeps ⍴’s ≈ 1 without admission control

41

Page 42: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Server 1 Server 2Strawman MechanismSI Objective – min (max ⍴)

Red GPUs become available

42

Page 43: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Strawman MechanismServer 1 Server 2

SI Objective – min (max ⍴)

Interface: Get ⍴ estimates from all apps Red GPUs become available

43

Page 44: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Server 1 Server 2Strawman Mechanism

SI Objective – min (max ⍴)

⍴1 ⍴2> > ⍴3 > ⍴N

Interface: Get ⍴ estimates from all apps Red GPUs become available

Sort in decreasing order of ⍴Allocate to app with highest ⍴ (green app) for lease duration

44

Page 45: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Server 1 Server 2Strawman Mechanism: Issues

SI Objective – min (max ⍴)

⍴1 ⍴2> > ⍴3 > ⍴N

Interface: Get ⍴ estimates from all apps Red GPUs become available

Sort in decreasing order of ⍴Allocate to app with highest ⍴ (green app) for lease duration

1. Inefficient Allocation – Red GPUs are not co-located with Green Apps GPUs

2. Lying Apps – Apps can lie with high ⍴ values to hoard GPU resources

45

Page 46: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Server 1 Server 2Themis: Mechanism

SI Objective – min (max ⍴)

⍴1 ⍴2> > ⍴3 > ⍴N

Interface: Get ⍴ estimates from all apps Red GPUs become available

1. Filter 1 – f apps with max ⍴ values2. Allocate to one or more of 1 – f apps for lease duration

using Partial Allocation Auctions

1 – f f

46

Page 47: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Server 1 Server 2Themis: Mechanism

SI Objective – min (max ⍴)

⍴1 ⍴2> > ⍴3 > ⍴N

Interface: Get ⍴ estimates from all apps Red GPUs become available

1. Filter 1 – f apps with max ⍴ values2. Allocate to one or more of 1 – f apps for lease duration

(Red GPUs can potentially go to Blue App) using Auctions

1 – f f

1. Tradeoff SI for Efficiency – f → 0 => More apps to allocate resources => Better opportunity to match placement preference of apps to resources. Our sensitivity analysis suggests f = 0.8 gives a good tradeoff.

2. Partial Allocation Auction within 1– f apps – Incentivizes truth telling of ⍴

47

Page 48: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Mechanism: Partial Allocation Auction

⍴1 ⍴2>

1 – fOffer Bids

Resource Alloc 𝜌new

+ 0 𝜌old

+1 Red GPU+2 Red GPUs 𝜌old-2𝜹

𝜌old-𝜹

Server 1 Server 2

Red GPUs become available

48

Page 49: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Mechanism: Partial Allocation Auction

⍴1 ⍴2>

1 – fOffer Bids

Resource Alloc 𝜌new

+ 0 𝜌old

+ 2 Red GPUs+1 Red GPU 𝜌old-𝜹

𝜌old-2𝜹

• Widen Interface to make resource offer and get bids

• Bid from each app is a valuation table

Server 1 Server 2

Red GPUs become available

• Input: Valuation Tables from filtered apps

• Pareto efficiency (PE) – max ∏ 1/⍴𝑖, 𝑛𝑒𝑤 �( – proportional fair allocations

• Strategy Proofness (SP) – Allocate a fraction of this per app for lease duration – rest is “hidden payment”

• More lying => higher hidden payments => incentivizes truth-telling

• Leftover Allocation – Allocate hidden payments to unfiltered apps at random to avoid unallocated resources and enable work conservation

49

Page 50: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Overall Design

Hyperparam-opt at the top

Agent: Shim layer added to Apps to enable interaction (estimate 𝜌, make bids) with Scheduler

Scheduler: Finish-Time Fair Metric (𝜌);Mechanism: SI + PE + SP THEMIS SCHEDULER

Ask all apps for 𝜌estimates

Offersto subset of apps

Make Bids

ReceiveAllocation

Give 𝜌Estimates

AGENT

App1 App2 AppN. . . . .

Appm

1

2

3

4

5

Res. Alloc. 𝜌new

0 𝜌old

M1:2GM1:1G,M2:1G 𝜌old - 𝜹

𝜌old - 2𝜹

Apps make bids Allocate Winning

Bids

50

Page 51: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Implementation

Hyperparam-opt at the top

Agent: Shim layer added to Apps to enable interaction with Scheduler

Scheduler: Finish-Time Fair Metric;

Mechanism: SI + PE + SP

THEMIS SCHEDULER

Ask all apps for 𝜌estimates

Offersto subset of apps

Make Bids

ReceiveAllocation

Give 𝜌Estimates

AGENT

App1 App2 AppN. . . . .

Appm

1

2

3

4

5

Res. Alloc. 𝜌new

0 𝜌old

M1:2GM1:1G,M2:1G 𝜌old - 𝜹

𝜌old - 2𝜹

Apps make bids Allocate Winning

BidsSubmarine AM

Hadoop 3.2.0 YARN RM

51

Page 52: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Themis: Evaluation

• 20 machine, 64 GPU cluster• 8 instances each with 2 Tesla K80 GPUs and • 12 instances each with 4 Tesla K80 GPUs

• A publicly available trace of DL apps from Microsoft • Baselines:

• Tiresias – Least Attained Service Job First • Optimus – Best Throughput Scaling First• Gandiva – Best Packing Job First• SLAQ – Best Loss Gradient Job First

52

Page 53: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Macrobenchmark: Sharing Incentive

• CDF of ⍴ for all apps in the workload

• max ⍴ = 1.2 (~1) with Themis

• ⍴distribution has long tail without Themis

No Admission Control max 𝜌 = 1.2

Placement Insensitivity hurts!

53

Page 54: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Macrobenchmark: Efficiency

• GPU Time to execute workload

• Themis better than Gandiva

• Auctions enable globally optimal packing

Greedy-local vs.globally-optimal packing

54

Page 55: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Sensitivity Analysis/Tradeoffs

• max finish-time fair metric (⍴) and GPU time for different values of fairness knob (f)

55

Page 56: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Sensitivity Analysis/Tradeoffs

• max finish-time fair metric (⍴) and GPU time for different values of fairness knob (f)

• f = 0.8 maximizes sharing incentive without degrading efficiency

Sub-optimalplacement

More choice;pack better

56

Page 57: Themis: Fair and Efficient GPU Cluster Scheduling · Themis: Interface • DL App = Managed by an Hyperparameter Optimizer (Hyperparam-Opt) like Google Vizier • Launch several DL

Conclusion

• Consolidation of GPUs => Sharing Incentive is key

• DL App properties => existing schedulers violate SI

• Themis proposes a new metric finish-time fairness that captures SI

• Filtering + Partial Allocation Auctions => Themis performs better than existing schedulers on SI and Efficiency

57