Large-scale cluster management at Google with Borgurvoy/docs/VICC/4_vicc.pdf · Machine sharing at...

Post on 09-Jul-2020

2 views 0 download

Transcript of Large-scale cluster management at Google with Borgurvoy/docs/VICC/4_vicc.pdf · Machine sharing at...

Large-scale cluster management at Google with Borg

Guillaume Urvoy-Keller

January 7, 2018

1 / 30

Sources documents

Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer,Eric Tune, John Wilkes: Large-scale cluster management at Google withBorg. EuroSys 2015Available athttps://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf

2 / 30

Borg

Borg

“ Google’s Borg system is a cluster manager that runs hundreds of thousandsof jobs, from many thousands of different applications ”

Figure: source: Borg paper

3 / 30

Borg The user perspective

Borg

Users = developers

Users submit jobs

Jobs =a number of tasks all executing the same program (binary)

Machine sharing at process level with performance isolation ⇒ containers

Jobs executed in cells = set of machines managed as a whole

Interaction between users and jobs done mostly via command line (RPC)

4 / 30

Borg The user perspective

Workload

Two types of workloads co-exist:

jobs handling interactive traffic (e.g., Gmail) ⇒ long term and delaysensitive jobsbatch jobs (e.g. Mapreduce)

Not delay sensitiveseconds to days of processing

The two workloads co-exist in a cell (ratio varies from one cell to the other)

5 / 30

Borg The user perspective

Resources Allocation

Alloc at machine level (CPU, RAM, I/O)

Alloc set : on a set of machine

Equivalent of cgroups at kernel level but at an higher scale

Allows to retain alloc when stopping/restarting a job or for future task(unlike cgroups that operate on a specified process id)

Usage of quotas and priorities if more work than admissible shows up

6 / 30

Borg The user perspective

Naming and monitoring

BNS: Borg Name Service

One per taskBorg writes the task’s hostname and port into a consistent,highly-available file in Chubby with this name, which is used by our RPCsystem to find the task endpoint.

“ Chubby’s client interface similar to a simple file system that performswhole-file reads and write” ⇒ help developers deal with coarse-grainedsynchronization within their systems.Google File System and BigTable use it as root of their distributed filesystem.

Burrows, Mike. "The Chubby lock service for loosely-coupled distributed systems." Proceedings of the 7th symposium on Operating systemsdesign and implementation. USENIX Association, 2006.

7 / 30

Borg The user perspective

Naming and monitoring

BNS name basis of task’s DNS nameEx: the 1st task of job foo owned by user joe will be reachable via1.foo.joe.cc.borg.google.com in cell cc

Borg writes job size and task health information into Chubby ⇒ loadbalancers can see where to route requests

8 / 30

Borg The user perspective

Naming and Monitoring

Every task contains a built-in HTTP serverpublishes information about task health (e.g., RPC latencies).

Borg monitors health-check URL + restarts unresponsive tasks

Web interface for users ⇒ see their tasks with detailed logs

9 / 30

Borg Borg Internal Architecture

Borg Internal Architecture

In each cell

Borgmaster: logically centralized controller

Borglet: agent in each machine

10 / 30

Borg Borg Internal Architecture

Borgmaster

Borgmaster

handles RPC from clients: create machines

Store states: machines, tasks, allocs

Communicates with Borglets

11 / 30

Borg Borg Internal Architecture

Borgmaster

Logically a single process but actually 5 replicas

Each replica holds memory copy of states of cellUse of Paxos (consensus in distributed systems) to

maintain consistent states andelect master

States also copied to local disk ⇒ chekpoints for recovery and debugging

10s to elect a new master but a bit longer to reconstruct states

12 / 30

Borg Borg Internal Architecture

Borgmaster

Borgmaster features one scheduler

Submitted jobs recorded in Paxos store ⇒ pending queue

Scanned asynchronously by Scheduler

Scheduler works at task, not job level.

Relies on priority levels (prod/non prod, etc) + round robin inside a level

13 / 30

Borg Borg Internal Architecture

Borgmaster Scheduler

Two phases:

Feasilbility checking: find candidate machines for tasksScoring: pick machines in feasibility test

tries to pick machine with copies of task’s packagesminimizes number of preemptions if cell is loaded and task of high priority

14 / 30

Borg Borg Internal Architecture

Task Scheduling

Two extremes:

Spread load accros machines. Increases fragmentationFill machine as tightly as possible

Placing large tasks is easierProblem for applications with bursty loadNot good for batch jobs that

specify small initial CPUbut expect to benefit from unused resources

Google scheduler is (obviously) hybrid

15 / 30

Borg Borg Internal Architecture

Scheduling.. and task start-up preemption

Scheduler might preempt some tasks when starting a new one ⇒preempted tasks on local pending queueTask start-up latency: median at 25 s but high variability (over all cells)

Package installation takes 80% of time!Packages can be distributed using Bittorrent-like protocols (i.e in parallel)

16 / 30

Borg Borg Internal Architecture

Borglet

Local agent running on every machine

Start and stop tasksManages local resources ⇒ manipulates local OS settings

...because it is a bare metal approach

Borgmaster pulls every few secondsPush approach can be dangerous (storm of messages after an outage)

Pull task shared between Borgmaster replicathat reports to master only diff from previous report

17 / 30

Borg Borg Internal Architecture

Borglets

What if Borglet unable to communicate

Borglet continues work

Borgmaster re-schedule tasks on different machine.... and kills them ifBorglet re-appears

This models assumes transient communication issues dominates

18 / 30

Borg Borg Internal Architecture

Scalability

A single Borgmaster can manage cells with several thousandsmachines

several cells have arrival rates > 10000 tasks/minute

Busy Borgmaster: 10–14 CPU cores and up to 50 GiB RAMUse of threading (parallelization) and sharing between Borgmaster replicaof functions:

the 99%ile response time of UI <1s95%ile of Borglet polling interval < 10 s

19 / 30

Borg Borg Internal Architecture

Scalability

Scores of machines (to know where to schedules) is cached and lazilyupdated

Schedulers examines not all scores but a subset at random until findingfeasible allocationWhat they claim:

a few 10s of seconds when scheduling an entire cell from scratchagainst ... 3 days

20 / 30

Borg Borg Internal Architecture

Availability

Failures is the norm in large systems

Applications must be able to adapt

A failure might be a re-scheduling, esp. for non prod tasks

21 / 30

Borg Borg Internal Architecture

Availability

Schedulers tries to avoid correlated failures⇒ tasks of a job placed on different racks, different power domains

They claim to achieve 99.99% availability

22 / 30

Borg Borg Internal Architecture

Utilization

First problem : finding a metric to characterize utilizationAverage utilization?

Need to take care of load spikesNeed to care for batch jobs

Cell compaction1 Take a workload (real, not synthetic → benefit from logs of Borgmaster)2 Take a given cell, try to fit with the scheduling algo. Repeat several times.

Why? Because of non determinism of algo3 Repeat with smaller cell until it does not fit

Nice metrics but details missing in the papere.g. they consider only a static case, not a dynamic one (what about batchjobs? take only their initial resource claim)

23 / 30

Borg Borg Internal Architecture

Utilization

Cell compaction enables to assess the headroom left

24 / 30

Borg Borg Internal Architecture

Cell sharing

Initial design choice

Mixing batch and latency-sensitive workload

How efficient it is? Segregating prod and non prod would require 20-30% moremachines

25 / 30

Borg Borg Internal Architecture

Cell sharing

Intuitive explanation:

latency sensitive (prod) reserves additional resources for spikes

batch jobs benefits from these additional resources

26 / 30

Borg Borg Internal Architecture

Cell sharing

More subtle question: do non prod tasks steal resources from prod tasks ⇒use of CPI (Cycles per Instruction) metrics

Heavily Shared vs. lightly shared cells

Mean CPI of 1.58 (σ = 0.35) in shared cells against 1.53 (stσ = 0.32) indedicated cells ⇒ CPU performance is about 3% worse in shared cells

27 / 30

Borg From Borg to Kubernetes

Kubernetes (Successor of Borg)

The bad (what needs to be changed)Job level is not flexible enough. Notion of pods : scheduling units sharinglabels (key/value pairs)

Labels enabled to have different grouping: service, tier, or release-type(e.g., production, stag- ing, test)

One IP per machine complicate things:contention on port numberNeed to have applications that adapt to available port at deployement

28 / 30

Borg From Borg to Kubernetes

From Borg to Kubernetes

The good:

Allocs (at pod level) are goodKubernetes supports naming and load balancing using service abstraction

service = name + dynamic set of pods defined by a label.Kubernetes automatically load-balances connections

to the service among the podsthat match the label selector

Giving access to large information at runtime to users. Why?Because they will help each other...and not overload the help-desk ;-)

29 / 30