Post on 09-Jul-2020
Large-scale cluster management at Google with Borg
Guillaume Urvoy-Keller
January 7, 2018
1 / 30
Sources documents
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer,Eric Tune, John Wilkes: Large-scale cluster management at Google withBorg. EuroSys 2015Available athttps://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf
2 / 30
Borg
Borg
“ Google’s Borg system is a cluster manager that runs hundreds of thousandsof jobs, from many thousands of different applications ”
Figure: source: Borg paper
3 / 30
Borg The user perspective
Borg
Users = developers
Users submit jobs
Jobs =a number of tasks all executing the same program (binary)
Machine sharing at process level with performance isolation ⇒ containers
Jobs executed in cells = set of machines managed as a whole
Interaction between users and jobs done mostly via command line (RPC)
4 / 30
Borg The user perspective
Workload
Two types of workloads co-exist:
jobs handling interactive traffic (e.g., Gmail) ⇒ long term and delaysensitive jobsbatch jobs (e.g. Mapreduce)
Not delay sensitiveseconds to days of processing
The two workloads co-exist in a cell (ratio varies from one cell to the other)
5 / 30
Borg The user perspective
Resources Allocation
Alloc at machine level (CPU, RAM, I/O)
Alloc set : on a set of machine
Equivalent of cgroups at kernel level but at an higher scale
Allows to retain alloc when stopping/restarting a job or for future task(unlike cgroups that operate on a specified process id)
Usage of quotas and priorities if more work than admissible shows up
6 / 30
Borg The user perspective
Naming and monitoring
BNS: Borg Name Service
One per taskBorg writes the task’s hostname and port into a consistent,highly-available file in Chubby with this name, which is used by our RPCsystem to find the task endpoint.
“ Chubby’s client interface similar to a simple file system that performswhole-file reads and write” ⇒ help developers deal with coarse-grainedsynchronization within their systems.Google File System and BigTable use it as root of their distributed filesystem.
Burrows, Mike. "The Chubby lock service for loosely-coupled distributed systems." Proceedings of the 7th symposium on Operating systemsdesign and implementation. USENIX Association, 2006.
7 / 30
Borg The user perspective
Naming and monitoring
BNS name basis of task’s DNS nameEx: the 1st task of job foo owned by user joe will be reachable via1.foo.joe.cc.borg.google.com in cell cc
Borg writes job size and task health information into Chubby ⇒ loadbalancers can see where to route requests
8 / 30
Borg The user perspective
Naming and Monitoring
Every task contains a built-in HTTP serverpublishes information about task health (e.g., RPC latencies).
Borg monitors health-check URL + restarts unresponsive tasks
Web interface for users ⇒ see their tasks with detailed logs
9 / 30
Borg Borg Internal Architecture
Borg Internal Architecture
In each cell
Borgmaster: logically centralized controller
Borglet: agent in each machine
10 / 30
Borg Borg Internal Architecture
Borgmaster
Borgmaster
handles RPC from clients: create machines
Store states: machines, tasks, allocs
Communicates with Borglets
11 / 30
Borg Borg Internal Architecture
Borgmaster
Logically a single process but actually 5 replicas
Each replica holds memory copy of states of cellUse of Paxos (consensus in distributed systems) to
maintain consistent states andelect master
States also copied to local disk ⇒ chekpoints for recovery and debugging
10s to elect a new master but a bit longer to reconstruct states
12 / 30
Borg Borg Internal Architecture
Borgmaster
Borgmaster features one scheduler
Submitted jobs recorded in Paxos store ⇒ pending queue
Scanned asynchronously by Scheduler
Scheduler works at task, not job level.
Relies on priority levels (prod/non prod, etc) + round robin inside a level
13 / 30
Borg Borg Internal Architecture
Borgmaster Scheduler
Two phases:
Feasilbility checking: find candidate machines for tasksScoring: pick machines in feasibility test
tries to pick machine with copies of task’s packagesminimizes number of preemptions if cell is loaded and task of high priority
14 / 30
Borg Borg Internal Architecture
Task Scheduling
Two extremes:
Spread load accros machines. Increases fragmentationFill machine as tightly as possible
Placing large tasks is easierProblem for applications with bursty loadNot good for batch jobs that
specify small initial CPUbut expect to benefit from unused resources
Google scheduler is (obviously) hybrid
15 / 30
Borg Borg Internal Architecture
Scheduling.. and task start-up preemption
Scheduler might preempt some tasks when starting a new one ⇒preempted tasks on local pending queueTask start-up latency: median at 25 s but high variability (over all cells)
Package installation takes 80% of time!Packages can be distributed using Bittorrent-like protocols (i.e in parallel)
16 / 30
Borg Borg Internal Architecture
Borglet
Local agent running on every machine
Start and stop tasksManages local resources ⇒ manipulates local OS settings
...because it is a bare metal approach
Borgmaster pulls every few secondsPush approach can be dangerous (storm of messages after an outage)
Pull task shared between Borgmaster replicathat reports to master only diff from previous report
17 / 30
Borg Borg Internal Architecture
Borglets
What if Borglet unable to communicate
Borglet continues work
Borgmaster re-schedule tasks on different machine.... and kills them ifBorglet re-appears
This models assumes transient communication issues dominates
18 / 30
Borg Borg Internal Architecture
Scalability
A single Borgmaster can manage cells with several thousandsmachines
several cells have arrival rates > 10000 tasks/minute
Busy Borgmaster: 10–14 CPU cores and up to 50 GiB RAMUse of threading (parallelization) and sharing between Borgmaster replicaof functions:
the 99%ile response time of UI <1s95%ile of Borglet polling interval < 10 s
19 / 30
Borg Borg Internal Architecture
Scalability
Scores of machines (to know where to schedules) is cached and lazilyupdated
Schedulers examines not all scores but a subset at random until findingfeasible allocationWhat they claim:
a few 10s of seconds when scheduling an entire cell from scratchagainst ... 3 days
20 / 30
Borg Borg Internal Architecture
Availability
Failures is the norm in large systems
Applications must be able to adapt
A failure might be a re-scheduling, esp. for non prod tasks
21 / 30
Borg Borg Internal Architecture
Availability
Schedulers tries to avoid correlated failures⇒ tasks of a job placed on different racks, different power domains
They claim to achieve 99.99% availability
22 / 30
Borg Borg Internal Architecture
Utilization
First problem : finding a metric to characterize utilizationAverage utilization?
Need to take care of load spikesNeed to care for batch jobs
Cell compaction1 Take a workload (real, not synthetic → benefit from logs of Borgmaster)2 Take a given cell, try to fit with the scheduling algo. Repeat several times.
Why? Because of non determinism of algo3 Repeat with smaller cell until it does not fit
Nice metrics but details missing in the papere.g. they consider only a static case, not a dynamic one (what about batchjobs? take only their initial resource claim)
23 / 30
Borg Borg Internal Architecture
Utilization
Cell compaction enables to assess the headroom left
24 / 30
Borg Borg Internal Architecture
Cell sharing
Initial design choice
Mixing batch and latency-sensitive workload
How efficient it is? Segregating prod and non prod would require 20-30% moremachines
25 / 30
Borg Borg Internal Architecture
Cell sharing
Intuitive explanation:
latency sensitive (prod) reserves additional resources for spikes
batch jobs benefits from these additional resources
26 / 30
Borg Borg Internal Architecture
Cell sharing
More subtle question: do non prod tasks steal resources from prod tasks ⇒use of CPI (Cycles per Instruction) metrics
Heavily Shared vs. lightly shared cells
Mean CPI of 1.58 (σ = 0.35) in shared cells against 1.53 (stσ = 0.32) indedicated cells ⇒ CPU performance is about 3% worse in shared cells
27 / 30
Borg From Borg to Kubernetes
Kubernetes (Successor of Borg)
The bad (what needs to be changed)Job level is not flexible enough. Notion of pods : scheduling units sharinglabels (key/value pairs)
Labels enabled to have different grouping: service, tier, or release-type(e.g., production, stag- ing, test)
One IP per machine complicate things:contention on port numberNeed to have applications that adapt to available port at deployement
28 / 30
Borg From Borg to Kubernetes
From Borg to Kubernetes
The good:
Allocs (at pod level) are goodKubernetes supports naming and load balancing using service abstraction
service = name + dynamic set of pods defined by a label.Kubernetes automatically load-balances connections
to the service among the podsthat match the label selector
Giving access to large information at runtime to users. Why?Because they will help each other...and not overload the help-desk ;-)
29 / 30