KubeCon EU 2016: A Practical Guide to Container Scheduling

39
Container Scheduling A Practical Guide

Transcript of KubeCon EU 2016: A Practical Guide to Container Scheduling

Container SchedulingA Practical Guide

@tekgrrl #kubecon #kubernetes

@tekgrrl

+MandyWaite

@tekgrrl #kubecon #kubernetes

web browsers

BorgMaster

link shard

UI shardBorgMaster

link shard

UI shardBorgMaster

link shard

UI shardBorgMaster

link shard

UI shard

Scheduler

borgcfg web browsers

scheduler

Borglet Borglet Borglet Borglet

Config file

BorgMaster

link shard

UI shard

persistent store (Paxos)

Binary

Cell Storage

@tekgrrl #kubecon #kubernetes

Developer View

job hello_world = {

runtime = { cell = 'ic' } // Cell (cluster) to run in

binary = '.../hello_world_webserver' // Program to run

args = { port = '%port%' } // Command line parameters

requirements = { // Resource requirements

ram = 100M

disk = 100M

cpu = 0.1

}

replicas = 5 // Number of tasks

}

10000

@tekgrrl #kubecon #kubernetes

Developer View

@tekgrrl #kubecon #kubernetes

Hello world!

Hello world!

Hello world!

Hello world!Hello

world! Hello world! Hello

world!

Hello world!

Hello world!

Hello world!

Hello world!

Hello world!

Hello world!

Hello world!

Hello world!

Hello world!

Hello world!Hello world!

Hello world!

Hello world!

Hello world!

Hello world!

Hello world! Hello

world!

Hello world!

Hello world!

Hello world!

Image by Connie Zhou

Hello world!

Hello world!

Hello world! Hello

world!

Hello world! Hello

world!

Hello world!

Hello world!

Hello world!

Hello world!

Hello world! Hello

world!

Hello world! Hello

world!

Hello world!

Hello world!

Hello world!

Hello world!

Hello world! Hello

world!

Hello world! Hello

world!

Hello world!

Hello world!

7 @tekgrrl #kubecon #kubernetes

Developer View

Hello world!

“Internally, we don't use VMs - we just use containers to pack multiple tasks onto one machine, and stop them treading on one another.” - John Wilkes

8 @tekgrrl #kubecon #kubernetes

Developer View

9 @tekgrrl #kubecon #kubernetes

task-eviction rates and causes

Failures

10 @tekgrrl #kubecon #kubernetesImages by

Connie Zhou

A 2000-machine service will have >10 task exits per dayThis is not a problem: it's normal

11 @tekgrrl #kubecon #kubernetes

available resourcesone

machine

Efficiency

Advanced bin-packing algorithms

Experimental placement of production VM workload, July 2014

stranded resources

12 @tekgrrl #kubecon #kubernetes

Efficiency

Use

d CP

U

Use

d CP

U (i

n co

res)

Use

d M

emor

y

Use

d M

emor

y

Available Resources

Stranded Resources

Use

d CP

U (i

n co

res)

Use

d M

emor

y

13 @tekgrrl #kubecon #kubernetes

tasks per machine

Efficiency

Multiple applications per machine

CPI^2 paper, EuroSys 2013

Median

14 @tekgrrl #kubecon #kubernetes

web browsers

BorgMaster

link shard

UI shardBorgMaster

link shard

UI shardBorgMaster

link shard

UI shardBorgMaster

link shard

UI shard

Scheduler

borgcfg web browsers

scheduler

Cell

Config file

BorgMaster

link shard

UI shard

persistent store (Paxos)

Binary

Cell Storage

Efficiency

batch

Cells run both Prod and Non Prod tasks

batch

15 @tekgrrl #kubecon #kubernetes

Efficiency

Cell

Sharing Cells between prod/non-prod is Better

shared cell (original)

shared cell (compacted)

Cell

Non-Prod load (compacted)

Prod load (compacted)

Represents the overhead of running prod and non-prod in their own cells

16 @tekgrrl #kubecon #kubernetes

Resource reclamation

time

limit: amount of resource requested

usage: actual resource consumption

Efficiency

reservation: estimate of future usage

potentially reusable resources

17 @tekgrrl #kubecon #kubernetes

Resource reclamation could be more aggressive

Nov/Dec 2013

Efficiency

18 @tekgrrl #kubecon #kubernetes

Nov/Dec 2013

EfficiencyResource reclamation could be more aggressive

Kubernetes

@tekgrrl #kubecon #kubernetes

K8s Master

API Server

Dash Board

scheduler

Kubelet Kubelet Kubelet Kubelet

Container Registry

etcdControllers

web browserskubectl web browsers

Config file

Image

@tekgrrl #kubecon #kubernetes

Kubernetes without a Scheduler

K8s Master

API Server

Dash Board

scheduler etcd

apiVersion: v1kind: Podmetadata: name: bursty-staticspec: containers: - name: nginx image: nginx ports: - containerPort: 80

Controllers

k8s-minion-xyz

Kubelet

k8s-minion-abc

Kubelet

k8s-minion-fig

Kubelet

k8s-minion-cat

Kubelet

@tekgrrl #kubecon #kubernetes

Kubernetes without a Scheduler

K8s Master

API Server

Dashboard

k8s-minion-xyz

poddy

Kubelet

k8s-minion-abc

Kubelet

k8s-minion-fig

Kubelet

k8s-minion-cat

Kubelet

etcd

apiVersion: v1kind: Podmetadata: name: poddyspec: nodeName: k8s-minion-xyz containers: - name: nginx image: nginx ports: - containerPort: 80

Controllers

Resources

@tekgrrl #kubecon #kubernetes

A Resource is something that can be requested, allocated, or consumed to/by a pod or a container

CPU: Specified in units of Cores, what that is depends on the provider

Memory: Specified in units of Bytes

CPU is Compressible (i.e. it has a rate and can be throttled)

Memory is Incompressible, it can’t be throttled

Kubernetes Resources

@tekgrrl #kubecon #kubernetes

Future Plans:

More Resources:

● Network Ops● Network Bandwidth● Storage● IOPS● Storage Time

Kubernetes Compute Unit (KCU)

Kubernetes Resources (contd)

@tekgrrl #kubecon #kubernetes

... spec: containers: - name: locust image: gcr.io/rabbit-skateboard/guestbook:gdg-rtv resources: requests: memory: "300Mi" cpu: "100m" limits: memory: "300Mi" cpu: "100m"

my-controller.yaml

Resource based Scheduling

@tekgrrl #kubecon #kubernetes

Resource based Scheduling (Work In Progress)

Provide QoS for Scheduled Pods

Per Container CPU and Memory requirements

Specified as Request and Limit

Future releases will [better] support:

● Best Effort (Request == 0)● Burstable ( Request < Limit)● Guaranteed (Request == Limit)

Best Effort Scheduling for low priority workloads improves Utilization at Google by 20%

@tekgrrl #kubecon #kubernetes

Scheduling Pods: Nodes

K8s Node

Kubelet

disk = ssd

Resources

LabelsDisks

Nodes may not be heterogeneous, they can differ in important ways:

● CPU and Memory Resources● Attached Disks● Specific Hardware

Location may also be important

@tekgrrl #kubecon #kubernetes

What CPU and Memory Resources does it need?

Can also be used as a measure of priority

Pod Scheduling: Identifying Potential Nodes

K8s Node

Kubelet Proxy

CPU

Mem

@tekgrrl #kubecon #kubernetes

What Resources does it need?

What Disk(s) does it need (GCE PD and EBS) and can it/they be mounted without conflict?

Note: 1.1 limits to

Pod Scheduling: Finding Potential Nodes

K8s Node

Kubelet Proxy

CPU

Mem

@tekgrrl #kubecon #kubernetes

What Resources does it need?

What Disk(s) does it need?

What node(s) can it run on (Node Selector)?

Pod Scheduling: Identifying Potential Nodes

K8s Node

Kubelet Proxy

CPU

Mem

disktype = ssd

kubectl label nodes node-3 disktype=ssd

(pod) spec: nodeSelector: disktype: ssd

@tekgrrl #kubecon #kubernetes

nodeAffinity (Alpha in 1.2)

{ "nodeAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": { "nodeSelectorTerms": [ { "matchExpressions": [ { "key": "beta.kubernetes.io/instance-type", "operator": "In", "values": ["n1-highmem-2", "n1-highmem-4"] } ] } ] } } }

http://kubernetes.github.io/docs/user-guide/node-selection/

Implemented through Annotations in 1.2, through fields in 1.3

Can be ‘Required’ or ‘Preferred’ during scheduling

In future can can be ‘Required’ during execution (Node labels can change)

Will eventually replace NodeSelector

If you specify both nodeSelector and nodeAffinity, both must be satisfied

@tekgrrl #kubecon #kubernetes

Prefer node with most free resource left after the pod is deployed

Prefer nodes with the specified label

Minimise number of Pods from the same service on the same node

CPU and Memory is balanced after the Pod is deployed [Default]

Pod Scheduling: Ranking Potential Nodes

Node2

Node3

Node1

@tekgrrl #kubecon #kubernetes

Extending the Scheduler

1. Add rules to the scheduler and recompile

2. Run your own scheduler process instead of, or as well as, the Kubernetes scheduler

3. Implement a "scheduler extender" that the Kubernetes scheduler calls out to as a final pass when making scheduling decisions

@tekgrrl #kubecon #kubernetes

Admission Control

Admission Control enforces certain conditions, before a request is accepted by the API Server

AC functionality implemented as plugins which are executed in the sequence they are specified

AC is performed after AuthN checks

Enforcement usually results in either

● A Request denial● Mutation of the Request Resource● Mutation of related Resources

K8s Master

API Server

scheduler

ControllersAdm

issi

on C

ontro

l

@tekgrrl #kubecon #kubernetes

NamespaceLifecycle Enforces that a Namespace that is undergoing termination cannot have new objects created in it, and ensures that requests in a non-existant Namespace are rejected

LimitRanger Observes the incoming request and ensures that it does not violate any of the constraints enumerated in the LimitRange object in a Namespace

ServiceAccount Implements automation for serviceAccounts

ResourceQuota Observes the incoming request and ensures that it does not violate any of the constraints enumerated in the ResourceQuota object in a Namespace.

Default plug-ins in 1.2: --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,ResourceQuota,PersistentVolumeLabel

Admission Control Examples

@tekgrrl #kubecon #kubernetes

Kubernetes is Open SourceWe want your help!

http://kubernetes.io

https://github.com/kubernetes/kubernetes

Slack: #kubernetes-users

@kubernetesio

@tekgrrl #kubecon #kubernetesImages by Connie

Zhou

cloud.google.com