Containerizationas* theBuildingBlock* forDatacenter ... · (micro)services ①...

Benjamin Hindman – @benh

Containerization as the Building Block for Datacenter Applications GOTO Amsterdam

June 19, 2015

an emerging trend

microservices containerization

(micro)services ①  do one thing and do it well (UNIX)

②  compose!

③  build/commit in isolation, test in isolation, deploy in isolation (with easy rollback)

④  captures organizational structure (many teams working in parallel)

containerization

now then

containerization

now then

more moving parts less moving parts

a reinforcing trend

microservices containerization

cluster management

“cluster management” at Twitter, circa 2010

(configuration/package management) (deployment)

MySQL memcached

MySQL Rails memcached

MySQL Cassandra Rails memcached

MySQL Cassandra Rails Hadoop memcached

challenges

challenges ①  failures

failures

challenges ②  maintenance

(aka “planned failures”)

maintenance ①  upgrading software (i.e., the kernel)

ops developers

maintenance ①  upgrading software (i.e., the kernel)

②  replacing machines, switches, PDUs, etc

challenges ③  utilization

Rails

Hadoop

memcached

utilization

utilization

Rails

Hadoop

memcached

utilization

Rails

Hadoop

memcached buy less machines

or run more applications!


② maintenance

③  utilization


② maintenance

③  utilization

planning for failure?

planning for failure


② maintenance

③  utilization

planning for utilization?

planning for utilization intra-‐machine resource sharing:

share a single machine’s resources between multiple applications (multi-‐tenancy)

intra-‐datacenter resource sharing:

share multiple machine’s resources between multiple applications

Twitter, circa 2010

I want a cluster manager!

cluster management ①   Treat machines as cattle not pets. »  Keep the base operating system small and simple,

run “containerized” applications.

cluster management ①   Treat machines as cattle not pets. »  Keep the base operating system small and simple,

run “containerized” applications.

②  Automate with software not humans. »  Let software schedule software, i.e., handle

failures, improve utilization, and manage maintenance.

cluster management

industry academia

different software

academia industry

•  MPI (Message Passing Interface) •  Apache (mod_perl, mod_php) •  web services (Java, Ruby, …)

different scale (at first)

academia industry

•  100’s of machines •  10’s of machines

cluster management

academia industry

•  PBS (Portable Batch System) •  TORQUE •  SGE (Sun Grid Engine)

•  ssh •  Puppet/Chef •  Capistrano/Ansible

cluster managers

different scale (converging)

academia industry

•  100’s of machines •  10’s of machines

1,000’s of machines

cluster management

academia industry

•  PBS (Portable Batch System) •  TORQUE •  SGE (Sun Grid Engine)

•  ssh •  Puppet/Chef •  Capistrano/Ansible

batch computation!

Apache Mesos is a modern general purpose cluster manager (i.e., not just focused on batch computing)

Mesos was designed to run:

stateless services!

Mesos is a cluster manager with a master/slave architecture

masters

slaves

schedulers register with the Mesos master(s) in order to run jobs/tasks

masters

slaves

schedulers

stateless services!

service scheduler

service scheduler

Mesos

orchestrates services using Mesos

orchestration vs scheduling

service scheduler

Mesos

schedule

orchestration vs scheduling

service scheduler

Mesos

orchestrate

schedule

orchestration w/ Mesosphere’s Marathon ①   configuration/package

management

②   deployment

configuration/package management

(1)  bundle services as jar, tar/gzip, or using Docker

(2) upload to HDFS, S3, Docker registry, etc

deployment

(1) describe services using JSON

(2) submit services to Marathon via REST or CLI

example-‐docker.json { "container": { "type": "DOCKER", "docker": { "image": "libmesos/ubuntu" }, "volumes" : [ { "containerPath": "/etc/a", "hostPath": "/var/data/a", "mode": "RO" }, { "containerPath": "/etc/b", "hostPath": "/var/data/b", "mode": "RW" } ] }, "id": "ubuntu", "instances": 1, "cpus": 0.5, "mem": 512, "cmd": "while sleep 10; do date -u +%T; done" }

orchestration w/ Kubernetes on Mesos ①   configuration/package

management

②   deployment

multiple schedulers

multiple schedulers

…

multiple schedulers

multiple schedulers

0.8.2 0.9.0

multiple schedulers

going deeper …

multiple schedulers

two-‐level scheduling Mesos influenced by multi-‐level scheduling in traditional operating systems (user-‐space scheduling and scheduler activations)

Mesos is designed less like a “cluster manager” and more like an operating system kernel

Apache Mesos: distributed systems kernel

Apache Mesos: datacenter kernel

Mesos (nodes)

Mesos: datacenter kernel

scheduler

Mesos (master)

scheduler

syscall-‐like API for datacenter

Mesos: datacenter kernel + enable running multiple distributed systems on the same cluster of machines and dynamically share the resources more efficiently!

Mesos: datacenter kernel + enable building new distributed systems by providing common functionality (primitives) every new distributed system re-‐implements

Mesos primitives •  principals, users, roles •  advanced fair-‐sharing

allocation algorithms •  high-‐availability (even

during upgrades) •  resource monitoring •  preemption/revocation •  volume management •  reservations (dynamic/

static) •  …

build on top of Mesos

①  don’t reinvent the wheel: leverage primitives to implement/automate failures, maintenance, etc.

build on top of Mesos

② make it easier for your users to use your software!

built on Mesos

2009 2010 2013 2014

ported to Mesos

2011 2012 2013 2014

Mesos is being run at:

2010 2013 2014 …

going even deeper …

Mesos resource requests/offers

masters

scheduler

request 3 CPUs 2 GB RAM

a request is purposely simplified subset of a specification, mainly including the required resources at that point in time


masters

scheduler

offer hostname 4 CPUs 4 GB RAM





masters

scheduler






masters

scheduler


scheduler uses the offers to decide what tasks to run





masters

scheduler


scheduler uses the offers to decide what tasks to run “two-‐level scheduling”

Mesos task/executor model

masters

scheduler

scheduler uses the offers to decide what tasks to run

task 3 CPUs 2 GB RAM

slave

a task with a command

mesos-slave!

slave


mesos-slave!

task!

slave


mesos-slave!

task! task!

slave

a task with an executor

mesos-slave!

slave


mesos-slave!

executor!

slave


mesos-slave!

executor!

task!

slave


mesos-slave!

executor!

task! task!

slave


mesos-slave!

executor!!!!

slave


mesos-slave!

executor!

task! task!

task!

slave

task/executor isolation

mesos-slave!

executor!

task!

task!

slave


mesos-slave!

executor!

task!

task!

containers

slave


mesos-slave!

executor!

task!

task!

slave

a task with a Docker image

mesos-slave!

Docker!image!

masters

master failover

scheduler

after a new master is elected all schedulers and slaves connect to the new master all tasks keep running across master failover!

masters

scheduler failover

scheduler

scheduler re-‐registers with master and resumes operation all tasks keep running across framework failover!

scheduler

slave

slave failover

mesos-slave!

task! task!

slave

slave failover

mesos-slave!

task!task!

slave

slave failover

task!task!

slave

slave failover

mesos-slave!

task!task!

slave

slave failover

mesos-slave!

(large in-‐memory services, expensive to restart)

down the rabbit hole …

Mesos 1st-‐level scheduler allocates resources to frameworks using a fair-‐sharing algorithm we created called Dominant Resource Fairness (DRF)

DRF, born of static partitioning

datacenter

static partitioning across teams

promotions trends recommendations team


fairly shared!

static partitioning across teams

goal: fairly share the resources without static partitioning

partition utilizations

promotions trends recommendations

45% CPU 100% RAM

75% CPU 100% RAM

100% CPU 50% RAM

team

utilization

observation: a dominant resource bottlenecks each team from running any more jobs/tasks

dominant resource bottlenecks


utilization

bottleneck RAM

45% CPU 100% RAM

75% CPU 100% RAM

100% CPU 50% RAM

RAM CPU

insight: allocating a fair share of each team’s dominant resource guarantees they can run at least as many jobs/tasks as with static partitioning!

… if my team gets at least 1/N of my dominant resource I will do no worse than if I had my own cluster, but I might do better when resources are available!

tep 4: Profit (statistical multiplexing) $

in practice, fair sharing is insufficient

weighted fair sharing


weighted fair sharing


weight 0.17 0.5 0.33

Mesos implements weighted DRF

masters

masters can be configured with weights per role resource allocation decisions incorporate the weights to determine dominant fair shares

in practice, weighted fair sharing is still insufficient

a non-‐cooperative framework (i.e., has long tasks or is buggy) can get hoard too many resources

resource reservations

resources on individual slaves can be reserved for particular roles resource offers include the reservation role (if any)

masters

framework (trends)

offer hostname 4 CPUs 4 GB RAM role: trends

reservations

slave *!role-foo!

reservations

mesos-slave!

role-bar!

static reservations

reservations available with Mesos using the --resources flag on each mesos-slave!

$ mesos-slave --resources=‘cpus(role-foo):2;mem(role-foo):1024;cpus(role-bar):2;mem(role-bar):1024;cpus(*):4;mem(*):4096’!

static reservations

+ strong guarantees

-‐ set up by an operator when starting slave

-‐ immutable (must drain/restart the slave)

dynamic reservations framework scheduler reserves resources at runtime when it accepts an offer (allocation)

masters


slave (1)

framework (role-‐baz)


masters


Accept Launch/Reserve(4 CPUs, 4 GB RAM)

slave (2)


masters

Launch/Reserve(4 CPUs, 4 GB RAM)

slave (3)


promotions 40%

trends 20%

used 10%

unused 30%

recommendations 40%

reservations

reservations provide guarantees, but at the cost of utilization

revocable resources

masters

framework (promotions)

resources that are reserved for another role and thus may be revoked at any time


preemption via revocation

… my tasks will not be killed unless I’m using revocable

resources!

oversubscription via revocable resources

masters

framework (promotions)

oversubscribe resources by allocating unused resources as revocable!


revocation “guarantee”

… my tasks will not be killed unless I’m using revocable

resources!

revocation “guarantee”

what about when:

① want/need to defrag (think page replacement algorithm!)

② maintenance

mechanism for deallocation

possible solution: introduce failures to deallocate resources

but why not communicate explicitly!?

inverse offer framework scheduler gets deallocation requests in the form of inverse offers

masters

offer-‐1 hostname 4 CPUs 4 GB RAM

slave (1)


inverse offer framework scheduler can kill tasks and acknowledge deallocation

masters


Kill/ACK

slave (2)

maintenance ①  drain machine by sending out inverse offers

② while draining can still send out offers with revocable resources

③  remove machine from allocation once drained (or at some specific time)

but what about my persistent data?

persistent volumes framework scheduler creates volumes for disk resources at runtime when it accepts an offer (allocation)

masters

Offer hostname 4 CPUs 4 GB RAM

slave (1)

scheduler (role-‐baz)

persistent volumes framework scheduler creates volumes for disk resources at runtime when it accepts an offer (allocation)

masters


Accept Launch/Create(1 TB DISK)

slave (2)

persistent volumes framework scheduler reserves resources at runtime when it accepts an offer (allocation)

masters

Launch/Create(1 TB DISK)

slave (3)


persistent volumes volumes are created before launching any tasks or executors

slave role-baz!

role-foo!

mesos-slave!

role-bar!

persistent volumes volumes are mounted into the container when a task or executor gets launched

slave

!!!!!

role-baz!

role-foo!

mesos-slave!

role-bar!

task!

persistent volumes volumes persist even after task or executor terminate!

slave

!!!!!

role-baz!

role-foo!

mesos-slave!

role-bar!

task!

persistent volumes volumes persist even after task or executor terminate!

slave role-baz!

role-foo!

mesos-slave!

role-bar!

the bigger picture reservations +

inverse offers +

persistent volumes =

long-‐lived stateful frameworks!

conclusion

our “other” computers are datacenters*

* collection of physical and/or virtual machines

the datacenter is just another form factor

Mesos is the kernel …

desktop server datacenter

OS

OS

OS

need a datacenter operating system

Mesosphere’s DCOS

Marathon

Marathon, a scheduler for running stateless services written in any language

init for the datacenter operating system

Chronos

Chronos, a scheduler for running cron jobs with dependencies

cron for the datacenter operating system

DCOS CLI

mesosphere.com

Thanks!

Containerizationas* theBuildingBlock* forDatacenter ... · (micro)services ①...

Documents

Transcript of Containerizationas* theBuildingBlock* forDatacenter ... · (micro)services ①...

Containerizationas* the*Building*Block* for*Datacenter ... · (micro)services* ①...

Documents

Transcript of Containerizationas* the*Building*Block* for*Datacenter ... · (micro)services* ①...

Containerizationas* theBuildingBlock* forDatacenter ... · (micro)services ①...

Transcript of Containerizationas* theBuildingBlock* forDatacenter ... · (micro)services ①...