Containerizationas* the*Building*Block* for*Datacenter ... · (micro)services* ①...
Transcript of Containerizationas* the*Building*Block* for*Datacenter ... · (micro)services* ①...
Benjamin Hindman – @benh
Containerization as the Building Block for Datacenter Applications GOTO Amsterdam
June 19, 2015
an emerging trend
microservices containerization
(micro)services ① do one thing and do it well (UNIX)
② compose!
③ build/commit in isolation, test in isolation, deploy in isolation (with easy rollback)
④ captures organizational structure (many teams working in parallel)
containerization
now then
containerization
now then
more moving parts less moving parts
a reinforcing trend
microservices containerization
cluster management
“cluster management” at Twitter, circa 2010
(configuration/package management) (deployment)
“cluster management” at Twitter, circa 2010
(configuration/package management) (deployment)
“cluster management” at Twitter, circa 2010
(configuration/package management) (deployment)
MySQL
MySQL memcached
MySQL Rails memcached
MySQL Cassandra Rails memcached
MySQL Cassandra Rails Hadoop memcached
challenges
challenges ① failures
failures
MySQL Cassandra Rails Hadoop memcached
MySQL Cassandra Rails Hadoop memcached
challenges ② maintenance
(aka “planned failures”)
maintenance ① upgrading software (i.e., the kernel)
ops developers
maintenance ① upgrading software (i.e., the kernel)
② replacing machines, switches, PDUs, etc
MySQL Cassandra Rails Hadoop memcached
MySQL Cassandra Rails Hadoop memcached
challenges ③ utilization
Rails
Hadoop
memcached
utilization
utilization
Rails
Hadoop
memcached
utilization
Rails
Hadoop
memcached buy less machines
or run more applications!
challenges ① failures
② maintenance
③ utilization
challenges ① failures
② maintenance
③ utilization
planning for failure?
planning for failure
challenges ① failures
② maintenance
③ utilization
planning for utilization?
planning for utilization intra-‐machine resource sharing:
share a single machine’s resources between multiple applications (multi-‐tenancy)
intra-‐datacenter resource sharing:
share multiple machine’s resources between multiple applications
Twitter, circa 2010
I want a cluster manager!
cluster management ① Treat machines as cattle not pets. » Keep the base operating system small and simple,
run “containerized” applications.
cluster management ① Treat machines as cattle not pets. » Keep the base operating system small and simple,
run “containerized” applications.
② Automate with software not humans. » Let software schedule software, i.e., handle
failures, improve utilization, and manage maintenance.
cluster management
industry academia
different software
academia industry
• MPI (Message Passing Interface) • Apache (mod_perl, mod_php) • web services (Java, Ruby, …)
different scale (at first)
academia industry
• 100’s of machines • 10’s of machines
cluster management
academia industry
• PBS (Portable Batch System) • TORQUE • SGE (Sun Grid Engine)
• ssh • Puppet/Chef • Capistrano/Ansible
cluster managers
different scale (converging)
academia industry
• 100’s of machines • 10’s of machines
1,000’s of machines
cluster management
academia industry
• PBS (Portable Batch System) • TORQUE • SGE (Sun Grid Engine)
• ssh • Puppet/Chef • Capistrano/Ansible
batch computation!
Apache Mesos is a modern general purpose cluster manager (i.e., not just focused on batch computing)
Mesos was designed to run:
stateless services!
Mesos is a cluster manager with a master/slave architecture
masters
slaves
schedulers register with the Mesos master(s) in order to run jobs/tasks
masters
slaves
schedulers
stateless services!
service scheduler
service scheduler
Mesos
orchestrates services using Mesos
orchestration vs scheduling
service scheduler
Mesos
schedule
orchestration vs scheduling
service scheduler
Mesos
orchestrate
schedule
orchestration w/ Mesosphere’s Marathon ① configuration/package
management
② deployment
configuration/package management
(1) bundle services as jar, tar/gzip, or using Docker
(2) upload to HDFS, S3, Docker registry, etc
deployment
(1) describe services using JSON
(2) submit services to Marathon via REST or CLI
example-‐docker.json { "container": { "type": "DOCKER", "docker": { "image": "libmesos/ubuntu" }, "volumes" : [ { "containerPath": "/etc/a", "hostPath": "/var/data/a", "mode": "RO" }, { "containerPath": "/etc/b", "hostPath": "/var/data/b", "mode": "RW" } ] }, "id": "ubuntu", "instances": 1, "cpus": 0.5, "mem": 512, "cmd": "while sleep 10; do date -u +%T; done" }
orchestration w/ Kubernetes on Mesos ① configuration/package
management
② deployment
multiple schedulers
multiple schedulers
…
multiple schedulers
multiple schedulers
0.8.2 0.9.0
multiple schedulers
going deeper …
multiple schedulers
two-‐level scheduling Mesos influenced by multi-‐level scheduling in traditional operating systems (user-‐space scheduling and scheduler activations)
Mesos is designed less like a “cluster manager” and more like an operating system kernel
Apache Mesos: distributed systems kernel
Apache Mesos: datacenter kernel
Mesos (nodes)
Mesos: datacenter kernel
scheduler
Mesos (master)
scheduler
syscall-‐like API for datacenter
Mesos: datacenter kernel + enable running multiple distributed systems on the same cluster of machines and dynamically share the resources more efficiently!
Mesos: datacenter kernel + enable building new distributed systems by providing common functionality (primitives) every new distributed system re-‐implements
Mesos primitives • principals, users, roles • advanced fair-‐sharing
allocation algorithms • high-‐availability (even
during upgrades) • resource monitoring • preemption/revocation • volume management • reservations (dynamic/
static) • …
build on top of Mesos
① don’t reinvent the wheel: leverage primitives to implement/automate failures, maintenance, etc.
build on top of Mesos
② make it easier for your users to use your software!
build on top of Mesos
② make it easier for your users to use your software!
built on Mesos
2009 2010 2013 2014
ported to Mesos
2011 2012 2013 2014
Mesos is being run at:
2010 2013 2014 …
going even deeper …
Mesos resource requests/offers
masters
scheduler
request 3 CPUs 2 GB RAM
a request is purposely simplified subset of a specification, mainly including the required resources at that point in time
Mesos resource requests/offers
masters
scheduler
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
Mesos resource requests/offers
masters
scheduler
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
Mesos resource requests/offers
masters
scheduler
offer hostname 4 CPUs 4 GB RAM
scheduler uses the offers to decide what tasks to run
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
offer hostname 4 CPUs 4 GB RAM
Mesos resource requests/offers
masters
scheduler
offer hostname 4 CPUs 4 GB RAM
scheduler uses the offers to decide what tasks to run “two-‐level scheduling”
Mesos task/executor model
masters
scheduler
scheduler uses the offers to decide what tasks to run
task 3 CPUs 2 GB RAM
slave
a task with a command
mesos-slave!
slave
a task with a command
mesos-slave!
task!
slave
a task with a command
mesos-slave!
task! task!
slave
a task with an executor
mesos-slave!
slave
a task with an executor
mesos-slave!
executor!
slave
a task with an executor
mesos-slave!
executor!
task!
slave
a task with an executor
mesos-slave!
executor!
task! task!
slave
a task with an executor
mesos-slave!
executor!!!!
slave
a task with an executor
mesos-slave!
executor!
task! task!
task!
slave
task/executor isolation
mesos-slave!
executor!
task!
task!
slave
task/executor isolation
mesos-slave!
executor!
task!
task!
containers
slave
task/executor isolation
mesos-slave!
executor!
task!
task!
slave
task/executor isolation
mesos-slave!
executor!
task!
task!
slave
task/executor isolation
mesos-slave!
executor!
task!
task!
slave
task/executor isolation
mesos-slave!
executor!
task!
task!
slave
task/executor isolation
mesos-slave!
executor!
task!
task!
slave
task/executor isolation
mesos-slave!
executor!
task!
task!
slave
task/executor isolation
mesos-slave!
executor!
task!
task!
slave
a task with a Docker image
mesos-slave!
Docker!image!
masters
master failover
scheduler
after a new master is elected all schedulers and slaves connect to the new master all tasks keep running across master failover!
masters
scheduler failover
scheduler
scheduler re-‐registers with master and resumes operation all tasks keep running across framework failover!
scheduler
slave
slave failover
mesos-slave!
task! task!
slave
slave failover
mesos-slave!
task!task!
slave
slave failover
task!task!
slave
slave failover
mesos-slave!
task!task!
slave
slave failover
mesos-slave!
task!task!
slave
slave failover
mesos-slave!
(large in-‐memory services, expensive to restart)
down the rabbit hole …
Mesos 1st-‐level scheduler allocates resources to frameworks using a fair-‐sharing algorithm we created called Dominant Resource Fairness (DRF)
DRF, born of static partitioning
datacenter
static partitioning across teams
promotions trends recommendations team
promotions trends recommendations team
fairly shared!
static partitioning across teams
goal: fairly share the resources without static partitioning
partition utilizations
promotions trends recommendations
45% CPU 100% RAM
75% CPU 100% RAM
100% CPU 50% RAM
team
utilization
observation: a dominant resource bottlenecks each team from running any more jobs/tasks
dominant resource bottlenecks
promotions trends recommendations team
utilization
bottleneck RAM
45% CPU 100% RAM
75% CPU 100% RAM
100% CPU 50% RAM
RAM CPU
insight: allocating a fair share of each team’s dominant resource guarantees they can run at least as many jobs/tasks as with static partitioning!
… if my team gets at least 1/N of my dominant resource I will do no worse than if I had my own cluster, but I might do better when resources are available!
tep 4: Profit (statistical multiplexing) $
in practice, fair sharing is insufficient
weighted fair sharing
promotions trends recommendations team
weighted fair sharing
promotions trends recommendations team
weight 0.17 0.5 0.33
Mesos implements weighted DRF
masters
masters can be configured with weights per role resource allocation decisions incorporate the weights to determine dominant fair shares
in practice, weighted fair sharing is still insufficient
a non-‐cooperative framework (i.e., has long tasks or is buggy) can get hoard too many resources
resource reservations
resources on individual slaves can be reserved for particular roles resource offers include the reservation role (if any)
masters
framework (trends)
offer hostname 4 CPUs 4 GB RAM role: trends
reservations
slave *!role-foo!
reservations
mesos-slave!
role-bar!
static reservations
reservations available with Mesos using the --resources flag on each mesos-slave!
$ mesos-slave --resources=‘cpus(role-foo):2;mem(role-foo):1024;cpus(role-bar):2;mem(role-bar):1024;cpus(*):4;mem(*):4096’!
static reservations
+ strong guarantees
-‐ set up by an operator when starting slave
-‐ immutable (must drain/restart the slave)
dynamic reservations framework scheduler reserves resources at runtime when it accepts an offer (allocation)
masters
offer hostname 4 CPUs 4 GB RAM
slave (1)
framework (role-‐baz)
dynamic reservations framework scheduler reserves resources at runtime when it accepts an offer (allocation)
masters
framework (role-‐baz)
Accept Launch/Reserve(4 CPUs, 4 GB RAM)
slave (2)
dynamic reservations framework scheduler reserves resources at runtime when it accepts an offer (allocation)
masters
Launch/Reserve(4 CPUs, 4 GB RAM)
slave (3)
framework (role-‐baz)
promotions 40%
trends 20%
used 10%
unused 30%
recommendations 40%
reservations
reservations provide guarantees, but at the cost of utilization
revocable resources
masters
framework (promotions)
resources that are reserved for another role and thus may be revoked at any time
offer hostname 4 CPUs 4 GB RAM role: trends
preemption via revocation
… my tasks will not be killed unless I’m using revocable
resources!
oversubscription via revocable resources
masters
framework (promotions)
oversubscribe resources by allocating unused resources as revocable!
offer hostname 4 CPUs 4 GB RAM role: trends
revocation “guarantee”
… my tasks will not be killed unless I’m using revocable
resources!
revocation “guarantee”
what about when:
① want/need to defrag (think page replacement algorithm!)
② maintenance
mechanism for deallocation
possible solution: introduce failures to deallocate resources
but why not communicate explicitly!?
inverse offer framework scheduler gets deallocation requests in the form of inverse offers
masters
offer-‐1 hostname 4 CPUs 4 GB RAM
slave (1)
framework (role-‐baz)
inverse offer framework scheduler can kill tasks and acknowledge deallocation
masters
framework (role-‐baz)
Kill/ACK
slave (2)
maintenance ① drain machine by sending out inverse offers
② while draining can still send out offers with revocable resources
③ remove machine from allocation once drained (or at some specific time)
but what about my persistent data?
persistent volumes framework scheduler creates volumes for disk resources at runtime when it accepts an offer (allocation)
masters
Offer hostname 4 CPUs 4 GB RAM
slave (1)
scheduler (role-‐baz)
persistent volumes framework scheduler creates volumes for disk resources at runtime when it accepts an offer (allocation)
masters
scheduler (role-‐baz)
Accept Launch/Create(1 TB DISK)
slave (2)
persistent volumes framework scheduler reserves resources at runtime when it accepts an offer (allocation)
masters
Launch/Create(1 TB DISK)
slave (3)
scheduler (role-‐baz)
persistent volumes volumes are created before launching any tasks or executors
slave role-baz!
role-foo!
mesos-slave!
role-bar!
persistent volumes volumes are mounted into the container when a task or executor gets launched
slave
!!!!!
role-baz!
role-foo!
mesos-slave!
role-bar!
task!
persistent volumes volumes persist even after task or executor terminate!
slave
!!!!!
role-baz!
role-foo!
mesos-slave!
role-bar!
task!
persistent volumes volumes persist even after task or executor terminate!
slave role-baz!
role-foo!
mesos-slave!
role-bar!
the bigger picture reservations +
inverse offers +
persistent volumes =
long-‐lived stateful frameworks!
conclusion
our “other” computers are datacenters*
* collection of physical and/or virtual machines
the datacenter is just another form factor
Mesos is the kernel …
desktop server datacenter
OS
OS
OS
need a datacenter operating system
Mesosphere’s DCOS
Marathon
Marathon, a scheduler for running stateless services written in any language
init for the datacenter operating system
Chronos
Chronos, a scheduler for running cron jobs with dependencies
cron for the datacenter operating system
DCOS CLI
mesosphere.com
Thanks!