Seven problems of Linux Containers
-
Upload
kirill-kolyshkin -
Category
Technology
-
view
20.992 -
download
0
description
Transcript of Seven problems of Linux Containers
parallels.com || openvz.org || criu.org
Seven Problemsof Linux Containers
Kir Kolyshkin<[email protected]>
28 April 2013 LinuxFest Northwest
parallels.com || openvz.org || criu.org
Seventy Seven Problemsof Linux Containers
Kir Kolyshkin<[email protected]>
28 April 2013 LinuxFest Northwest
(of which I am going to cover six)
parallels.com || openvz.org || criu.org
Problem 1: Effective virtualization
● Virtualization is partitioning● Historical way: $M mainframes● Modern way: virtual machines● Problem: performance overhead● Partial solution: hardware support
(Intel VT, AMD V)
parallels.com || openvz.org || criu.org
Solution: isolation
● Run many isolated userspace instanceson top of sone single (Linux) kernel
● All processes see each other– files, process information, network,
shared memory, users, etc.● Make them unsee it!
parallels.com || openvz.org || criu.org
parallels.com || openvz.org || criu.org
One historical way to unsee
chroot()
parallels.com || openvz.org || criu.org
Namespaces
● Implemented in the Linux kernel– PID– net– IPC– UTS– mnt– user
● clone() with CLONE_NEW* flags
parallels.com || openvz.org || criu.org
Problem 2: Shared resources
● All containers share the same set of resources (CPU, RAM, disk, various kernel things ...)
● Need fair distribution of goods so everyone gets their share
● Need DoS prevention● Need prioritization
– “All animals are equal, but some animals are more equal than others” -- George Orwell
parallels.com || openvz.org || criu.org
parallels.com || openvz.org || criu.org
Solution: OpenVZ resource controls
● OpenVZ:– user beancounters
● controls 20 parameters– hierarchical CPU scheduler– disk quota per containers– I/O priorities per-container
● Dynamic control, can “resize” runtime
parallels.com || openvz.org || criu.org
Solution: cgroups
● Cgroups is a mechanism to control resources per hierarchical groups of processes
● Cgroups is nothing without controllers:– blkio, cpu, cpuacct, cpuset, devices, freezer,
memory, net_cls, net_prio● Cgroups are orthogonal to namespaces● Still a work in progress (kernel memory)
parallels.com || openvz.org || criu.org
Problem 3: easy resources
● User Beancounters are complicated:– http://wiki.openvz.org/UBC_consistency_check– user has to set all these parameters– some of which are interdependent
● We created a collection of valid configs,● ... wrote a whole book about UBC● ... and a set of tools to help
parallels.com || openvz.org || criu.org
parallels.com || openvz.org || criu.org
Solution: VSwap
● Only two primary parameters: RAM and swap– others still exist, but no longer required to set
● Swap is virtual, no actual I/O is performed● Slow down to emulate real swap● Only when actual global RAM shortage
occurs,virtual swap goes into the real swap
● Currently only available in OpenVZ kernel
parallels.com || openvz.org || criu.org
Problem 4: fast live migration
● We can migrate an OpenVZ containerfrom one physical server to anotherwithout a shutdown
● We want to do it fast even for huge containers– huge disk: use shared storage– huge RAM: ???
parallels.com || openvz.org || criu.org
Normal migration process
● (Assuming shared storage)● 1 Freeze the container● 2 Dump its complete state to a dump file● 3 Copy dump file to destination server● 4 Undump● 5 Unfreeze● Problem: huge dump file
parallels.com || openvz.org || criu.org
Solution 1: network swap
● 1 Dump the minimal memory, lock the rest● 2 Restore the minimal memory,
mark the rest as swapped out● 3 Set up network swap from the source● 4 Unfreeze. Missing RAM will be “swapped in”● 5 Migrate the rest of RAM and kill it on source
parallels.com || openvz.org || criu.org
parallels.com || openvz.org || criu.org
Solution 1: network swap
● 1 Dump the minimal memory, lock the rest● 2 Copy, undump what we have,
mark the rest as swapped out● 3 Set up network swap served from the source● 4 Unfreeze. Missing RAM will be “swapped in”● 5 Migrate the rest of RAM and kill it on source● PROBLEM? Reliability, no way to rollback
parallels.com || openvz.org || criu.org
Solution 2: Iterative RAM migration
● 1 Ask kernel to track modified pages● 2 Copy all memory to destination system● 3 Ask kernel for list of modified pages● 4 Copy those pages● 5 GOTO 3 until satisfied● 6 Freeze and do migration as usual
parallels.com || openvz.org || criu.org
Problem 5: upstreaming
● OpenVZ was developed separately● Then we wanted to merge it upstream
(i.e. to vanilla Linux kernel)● Problem?
parallels.com || openvz.org || criu.org
parallels.com || openvz.org || criu.org
Problem 5: upstreaming
● OpenVZ was developed separately● Then we wanted to merge it upstream
(i.e. to vanilla Linux kernel)● Problem:● upstream devs are not accepting our work
parallels.com || openvz.org || criu.org
Solution 1: rewrite from scratch
● User Beancounters -> CGroups● Did 2 rewrites for PID namespace
until it finally got accepted● Network namespace redone● It works!● about 1500 patches got landed to vanilla● II Parallels made it to top10 contributors
parallels.com || openvz.org || criu.org
Solution 2: CRIU
● We tried hard to merge checkpoint/restore● Other people tried hard too, no luck● Can't make it to the kernel, let's go userspace● With minimal kernel intervention when
required● Kernel exports most of information already, so
let's just add missing bits and pieces
parallels.com || openvz.org || criu.org
CRIU
● Checkpoint / Restore (mostly) In Userspace
Tools currently at version 0.4● Will do 1.0 release this year● Kernel 3.8 has about 120 patches from us
– 95% of needed features are there● Memory snapshot recently made it to -mm tree
parallels.com || openvz.org || criu.org
parallels.com || openvz.org || criu.org
Problem 6: common file system
● Container is just a directory on host,all CTs reside on the same FS
● File system journal is a bottleneck● Lots of small-size files I/O on CT backup● No sub-tree disk quota support in upstream● No per-container snapshots● Live migration: rsync -- changed inodes● File system type and properties are fixed
parallels.com || openvz.org || criu.org
Solution 1: LVM
● Only works only on top of block device● Hard to manage (e.g. how to migrate huge
volume?)● No dynamic allocation● Complicated management
parallels.com || openvz.org || criu.org
Solution 2: loop device
● VFS operations leads to double page-caching– (already fixed in the recent kernels)
● No dynamic allocation, max space is used● Limited feature set
parallels.com || openvz.org || criu.org
Solution 3: ploop
● Basic idea: same as loop, just better● Modular design:
– various image formats (qcow2 in TODO)– various I/O backends
● More features:– live resize– instant live snapshots– write tracker to help in live migration
parallels.com || openvz.org || criu.org
Any problems questions?
● [email protected]● Twitter: @kolyshkin