2. Vagin. Linux containers. June 01, 2013
-
Upload
ru-fedora-moscow-2013 -
Category
Presentations & Public Speaking
-
view
98 -
download
7
Transcript of 2. Vagin. Linux containers. June 01, 2013
2
Different types of Virtualization
● Virtual Machines
– Emulation (qemu)
– Paravirtualization (XEN)
– Hardware Virtualization (KVM, ESX)
● OS Level Virtualization
– Containers (Linux Containers, Solaris Zones, BSD Jails)
3
Virtual Machine (VM)
Hardware
Hypervisor
Virtual HW
Kernel
Apps
Virtual HW
Kernel
Apps
Virtual HW
Kernel
Apps
Virtual HW
Kernel
Apps
4
Containers (CT)
Hardware
Host Kernel
Apps
Namespaces
Apps
Namespaces
Apps
Namespaces
Apps
Namespaces
- chroot() on steroids
5
7
Comparison VM-s vs CT-s
● One real HW, many virtual HW, many OS-s.
● One real HW, one kernel, many userspace instances
● Full control on the guest OS ● Native performance: [almost] no overhead● High density
● KSM (Kernel SamePage Merging) ● Use resources on demand● Dynamic resource allocation● Naturally share pages
● Depends on hardware(VT-x, VT-d, EPT, etc)
● Not all functionality are virtualized
● Flexibility
8
9
10
Evolution of Operating System
● Multitaskmany processes
● Multiusermany users
● Multicontainermany containers
11
Containers (CT)
Cgroups– control resources
● cpu, cpuacct, cpuset
● blkio
● memory
● net_cls
Namespaces– isolate environments
● MNT
● PID
● NET
● IPC
● User
● UTS
12
How to execute CT
All allowed by default● unshare, nsenter
● Systemd Lightweight Containers
● LXC
● Libvirt LXC
All restricted by default● OpenVZ (vzctl-core) (FC19)
13
vzctl - perform various operations on a container
# yum install -y vzctl-core# vzctl create 101 --ostemplate fedora-15# vzctl start 101# vzctl exec 101 ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:00 init11830 ? Ss 0:00 syslogd -m 011897 ? Ss 0:00 /usr/sbin/sshd11943 ? Ss 0:00 xinetd -stayalive -pidfile ...12218 ? Ss 0:00 sendmail: accepting connections12265 ? Ss 0:00 sendmail: Queue runner@01:00:0013362 ? Ss 0:00 /usr/sbin/httpd13363 ? S 0:00 \_ /usr/sbin/httpd..............................................6416 ? Rs 0:00 ps axf# vzctl stop 101# vzctl destroy 101
14
OpenVZ kernel only features
● Ploop (snapshot, backups, different formats)
● Second level quota
● More functional memory accounting
● PFCache (memory deduplication. Io-ops saving)
● More isolated in compare with FC19 (lack of userns)
Questions?
http://openvz.org
Andrey Vagin <[email protected]><
CRIU - Checkpoint/Restore in User-space
17
What is C/R and how can it be used?
C/R is the ability to save states of processes and to restore them later.
Usage scenarios:
– Failure recovery
– Live migration
– Reboot-less upgrade
– Speed up of slow-boot services
– HPC issues
18
History
● Berkeley Lab Checkpoint/Restart (BLCR) (2003)
– Load a kernel module and link with a library
● DMTCP: Distributed MultiThreaded CheckPointing (2004-2006)
– Preload a library
● OpenVZ (2005)
– OpenVZ kernel
● Linux Checkpoint/Restart by Oren Laadan (2008)
– A non-mainline kernel
● CRIU (2011)
OpenVZ2005
BLCR2003
Linux C/R2008
CRIU2011
DMTCP2007
19
How does this work?
Kernel objects Process tree
crtools
Image files
Name-spaces
Files
Sockets
Pipes
001101101010110001011010000011010101
001101101010110001011010000011010101
001101101010110001011010000011010101
001101101010110001011010000011010101
001101101010110001011010000011010101
001101101010110001011010000011010101
20
Kernel interfaces
Dump Restore
syscalls
netlink
/proc/
ptrace
21
Dump
● Parasite code
– Receive file descriptors
– Dump memory content
– Prctl(), sigaction, pending signals, timers, etc.
● Ptrace
– freeze processes
– Inject a parasite code
● Netlink
– Get information about sockets, netns
● Procfs
/proc/PID/maps, /proc/PID/map_files/, /proc/PID/status, /proc/PID/mountinfo
22
Restore
● Collect shared objects
● Restore name-spaces
● Create a process tree
– Restore SID, PGID
– Restore objects, which should be inherited
● Files, sockets, pipes, ...
● Restore per-task properties.
● Restore memory
● Call sigreturn
● Awesome
Namespaces
Processes
23
Interesting moments
● How to restore shared objects?
– Send file descriptors via unix sockets
– Map files from /proc/self/map_files/ for restoring anon shared mappings
● How to restore memory mappings on the correct places?
– Map a new code block and a stack
– Unmap crtools' mappings
– Remap task's mappings on the correct places
● How to resume a process?
– Create a signal frame
– Call sigreturn()
24
Kernel impact
~140 patches merged ~10 patches in flight
~11 new features appeared ~2 new features to come
25
New features in a kernel
● Parasite code injection (by Tejun Heo)
– Read task states, that are currently retrieved by a task only about itself
● The kcmp() system call
– Helps checking which kernel objects are shared between processes
● Proc map_files directory
– Find out what exact file is mapped
– Mappings sharing info
● A bunch of prctl extensions
– Set various private stuff on task/mm objects (c/r-only feature)
● Last-pid sysctl
– Restore task with desired PID value
26
New features in a kernel
● TCP repair mode
– Read intimate state of a TCP connectionand reconstructs it from scratch on a freshly created socket
● Sockets information dumping via netlink (sock_diag)
– Extendable sockets state retrieving engine
● Virtual net devices indexes
– Allows to restore network devices in a namespace
● Socket peeking offset
– Allows peeking sockets queues (reading without removing data from queue)
● Task memory tracking
– incremental snapshots, online migration
27
What are already supported?
– X86_64 architecture
– Process tree linkage
– Multi-threaded apps
– All kinds of memory mappings
– Terminals, groups, sessions
– Open files (shared and unlinked)
– Established TCP connections
– Unix sockets, Packet sockets
– Name-spaces (net, mount, ipc)
– Non-posix files (epoll, inotify)
– Pipes, Fifo-s, IPC, ...
– ARM architecture
– Pending signals
– TCP time-stamps
– Iterative snapshots
– VDSO
– LXC and OpenVZ containers
In flight
– Posix timers
– Convert OpenVZ images
28
How is CRIU tested?
● ZDTM – a set of unit-tests
● Real-life applications
– Apache, Nginx
– MySQL, MongoDB, Oracle
– Make && gcc
– Tar & gzip
– Screen
– Java
– LXC
– VNC server + GUI applications
29
Future plans (Feb, 2013)
● Support all kinds of kernel objects
● Merge all in-flight patches in the mainstream kernel
● Integrate CRIU with OpenVZ and LXC utilities
● Iterative migration
– Migrate memory content before freezing applications
● Integration in distributions
– CRIU was accepted to Fedora 19
30
How to use
● ./crtools dump -t pid [<options>]
– checkpoint a process/tree identified by pid
● ./crtools restore -t pid [<options>]
– restore - restore a process/tree identified by pid
● ./crtools show (-D dir)|(-f file) [<options>]
– show dump file(s) contents
● ./crtools check
– checks whether the kernel support is up-to-date
● ./crtools exec -t pid <syscall-string>
– exec - execute a system call by other task
31
Checkpoint/restore of a VNC server.
Questions?
http://criu.org