2. Vagin. Linux containers. June 01, 2013

31
Andrey Vagin <[email protected]> 1 June 2013, Moscow< Linux Containers Fedora Virtualization Day

Transcript of 2. Vagin. Linux containers. June 01, 2013

Page 1: 2. Vagin. Linux containers. June 01, 2013

Andrey Vagin <[email protected]>● 1 June 2013, Moscow<

Linux Containers

Fedora Virtualization Day

Page 2: 2. Vagin. Linux containers. June 01, 2013

2

Different types of Virtualization

● Virtual Machines

– Emulation (qemu)

– Paravirtualization (XEN)

– Hardware Virtualization (KVM, ESX)

● OS Level Virtualization

– Containers (Linux Containers, Solaris Zones, BSD Jails)

Page 3: 2. Vagin. Linux containers. June 01, 2013

3

Virtual Machine (VM)

Hardware

Hypervisor

Virtual HW

Kernel

Apps

Virtual HW

Kernel

Apps

Virtual HW

Kernel

Apps

Virtual HW

Kernel

Apps

Page 4: 2. Vagin. Linux containers. June 01, 2013

4

Containers (CT)

Hardware

Host Kernel

Apps

Namespaces

Apps

Namespaces

Apps

Namespaces

Apps

Namespaces

- chroot() on steroids

Page 5: 2. Vagin. Linux containers. June 01, 2013

5

Page 6: 2. Vagin. Linux containers. June 01, 2013

7

Comparison VM-s vs CT-s

● One real HW, many virtual HW, many OS-s.

● One real HW, one kernel, many userspace instances

● Full control on the guest OS ● Native performance: [almost] no overhead● High density

● KSM (Kernel SamePage Merging) ● Use resources on demand● Dynamic resource allocation● Naturally share pages

● Depends on hardware(VT-x, VT-d, EPT, etc)

● Not all functionality are virtualized

● Flexibility

Page 7: 2. Vagin. Linux containers. June 01, 2013

8

Page 8: 2. Vagin. Linux containers. June 01, 2013

9

Page 9: 2. Vagin. Linux containers. June 01, 2013

10

Evolution of Operating System

● Multitaskmany processes

● Multiusermany users

● Multicontainermany containers

Page 10: 2. Vagin. Linux containers. June 01, 2013

11

Containers (CT)

Cgroups– control resources

● cpu, cpuacct, cpuset

● blkio

● memory

● net_cls

Namespaces– isolate environments

● MNT

● PID

● NET

● IPC

● User

● UTS

Page 11: 2. Vagin. Linux containers. June 01, 2013

12

How to execute CT

All allowed by default● unshare, nsenter

● Systemd Lightweight Containers

● LXC

● Libvirt LXC

All restricted by default● OpenVZ (vzctl-core) (FC19)

Page 12: 2. Vagin. Linux containers. June 01, 2013

13

vzctl - perform various operations on a container

# yum install -y vzctl-core# vzctl create 101 --ostemplate fedora-15# vzctl start 101# vzctl exec 101 ps ax PID TTY STAT TIME COMMAND 1 ? Ss 0:00 init11830 ? Ss 0:00 syslogd -m 011897 ? Ss 0:00 /usr/sbin/sshd11943 ? Ss 0:00 xinetd -stayalive -pidfile ...12218 ? Ss 0:00 sendmail: accepting connections12265 ? Ss 0:00 sendmail: Queue runner@01:00:0013362 ? Ss 0:00 /usr/sbin/httpd13363 ? S 0:00 \_ /usr/sbin/httpd..............................................6416 ? Rs 0:00 ps axf# vzctl stop 101# vzctl destroy 101

Page 13: 2. Vagin. Linux containers. June 01, 2013

14

OpenVZ kernel only features

● Ploop (snapshot, backups, different formats)

● Second level quota

● More functional memory accounting

● PFCache (memory deduplication. Io-ops saving)

● More isolated in compare with FC19 (lack of userns)

Page 14: 2. Vagin. Linux containers. June 01, 2013

Questions?

http://openvz.org

Page 15: 2. Vagin. Linux containers. June 01, 2013

Andrey Vagin <[email protected]><

CRIU - Checkpoint/Restore in User-space

Page 16: 2. Vagin. Linux containers. June 01, 2013

17

What is C/R and how can it be used?

C/R is the ability to save states of processes and to restore them later.

Usage scenarios:

– Failure recovery

– Live migration

– Reboot-less upgrade

– Speed up of slow-boot services

– HPC issues

Page 17: 2. Vagin. Linux containers. June 01, 2013

18

History

● Berkeley Lab Checkpoint/Restart (BLCR) (2003)

– Load a kernel module and link with a library

● DMTCP: Distributed MultiThreaded CheckPointing (2004-2006)

– Preload a library

● OpenVZ (2005)

– OpenVZ kernel

● Linux Checkpoint/Restart by Oren Laadan (2008)

– A non-mainline kernel

● CRIU (2011)

OpenVZ2005

BLCR2003

Linux C/R2008

CRIU2011

DMTCP2007

Page 18: 2. Vagin. Linux containers. June 01, 2013

19

How does this work?

Kernel objects Process tree

crtools

Image files

Name-spaces

Files

Sockets

Pipes

001101101010110001011010000011010101

001101101010110001011010000011010101

001101101010110001011010000011010101

001101101010110001011010000011010101

001101101010110001011010000011010101

001101101010110001011010000011010101

Page 19: 2. Vagin. Linux containers. June 01, 2013

20

Kernel interfaces

Dump Restore

syscalls

netlink

/proc/

ptrace

Page 20: 2. Vagin. Linux containers. June 01, 2013

21

Dump

● Parasite code

– Receive file descriptors

– Dump memory content

– Prctl(), sigaction, pending signals, timers, etc.

● Ptrace

– freeze processes

– Inject a parasite code

● Netlink

– Get information about sockets, netns

● Procfs

/proc/PID/maps, /proc/PID/map_files/, /proc/PID/status, /proc/PID/mountinfo

Page 21: 2. Vagin. Linux containers. June 01, 2013

22

Restore

● Collect shared objects

● Restore name-spaces

● Create a process tree

– Restore SID, PGID

– Restore objects, which should be inherited

● Files, sockets, pipes, ...

● Restore per-task properties.

● Restore memory

● Call sigreturn

● Awesome

Namespaces

Processes

Page 22: 2. Vagin. Linux containers. June 01, 2013

23

Interesting moments

● How to restore shared objects?

– Send file descriptors via unix sockets

– Map files from /proc/self/map_files/ for restoring anon shared mappings

● How to restore memory mappings on the correct places?

– Map a new code block and a stack

– Unmap crtools' mappings

– Remap task's mappings on the correct places

● How to resume a process?

– Create a signal frame

– Call sigreturn()

Page 23: 2. Vagin. Linux containers. June 01, 2013

24

Kernel impact

~140 patches merged ~10 patches in flight

~11 new features appeared ~2 new features to come

Page 24: 2. Vagin. Linux containers. June 01, 2013

25

New features in a kernel

● Parasite code injection (by Tejun Heo)

– Read task states, that are currently retrieved by a task only about itself

● The kcmp() system call

– Helps checking which kernel objects are shared between processes

● Proc map_files directory

– Find out what exact file is mapped

– Mappings sharing info

● A bunch of prctl extensions

– Set various private stuff on task/mm objects (c/r-only feature)

● Last-pid sysctl

– Restore task with desired PID value

Page 25: 2. Vagin. Linux containers. June 01, 2013

26

New features in a kernel

● TCP repair mode

– Read intimate state of a TCP connectionand reconstructs it from scratch on a freshly created socket

● Sockets information dumping via netlink (sock_diag)

– Extendable sockets state retrieving engine

● Virtual net devices indexes

– Allows to restore network devices in a namespace

● Socket peeking offset

– Allows peeking sockets queues (reading without removing data from queue)

● Task memory tracking

– incremental snapshots, online migration

Page 26: 2. Vagin. Linux containers. June 01, 2013

27

What are already supported?

– X86_64 architecture

– Process tree linkage

– Multi-threaded apps

– All kinds of memory mappings

– Terminals, groups, sessions

– Open files (shared and unlinked)

– Established TCP connections

– Unix sockets, Packet sockets

– Name-spaces (net, mount, ipc)

– Non-posix files (epoll, inotify)

– Pipes, Fifo-s, IPC, ...

– ARM architecture

– Pending signals

– TCP time-stamps

– Iterative snapshots

– VDSO

– LXC and OpenVZ containers

In flight

– Posix timers

– Convert OpenVZ images

Page 27: 2. Vagin. Linux containers. June 01, 2013

28

How is CRIU tested?

● ZDTM – a set of unit-tests

● Real-life applications

– Apache, Nginx

– MySQL, MongoDB, Oracle

– Make && gcc

– Tar & gzip

– Screen

– Java

– LXC

– VNC server + GUI applications

Page 28: 2. Vagin. Linux containers. June 01, 2013

29

Future plans (Feb, 2013)

● Support all kinds of kernel objects

● Merge all in-flight patches in the mainstream kernel

● Integrate CRIU with OpenVZ and LXC utilities

● Iterative migration

– Migrate memory content before freezing applications

● Integration in distributions

– CRIU was accepted to Fedora 19

Page 29: 2. Vagin. Linux containers. June 01, 2013

30

How to use

● ./crtools dump -t pid [<options>]

– checkpoint a process/tree identified by pid

● ./crtools restore -t pid [<options>]

– restore - restore a process/tree identified by pid

● ./crtools show (-D dir)|(-f file) [<options>]

– show dump file(s) contents

● ./crtools check

– checks whether the kernel support is up-to-date

● ./crtools exec -t pid <syscall-string>

– exec - execute a system call by other task

Page 30: 2. Vagin. Linux containers. June 01, 2013

31

Checkpoint/restore of a VNC server.

Page 31: 2. Vagin. Linux containers. June 01, 2013

Questions?

http://criu.org