Containers and isolation as implemented in the Linux kernel

21
Containers and isolation as implemented in the Linux kernel Technical Deep Dive Session Hannes Frederic Sowa <[email protected]> Senior Software Engineer 13. September 2016

Transcript of Containers and isolation as implemented in the Linux kernel

Page 1: Containers and isolation as implemented in the Linux kernel

Containers and isolation as implemented in the Linux kernel

Technical Deep Dive Session

Hannes Frederic Sowa <[email protected]>Senior Software Engineer13. September 2016

Page 2: Containers and isolation as implemented in the Linux kernel

2

OutlineContainers and isolation as implemented in the Linux kernel

Learned from history and enhanced and innovated in Free Software.

● Overview of not so recent history from other operating systems

● Representation and control from user space

● Implementation details in the kernel● What to come?

Page 3: Containers and isolation as implemented in the Linux kernel

3

History of operating system isolation

• Plan9 per-process namespaces• Distributed computing

• Architecture specific files mapped via bind/union mounts

• Directory vnodes had an append operation

• User space server via 9p protocol

• Not yet implemented in linux: RPC via AF_UNIX over NFS

Page 4: Containers and isolation as implemented in the Linux kernel

4

History of operating system isolation

• POSIX chroot• Available as syscall thus usable in self written applications

• Provides a new filesystem view thus limited isolation

• FreeBSD’s jails• Strongly integrated into the operating system

• Only small helper library available

• No operating system control and tuning

• Limited network isolation only based on IP addresses

• Solaris Zones• Strongly integrated into the operating system (even package manager)

• Tooling is dictated by Solaris tools

Page 5: Containers and isolation as implemented in the Linux kernel

5

Namespace API design in Linux

• Isolation and resource management completely decoupled

• API never tightly coupled to any user space library• Paved the path to a lot of user space frameworks (e.g. docker)

• Syscalls openly documented and reusable by 3rd party software

• Management available on/with already known kernel primitives• With rather primitive tools – nearly no new tools were needed

• Fine grain control of primitives to namespace• Opt-in model

• Easy to enhance in user space as well as in the kernel

Page 6: Containers and isolation as implemented in the Linux kernel

6

Isolation vs. Resource Management

cgroups -Resource management

namespaces -isolation

Process 1 Process 2

Process 3

Not completely orthogonal but still...

Process 4

cgroup1

cgroup2

ns1 ns2

Page 7: Containers and isolation as implemented in the Linux kernel

7

Namespaces in regular useEven on non-servers namespaces see regular use nowadays:

Type code snip$ lsns NS TYPE NPROCS PID USER COMMAND4026531836 pid 63 2028 hsowa /usr/lib/systemd/systemd --user4026531837 user 63 2028 hsowa /usr/lib/systemd/systemd --user4026531838 uts 70 2028 hsowa /usr/lib/systemd/systemd --user4026531839 ipc 70 2028 hsowa /usr/lib/systemd/systemd --user4026531840 mnt 70 2028 hsowa /usr/lib/systemd/systemd --user4026531969 net 63 2028 hsowa /usr/lib/systemd/systemd --user4026532501 pid 2 3485 hsowa /opt/google/chrome/chrome --type=zygote4026532503 net 6 3485 hsowa /opt/google/chrome/chrome --type=zygote4026532621 pid 1 3486 hsowa /opt/google/chrome/nacl_helper4026532623 net 1 3486 hsowa /opt/google/chrome/nacl_helper4026532724 user 1 3486 hsowa /opt/google/chrome/nacl_helper4026532725 user 6 3485 hsowa /opt/google/chrome/chrome --type=zygote...

Type code snip$ lsns NS TYPE NPROCS PID USER COMMAND4026531836 pid 63 2028 hsowa /usr/lib/systemd/systemd --user4026531837 user 63 2028 hsowa /usr/lib/systemd/systemd --user4026531838 uts 70 2028 hsowa /usr/lib/systemd/systemd --user4026531839 ipc 70 2028 hsowa /usr/lib/systemd/systemd --user4026531840 mnt 70 2028 hsowa /usr/lib/systemd/systemd --user4026531969 net 63 2028 hsowa /usr/lib/systemd/systemd --user4026532501 pid 2 3485 hsowa /opt/google/chrome/chrome --type=zygote4026532503 net 6 3485 hsowa /opt/google/chrome/chrome --type=zygote4026532621 pid 1 3486 hsowa /opt/google/chrome/nacl_helper4026532623 net 1 3486 hsowa /opt/google/chrome/nacl_helper4026532724 user 1 3486 hsowa /opt/google/chrome/nacl_helper4026532725 user 6 3485 hsowa /opt/google/chrome/chrome --type=zygote...

Page 8: Containers and isolation as implemented in the Linux kernel

8

Namespace API wrap-up

• No dependencies to 3rd party libraries or tools

• No design mandated by operating system or distributions

• Resource management independent from isolation

• Made several management tools possible (some specialized)• Iproute2, systemd, rkt, Docker, LXC, LXD, lmctfy, runc

• Own choices to use complete distribution or specialized init

• … or maybe just running the application directly in a namespace

• OpenVZ/Virtuozzo reusing and contributing to namespaces upstream

Page 9: Containers and isolation as implemented in the Linux kernel

9

OutlineContainers and isolation as implemented in the Linux kernel

Learned from history and enhanced and innovated in Free Software.

● Overview of not so recent history from other operating systems

● Representation and control from user space

● Implementation details in the kernel● What to come?

Page 10: Containers and isolation as implemented in the Linux kernel

10

Representation and control from user

# ls -l /proc/self/ns/total 0lrwxrwxrwx. 1 root hsowa 0 12. Sep 22:09 cgroup -> 'cgroup:[4026531835]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 ipc -> 'ipc:[4026531839]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 mnt -> 'mnt:[4026531840]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 net -> 'net:[4026531969]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 pid -> 'pid:[4026531836]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 user -> 'user:[4026531837]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 uts -> 'uts:[4026531838]'# unshare -n # -n :: unshare the network namespace# ls -l /proc/self/ns/netlrwxrwxrwx. 1 root root 0 12. Sep 22:10 /proc/self/ns/net -> 'net:[4026532727]'#

# ls -l /proc/self/ns/total 0lrwxrwxrwx. 1 root hsowa 0 12. Sep 22:09 cgroup -> 'cgroup:[4026531835]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 ipc -> 'ipc:[4026531839]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 mnt -> 'mnt:[4026531840]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 net -> 'net:[4026531969]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 pid -> 'pid:[4026531836]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 user -> 'user:[4026531837]'lrwxrwxrwx. 1 root root 0 12. Sep 22:09 uts -> 'uts:[4026531838]'# unshare -n # -n :: unshare the network namespace# ls -l /proc/self/ns/netlrwxrwxrwx. 1 root root 0 12. Sep 22:10 /proc/self/ns/net -> 'net:[4026532727]'#

Processes are associated with one namespace:

Page 11: Containers and isolation as implemented in the Linux kernel

11

Making namespaces persistent

# unshare -n # -n :: unshare the network namespace# ls -l /proc/self/ns/netlrwxrwxrwx. 1 root root 0 12. Sep 22:10 /proc/self/ns/net -> 'net:[4026532727]'# touch /run/netns/my_namespace1# mount -o bind /proc/self/ns/net /run/netns/my_namespace1# ls -i /run/netns/my_namespace14026532727 /run/netns/foo# exit# readlink /proc/self/ns/netnet:[4026531969]# nsenter --net=/run/netns/my_namespace1# readlink /proc/self/ns/net net:[4026532727]#

# unshare -n # -n :: unshare the network namespace# ls -l /proc/self/ns/netlrwxrwxrwx. 1 root root 0 12. Sep 22:10 /proc/self/ns/net -> 'net:[4026532727]'# touch /run/netns/my_namespace1# mount -o bind /proc/self/ns/net /run/netns/my_namespace1# ls -i /run/netns/my_namespace14026532727 /run/netns/foo# exit# readlink /proc/self/ns/netnet:[4026531969]# nsenter --net=/run/netns/my_namespace1# readlink /proc/self/ns/net net:[4026532727]#

Managing namespaces as a mountpoint:

Page 12: Containers and isolation as implemented in the Linux kernel

12

User namespaces

• User namespaces have a special role as they directly influence permission control

• Allowing to become root inside a user created namespace

• Disassociate permissions with parent namespace

• Example:

$ id -u1000$ unshare –user -r bash# id -u0# unshare -n# nc -l 80 # netcat is allowed to bind to port 80

$ id -u1000$ unshare –user -r bash# id -u0# unshare -n# nc -l 80 # netcat is allowed to bind to port 80

Page 13: Containers and isolation as implemented in the Linux kernel

13

Easier management: netns

# ip netns add foo# ip netns add bar# ip link add type veth# ip link set dev veth0 netns foo# ip link set dev veth1 netns bar# ip netns exec foo bash# ip l l1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:002: ip_vti0@NONE: <NOARP> mtu 1332 qdisc noop state DOWN mode DEFAULT group default qlen 1 link/ipip 0.0.0.0 brd 0.0.0.047: veth0@if48: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether ce:e5:a7:2f:d5:69 brd ff:ff:ff:ff:ff:ff link-netnsid 1# exit

# ip netns add foo# ip netns add bar# ip link add type veth# ip link set dev veth0 netns foo# ip link set dev veth1 netns bar# ip netns exec foo bash# ip l l1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:002: ip_vti0@NONE: <NOARP> mtu 1332 qdisc noop state DOWN mode DEFAULT group default qlen 1 link/ipip 0.0.0.0 brd 0.0.0.047: veth0@if48: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether ce:e5:a7:2f:d5:69 brd ff:ff:ff:ff:ff:ff link-netnsid 1# exit

OpenStack already uses a lightweight wrapper around these to manage netns:

Page 14: Containers and isolation as implemented in the Linux kernel

14

Representation wrap-up

• Namespaces are internally represented via normal inodes living in its own filesystem, which are globally valid

• Thus filedescriptor passing works as usual

• Persisting of namespaces simply achieved by bind mounting the representative file to “stable location”

• Easy atomic utilities map directly to the representative syscalls• unshare(1) unshare(2) or clone(2)→

• nsenter(1) setns(2)→

• mount is really just mounting

Page 15: Containers and isolation as implemented in the Linux kernel

15

OutlineContainers and isolation as implemented in the Linux kernel

Learned from history and enhanced and innovated in Free Software.

● Overview of not so recent history from other operating systems

● Representation and control from user space

● Implementation details in the kernel● What to come?

Page 16: Containers and isolation as implemented in the Linux kernel

16

Implementation details in the kernel• struct user_namespace

• Establishes own configurable UID and GID mapping

• struct nsproxy

• struct uts_namespace

• isolates hostname and domainname (e.g. for auth purposes)

• struct ipc_namespace

• Isolates (POSIX/svipc) mqueue, semaphores, shared memory

• struct mnt_namespace

• Abstraction and isolation over the filesystem views

• struct pid_namespace

• Isolate process tree and pid numbers

• struct net

• Control isolation with network interfaces, routing tables, ip addresses

• struct cgroup_namespace (recent development)

• control group namespace, isolates resource management

Page 17: Containers and isolation as implemented in the Linux kernel

17

Mount namespace

• Most important namespace, as they also provide the isolation for /proc and (partially) for sysfs, which should get remounted in a new container

• Mount namespaces basically form trees in the kernel which can be partially overlapping (mount subtrees)

• Process attached to one subtree

• Discovered via nsproxy

Page 18: Containers and isolation as implemented in the Linux kernel

18

System configuration (netns)

• Configuration, Routing tables, firewall etc. are all separated per network namespace, how?

• System configuration mostly being done via sysctl

• A lot of sysctls are manageable per namespace

• netns namespace has own sysctl in struct net• Incoming packets use configuration based on the network namespace of

the incoming interface

• Outgoing packets can use socket namespace (locally generated) or the device context

Page 19: Containers and isolation as implemented in the Linux kernel

19

OutlineContainers and isolation as implemented in the Linux kernel

Learned from history and enhanced and innovated in Free Software.

● Overview of not so recent history from other operating systems

● Representation and control from user space

● Implementation details in the kernel● What to come?

Page 20: Containers and isolation as implemented in the Linux kernel

20

What is coming?

• Basically the namespace concept is architectural complety implemented

• New features added to the kernel are already designed in an orthogonal way or can correctly deal with namespaces

• Network namespace is heavy weight, thus• Connecting netns to outside world requires one virtual router or bridge

• Alternatives exists but are architectural a dead end

• ipvlan: multiplexes IP addresses on one interface

• macvlan: multiplexes MAC addresses on one interface

• Provide isolation on IP layer like FreeBSD jails or Solaris

• Maybe even extended to act like VRF with sockets

Page 21: Containers and isolation as implemented in the Linux kernel

THANK YOU

plus.google.com/+RedHat

linkedin.com/company/red-hat

youtube.com/user/RedHatVideos

facebook.com/redhatinc

twitter.com/RedHatNews