Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution...

76
FOSDEM 2016 Container mechanics in Linux and rkt

Transcript of Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution...

Page 1: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

FOSDEM 2016

Container mechanics in Linux and rkt

Page 2: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Jonathan Boullegithub.com/jonboulle@baronboulle

Alban Crequygithub.com/alban

Page 3: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

a modern, secure, composable container runtime

Page 4: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

an implementation of appc(image format, execution environment)

Page 5: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt

simple CLI toolgolang + Linuxself-contained

Page 6: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

simple CLI toolno (mandatory) daemon:

apps run directly under spawning process

Page 7: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

bash/runit/systemd

rkt

application(s)

Page 8: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt internalsmodular architecture

execution divided into stagesstage0 → stage1 → stage2

Page 9: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

bash/runit/systemd

rkt

application(s)

Page 10: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

Page 11: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

Page 12: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

stage0 (rkt binary)discover, fetch, manage application images

set up pod filesystemscommands to manage pod lifecycle

Page 13: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

Page 14: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

Page 15: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

stage1"the container"

execution environment for podsprocess lifecycle managementresource constraints (isolators)

Page 16: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

stage1 (swappable)

● binary ABI with stage0● rkt's stage0 calls exec(stage1, args...)

● default implementation ○ based on systemd-nspawn + systemd ○ Linux namespaces + cgroups for isolation

● kvm implementation ○ based on lkvm + systemd ○ hardware virtualisation for isolation

Page 17: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt (stage0)

pod (stage1)

bash/runit/systemd/... (invoking process)

app1 (stage2)

app2 (stage2)

Page 18: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt (stage0)

systemd-nspawn (stage1)

bash/runit/systemd/... (invoking process)

cached (stage2)

workerd (stage2)

systemd

Page 19: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt (stage0)

systemd-nspawn (stage1)

bash/runit/systemd/... (invoking process)

cached (stage2)

workerd (stage2)

systemd

container

Page 20: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Containers on Linuxnamespaces

cgroupschroot

Page 21: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Linux namespaces

Page 22: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

hostname:thunderstorm

hostname:rainbow

hostname:sunshinecontainers

host

Page 23: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Containers: no guest kernel

Hardware

Host Linux kernel

rkt rkt

app app app

system calls:example: sethostname()

kernel API

Page 24: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Containers with an example

Getting and setting the hostname:

✤ The system calls for getting and setting the hostname are older than containers

int uname(struct utsname *buf);

int gethostname(char *name, size_t len);

int sethostname(const char *name, size_t len);

Page 25: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Processes in namespaces

1

26

3

9

gethostname() -> “rainbow” gethostname() -> “thunderstorm”

Page 26: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Linux NamespacesSeveral independent namespaces

✤ uts (Unix Timesharing System) namespace✤ mount namespace✤ pid namespace✤ network namespace✤ user namespace

Page 27: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

1

26

“rainbow”

unshare(CLONE_NEWUTS);

Creating new namespaces

Page 28: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

1

26

“rainbow”

6

“rainbow”

Creating new namespaces

Page 29: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2
Page 30: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

PID namespace

Page 31: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

✤ the host sees all processes✤ the container only its own processes

Hiding processes and PID translation

Page 32: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

✤ the host sees all processes

✤ the container only its own processes

Actually pid 30920

Hiding processes and PID translation

Page 33: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Initial PID namespace

1

26

7

Page 34: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

1

26

7

Creating a new namespace

clone(CLONE_NEWPID, ...);

Page 35: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Creating a new namespace

1

26

1 7

Page 36: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt

rkt run ...

✤ uses clone() to start the first process in the container with a new pid namespace

✤ uses unshare() to create a new network namespace

rkt enter ...

✤ uses setns() to enter an existing namespace

Page 37: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Joining an existing namespace

1

26

1 7

setns(...,CLONE_NEWPID);

Page 38: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Joining an existing namespace

1

26

1

4

Page 39: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Mount namespaces

Page 40: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

container

host

/

/home/var/etc

user

/

/my-app

Page 41: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Storing the container data (Copy-on-write)

Container filesystem

Overlay fs “upper” directory/var/lib/rkt/pods/run/<pod-uuid>/overlay/sha512-

.../upper/

Application Container Image/var/lib/rkt/cas/tree/sha512-...

Page 42: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt directories/var/lib/rkt ├─ cas │ └─ tree │ ├─ deps-sha512-19bf... │ └─ deps-sha512-a5c2... └─ pods └─ run └─ e0ccc8d8 └─ overlay/sha512-19bf.../upper └─ stage1/rootfs/

Page 43: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

/

/home/var/etc

user

1 7

3

unshare(..., CLONE_NEWNS);

Page 44: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

/

/home/var/etc

user

1 7

3

/

/home/var/etc

user

7

Page 45: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

/

Changing root with MS_MOVE

/

...

rootfs

my-app

mount($ROOTFS, “/”, MS_MOVE)

/

...

rootfs

my-app

$ROOTFS = /var/lib/rkt/pods/run/e0ccc8d8.../stage1/rootfs

Page 46: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

/

/home/var/etc

user

/

/home/var/etc

user

Relationship between the two mounts:

- shared- master / slave- private

Mount propagation events

Page 47: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Private Shared Master and slave

/home/home

Mount propagation events

/home

/home

/home/home

Page 48: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

✤ / in the container namespaceis recursively set as slave:

mount(NULL, "/", NULL, MS_SLAVE|MS_REC, NULL)

How rkt uses mount propagation events

/

/home/var/etc

user

Page 49: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Network namespace

Page 50: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Network isolationGoal:

✤ each container has their own network interfaces

✤ Cannot see the network traffic outside the container (e.g. tcpdump)

container1

host

container2

eth0

eth0eth0

Page 51: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Network tooling

✤ Linux can create pairs of virtual net interfaces

✤ Can be linked in a bridge

container1 container2

eth0

veth1

eth0

veth2

IP masquerading via iptables

eth0

bridge

Page 52: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

rkt networking

✤ plugin based✤ Container Network Interface (CNI)✣ rkt✣ Kubernetes✣ Calico

Page 53: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Container Runtime (e.g. rkt)

veth macvlan ipvlan OVS

Container Networking Interface (CNI)

Page 54: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

How does rkt do it?✤ rkt uses the network plugins implemented by the Container Network Interface (CNI,

https://github.com/appc/cni)

rktnetwork pluginsexec

systemd-nspawn

exec()

/var/lib/rkt/pods/run/$POD_UUID/netns

network namespace

configurevia setns + netlink

create, join

Page 55: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

User namespaces

Page 56: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

History of Linux namespaces

✓ 1991: Linux

✓ 2002: namespaces in Linux 2.4.19

✓ 2008: LXC✓ 2011: systemd-nspawn✓ 2013: user namespaces in Linux 3.8✓ 2013: Docker✓ 2014: rkt

… development still active

Page 57: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Why user namespaces?

✤ Better isolation✤ Run applications which would need more

capabilities✤ Per user limits✤ Future:

✣ Unprivileged containers: possibility to have container without root

Page 58: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

0

host

65535

4,294,967,295(32-bit range)

0

container 1

655350

container 2

User ID ranges

Page 59: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

unmapped

User ID mapping/proc/$PID/uid_map: “0 1048576 65536”

host

container

1048576

65536

65536

unmappedunmapped

Page 60: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Problems with container images

Container filesystem

Container filesystem

Overlayfs “upper”

directory

Overlayfs “upper”

directory

Application Container Image (ACI)

ApplicationContainer

Image (ACI)

container 1 container 2

downloading

web server

Page 61: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Problems with container images

✤ Files UID / GID✤ rkt currently only supports user namespaces without overlayfs

✣ Performance loss: no COW from overlayfs✣ “chown -R” for every file in each container

Page 62: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Problems with volumes

/

/home/var

user

/

/data /my-app

bind mount(rw / ro)

/data

✤ mounted in several containers

✤ No UID translation

✤ Dynamic UID maps

/data

Page 63: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

User namespace and filesystem problem

✤ Possible solution: add options to mount() to apply a UID mapping

✤ rkt would use it when mounting:✣ the overlay rootfs✣ volumes

Page 64: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Isolators

Page 65: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Isolators in rkt

✤ specifiedin an image manifest

✤ limiting capabilitiesor resources

Page 66: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Isolators in rkt

Currently implemented

✤ capabilities✤ cpu✤ memory

Possible additions

✤ block-bandwidth✤ block-iops✤ network-bandwidth✤ disk-space

Page 67: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

cgroups

Page 68: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

What’s a control group (cgroup)

✤ group processes together✤ organised in trees✤ applying limits to them as a group

Page 69: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

cgroup API

/sys/fs/cgroup/*/

/proc/cgroups

/proc/$PID/cgroup

Page 70: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

cgroups

Page 71: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

List of cgroup controllers/sys/fs/cgroup/ ├─ cpu ├─ devices ├─ freezer ├─ memory ├─ ... └─ systemd

Page 72: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Memory isolator

“limit”: “500M”

ApplicationImage Manifest

[Service]ExecStart=MemoryLimit=500M

systemd service file

write tomemory.limit_in_

bytes

systemd action

Page 73: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

CPU isolator

“limit”: “500m”

ApplicationImage Manifest

write tocpu.share

systemd action

[Service]ExecStart=CPUShares=512

systemd service file

Page 74: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Unified cgroup hierarchy

✤ Multiple hierarchies:✣ one cgroup mount point for each controller (memory, cpu...)✣ flexible but complex✣ cannot remount with a different set of controllers✣ difficult to give to containers in a safe way

✤ Unified hierarchy:✣ cgroup filesystem mounted only one time✣ soon to be stable in Linux (mount option

“__DEVEL__sane_behavior” being removed)✣ initial implementation in systemd-v226 (September 2015)✣ no support in rkt yet

Page 75: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

Questions?

github.com/coreos/rkt

Join us!

Page 76: Container mechanics in Linux and rkt - FOSDEM 2018 · rkt internals modular architecture execution divided into stages stage0 → stage1 → stage2

● Early bird tickets● Sponsorships are still available● Submit a talk before February 29th!

coreos.com/fest

May 9 & 10, 2016 | Berlin, Germany

@coreosfest