孙健波 namespace cgroups docker...

28
深解析Docker背后的 Linux内核技术 孙健波 浙江学SEL/VLIS实验室 www.sel.zju.edu.cn

Transcript of 孙健波 namespace cgroups docker...

深⼊入解析Docker背后的 Linux内核技术

孙健波 浙江⼤大学SEL/VLIS实验室 www.sel.zju.edu.cn

Agenda• Namespace

• ipc、uts、pid、network、mount、user

• Cgroup

• what are cgroups?

• usage、concepts、implementation……

What is Namespace ?

• Lightweight Process virtualization

• Isolation:Enable a process (or several processes) to have different views of the system than other processes.

hostname… IPC

network stack filesystem

PID1,PID2,…. uid,gid,capabilities…

hostname……

IPC

network stack

filesystem

PID1,PID2,….uid,gid,capabilities…

namespacesThere are currently 6 namespaces: uts (hostname)ipc (System V IPC)net (network stack)mnt (mount points, filesystems) pid (processes)user (UIDs)

/proc/[pid]/ns

use mount to keep namespace alive

APIs

• Three system calls are used.

• clone()

• unshare()

• setns()

namespace

clone()

process

new process

new namespace

• creates a new process and a new namespace

clone()

unshare()

• creates a new namespace

• attaches the current process to it

namespace

process

new namespace

unshare()

process

setns()namespace A

process

process

namespace B

setns()

• joining an existing namespace.

UTS namespacestruct task_struct

……

*nsproxy

struct nsproxy

……

*uts_ns

*mnt_ns

*net_ns

*pid_ns

*ipc_ns

struct uts_namespace

ceenodename

sysname

release

version

machinestatic inline struct new_utsname *utsname(void){ return &current->nsproxy->uts_ns->name; }

SYSCALL_DEFINE2(gethostname, char __user *, name, int, len){ struct new_utsname *u; ... u = utsname(); if (copy_to_user(name, u->nodename, i)) errno = -EFAULT; ... }

• the principle is the same

• more code

IPC namespace

Network namespace• logically another copy of the network stack

• use pipe to create veth pair to communicate

container namespace A

container namespace B

eth0 eth0

Bridge: docker0

veth veth

Physical Network Device

Host

Mount namespace

/bin

mount namespace

/lib /proc /root

/bin

child namespace

master

slave

/lib

share

share

/proc

private

private

unbindable

another namespace/bin

share

share

• First namespace in history

• Default to create a new copy instead of point to root namespace

PID namespace• Same PID in

different namespace

• can be nested up to 32 levels

• PID 1 = init process

• child reaping

• ignore SIGKILL

User namespace

• Will be supported by Docker in future.

• Docker in Yarn

• ……

normal useruser namespace (privileged user)

pid network

uts ipc

mount

What are cgroups ?

• Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.

Usage of cgroups• Resource Limitation:groups can be set to not

exceed a configured limit

• Prioritization:some groups may get a larger share of CPU utilization or disk I/O throughput

• Accounting:measures how much resources certain systems use

• Control :freezing the groups of processes, their checkpointing and restarting

Concepts• cgroup – a group of tasks with shared characteristics

• subsystem – a module that applies parameters to cgroups to control them in particular ways, typically for resource management

• hierarchy – a set of cgroups organized in a hierarchical tree, plus one or more subsystems associated with that tree

• VFS -> API

Cgroups—Example

/cgroup

/cgroup/memlimits

(memory subsystem mount point & hierarchy)

/cgroup/cpulimits

(cpuset subsystem mount point & hierarchy)

/cgroup/memlimits/student

memory.limit=1G tasks=1,2,3,4,5

/cgroup/memlimits/teacher

memory.limit=2G tasks= 6,7,8

/cgroup/cpulimits/student

cpuset.cpus=0-1 tasks=1,2,3,4,5

/cgroup/cpulimits/teacher

cpuset.cpus=0-3 tasks= 6,7,8

two hierarchy

Parameters—Examples• cpuset subsystem

• cpuset.cpus: defines the set of cpus that the tasks in the cgroup are allowed to execute on

• echo “0-2” > /cgroup/cpuset/lab2/cpuset.cpus

• memory subsystem

• memory.limit_in_bytes: sets the maximum amount of user memory

• echo 1G > /cgroup/memory/lab1/memory.limit_in_bytes

Relationships Between Subsystems, Hierarchies, Control Groups and Tasks

Rule 1

Relationships Between Subsystems, Hierarchies, Control Groups and Tasks

Rule 2

Relationships Between Subsystems, Hierarchies, Control Groups and Tasks

Rule 3

Relationships Between Subsystems, Hierarchies, Control Groups and Tasks

Rule 4

Current subsystems used by Docker

• cpuset – controls access to individual CPUs and memory nodes by a cgroup

• cpu – schedules CPU access to cgroups

• cpuacct – reports CPU resource usage by a cgroup

• memory – controls access to memory resources and reports memory resource usage by a cgroup

• devices – controls access to devices by a cgroup; e.g., gpus

• freezer – suspends and resumes tasks in a cgroup

• blkio – tracks I/O ownership, allowing control of access to block I/O resources

cgroups hookstask_struct

……

css_set *cgroups

list_head cg_list

……

css_set

……

hlist_node hlist

list_head tasks

list_head cg_links

cgroup_subsys_state *subsys[]

……

cg_cgroup_link

list_head cgrp_link_list

cgroup *cgrp

list_head cg_link_list

css_set *cg

cgroup

……

cgroup_subsys_state *subsys[]

list_head css_sets

cgroupfs_root *root

……cgroup_subsys_

state……

cgroup *cgroup……

cgroupfs_root

……

int hierarchy_id

list_head root_list

list_head subsys_list

……

cgroup_subsys

func create

func destroy

func attach

func forkfunc exit

……

int subsys_id

cgroupfs_root *root

list_head sibling

……

css_set_table

……

css_set_hash()

cpuset

……freezer

……blkio_cgroup

…………

……

References

• http://lwn.net/Articles/531114/

• https://www.kernel.org/doc/Documentation/cgroups

• https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/

Thanks!